Using bigmemory for a distance matrix

The process of working on metadata and temperature series gives rise to several situations where I need to calculate the distance from every station to every other station. With a small number of stations this can be done easily on the fly with the result stored in a matrix. The matrix has rows and columns equal to the number of stations and it is filled by looping through the various station combinations. Of course one need only calculate the distance for every pair once. When the number of stations increases to 10s of thousands, such as with the Berkeley Earth project, there are two challenges. First is the time that the process takes and second is the memory it consumes. Handling the space problem is relatively straightforward and I decided to use bigmemory to store the data. The speed problem is an entirely different matter, but I’m making a little progress. In addition to storing the data I also want a way to quickly retrieve the neighbors of a given station and I want my solution to be relatively bullet proof and easy to understand.

To solve the problem I define two functions createDistanceMatrix() and getNeighbors(). The coed for each follows:

The function is declared with a variable for the Inventory of stations, a Directory for where the file will be written, and a default filename. Next we indicate that the bigmemory library is required and then proceed to define a local function [deg2rad()] and do some simple input checking. I enforce the use of a specific extension (“.bin”) which I used for all my filebacked bigmatrices. Then we check the inventory to make sure that it has the columns required:columns named “Id”, “Lon” and “Lat”. In my coding Inventories are usually data.frames, but I allow for using a matrix. If you use a matrix I coerce it locally to a data.frame. Finally, I create a new file name that is critical for future “reads” of the data. In bigmemorywhen you write a file a “description file” is created to document the structure of the file. We will use this to attach the file in the future. Proceeding:

These lines require a bit of explanation. To calculate the distance between stations I am going to use a great circle calculation. That calculation repeatedly takes the sin() and cos() of the latitude. For speed I do that once for every station in the inventory. Also, R takes radians so I do that transformation once. ER is the earth’s radius in kilometers. Next we define some parameters for the file we want to create:

The matrix we create and store as a file will be square. In truth I could make the matrix one row or one column smaller, but I’ve left it like this for now. Note as well that I decide to set the options for using rownames and column names. This will allow people to use character Ids for their station Id in an inventory. That will be clearer as we progress. With those preliminaries we are ready to initialize the filebacked big matrix:

Some things to note: I set the rows and columns to be equal. I initialize the data with NA. We only write into the upper diagonal of the matrix and I want the other values to be NA. I set the dimnames so that the rownames are equal to the station Ids. They can be numbers or characters. When you seek on the file you’ll use the station ID. So, you could use characters ( “USC26541328″) for station Ids and the functions will still work. I also set a directory where the file is written, the file name and the decriptor name. Note, the decriptor name is DERIVED from the filename by simply changing the extension: “distance.bin” has a file descriptor of “distance.desc”. Type is set as “double.” Next I do a couple of things that might look confusing. Sometime a station has no Latitude or Longitude. I handle that here by allowing those stations including in the Inventory, but I detect that problem and write a negative distance in the appropriate rows and columns. Our matrix will have 3 values: distance, a negative filler, and NA for elements of the matrix that dont have to be calculated ( lower half of the diagonal ). I also determine which rows have valid Lat and Lon. That will be used as a set for looping controls. Below we set the row and column of stations that have missing Lat/Lon to -9999.

Time to loop over the data. The approach is simple: Imagine we had a 4×4 matrix: we want to generate 1:2;1:3;1:4;2:3;2:4;3:4. Our first loop control goes over the 1-3 and the second goes over 2-4. Because some stations have missing Lat/lon we will skip over them. That is the purpose of validLonLat. It conatins those rows that are valid.

So, we start looping over the first element in validLonLat and we loop to N-1. In the second loop, we skip lower rownumbers. The range value is calculated by using the precomputed values for sinLat() and cosLat() so that I am basically reducing the number of trig functions that have to be called by orders of magnitude. Finally, there is a nasty little warning we have to care about. Sometimes when two stations have the same location acos() will blow up. This happens when acos() is feed a number slightly greater than 1 or less than -1. It happens rarely. The result is a NaN instead of the desired result of zero. Rather than prevent the warning, I decide to allow the calculation and catch and fix the error. I suppose I could trim sinLat or cosLat and prevent it. In anycase when this finishes we have created a bigmatrix on file. Lets test it:

I create a random inventory of 1000 stations. I introduce some random NAs and time the function:

Time difference of 6.371714 mins

Creating a matrix of 1 million elements takes a bit over 6 minutes. For this we are calculating roughly 500K distances. I’m sure part of the slowness of this approach is the repeated writing I am doing to the file. So, I’ll have to investigate ways to buffer this, but for now it works. I could also see ways of using apply(). Now, lets look at how we get data from this file.

To grab a stations neighbors I call “getNeighbors” and pass the filename of the descriptor file. I may change this, bt you’ll get the idea. I also pass the Id. Id is supposed to be passed as a character. If you screw up and pass an integer, I coerce it under the hood and proceed. The code for the function looks like this:

Pretty straightforward. First we “attach” the file to read it using the descriptor file. If “Id” is not a character I coerce it. Then I do some checking to insure that the ID can be found in the rownames of the file and to make sure it is only found once. I will probably add a check to the creating file to make sure that Ids are unique. For now I check it here, but that will change. Then I pull the neighbors for the ID. The neighbors of any given station consists of the cells in a column and cells in a row. The best way to see this is to draw a diagram using a 4×4 matrix as an example: presume you want the 3rd row: To get those values you take rows 1-3 of the 3rd column. cell 3,3 will be NA because we do not calculate the distance from a site to itself. The balance of neighbors will be in 3 row on column 4. neighbors <- c(D[1:dex,dex],D[dex,(dex+1):ncol(D)])

That line collects all the neighbors for an Id including the Id. Next, we take the cell that is NA D[dex,dex] and set it to 0 in neighbors. Finally we take the negative distances and set them to NA. I suppose I could just as easily get rid of the whole setting things to -9999 ploy and probably will in a final version.