Blog

I recently needed to create a map of a decent-sized dataset (13,000+ records) that I’d downloaded from the state of New Jersey. I’m new to the large-dataset* game so I had some tricks to learn and this seemed like a good opportunity. There’s lots of great information on the web on this but this may be helpful if you’re coming to the custom mapping game. The biggest problem you face is you’ll be on the steep side of the learning curve of several new applications at once. Hang in there.

The location data in my datatset had x and y coordinates that I could tell weren’t latitude and longitude. I wanted to get this on a map, but how to do it?

The map had street address information, though street and city information weren’t in the same field and there was no state column since NJ was assumed. So I imported the file into Google Refine, added a state column, filled it with the ‘fill down’command, and then concatenated the Street, City, and State columns into a single new column.

My first attempt at a map was to export a .csv file which I brought into Google Fusion Tables. Using the Visualize > Map command, I had Google geocode the street addresses into latitude/longitude location data and plot it on the map. So far so good. But for me at least, the Google (Fusion Tables) maps lack a certain visual polish, not to mention customizability.

Problem is, Fusion Tables won’t let you export the latitude/longitude data anymore (though that apparently used to be a feature). So I needed another way to get that data. There are several services that will do this for you (Google refine has one as referenced in this video), but they all limit the number of records you can geocode and I was well above that limit.

I happened on John Keefe’s post about re-projecting map data in QGIS. It works, but I’ll warn you that installing QGIS is a bit involved and getting the lat/long data out involves some knowledge of QGIS which I didn’t have at the time.

First you’ll need to install QGIS. That involves installing several frameworks that QGIS depends on. All these are documented on the QGIS OSX download page (note: you can install QGIS for other systems as well, see the QGIS site), but know that it’s not as simple as doing a single app install.

My data was a shapefile, so to get it in to QGIS I added it as a layer (Layer > Add Vector Layer). After a little research I determined that those x and y coordinates were in a coordinate reference system (CRS) known as NAD83 / New York East (Ft/US). To get lat/long coordinates, you’ll need to translate or ‘project’ these onto a different CRS, namely WGS 84 (ESPG: 4326). Here’s how to project the data from one to the other (hat tip: John Keefe). It’s good to remember to check this later to verify you did this correctly.

An aside here: If your experience using maps up to this point is restricted to Google Maps, here’s where things get a bit complicated. Basically, there’s a whole slew of methods of translating spherical map data (the earth) onto a flat surface (a paper map or your screen). The different methods are called coordinate reference systems (CRS). To translate x and y coordinate data you may have into latitude and longitude data, you’ll need to figure out what CRS you’re starting from and ‘project’ it to another one. (Apologies to map aficionados if I butchered that explanation – please set me straight in the comments).

Once your data is in the WGS 84 / World Mercator projection it will have latitude and longitude data. Here’s how to get it back out of QGIS (v1.8): Right-click on your data layer and choose ‘Save As…’ In my case I chose a .csv file as the format since I find it the most portable if I’m going to hop the data through another service like Fusion Tables (If I’m building a map with Leaflet, I’ll use GeoJSON, but that’s a subject for another post). Click the Browse button and choose a name and location to save the file to. The CRS field should show ‘WGS 84′(ESPG: 4326). In the’Layers’ text box, type ‘GEOMETRY=AS_XY’**. You should now have a file that contains your original data and the new latitude and longitude coordinates.

*I realize that 13,000 records isn’t really a *large* dataset in the greater scheme of things, but it’s the largest I had worked with at that point and was large enough that it exceeded the limits of the free version of BatchGeo and Google Refine’s built-in geocoding service (which as of this writing will only geocode 2,500 records/day).