Martijn van Exel and his OpenStreetMap adventures

Menu

Category Archives: gis

Let me explain. Back in 2007, we imported TIGER/Line data from the U.S. Census into OpenStreetMap. TIGER/Line was and is pretty crappy geodata, never meant to make pretty maps with, let alone do frivolous things like routing. But we did it anyway, because it gave us more or less complete base data for the U.S. to work with. And so we worked. And worked and worked. Fixing spaghetti highways. Reclassifying and arguing about reclassifying. Connecting and disconnecting. And now, four years and change later, we have pretty maps!

But we don’t all live in Epcot Center (actually, I don’t think any of us do really) and there’s lots of places where we haven’t been taking as good care of the data. Vast expanses of U.S. territory where the majority of the data in OSM is still TIGER as it was imported all those years ago.

The TIGER deserts.

I want those all to go away, but there’s only so many folks mapping here in the U.S., so we may want to prioritize a little bit. So I wanted to look at the TIGER desert phenomenon a little closer. In particular, I wanted to look a the parts of TIGER deserts that cover towns, and even cities. That’s where people actually do live, and that’s where OSM data is most valuable.

So I set out to identify those the only way I know how: using ArcGIS. The thing is: this kind of job has ‘raster analysis’ plastered all over it, and I just don’t know how to do that using FOSS4G tools. So maybe I’ll just explain what I did in ArcGIS, and then you all can chime in with smart comments about how to do this in PostGIS, R, GRASS, QGIS or whatever free software can do this magic. If you don’t care for all that, just scroll to the bottom for the results.

I created a shapefile first with all OSM highways with TIGER tags in Florida using C++ and osmium. (There’s some good example code to get you started if you’re interested.)

Then, with that loaded into ArcMap, I first created a 5km grid with the predominant (in terms of way length) version number as the cell value.

A second grid for the neighborhood way density:

I reclassified the version grid generously – all cells with 1 or 2 a the predominant TIGER way version got true / 1, the rest false / 0. For distinguishing between built-up and boondock cells, a threshold of 1.8 looked good after some tweaking.

And finally some simple map algebra to combine the two variables into the final result grid:

So there we are folks – TIGER deserts and TIGER ghost towns in Florida:

TIGER deserts in Florida

TIGER ghost towns in Florida

Hmm. I hope we can figure out a way to improve this analysis so the situation does not look quite so bleak. GIS does not lie though – these are the 5km cells that have a reasonably high way density and TIGER way versions that are predominantly 1 or 2.

So let me know folks – 1) Is this a good approach for identifying TIGER ghost towns and if not, what is? and 2) how do you do this using FOSS4G tools?

I had what I thought was a pretty straightforward use case for OpenStreetMap data:

I want all bridges in the US that are mapped in OpenStreetMap in a PostGIS database.

There are about 125,000 of them – for now loosely defined as ‘ways that have the ‘bridge’ tag‘. So on the scale of OpenStreetMap data it’s a really small subset. In terms of the tools and processes needed, the task seems easy enough, and as long as you are satisfied with a one-off solution, it really is. You would need only four things:

this will extract the planet file and pipe the output to osmosis. Osmosis’s –read-xml task consumes the xml stream, passes it to a –bounding-polygon task to clip the data using the US bounding polygon, a couple of –tag-filter tasks that throw out all relations and all ways except for those tagged ‘bridge=*’ (there’s a negligible number of ways tagged bridge=no, but catching all the different ways of tagging a ‘true’ value here is more work than it’s worth, if you ask me), a –used-node task that throws out all the nodes except for those that are used by the ways we are keeping, and finally a –write-pgsql task that writes it all the objects to the PostGIS database. (Osmosis can be overwhelming at first with its plethora of tasks and arguments, but if you break it down it’s really quite straightforward. It may help to use There’s also a graphical wrapper around osmosis called OSMembrane that may help to make this tool easier to understand and master.)

But for me, it didn’t end there.

OpenStreetMap data is continuously updated by more than a half million contributors around the world. People are adding, removing and changing features in OpenStreetMap around the clock. And those changes go straight into the live database. There’s no release schedule, no quality assurance. Every time one of those half a million people clicks ‘save’ in one of the OpenStreetMap editors there is, for all intents and purposes, a new OpenStreetMap version. That means the bridges database I just built is already obsolete even before the import is complete. For my yet to disclose purpose, that would not be acceptable. So let me specify my goal a little more precisely:

I want all bridges in the US that are mapped in OpenStreetMap in a PostGIS database that stays as up-to-date as possible, reflecting all the latest changes.

There is not one single ready-made solution for this, it turned out, so let me describe how I ended up doing it. It may not be the most common OpenStreetMap data processing use case out there, but it’s going to be useful for, for example, thematic overlay maps, if nothing else – even though the final step of importing into a geospatial database may need some tweaking.

After some less successful attempts I settled on the following workflow:

osmupdate – a tool to automate patching local OpenStreetMap data files, including downloading the change files from the server.

The trio osmconvert / osmfilter / osmupdate together can do most of the things osmosis can do, but do it a heck of a lot faster, and is more flexible in a few key aspects that we will see soon.

Let’s go through the numbered steps in the diagram one by one, explaining how each step is executed and how it works.

1. Planet file – The complete, raw OpenStreetMap data. A fresh one is made available every week on the main OpenStreetMap server, but your area of interest may be covered at one of the many mirrors, which can save you some download time and bandwidth. There is no planet mirror for the entire US, so I started with the global planet file. If you have a planet file that matches your area of interest, you can skip step 3 (but not the next step).

2. Bounding polygon – regardless whether you find an initial planet file that matches your area of interest nicely, you will need a bounding polygon in OSM POLY format for the incremental updates. You’ll find ready-made POLY files in several places, including the OpenStreetMap SVN tree and GeoFabrik (read the README though), or you can create them yourself from any file that OGR can read using ogr2poly.

3. Filter area of interest – to save on disk space, memory usage and processing time, we’re going to work only with the data that is inside our area of interest. There are quite a few ways to create geographical extracts from a planet file, but we’re going to use osmconvert for two reasons: a) it’s fast! (osmosis takes about 4 hours and 45 minutes to do this, osmconvert takes 2 hours. This is on an AMD Phenom II X4 965 machine with 16GB RAM) b) it outputs the o5m format for which the next tool in the chain, osmfilter, is optimized.

bzcat planet-latest.osm.bz2 | ./osmconvert – -B=us.poly -o=us.o5m

4. Filter features of interest – The second step is creating a file that holds only the features that we are interested in. We could have done this together with the previous step in one go, but as the diagram shows we will need the output file of step 3 (the US planet file) for the incremental update process. Here, osmfilter comes into play

osmfilter works in much the same way as the osmosis –tag-filter task. It accepts arguments to drop specific feature types, or to keep features that have a specific tags. In this case, we want to drop everything (–keep=) but the ways that have the key ‘bridge’ (–keep-ways=”bridge=”). We have osmfilter output the result in the efficient o5m format. (o5m lies in between the OSM xml and pbf formats in terms of file size, and was designed as a compromise between the two. One of the design goals for the o5m format was the ability to merge two files really fast, something we will be relying on in this process.)

5. Convert to pbf – The trio osmconvert / osmfilter / osmupdate is designed to handle file-based data and has no interface for PostGIS, so we need to fall back on osmosis for this step. As osmosis cannot read o5m files, we need to convert to pbf first:

Wait a minute. A lot more happened there than just a format conversion. Let’s take a step back. Because we’re working with a geographical extract of the planet file, we need to be concerned about referential integrity. Because the way objects in OpenStreetMap don’t have an inherent geometry attached to them, any process looking to filter ways based on a bounding box or polygon needs to go back to the nodes referenced and see if they are within the bounds. It then needs to decide what to do with ways that are partly within the bounds: either cut them at the bounds, dropping all nodes that lie outside the bounds (‘hard cut;), include the entire way (‘soft cut’) or drop the entire way. As the –drop-broken-refs argument name suggests, we are doing the latter here. This means that data is potentially lost near the bounds, which is not what we actually want. We need to do it this way though, because the planet update (step 7) cannot take referential integrity into account without resorting to additional (expensive) API calls. (Consider this case: a way is entirely outside the bounds on t0. Between updates, one of the nodes is moved inside the bounds, so the way would be included in the extract now. But the old file does not contain the rest of the nodes comprising that way, nor are they in the delta files that are used in the update process – so the full geometry of the new way cannot be known.)

One way to compensate for the data loss is by buffering the bounding polygon. That would yield false positives, but that may be acceptable. It is how I solved this. What’s best for your case depends on your scenario.

The -b=-180,-90,180,90 option defining a global bounding box seems superfluous, but is actually necessary to circumvent a bug in the –drop-broken-refs task that would leave only nodes in the data.

6. Initial database import – This is a straightforward step that can be done with a simple osmosis command:

This reads the pbf file we just created (–rb) and writes it directly to the database ‘bridges’ using the credentials provided (–wp). If you want way geometries, be sure to load the linestring schema extension in addition to the snapshot schema when creating the database:

osmosis will detect this on import, there is no need to tell osmosis to create the line geometries.

Note that we are using the direct write task (–wp) which is fine for smaller datasets. If your dataset is much larger, you’re going to see real performance benefits from using the dump task (–wpd) and load the dumps into the database using the load script provided with osmosis.

Now that we have the initial import done, we can start the incremental updates. This is where the real fun is!

7. Updating the planet file – This is where osmupdate really excels in flexibility over osmosis. I had not used this tool before and was amazed by how it Just Works. What osmupdate does is look at the input file for the timestamp, intelligently grab all the daily, hourly and minutely diff files from the OpenStreetMap server, and apply them to generate an up-to-date output file. It relies on the osmconvert program that we used before to do the actual patching of the data files, so osmconvert needs to be in your path for it to function. You can pass osmconvert options in, which allows us to apply the bounding polygon in one go:

osmupdate us.o5m us-new.o5m B=us.poly

8. Filter features of interest for the updated planet file – This is a repetition of step 4, but applied to the updated planet file:

9. Derive a diff file – We now have our original bridges data file, derived from the planet we downloaded, and the new bridges file derived from the updated planet file. What we need next is a diff file we can apply to our database. This file should be in the OSM Change file format, the same format that is used to publish the diffs for the planet osmconvert used in step 7. This is another task at which osmconvert excels: it can derive a change file from two o5m input files really fast:

Again, there’s a little more going on than just deriving a diff file, isn’t there? What is that –fake-lonlat argument? As it turns out, osmconvert creates osc files that don’t have coordinate attributes for nodes that are to be deleted. To do so would be unnecessary, you really only need a node ID to know which node to delete, there is no need to repeat other attributes of the node. But some processing software, including osmosis, requires these attributes to be present, even if the node is in a <delete> block.

10. Update the database – With the osc file defining all the changes since the initial import available, we can instruct osmosis to update the database:

..And we’re done. Almost. To keep the database up-to-date, we need to automate steps 7 through 10, and add some logic to move and delete a few files to create a consistent initial state for the replication process. I ended up creating a shell script for this and adding a crontab entry to have it run every three hours. This interval seemed like a good trade-off between server load and data freshness. The incremental update script takes about 11 minutes to complete: about 6 minutes for updating the US planet file, 4 minutes for filtering the bridges, and less than a minute to derive the changes, patch the database and clean up. Here’s some log output from the script, that by the way I’d be happy to share with anyone interested in using or improving it:

I’ll spend another blog post on my purpose of having this self-updating bridges database sometime soon. It has something to do with comparing and conflating bridges between OpenStreetMap and the National Bridge Inventory. The truth is I am not quite sure how that should be done just yet. I already did some preliminary work on conflation queries in PostGIS and that looks quite promising, but not promising enough (by far) to automate the process of importing NBI data into OSM. Given that NBI is a point database, and bridges in OSM are typically linear features, this would be hard to do anyway.

I’d like to thank Markus Weber, the principal author of osmupdate / osmconvert / osmfilter, for his kind and patient help with refining the process, and for creating a great tool set!

Looks can be deceiving – we all know that. Did you know it also applies to maps? To OpenStreetMap? Let me give you an example.

Head over to osm.org and zoom in to an area outside the major metros in the United States. What you’re likely to see is an OK looking map. It may not be the most beautiful thing you’ve ever seen, but the basics are there: place names, the roads, railroads, lakes, rivers and streams, maybe some land use. Pretty good for a crowdsourced map!

What you’re actually likely looking at is a bunch of data that is imported – not crowdsourced – from a variety of sources ranging from the National Hydrography Dataset to TIGER. This data is at best a few years old and, in the case of TIGER, a topological mess with sometimes very little bearing on the actual ground truth.

The horrible alignment of TIGER ways, shown on top of an aerial imagery base layer. Click on the image for an animation of how this particular case was fixed in OSM. Image from the OSM Wiki.

For most users of OpenStreetMap (not the contributors), the only thing they will ever see is the rendered map. Even for those who are going to use the raw data, the first thing they’ll refer to to get a sense of the quality is the rendered map on osm.org. The only thing that the rendered map really tells you about the data quality, however, is that it has good national coverage for the road network, hydrography and a handful of other feature classes.

To get a better idea of the data quality that underlies the rendered map, we have to look at the data itself. I have done this before in some detail for selected metropolitan areas, but not yet on a national level. This post marks the beginning of that endeavour.

I purposefully kept the first iteration of analyses simple, focusing on the quality of the road network, using the TIGER import as a baseline. I did opt for a fine geographical granularity, choosing counties (and equivalent) as the geographical unit. I designed the following analysis metrics:

Number of users involved in editing OSM ways – this metric tells us something about the amount of peer validation. If more people are involved in the local road network, there is a better chance that contributors are checking each other’s work. Note that this metric covers all linear features found, not only actual roads.

Average version increase over the TIGER imported roads – this metric provides insight into the amount of work done on improving TIGER roads. A value close to zero means that very little TIGER improvements were done for the study area, which means that all the alignment and topology problems are likely mostly still there.

Percentage of TIGER roads – this says something about contributor activity entering new roads (and paths). A lower value means more new roads added after the TIGER import. This is a sign that more committed mappers have been active in the area — entering new roads arguably requires more effort and knowledge than editing existing TIGER roads. A lower value here does not necessarily mean that the TIGER-imported road network has been supplemented with things like bike and footpaths – it can also be caused by mappers replacing TIGER roads with new features, for example as part of a remapping effort. That will typically not be a significant proportion, though.

Percentage of untouched TIGER roads – together with the average version increase, this metric shows us the effort that has gone into improving the TIGER import. A high percentage here means lots of untouched, original TIGER roads, which is almost always a bad thing.

Analysis Results

Below are map visualizations of the analysis results for these four metrics, on both the US State and County levels. I used the State and County (and equivalent) borders from the TIGER 2010 dataset for defining the study areas. These files contain 52 state features and 3221 county (and equivalent) features. Hawaii is not on the map, but the analysis was run on all 52 areas (the 50 states plus DC and Puerto Rico – although the planet file I used did not contain Puerto Rico data, so technically there’s valid results for 51 study areas on the state level).

I will let the maps mostly speak for themselves. Below the results visualisations, I will discuss ideas for further work building on this, as well as some technical background.

Further work

This initial stats run for the US motivates me to do more with the technical framework I built for it. With that in place, other metrics are relatively straightforward to add to the mix. I would love to hear your ideas, here are some of my own.

Breakdown by road type – It would be interesting to break the analysis down by way type: highways / interstates, primary roads, other roads. The latter category accounts for the majority of the road features, but does not necessarily see the most intensive maintenance by mappers. A breakdown of the analysis will shed some light on this.

Full history – For this analysis, I used a snapshot Planet file from February 2, 2012. A snapshot planet does not contain any historical information about the features – only the current feature version is represented. In a next iteration of this analysis, I would like to use the full history planets that have been available for a while now. Using full history enables me to see how many users have been involved in creating and maintaining ways through time, and how many of them have been active in the last month / year. It also offers an opportunity to identify periods in time when the local community has been particularly active.

Relate users to population / land area – The absolute number of users who contributed to OSM in an area is only mildly instructive. It’d be more interesting if that number were related to the population of that area, or to the land area. Or a combination. We might just find out how many mappers it takes to ‘cover’ an area (i.e. get and keep the other metrics above certain thresholds).

Routing specific metrics – One of the most promising applications of OSM data, and one of the most interesting commercially, is routing. Analyzing the quality of the road network is an essential part of assessing the ‘cost’ of using OpenStreetMap in lieu of other road network data that costs real money. A shallow analysis like I’ve done here is not going to cut it for that purpose though. We will need to know about topological consistency, correct and complete mapping of turn restrictions, grade separations, lanes, traffic lights, and other salient features. There is only so much of that we can do without resorting to comparative analysis, but we can at least devise some quantitative metrics for some.

Technical Background

I used the State and County (and equivalent) borders from the TIGER 2010 dataset to determine the study areas.

I collated all the results using some python and bash hacking. I used the PostgreSQL COPY function to import the results into a PostgreSQL table.

Using a PostgreSQL view, I combined the analysis result data with the geometry tables (which I previously imported into Postgis using shp2pgsql).

I exported the views as shapefiles using ogr2ogr, which also offers the option of simplifying the geometries in one step. Useful because the non-generalized counties shapefile is 230MB and takes a long time to load in a GIS).

I created the visualizations in Quantum GIS, using its excellent styling tool. I mostly used a quantiles distribution (for equal-sized bins) for the classes, which I tweaked to get prettier class breaks.

I’m planning to do an informal session on this process (focusing on the osmjs / osmium bit) at the upcoming OpenStreetMap hack weekend in DC. I hope to see you there!

OpenStreetMap represents a lot of data. If you want to import the entire planet into a PostGIS database using osmosis, you need at least 300GB of hard disk space and, depending on how much you spent on fast processors and (more importantly) memory, a lot of patience. Chances are that you are interested in only a tiny part of the world, either to generate a map or do some data analysis. There’s several ways to get bite-sized chunks of the planet – take a look at the various planet mirrors or the cool new Extract-o-tron tool – but sometimes you may want something custom. For the data temperature analysis I did for State of the Map, I wanted city-sized extracts using a small buffer around the city border. If you want to do something similar – or are just interested in how to do basic geoprocessing on a vector file – this tutorial may be of interest to you. Instead of city borders, which I created myself from the excellent Zillow neighborhood boundary dataset, I will show you how to create a suitably generalized OSM POLY file (the de facto standard for describing polygon extracts used by various OSM tools) that is appropriate for extracting a country from the OSM planet with a nice buffer around it.

Let’s get to work.

Preparing Quantum GIS

We will need to add a plugin that allows us to export any polygon from your QGIS desktop as an OSM POLY file. We can get that OSM POLY export plugin for Quantum GIS here.

Unzip the downloaded file and copy the resulting folder into the Python plugins folder. On Windows, if you used the OSGeo installer, that might be

Getting Country Borders

Geoprocessing step 1: Query

Open the Layer Query dialog by either right-clicking on the layer name or selecting Query… from the Layer menu with the TM_WORLD_BORDERS-0.3 layer selected (active).

Type “ISO2″ = “US” in the SQL where clause field and run the query by clicking OK.

Geoprocessing step 2: Buffering

The next step is to create a new polygon representing a buffer around an existing polygon. Because we already queried for the polygon(s) we want to buffer, there’s no need to select anything in the map view. Just make sure the TM_WORLD_BORDERS-0.3 layer is active and select Vector > Geoprocessing Tools > Buffer(s):

Make sure the input vector layer is TM_WORLD_BORDERS-0.3. Only the query will be affected, so we’re operating on a single country and not the entire world.

For Buffer distance, type 1. This is in map units. Because our source borders file is in EPSG:4326, this corresponds to 1 degree which is 69 miles (for the longitudinal axis, that measurement is only valid at the equator and decreases towards the poles). This is a nice size buffer for a country, you may want something larger or smaller depending on the size of the country and what you want to accomplish, so play around with the figure and compare results. Of course, if your map projection is not EPSG:4326, your map units may not be degrees and you should probably be entering much bigger values.

Select a path and filename for the output shapefile. Do not select ‘Dissolve buffer results’. The rest can be left at the default values. Push OK to run the buffer calculation. This can take a little while and the progress bar won’t move. Then you see:

Click Yes. Now we have a buffer polygon based on the US national border:

Geoprocessing step 3: Generalizing

We’re almost done, but the buffer we generated contains a lot of points, which will make the process of cutting a planet file slow. So we’re going to simplify the polygon some. This is also a QGIS built-in function.

Select Vector > Geometry tools > Simplify geometries:

Make sure your buffer layer is selected as the input. Set 0.1 (again, this is in map units) as the Simplify tolerance. This defines by how much the input features will be simplified, the higher this number, the more simplification.

Select a destination for the simplified buffer to be saved. Also select Add result to canvas. Click OK:

This dialog may not seem very promising, but it has worked. Also, I have sometimes gotten an error message after this process completes. Ignore these if you get them.

Geoprocessing step 4: resolving multipolygons

Now, if your simplified country border consists of multiple polygons (as is the case with the US) we have a slight problem. The POLY export plugin does not support multipolygons, so we need to break the multipolygon into single polygons. And even then, we will need to do some manual work if we want OSM .poly files for all the polygons. This is because the plugin relies on unique string attribute values to create different POLY files, and we do not have those because the polygons we are using are all split from the same multipolygon. So we need to either create a new attribute field and manually enter unique string values in it, or select and export the parts to POLY files one by one and rename the files before they get overwritten.

Finale: Export as POLY

I am going to be lazy here and assume I will only need the contiguous US, so I select the corresponding polygon. After that I invoke the plugin by selecting Plugins > Export OSM Poly > Export to OSM Poly(s):

The plugin will show a list of all the fields that have string values. Select ISO2 and click Yes. Next you will need to select a destination folder for your exported POLY files. Pick or create one and push OK.

This is it! Your POLY files are finished and ready to be used in Osmosis, osmchange and other tools that use it for data processing.

By the way: you can’t load POLY files into JOSM directly, but there’s a perl script to convert POLY files to OSM files that I used in order to visualize the result.