Sunday, October 24, 2010

Update:I have a second post with two more state maps up here. More to come!Update 2:Here are twomore posts with nationwide maps.

If you are at all familiar with OpenStreetMap, especially in the US then you will know that most of the roads here are imported from the US Census Bureau TIGER data. This data was imported from the current data in 2007. It is fairly complete but the accuracy leaves a lot to be desired in some areas. For example look at this screen shot comparing the TIGER data to reality (USGS aerial imagery in this case). Some roads are off by over 200 meters. And there are even worse spots - this was just one I was able to find quickly.

Example of bad TIGER data

While the data was pretty complete when it was made, it is now several years old and missing many new housing developments and does not reflect instances where roads were moved. The good news is that, this being a census year, there will be new TIGER data available from the Census Bureau starting at the end of the year. It will be more complete and probably more accurate considering that GPS technology has made some advances over the last 10 years.

Well that's great but the question is how do we use the new data to improve OSM? In the software world this would be an ideal situation to use a diff utility to determine what had changed between two versions and apply patches to update the old version. But I am not aware of any geo-spatial diff utilities.

So how about just deleting the old data and re-importing the new stuff? This would obviously be a disaster in areas with active mappers. All the work they have put into correcting existing TIGER ways would be destroyed and replaced with data that is probably better than the old TIGER import but also probably worse than what local mappers have done.

BUT...

I'm in Kansas. There aren't a lot of mappers here, especially in the western part of the state. Mostly because there just aren't a lot of people out there. Some counties have a population density of 6 people per square mile. Oh and the TIGER data was imported in county sized chunks. So how about handling this on a county-by-county basis? Counties that haven't been touched could just be blown away and refreshed with the new data. If only someone could determine how much TIGER data has changed since the initial import, on a per-county basis... Oh wait, I did!

I am taking a cartography class at Kansas State University this semester. After starting to contribute to OSM I decided to take advantage of the employee tuition assistance and learn more about the subject. For $150/semester, why not! 20% of the grade in the class is for a project where he turned us loose to go get our own data and make some kind of interesting map. So I got my data by importing a Kansas extract from the OSM planet file and made this map:

A couple of notes about the map:

I let ArcMap classify the data using natural breaks. There are 105 counties in Kansas. 104 of them still have have 78% of their TIGER data in its original state. One has had 75% of it modified. Yes, this is the county I live in. Yes, most of it was done by me. Yes, this inflates my ego.

The total count doesn't really do much for the map in my opinion but the class project called for a bivariate map, so a bivariate map I made! Obviously counties with bigger cities have more TIGER ways. This is your dose of non-surprise for the day.

My method of detecting changes by "local" mappers is by no means fool-proof. Basically I looked at the user who last changed a given way. If the user was one of 4 users I identified that were definitely not local mappers, I counted that road as unchanged. (technical details below)

Labels would add a lot of clutter to the map. If you want to see which counties are which, I suggest the Counties of Kansas wikipedia page.

So I guess the question is how do we use this information? This will have to be a discussion amongst the OSM-US community. Due to time limitations I only did Kansas but this could certainly be done for other states (particularly the sparsely populated ones) to help local mappers decide what, if anything, to do with the new TIGER data. My suggestion would be to do a fresh import of any counties that have above a certain threshold of unchanged data. Say 95%? In Kansas that would be 59 counties. If a threshold were decided upon the map classifications could be altered to reflect that number.

For counties that HAVE had local activity, perhaps some process could be set up using tiles from the TIGER edited map that MapQuest has provided as a background layer in JOSM or P2. Then pull up the new TIGER data on top of it and compare. Import missing roads or ones that are more accurate in the new data and haven't been touched by local mappers. That is just a thought that popped into my head as I was writing this.

Technical Details

I imported the October 20th planet file into an apidb using osmosis. I originally tried to do the whole planet and intended to do a larger analysis of bot activity on a worldwide basis but after 4 weeks and 350 GB it showed no signs of being anywhere close to finished so I fell back to only importing Kansas. Luckily Kansas is pretty much a big rectangle so I just used a simple bounding box. This finished in about 2 hours.

Here is the final SQL query I came up with to get me all the data I needed in one result set. Keep in mind that I only imported a bounding box around Kansas:

select v as county,
sum(CASE WHEN user_id in (147510,7168,20587,293105)
THEN 1 ELSE 0 END) as bot_count,
sum(CASE WHEN user_id in (147510,7168,20587,293105)
THEN 0 ELSE 1 END) as user_count,
count(*) as total_count
from ways, way_tags, changesets
where ways.id = way_tags.id
and ways.changeset_id = changesets.id
and k = 'tiger:county'
and v like '%KS%'
and v not like '%;%'
group by v
order by v

In less SQLish terms: I am primarily looking at the tiger:county tag to group the query by county. It contains values like "Riley, KS" so basically I am looking for any way with a 'tiger:county=*KS*' tag. This excludes the few ways around the Kansas border that are from other states since my bounding box was just a little bigger than the state borders. However I exclude ways that have multiple values in the tag, separated by a semicolon. Typically this would come from two ways in adjacent counties being joined together. This is pretty rare so I ignored them. Once I find those ways I examine the user who last touched the way by joining through the changesets table. If the user ID is one of 4 values then I count the way as not having been modified by a local mapper. Those 4 user IDs belong to the following users:

147510 = woodpeck_fixbot (this is a bot that has performed various automated edits as documented on the OSM wiki)

7168 = DaveHansenTiger (this is the user who did the initial TIGER import)

293105 = NHD edits (NHD = National Hydrography Dataset. I'm assuming this user probably imported some rivers and ended up splitting some TIGER ways to make bridges or something

Looking at the data now, I should have maybe also excluded NE2. I believe he has done a lot of work on national highway/interstate routes. I don't think there is any reason to re-import ways that are part of those systems since they have since been added to route relations and such. So those should probably be excluded from the data. Hindsight and all that.

I don't have a Colorado county shapefile on hand... although I guess I do have the nationwide one that I could filter out individual states from. So yes, I suppose I could. The longest part of the process is importing the data into the database.

As for contact, you have been in contact with me already! (I'm ToeBee on IRC) But yes, I have adjusted my profile a bit. I just set up this blog on Friday and am still tweaking things.

Welcome to the blogosphere ToeBee! We like blogs about OpenStreetMap .

Don't get too hung up on waiting around for the government to supply OpenStreetMap with better data. I can understand that manual mapping tasks may be pretty uninspiring, when you're looking across the vast area of Kansas, and it may seem logical to just wait for better TIGER data, but the way OpenStreetMap will build a truly great map, is by building a community of active mappers.

For me the most exciting part of this blog post is the bit where you talk about 75% modified in the county you live in. That's the spirit!

But the work you've done on this is neat. In fact this kind of map can help motivate people to improve their area, if you frame it that way. Don't forget that rural areas have fewer people, but there's also fewer roads to fix up!

@Zeke Did you try pointing OSM Mapper at Vermont? You can make it show users in the area, and rank them by the amount they've contributed. Also try <a href="http://matt.dev.openstreetmap.org/owl_viewer/map?zoom=11&lat=44.6173&lon=-72.27167&layers=BF>pointign OWL at Vermont</a>

Thanks for the encouragement. I am working on expanding this analysis. SteveC pointed me towards OpenHeatMap so I may try to do a nationwide analysis and post it there instead of doing individual state maps.

I may need some help in compiling a good list of known bots and import accounts that I can safely ignore. There probably aren't THAT many since I only care about changes that cause version bumps in TIGER ways.

@Tim: I don't think everyone knows what all the tiger:* tags are. I certainly didn't when I first started. The only reason I became aware of the function of the tiger:reviewed tag was when I noticed that JOSM was rendering existing ways differently from ways that I had added myself.

I think I am using a similar technique that the "TIGER edited map" from MapQuest is using except I am grouping it into counties instead of displaying individual ways.

Yeah, I'm like Troy. Don't understand much of what is said here, but am an avid user of any GPS/Mapping system, so if we get more accurate data from high-power users like you, that's great! Thanks for the work.