Pages

Thursday, 23 July 2015

This material seems to be generating quite a bit of interest, and I'm starting to get questions asking about what it means in practice. We'll come to that, but first I'd like to consider a different aspect of data quality. So far most of the focus has been on coverage: what proportion of retail features have been added to the database.

Coverage of one common category of shop has not been considered yet, though. In around 1% of cases the intent of the contributor was clearly to indicate that this was a shop that was not in use. These include “shop=closed”, “shop=empty”, and most commonly “shop=vacant”.

High street vacancy rates across the UK are currently averaging around 10%. Out of around 50,000 vacant shops, we have data on just over 1,000 (2%). This is one of the lowest levels of coverage that we have identified. We can probably assume that contributors are most active in the most vibrant high streets (i.e. those with fewest vacancies), but this still suggests that vacant shops are badly under-recorded in OSM. It is difficult to say whether that means the missing vacant shops are completely un-recorded, or recorded in a way that is difficult to recognise. Either way they are not readily available to data users. However, that probably doesn't matter greatly. It's difficult to imagine many users who would value an application that can find the nearest vacant shop.

But data quality is not just about completeness. We must also question whether the recorded data accurately represents what is on the ground.

In my efforts to uncover as many retail premises as possible I've identified over 2,000 different tag combinations. Around 80 of those account for more than 95% of retail premises. The most common 26 account for 85%. Among the 2,000 are around 200 minor spelling mistakes. These represent 10% of the tagging variations, but a much smaller proportion of the data.

My estimate of the number of spelling mistakes is based on calculating the Levenshtein distance between different values of the shop tag. Where there are only one or two differences in spelling between one tag and another, my initial premise is that the less common variant is a spelling mistake for the more common alternative. However, this approach also picks up some correct values of the shop tag, that have to be eliminated manually from the sample (“shop=car” and “shop=card” for example only differ in one character, but are not spelling variations of each other). The approach is bound to miss some more complex spelling mistakes, but hopefully not too many. I think it is capturing the great majority.

Variations in the use of plural and singular forms account for around 60% of these errors; differences in capitalisation for around 8%; and differences in hyphenation and underscores around 6%). The remaining 25% of near matches are more diverse. Overall this approach detected spelling mistakes in around 0.7% of shop tagging.

Given the controversies over bulk editing, it may be worth noting that

the number of retail features in the database which contain a spelling mistake in the shop key is in the order of 1,000 (compared to 385,000 missing retail premises).

around a third of the spelling mistakes in the shop tag are unique occurrences

many spelling mistakes are an unusual spelling of a value which itself is comparatively rare (or a non-standard use of the “shop” tag)

data users are probably just going to ignore these - the volumes of lost data are too small to justify a lot of effort on their part

In other words, it looks as though very few of these cases are suitable for bulk editing: virtually all either need to be checked and fixed manually, or can be more easily fixed manually than with a bulk edit.

Examples of spelling mistakes which occur more than a couple of times include: