Wikidata, the crowdsourced database of structured knowledge by the Wikimedia movement has grown to over 24 million entries and by now has structured information for every major settlement on earth. These are extremely useful properties like multlingual labels, statistics like populations and GDP, and other related information like politics, history and media about the place (See London, New York City, Timbuktu).

Geolocated articles on Wikidata, with those added in the last year highlighted in pink. Source: Wikidata Map

Current state of multilingual tags

One of the great strengths of OSM is to leverage the data to create create multlingual maps that make the map accessible to a lot more readers than just the local population. Since the beginning of the project, the community have been adding various name:code tags for this purpose, and has resulted in map features with a ever growing list of multilingual names eg. the node for London has 171 properties, of which 155(90%) are name tags in various languages.

A more scalable approach would be to leverage the Wikidata entry for London, which has the translated name in 248 languages, and growing automatically with every Wikipedia page of the city that is created in a new language.

Matching Wikidata items to OSM

Just like OSM, Wikidata items of places have tags describing the feature and coordinates that make it possible to automatically match a feature on OSM to the corresponding feature on Wikidata. Unfortunately the geographical accuracy of Wikidata entries cannot be trusted, as many of the coordinates are derieved from Wikipedia pages which in turn are usually derived from Google Maps. Moreover entries of lesser known places may not be tagged correctly on Wikidata and might result in ambiguous matches to an OSM feature. For this reason manual confirmation of a match is necessary.

At the Mapbox data team, we have been experimenting with adding Wikidata tags to cities and towns on OSM based on an exact name and location match. The possible matches were loaded onto a spreadsheet with the match distance and Wikidata description of the corresponding item. After a manual review, its easy to confirm the match with a very high degree of confidence based on the name, distance and description of the match. With this approach we have found that just an exact name and location match can give a 99% success rate for places.

There are two cases when the name matching happens:
- Unique matches: One OSM feature matches to one Wikidata feature
- Duplicate matches: One OSM feature matches to multiple Wikidata features with the same name

Unique matches

In most cases, the location of the matched feature on Wikidata is less than a few Kms, and by confirming from the description that the feature is also a city or town, its possible to confirm this was the correct match. It is important to be careful about the feature description as in some cases Wikidata may have ambiguous entries that represents multiple concepts like both a city and a province with the same name as one object.

For unique matches with a large match distance >10kms, it is likely the match was to another place with the same name and is an incorrect match. In a few rare cases, the Wikidata location was found to be incorrect and was actually a correct match.

Duplicate matches

When an OSM feature matches to multiple Wikidata entries with the same name, it is considered a duplicate match. In most cases a distance filter of around 10km enables a unique match, and a further look at the description can confirm the match is correct.

In a few rare cases multiple OSM features with the same name and location match to a single Wikidata feature. These are places with duplicate nodes on OSM itself and need to be merged.

What next?

Large scale map features like countries, cities, towns and water bodies are great candidates to start matching with Wikidata as they are fairly well defined on both projects and can be matched without ambiguity. Doing this will allow us to better understand the value that Wikidata can add to OSM, and help pave the wave for more interesting map services that can be built on open data.

The tiny weeny issue with this is naturally that there is the underlying assumption that wikidata is correct and that the data meets our quality criteria (as in actually being in use and not invented).

Since the matching is based on the name, location and description on two databases being coherent, the chances of having invented data being added is really low, unless of course the same invented data made it to both the databases, and we found this did happen with the GNIS place data in the US. Check out this discussion https://www.openstreetmap.org/changeset/43187605

Still figuring out what the scale of the issue is, since it looks like nobody really reviewed if all these towns were tagged correctly on the map in the last 9 years.

I like the idea of using Wikidata to link different platforms of information, but I miss information about API and development tools to properly make use of it. The problem is obviously not on OpenStreetMap side, but rather on WikiData side. An use example, a tool getting border relations from OSM, and collects the names of the City/State/Country from name:* tags, it could also call a WikiData API for the same reason, and would than be able to get the names from a broader selection, and less prone to miss names due to limited tagging, i.e., several Chinese provinces have no latinised name tags, though this might exist in WikiData.

Great initiative! I see one danger thought. Usually a city or village is also connected to surrounding region. Both the city and the region cam be mapped in OSM (and in Czechia they always are) having the same or similar name. One needs to be carefully to attach the wiki data label to the right area than. If needed I can easily fingers and example.