Geocoding with GeoPy and Nominatim

Geocoding refers to the conversion of addresses into coordinates and, vice versa, the conversion of coordinates into the corresponding address (reverse geocoding).

There are a number of freely available geocoding APIs that are suitable for smaller use cases, e.g. geocoding of less than 10,000 points per day.

In most cases it is not necessary to call the APIs manually.geopy is an excellent Python library for (among others) geocoding and reverse geocoding that supports many APIs. In this example we use the Nominatim API, which is based on OpenStreetMap (OSM) data. The OSM data is subject to the Open Database License (ODbL).

The Nominatim API does not necessarily require an address consisting of street, house number, and city, but also knows many business addresses and points of interest. We use the API to create a list of the 114 largest football stadiums in Germany with coordinates – which we present in the next section on a map.

However, the column gps_height, which specifies the height of the measuring point, still contains missing values for about 1/3 of all measuring points.

df.gps_height.isnull().sum()
# 25637

We use a weighted k-nearest neighbors regression on the columns latitude and longitude to estimate the missing values. Note that the missing values for gps_height have a very regional distribution, which tends to have a negative effect on the result of our regression.

The algorithm calculates the distance between two points as Euclidean (L2) distance in the dimensions longitude and latitude. However, we have not yet taken into account that …

… longitude and latitude have a different conversion factor in actual distance in this dimension (in km)

… the conversion factor for the longitude is dependent on the latitude

.

This can interfere with the algorithm in the selection of the “nearest neighbor” and lead to worse results. Of course, the further the points are apart, the greater the effect. Even if in our example the variance in the latitude is low and the points are quite close to each other, and our algorithm with k=15 is very robust against uncertainties in the distance, we try to improve the regression by calculating the “real” distance and weighting by distance. This is particularly useful due to the regional distribution of points with missing gps_heightvalue mentioned above.

The test shows that for the “dense” points with an gps_height value present, it makes no significant difference whether the Euclidean distance is used at longitude and latitude or the actual distance in kilometres.

However, since the points whose gps_height value we want to estimate are sometimes very far from the points used in the training of the algorithm, we nevertheless use the actual distance to estimate the non-existent values.