Improve IP Geolocation Results

Description

There are a couple problems with our current data set. They're not strictly related to each other, but I'm creating a single ticket for this because I'm guessing that the solution to both will be the same: switch from our current data source to a better one.

Problem #2: No IPv6 support. Without this, the results will quickly lose effectiveness. As of May 2017, ​IPv6 adoption is at 18% globally and rising relatively fast. Some countries are as high as 32%.

It looks like our database was updated recently (March 2017), but I can't find any info on where the original data comes from. @dd32, @tellyworth, do either of you know? Do you have any thoughts on other potential sources?

Attachments (3)

Approach using Google Geocode to power user lookups instead of attempting to roll our own geocode service - Google does it better, and doesn't receive any personal user details with a server-side request

At the time I first encountered those years ago, they were an outdated copy of the free version. I updated the data to the latest copy, but then we didn't use it for much, so I let it go. Looking at it now, it appears to have many more rows than the free version, so I suspect somebody updated it to the non-free version. They may still have access to that and can get the big dataset, if we know who updated it last.

1) The ip2location data on WP.org is very very old.
2) WP.org doesn't have a subscription/license to use the data that I am aware of.
3) I don't think there is any legitimate use case to pass an ip in the query string - this just opens up the API to abuse and most likely violates the terms/EULA of using this data.
4) The query performance is very poor and probably won't scale to the volume that we need. For ip2location specifically we have already solved this problem on WordPress.com, so I would suggest looking at the code there. I haven't looked at the geonames code much, but I suspect it suffers from similar issues.
5) We need a way to keep the information always current if we are going to use it. It changes enough on a regular basis that once it's > 60 days old it starts becoming noticeably wrong.

The first problem to solve is #2. I don't think we really need ipv6 support yet since the API should always use REMOTE_ADDR and the WordPress.org API currently doesn't have any AAAA records.

The first problem to solve is #2. I don't think we really need ipv6 support yet since the API should always use REMOTE_ADDR and the WordPress.org API currently doesn't have any AAAA records.

The IP is not the REMOTE_ADDR in most cases - it's the WordPress user client IP, not the server IP. There's some "anonymisation" stuff in there - ​see maybe_anonymize_ip_address(). The exception is when the client IP provided is a private range, in which case it uses REMOTE_ADDR.

Would it be better if the client browser was making the request directly? (and not proxied through WordPress) - the detect-location call would need to be made on every request in that case, and not just on the first call / when user searches for a location change.

I see why it was implemented this way, but I think it's a clear violation of the ​EULA and the endpoint will quickly be abused. I am not too worried about the traffic once #4 above is fixed, but I guess there are some general performance and privacy concerns that will need to be addressed if the implementation is changed.

1) The ip2location data on WP.org is very very old.
2) WP.org doesn't have a subscription/license to use the data
3) I don't think there is any legitimate use case to pass an ip in the query string
4) The query performance is very poor and probably won't scale to the volume that we need
5) We need a way to keep the information always current if we are going to use it

#1, #2, and #5 are all easy to solve.

#3 (EULA)

On a technical level, it's easy to switch the client to request the data directly from api.w.org instead of proxying through their WP site.

For anyone who's concerned about privacy with that architecture, we can release a plugin that will disable the Events section of the Dashboard widget entirely. We could replace it with a simple link to ​the Events page on make/Community.

Does anyone think that would not be enough to mitigate privacy concerns?

#4 (performance)

I'm going to start digging into this and looking for ways to optimize. If you can say which queries you're concerned the most about, that would be helpful.

Almost all of the queries are there to support extra functionality and edge cases. Worst case scenario, the queries in the following functions can all be disabled without preventing the majority of users from successfully retrieving events: guess_location_from_ip(), guess_location_from_geonames_fallback(), get_valid_country_codes(), get_country_from_name(), and get_city_from_coordinates()

That can be done at-will on the API side without breaking the client in Core. We could even do it automatically during traffic spikes if we have some kind of signal from monitoring agents to check.

I have been thinking about the EULA a bit, I think if we remove the actual passed IP geolocation data from the response and just return the events data it will be fine. We would then use the client IP geolocation internally, in "business logic" to figure out the closest events and return them. We still need to work on the other issues, but they are technical and thus easier to fix.

That sounds like a good idea to me. I think the detailed steps look like this:

The initial request. The client doesn't have a location saved at this point.

Client requests events near {ip address}

API returns the events object like it does today, but no location object

Client parses the city name from first event (e.g. "Seattle"), and saves that as the location

The second request. The client has a city name saved at this point.

The client requests events near {city name}, just like it does today when a user manually types in a location

The API matches the city name to lat/long coordinates using the Creative Commons data from geonames.org

The API returns the events object, and a location object, just like it does today.

The client saves the full location object as its location

All future requests. The client has a saved lat and long at this time

These would be exactly the same as they are today.

So, the client would have to make 2 requests instead of 1, but it would eventually get the same behavior as we have today. The client could do step #2 asynchronously after the events received in step #1 have been displayed, in order to make things a little smoother.

All of the above only applies if the user hasn't manually entered a location. If they enter a location manually, then we skip step #1, and go straight to #2.

@barry, does that sound right to you? If so, I'll get started on the necessary changes.

To clarify, are you thinking we'd remove the IP from the client's request and just use REMOTE_ADDR on the API side, or would the client still be able to specify an arbitrary IP?

Clients can specify an arbitrary IP. I was initially concerned about the privacy implications of the clients connecting directly to the .org API which would provide their full IP and not the truncated version that the current proxy approach uses. I thought this would be the only place that a .org site sends the client IP to WordPress.org (as opposed to the server IP which is sent in many places). However, I then remembered that we use the W.org CDN for serving emoji in core, so I am not sure if it's a huge problem.

I'm ok with any approach that satisfies the "business requirements" of this widget and doesn't violate the EULA for the data source(s) we use. The rest of the stuff like performance and scaling we can figure out pretty easily.

Oh, and I forgot, if we send arbitrary IPs we might need IPv6 support which I haven't looked at yet. If the privacy concerns aren't an issue, having the client connect directly to the API endpoint might be the cleaner implementation.

@dd32 has a really good idea to modify the process a bit. I think will make things simpler on the client side, while still staying within the EULA.

When the client makes a request for events and provides the IP, we geolocate that IP from ip2location, and fetch the events that are close to it. We do not return the location data provided by ip2location to the client.

This way, we can still return a location to the client in the initial request, but we're not directly exposing the location details from ip2location at all. The previous approach was essentially the same thing, but the response.location lookup was being done in a subsequent request (see comment:17), rather than all at once.

r5491 introduced a new condition where the client might receive an unsuccesful location lookup, but it wasn't consistent with prior behavior. This commit applies the prior behavior to the new case, so that clients can expect a consistent no_location_available error if the location wasn't found.

2823.diff​ takes a pretty harsh attack on some of the code in order to attempt to get the performance better for location searches.

get_city_from_coordinates() is gone - It's just not performant enough and I couldn't find an easy way to achieve decent performance for such an edgecase

get_valid_country_codes() wasn't actually being used (though was querying)

guess_location_from_geonames_fallback() and as a result get_ideographic_counties() is gone - this at first seems like a step backwards, however, I think what it's attempting to achieve was going about things the wrong way.

REPLACE( alternatenames, CONCAT( asciiname, ' - ' ), '' ) LIKE wasn't performant and there wasn't much of a chance of it being so

using a alternatenames LIKE '%,Blah,%' was working, but wasn't really fast

In order to speed things up, I created a new table based off the geoname data, geoname_summary which expands the (name, asciiname, alternatenames) out into a single name column, and also adds variants (for example the city name without dashes).

One major change in this is that if you type in, for example, The Big Apple the location returned will now be The Big Apple instead of New York City, similarly, if you search for Wien you'll get just that Wien not Vienna.
Unfortunately typo's will stay as typo's (for example, New Yorke or Lunden).

This might also make it easier to return multiple results in the future (Springfield, MO vs Springfield, MA for example, or allowing an en_AU locale to pull up Wellington, NZ instead of Wellington, AU).

All the test cases that passed are still passing albeit with the new return data, but instead of the ~100 queries taking 4-6 seconds, it now takes 0.04s for a slightly fewer number of queries.

The function was not performant, and it's not easy to make it that way. It's only needed for a few edge cases, and we can get the location from the events themselves (similar to r5491). So, removing this is a good trade-off.

Because we are able to rebuild the location, no changes are needed in Core, and a new version of this endpoint is not needed.

Firstly, I am nowhere near Romford (that's a pretty far away from where I am).

Secondly, as someone from UK, showing US City in the box is a weird experience. Also, why only cities? In UK for example towns have meetups and a lot of people don't live in cities.. this I am sure happens in US :)

Secondly, as someone from UK, showing US City in the box is a weird experience.

That string is internationalized, so you'll see a local city as soon as it's translated.

Also, why only cities?

The database we have is primarily city-centric. It has some region information, but it's not detailed or comprehensive enough to fully support. The API does try to match inputs other than cities, but we only explicitly mention cities in the instructions because that's the only thing we can guarantee will work.

The database we have is primarily city-centric. It has some region information, but it's not detailed or comprehensive enough to fully support. The API does try to match inputs other than cities, but we only explicitly mention cities in the instructions because that's the only thing we can guarantee will work.

Worth noting that the database we've been using says "city" but in this context, city means town. We've used a data source which includes town with a population of >1k (or an administrative area of 1k, so town X of 500 and 500 in the surrounding area, where X is the common point would be included). For countrycode = GB there were 3.6k 'cities' with 14k different spellings and alternate names.

Approach using Google Geocode to power user lookups instead of attempting to roll our own geocode service - Google does it better, and doesn't receive any personal user details with a server-side request

Searching for "Köln" gives me a "We couldn’t locate Köln." error but a search for "Cologne" or "Koeln" returns the proper "Köln". "Münster" is working fine so I'm not sure if it's an issue with umlauts.

Another oddity: "Düsseldorf" returns "Düsseldorf-Pempelfort" which is kinda unexpected because "Pempelfort" is just a city part of "Düsseldorf".

As I mentioned in the devchat yesterday, searching for "Gardner, CO" gives events in Kansas City, MO...which is almost 700 miles away. Looking into the community-events-location meta_key, I see that the lat/lng stored is for KC, MO, which explains why it showing event for KC, MO.

Searching for "Köln" gives me a "We couldn’t locate Köln." error but a search for "Cologne" or "Koeln" returns the proper "Köln". "Münster" is working fine so I'm not sure if it's an issue with umlauts.

Another oddity: "Düsseldorf" returns "Düsseldorf-Pempelfort" which is kinda unexpected because "Pempelfort" is just a city part of "Düsseldorf".

I think most of the charset issues have been worked out, but there's still at least one that I know about (see the unit tests).

@dd32 is working on upgrading our geocoding from geonames.org to Google's Geocoding API, which should fix all of that. See 2823.2.diff​.

As I mentioned in the devchat yesterday, searching for "Gardner, CO" gives events in Kansas City, MO...which is almost 700 miles away. Looking into the community-events-location meta_key, I see that the lat/lng stored is for KC, MO, which explains why it showing event for KC, MO.

I think that's an edge case just because there's not enough information to distinguish between the two Kansas Citys, since they're so close to each other geographically.

​[wp40774] might be able to help there, though, since we could use the IP geolocation as a signal. We'd also like to have the API start returning a list of alternate locations, and have Core display those in a dropdown for the user to choose. That'll probably have to be 4.8.1, though.

Worth noting that the database we've been using says "city" but in this context, city means town. We've used a data source which includes town with a population of >1k (or an administrative area of 1k, so town X of 500 and 500 in the surrounding area, where X is the common point would be included).

That probably explains the behavior I'm seeing...as Gardner, CO has a population of only 70. If the surrounding farms are included, what I call "Greater Metropolitan Gardner", the pop balloons to a whopping 400. Yes, I live in the sticks :-).

Worth noting that the database we've been using says "city" but in this context, city means town. We've used a data source which includes town with a population of >1k (or an administrative area of 1k, so town X of 500 and 500 in the surrounding area, where X is the common point would be included).

That probably explains the behavior I'm seeing...as Gardner, CO has a population of only 70. If the surrounding farms are included, what I call "Greater Metropolitan Gardner", the pop balloons to a whopping 400. Yes, I live in the sticks :-).

No shame in that :)

I think the only reason we limited it to towns with > 1k population is just because, for an endpoint that will see as much traffic as this one, we don't have the server resources to do the database queries on larger data sets. Once we switch to Google's API, though, that should solve that problem.

As I mentioned in the devchat yesterday, searching for "Gardner, CO" gives events in Kansas City, MO...which is almost 700 miles away. Looking into the community-events-location meta_key, I see that the lat/lng stored is for KC, MO, which explains why it showing event for KC, MO.

I think that's an edge case just because there's not enough information to distinguish between the two Kansas Citys, since they're so close to each other geographically.

Sorry, "KC, MO" was just my abbreviation for "Kansas City, MO"...I should have been more clear.

r5543 merges the last bits of 2823.diff​, so the performance issues should be sorted now.

The two things still left in this ticket are:

Switching to an external geocoding API for better accuracy. I'm talk to a Google rep today about ToS, pricing, etc. If that goes well, I'll work on merging 2823.2.diff​. I'd like to keep the geonames stuff around as a fallback, though, at least for a few weeks. Depending on pricing, we may also want to keep it around permenantly, and use it for queries that we are confident will return good results from the geonames.org data, and then only use the external API for requests that are likely to be get better results from it.

Automating the ip2location updates. I think I have everything I need to finish the script now, so I'll be working on this today.