I’ve done a number of talks lately on Worldmaps and typically in side conversations/emails, people are curious about the databases and converting IP addresses to geographic locations. And, often when you dive into using the data, it seems there are a number of performance considerations and I thought I’d share my input on these topics.

First up, the data. Worldmaps uses two databases for IP resolution. The primary/production database is Ip2Location. I’ve found this database to be very accurate. For development/demo purposes, I use IPinfoDB. I haven’t had too much time to play with this database yet, but so far seems accurate also. The latter is free, whereas Ip2Location is not.

In either case, the schema is nearly identical:

The BeginIp and EndIp columns are a clustered primary key. In the case of IPinfoDB, there is no EndIp field (and it’s not really needed). When performing a resolution, a string IP address is converted into a 64 bit integer and then used in searching the table. That’s why having a clustered key on the BeginIp (and optionally EndIp) is crucial to performance.

But it doesn’t stop there. The examples posted in the database’s respective home pages are accurate and simple, but need to be refactored for performance. For example, to do a simple resolution on Ip2Location, according to their docs:

Both of these methods are perfectly fine, particularly for use as generic samples. The second one is on the right track, but it doesn’t work for joins so if you’re querying over a range, you’d need to refactor. And in the first example, using a BETWEEN operator forces a clustered index scan when joining, killing the performance.

If we run the first example across my minified Ip2LocationSmall table, we’ll see something like this (and this is running off of SQL Azure – the perf is pretty great compared to localhost!):

We can also look at the time:

Ouch! Now, it doesn’t seem too bad, but imagine doing thousands of these requests per minute, or doing large joins.

The goal then is to provide some hints that will optimize the query, particularly for joins. Our indexes are correct, so we can rework the query to get rid of the BETWEEN operator – we can sacrifice a little readability and do something like:

We went from 16,500 milliseconds to 260 – over 60x the performance! Mike @AngryPets would be proud. The reason for the perf gain is we were able to eliminate the nested loop, which is (in this case) scanning the entire clustered index for the matching rows.

The second benefit is the ability to switch schemas easily between IP2Location and IPinfoDB, and we can additionally lose the EndIp column which trims the size of the table.