The SitePoint Forums have moved.

You can now find them here.
This forum is now closed to new posts, but you can browse existing content.
You can find out more information about the move and how to open a new account (if necessary) here.
If you get stuck you can get support by emailing forums@sitepoint.com

If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.

hi,
i am trying to achieve the same goal.
i currently use the free ip2country database imported in mysql.
On a daily basis, I run about a million queries against this table.
Due to the format of the table (each ip block identified by lower and higher ip), one cannot query the table for multiple IPs at a time, neither use indexes (because of the use of < and > in the where clause). I think this also prevents from using in-memory HEAP tables.
This results in the process taking quite some time even on a speedy machine.

I'd be curious as to any performance improvement possible as well. The way I see it, better performance could only be achieved by using a non sql database, and use the known formatting of the table and good algorithms to avoid full table scans.

actually, that's the way it is already. I was under the impression that mysql did not use indexes when having a comparison in where clauses, are you sure it does ?
If yes, what I don't get, is why i get the following:

The table has indexes on both l and h (high and low value of the ip block, and only returns one result (which is normal), yet the explains says that 16000 rows (full table scan) have to be looped through ?
If it used indexes correcty, shouldn't it be able to find the right row right away ?

after making a few tests, it seems that the number of rows looked at depend on the number one compares to. Probably is there some sort of optimization algorithm or something.

Anyway, it doesn't seem optimized enough, because I have written a small php script that searches through a text file, using a simple dichotomy algorithm and I'm having much better results.
On a 60000+ lines database (from webhosting.info ), a SQL query like the above takes (on my machine) between 0.0006 s (when the number is such that only a few lines are looked through) and 0.02s (when full table scan).
My dichotomic script returns systematically results in around 0.0003s, having performed 13 to 18 iterations.

Unless there's something i've missed on the index usage, i'm probably going to switch my method of operation, hopefully dividing at least in two the time taken by my batch.

Ren, don't you experiment that changing the number you compare to produces different results?
IP tables have a somewhat strange repartition scheme, which may be the reason for disparities.
This is also the reason why I don't get better results with other algorithms than dichotomy. I've tried the false position and tangent algorithms, which are supposed to converge faster than dichotomy, but the repartition of the IP blocks seems to produce weird results with those methods.

Anyway, I've modified my script a bit to put the whole file in memory and search php strings/arrays rather than the file, and I'm now getting results around 0.0002s per query (vs 0.0003s searching the file and 0.0006s-to 0.003s with sql).
Getting closer. Might experiment with a binary version and then i should be done .

Ren, don't you experiment that changing the number you compare to produces different results?
IP tables have a somewhat strange repartition scheme, which may be the reason for disparities.
This is also the reason why I don't get better results with other algorithms than dichotomy. I've tried the false position and tangent algorithms, which are supposed to converge faster than dichotomy, but the repartition of the IP blocks seems to produce weird results with those methods.

Anyway, I've modified my script a bit to put the whole file in memory and search php strings/arrays rather than the file, and I'm now getting results around 0.0002s per query (vs 0.0003s searching the file and 0.0006s-to 0.003s with sql).
Getting closer. Might experiment with a binary version and then i should be done .

I've been using the MaxMind geoip, as well that was the first IP to country I've found. Just wondering how both data sets compare for accuracy etc.

the problem is that mysql reverts to a table scan when it thinks that approxiamtely 30% of the table's rows will be returned. based on the statistics that mysql has about the table, it thinks that you query is going to return 16k rows, which is hlaf of teh 37k in the table. but by adding the iso_code column, it also knows that it can complete the query using only the indexes so it ignores the 30% rule.

I've just tried that, but it produces exactly the same result (indexes not used, or only partially). The phenomenon you explain seems likely to be correct, but the solution doesn't work .

I've been using the MaxMind geoip, as well that was the first IP to country I've found. Just wondering how both data sets compare for accuracy etc.

I'd be curious to compare as well. I've switched recently to the one provided by webhosting.info, however it's not perfectly accurate (although better than the ip2country i used before). An accurate one (including for the coty code) is the geobytes one, it's the only one i've come across that found correctly the city i live in, but it isn't free, and i'm not sure i need the extra precision for what i'm doing.

But as I need to do lookups on batchs of IPs, guessing that SQL version will perform better.

I do need to do lookups on batches of IP as well, but I can't find a way to make sql faster. Due to the conversion of IPs to numbers first, and most importantly due to the double comparison in the where clause of the queries, I haven't been able to find away to query more than one IP at onces. If you have, I'm really curious how you're doing.

I'd like to try out maxmind and see how it compares to webhosting.info's, but it seems it's not free either. Does anyone know a freely available databases including ip to city correspondance ?

I've just tried that, but it produces exactly the same result (indexes not used, or only partially). The phenomenon you explain seems likely to be correct, but the solution doesn't work .

I'd be curious to compare as well. I've switched recently to the one provided by webhosting.info, however it's not perfectly accurate (although better than the ip2country i used before). An accurate one (including for the coty code) is the geobytes one, it's the only one i've come across that found correctly the city i live in, but it isn't free, and i'm not sure i need the extra precision for what i'm doing.

I do need to do lookups on batches of IP as well, but I can't find a way to make sql faster. Due to the conversion of IPs to numbers first, and most importantly due to the double comparison in the where clause of the queries, I haven't been able to find away to query more than one IP at onces. If you have, I'm really curious how you're doing.

I'd like to try out maxmind and see how it compares to webhosting.info's, but it seems it's not free either. Does anyone know a freely available databases including ip to city correspondance ?

Maxmind do a free version of their binary .dat file.. "GeoLite".

Unfortunately cant get to their website atm to find the download link.

I'm importing web log files, so each line gets imported, atm I'm doing the country lookup in PHP, as each line gets imported. But plan to remove it from the loop, and have a single INSERT when the raw data has been imported.
Something along the lines of...

well, it says using index, but it still looks through 63370 rows (full table scan) when only one matches the query. My definition of correct index usage is that it only has to look up the records that end up in the result set.

the 63370 in the explain is NOT an exact number. that is an ESTIMATE of the number of rows that the index will return based on the key distribution. it has nothing to do with the actual number of rows looked at, although some times it just happens to be the same as the number of rows returned from that table, but that is coincidence, not a mathematical or procedural certainty.

the 63370 in the explain is NOT an exact number. that is an ESTIMATE of the number of rows that the index will return based on the key distribution. it has nothing to do with the actual number of rows looked at, although some times it just happens to be the same as the number of rows returned from that table, but that is coincidence, not a mathematical or procedural certainty.

Yes, but the performance is still poor even if it is using an index.

932 rows in set (32.99 sec)

vs

932 rows in set (9.16 sec)

Using the pair of queries post above. Latter timing using the extra table.

the 63370 in the explain is NOT an exact number. that is an ESTIMATE of the number of rows that the index will return based on the key distribution. it has nothing to do with the actual number of rows looked at, although some times it just happens to be the same as the number of rows returned from that table, but that is coincidence, not a mathematical or procedural certainty.

Still, if it ends up returning one row, and the estimate was 63370, it means the index doesn't work as it should. And if it estimates to get 63370 in the result set, it will go through a full table scan (because it will think it has to) won't it?
What's weird is the query times, I now get 0.002s for the same query that takes 0.0006s without the index.

i'm stumped. if anyone else want's to take a crack at it, i've attached a SQL file that creates a table of random ip addresses. you can download the ip2country sql file from http://www.ip2country.net/download/cr.zip (look for a file called anp_ip2country.sql)