I am processing a large number of historical web server logs. I can do queries in parallel (e.g. 10 processes all getting 1 query per second) but parallelising the log processing code is a much bigger task than I can handle right now.

Maxmind supply a compiled lookup tool and their database in binary format. Lookups using that are lightning fast.

How can I create something similar for an arbitrary dataset? I'd like to load the 20M rows into some magic binary indexer and get the code to make API calls to some daemon instead of MySQL lookups. I can't use Maxmind - the data I have is customised heavily.

Theoretically, I could split up the 20M rows by netblock and store them as text files on disk and have a lookup table in the code to tell it which folder to look in. But there must be a better way! I have Percona Server 5.5.28-29.2

Comment

Thanks for your response. You're correct, it was supposed to be start_ip_int >= inet_aton(). Good suggestion to leave the end_ip_int out of the query but I just tried it and there was no performance improvement, and it introduced some incorrect results (presumably I need to sort the results which slow things down).

If I replace the MySQL lookup with an API call to the Maxmind binary database I get a massive performance increase (like 300/second). I didn't think MySQL could compete with that for sequential lookups? As i'm trying to process a huge amount of web logs (probably over a million unique IPs), I was hoping for some way to create my own version of whatever Maxmind do in their binary.. I can't use maxmind because I have my own custom fields but i'd really like something that I can export the MySQL table into that magically speeds it up.

I've been messing around with Sphinx which is good for getting the hits off my busy MySQL server, but it didn't give me the performance boost I need either.