--------------000706010906020103020108
Content-Type: text/plain; charset=ISO-8859-15
Content-Transfer-Encoding: 7bit
Ruby Quiz wrote:
> This week's Ruby Quiz is to write a simple utility. Your program should accept
> an IP address as a command-line argument and print out the two letter code for
> the country that IP is assigned in. You can find a database for the matching
> at:
>
> http://software77.net/cgi-bin/ip-country/geo-ip.pl
>
> To keep the problem interesting though, let's write our programs with a focus on
> speed and memory efficiency.
>
> $ time ruby ip_to_country.rb 68.97.89.187
> US
>
> real 0m0.314s
> user 0m0.259s
> sys 0m0.053s
This is my first submission to Ruby Quiz, so don't expect ground-breaking techniques from this implementation.
Let me characterize my solution: Instead of looking up the csv file directly, I precompile it. The search can then be performed quickly. I think my solution will beat some others when it comes to massive lookups in the range of 100th of thousand lookups.
First, I wrote a random IP address generator. It can alternatively output the IP addresses as a space delimited string or each IP address on a separate line. On my bash submitting all IP address worked up to about 8 k entries until the argument list was too long, so I implemented that the addresses can be given on $stdin also. For a proper argument list on stdin, the entries should be separated by \n, so just add a second argument to rand_ip.rb to generate the list of ips -- the script doesn't care what the second parameter is, it just has to be there (i use "byline").
I use the script like this:
$ ./rand_ip.rb 100000 byline > ip-test-list
Second, I use a precompile run. I don't know how folks here can stay under 100 milliseconds without parsing the file beforehand. My precompilation generates a binary table that stores 10 bytes for each entry: 4 bytes each for start/end of address space and 2 bytes for the country code. Additionally, for saving memory, I squeeze memory areas with adjacent addresses in the same country. This saves more than 50% of the records while it doesn't significantly speed up the search (just about 1 step more in my binary search algorithm, see below). the packer (packr.rb) uses pack("NNa2") for building the binary table, and look, the addresses are stored in network byte order (i.e. big endian). The output table holds about 32 k Entries.
The packer doesn't accept any parameters, just call
$ ./packer.rb
Third comes the actual address search. I implemented two versions for comparison: One with a small RAM footprint that looks up everything in the packed table file. The other one reads the whole table into memory and performs all lookups in memory.
The algorithm is the same in both implementations: I use binary search over all entries until the ip address either matches one range or there is no match at all.
Comparison is sped up by making 4-letter string comparisons of the respective binary addresses. That way "search_ip > ddr_start" works even when the addresses are strings. This saves a lot of time because at search time there is no .to_i calculation avoided.
I run the searchers (search_file.rb, search_str.rb) for small numbers of random IP address using the command line:
$ time ./search_file.rb `./rand_ip.rb 3000` > /dev/null
For more addresses I use one of the following -- the last usage is to make successive runs with equal input:
$ time ./rand_ip.rb 100000 byline | ./search_file.rb > /dev/null
$ ./rand_ip.rb 100000 byline > ip-test-list; time ./search_file.rb < ip-test-list > /dev/null
My results for 100000 lookups are:
$ time ./packer.rb
real 0m1.634s
user 0m1.576s
sys 0m0.056s
$ time ./rand_ip.rb 100000 byline > ip-test-list
real 0m0.797s
user 0m0.740s
sys 0m0.056s
$ time ./search_file.rb < ip-test-list > /dev/null
real 0m11.091s
user 0m9.673s
sys 0m1.420s
$ time ./search_str.rb < ip-test-list > /dev/null
real 0m7.131s
user 0m6.960s
sys 0m0.168s
btw: I don't understand the following result -- can you help me improving this? 129 ms just for firing ruby up! Can't be!
$ time ruby /dev/null
real 0m0.129s
user 0m0.120s
sys 0m0.012s
$ uname -a
Linux server 2.6.21-modgentoo #2 SMP Wed May 2 19:07:13 CEST 2007 x86_64 AMD Athlon(tm) 64 Processor 3000+ AuthenticAMD GNU/Linux
$ ruby --version
ruby 1.8.6 (2007-03-13 patchlevel 0) [x86_64-linux]
Cheers,
- Matthias
--------------000706010906020103020108
Content-Type: text/plain;
name and_ip.rb"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
filename and_ip.rb"
#!/usr/bin/ruby
r ¨ÂÖÛ°Ý®ôïß
EOL RGV[1] ? "\n" : " "
r.times {
print "#{rand(256)}.#{rand(256)}.#{rand(256)}.#{rand(256)}#{EOL}"
}
--------------000706010906020103020108
Content-Type: text/plain;
name acker.rb"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
filename acker.rb"
#!/usr/bin/ruby
# comment
last_start l
last_end l
last_country l
File.open("packed-ip.dat","wb") do |wfile|
IO.foreach("geo-ip.csv") do |line|
next if !(line /^"/ )
s,e,d1,d2,co ne.delete!("\"").split(",")
s,e .to_i,e.to_i
if !last_start
# initialize with first entry
last_start,last_end,last_country ,e,co
else
if s ast_end+1 and co ast_country
# squeeze if successive ranges have zero gap
last_end else
# append last entry, remember new one
wfile << [last_start,last_end,last_country].pack("NNa2")
last_start,last_end,last_country ,e,co
end
end
end
# print last entry
if last_start
wfile << [last_start,last_end,last_country].pack("NNa2")
end
end
--------------000706010906020103020108
Content-Type: text/plain;
name earch_file.rb"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
filename earch_file.rb"
#!/usr/bin/ruby
# take command line or stdin -- the latter has performance advantage for long lists
if ARGV[0]
arr ¨Âelse
arr tdin
end
# the binary table file is looked up with each request
File.open("packed-ip.dat","rb") do |rfile|
rfile.seek(0,IO::SEEK_END)
record_maxile.pos/10-1
arr.each { |argv|
# build a 4-char string representation of IP address
# in network byte order so it can be a string compare below
ipstr rgv.split(".").map {|x| x.to_i.chr}.join
# low/high water marks initialized
low,high record_max
while true
mid ow+high)/2 # binary search median
rfile.seek(10*mid) # one record is 10 byte, seek to position
strile.read(8) # for range matching, we need only 8 bytes
# at comparison, values are big endian, i.e. packed("N")
if ipstr>