Purging referrer URLs concurrently

After writing a "pooling executor" that assigns
tasks to a number of handlers in parallel, filtering referrer URLs somewhat
efficiently becomes easier. Checking multiple referrers concurrently helps
maximize bandwidth usage, which is quite important when you have to fetch
~8000 pages. Were this done serially, the process could easily take a couple
hours or more.

I just have to iterate over all these files, verifying if those referrer locations
actually exist and seem OK, and overwrite the data.

Checking a referrer

Some URLs can be discarded right away: search engines, common RSS aggregators
(mostly bloglines, which I get lots of hits from), google groups... I also
whitelisted a few patterns to make the script a bit faster, and cache the
results to avoid downloading a page twice.

I'm using Timeout and retrying up to 3 times for the lazy servers out there.
The "check" isn't the epitome of sophistication, but it's very effective.

Concurrent verifiers

The key to making this run faster is checking several URLs at a time. My
PoolingExecutor handles that easily, creating
the threads and throttling HTTP requests for me:

require'fileutils'require'poolingexecutor'require'open-uri'require'timeout'def purge_bad_referers(filename,good_referrers={})referer_hash=eval(File.read(filename))ok_referer_hash={}executor=PoolingExecutor.newdo|handlers|15.timesdo# we could as well register a plain Object and use it as a "token",# with the actual code in the executor.run { ... } blockhandlers<<lambdado|referer,count|check_referer(referer,count,ok_referer_hash,good_referrers)endendendreferer_hash.eachdo|referer,count|executor.rundo|handler|puts"Processing #{referer}"puts"Keeping #{referer}"ifhandler[referer,count]endendexecutor.wait_for_allFile.open(filename,"w")do|f|f.puts"{"ok_referer_hash.keys.sort.eachdo|key|f.puts"#{key.inspect} => #{ok_referer_hash[key].inspect},"endf.print"}"endend