Abstract:
This paper shares our experience in designing a web crawler that can
download billions of pages using a single-server implementation and
models its performance. We show that with the quadratically increasing
complexity of verifying URL uniqueness, BFS crawl order, and fixed
per-host rate-limiting, current crawling algorithms cannot effectively
cope with the sheer volume of URLs generated in large crawls,
highly-branching spam, legitimate multi-million-page blog sites, and
infinite loops created by server-side scripts. We offer a set of
techniques for dealing with these issues and test their performance in an
implementation we call IRLbot. In our recent experiment that lasted $41$
days, IRLbot running on a single server successfully crawled $6.3$
billion valid HTML pages ($7.6$ billion connection requests) and
sustained an average download rate of $319$ mb/s ($1,789$ pages/s).
Unlike our prior experiments with algorithms proposed in related work,
this version of IRLbot did not experience any bottlenecks and
successfully handled content from over $117$ million hosts, parsed out
$394$ billion links, and discovered a subset of the web graph with $41$
billion unique nodes.