Web crawling is by no means trivial or simple. There are all sorts of
things to consider, such as:
. site throttling
. graph cycle detection
. endless graph detection (you've seen Fred's roach motel california
for crawlers, or whatever he calls it)
. optimizing the crawling queue
. detecting redundant content (rather difficult)
. extracting links from js
. etc, etc
Heritrix, as Trevor mentioned, is full featured, very mature, and has a
bunch of users and an active mailing list. You don't need to be a java
programmer to make use of it; it is very configurable.
If you don't use heritrix, it is still well worth having a good look at
the project. It has good architecture, and implements all of what you'd
need for most crawling. If you want to write plugins, it has a finely
documented API, and the mailing list users are helpful.
My trouble with heritrix is that it is geared for deep, archival
crawling (it's what crawls for the wayback machine), where I wanted
shallow, iterative crawling, to find new content.
There's a project that implements a good crawler in Perl, but it is
abandoned:
http://search.cpan.org/~dmaki/Gungho-0.09008/
I played with Gungho a bit, and found it to be mostly what I wanted,
except that to customize it, you have to write code, instead of tweaking
parameters, as you can do with heritrix.
-Colin.
On Thu, Apr 22, 2010 at 02:08:29PM -0700, Michael R. Wolf wrote:
> I let myself get sucker punched! I wrote my own web crawler based on
> WWW::Mechanize because my preliminary research indicated that crawlers
> were simple.
>> [Aside: If you're tempted to do that, let me save you some time.
> Don't. They are *conceptually* simple (GET page, push links on queue,
> iterate), but there are many levels of devils lurking in the details.]
>> Having finished phase 1 (way behind schedule and way over budget), I'm
> looking for a better web crawler solution for phase 2.