Simple web-crawler for node.js

Simplecrawler is designed to provide the most basic possible API for crawling
websites, while being as flexible and robust as possible. I wrote simplecrawler
to archive, analyse, and search some very large websites. It has happily chewed
through 50,000 pages and written tens of gigabytes to disk without issue.

And of course, you're probably wanting to ensure you don't take down your web
server. Decrease the concurrency from five simultaneous requests - and increase
the request interval from the default 250ms like this:

myCrawler.interval =10000;// Ten seconds

myCrawler.maxConcurrency =1;

For brevity, you may also specify the initial path and request interval when
creating the crawler:

By default, simplecrawler does not download the response body when it encounters
an HTTP error status in the response. If you need this information, you can listen
to simplecrawler's error events, and through node's native data event
(response.on("data",function(chunk) {...})) you can save the information yourself.

If this is annoying, and you'd really like to retain error pages by default, let
me know. I didn't include it because I didn't need it - but if it's important to
people I might put it back in. :)

Sometimes, you might want to wait for simplecrawler to wait for you while you
perform sone asynchronous tasks in an event listener, instead of having it
racing off and firing the complete event, halting your crawl. For example,
if you're doing your own link discovery using an asynchronous library method.

Simplecrawler provides a wait method you can call at any time. It is available
via this from inside listeners, and on the crawler object itself. It returns
a callback function.

Once you've called this method, simplecrawler will not fire the complete event
until either you execute the callback it returns, or a timeout is reached
(configured in crawler.listenerTTL, by default 10000 msec.)

crawler.domainWhitelist -
An array of domains the crawler is permitted to crawl from. If other settings
are more permissive, they will override this setting.

crawler.supportedMimeTypes -
An array of RegEx objects used to determine supported MIME types (types of
data simplecrawler will scan for links.) If you're not using simplecrawler's
resource discovery function, this won't have any effect.

crawler.allowedProtocols -
An array of RegEx objects used to determine whether a URL protocol is supported.
This is to deal with nonstandard protocol handlers that regular HTTP is
sometimes given, like feed:. It does not provide support for non-http
protocols (and why would it!?)

crawler.maxResourceSize -
The maximum resource size, in bytes, which will be downloaded. Defaults to 16MB.

crawler.downloadUnsupported -
Simplecrawler will download files it can't parse. Defaults to true, but if
you'd rather save the RAM and GC lag, switch it off.

crawler.needsAuth -
Flag to specify if the domain you are hitting requires basic authentication

crawler.authUser -
Username provided for needsAuth flag

crawler.authPass -
Password provided for needsAuth flag

crawler.customHeaders -
An object specifying a number of custom headers simplecrawler will add to
every request. These override the default headers simplecrawler sets, so
be careful with them. If you want to tamper with headers on a per-request basis,
see the fetchqueue event.

crawler.acceptCookies -
Flag to indicate if the crawler should hold on to cookies

crawler.urlEncoding -
Set this to iso8859 to trigger URIjs' re-encoding of iso8859 URLs to unicode.
Defaults to unicode.

Simplecrawler has a mechanism you can use to prevent certain resources from being
fetched, based on the URL, called Fetch Conditions*. A fetch condition is just
a function, which, when given a parsed URL object, will return a true or a false
value, indicating whether a given resource should be downloaded.

You may add as many fetch conditions as you like, and remove them at runtime.
Simplecrawler will evaluate every single condition against every queued URL, and
should just one of them return a falsy value (this includes null and undefined,
so remember to always return a value!) then the resource in question will not be
fetched.

This example fetch condition prevents URLs ending in .pdf from downloading.
Adding a fetch condition assigns it an ID, which the addFetchCondition function
returns. You can use this ID to remove the condition later.

var conditionID = myCrawler.addFetchCondition(function(parsedURL){

return!parsedURL.path.match(/\.pdf$/i);

});

NOTE: simplecrawler uses slightly different terminology to URIjs. parsedURL.path
includes the query string too. If you want the path without the query string,
use parsedURL.uriPath.

Simplecrawler has a queue like any other web crawler. It can be directly accessed
at crawler.queue (assuming you called your Crawler() object crawler.) It
provides array access, so you can get to queue items just with array notation
and an index.

crawler.queue[5];

For compatibility with different backing stores, it now provides an alternate
interface which the crawler core makes use of:

First of all, the queue can provide some basic statistics about the network
performance of your crawl (so far.) This is done live, so don't check it thirty
times a second. You can test the following properties:

requestTime

requestLatency

downloadTime

contentLength

actualDataSize

And you can get the maximum, minimum, and average values for each with the
crawler.queue.max, crawler.queue.min, and crawler.queue.avg functions
respectively. Like so:

console.log("The minimum download time was %dms.",crawler.queue.min("downloadTime"));

console.log("The average resource size received is %d bytes.",crawler.queue.avg("actualDataSize"));

You'll probably often need to determine how many items in the queue have a given
status at any one time, and/or retreive them. That's easy with
crawler.queue.countWithStatus and crawler.queue.getWithStatus.

crawler.queue.countWithStatus returns the number of queued items with a given
status, while crawler.queue.getWithStatus returns an array of the queue items
themselves.

var redirectCount = crawler.queue.countWithStatus("redirected");

crawler.queue.getWithStatus("failed").forEach(function(queueItem){

console.log("Whoah, the request for %s failed!",queueItem.url);

// do something...

});

Then there's some even simpler convenience functions:

crawler.queue.complete - returns the number of queue items which have been
completed (marked as fetched)

crawler.queue.errors - returns the number of requests which have failed
(404s and other 400/500 errors, as well as client errors)

You'll probably want to be able to save your progress and reload it later, if
your application fails or you need to abort the crawl for some reason. (Perhaps
you just want to finish off for the night and pick it up tomorrow!) The
crawler.queue.freeze and crawler.queue.defrost functions perform this task.

A word of warning though - they are not CPU friendly as they rely on
JSON.parse and JSON.stringify. Use them only when you need to save the queue -
don't call them every request or your application's performance will be incredibly
poor - they block like crazy. That said, using them when your crawler commences
and stops is perfectly reasonable.

Note that the methods themselves are asynchronous, so if you are going to exit the
process after you do the freezing, make sure you wait for callback - otherwise
you'll get an empty file.

Mike Moulton for
[fixing a bug in the URL discovery mechanism]
(https://github.com/cgiffard/node-simplecrawler/pull/3), as well as
[adding the discoverycomplete event]
(https://github.com/cgiffard/node-simplecrawler/pull/10),

Redistribution and use in source and binary forms, with or without modification,
are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.

Redistributions in binary form must reproduce the above copyright notice, this
list of conditions and the following disclaimer in the documentation and/or
other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.