A crawl script that runs properly with bash and has been tested with Nutch 1.0 can be found here: Self:Crawl. This script can do crawl as well as recrawl. However, not much real world recrawl has been done with this script. It might require a little bit of tweaking if you find that the script does not suit your needs.

A crawl script that runs properly with bash and has been tested with Nutch 1.0 can be found here: [[Crawl]]. This script can do crawl as well as recrawl. However, not much real world recrawl has been done with this script. It might require a little bit of tweaking if you find that the script does not suit your needs.

NOTE: the scripts listed here do not do recrawl correctly. It will add additional depth (specified by user) to a crawl. To avoid this, we need to use '-noAdditions' options to 'updatedb' command. But an annoying problem is that if you have used the 'crawl' command, newly discovered url have been added to the crawldb and will be fetched with the next 'fecth' command.

So the problem is, you will seen more pages being crawled using this recrawl script, not just the pages you have fecthed.

Version 1.0

A crawl script that runs properly with bash and has been tested with Nutch 1.0 can be found here: Crawl. This script can do crawl as well as recrawl. However, not much real world recrawl has been done with this script. It might require a little bit of tweaking if you find that the script does not suit your needs.