"readdb" is an alias for "org.apache.nutch.crawl.CrawlDbReader"

Returns or Exports information on the Crawl Database

Usage

<crawldb>: Path to the crawldb directory.[-stats]: Prints the overall statistics to System.out[-dump <out_dir>]: Exports the crawldb to a file in <out_dir>[-url <url>]: Prints statistics on <url> to System.out

Configuration Files

hadoop-default.xmlhadoop-site.xmlnutch-default.xmlnutch-site.xml

Other Files

None.

Caveats and Notes

stat command

the command -stat is quite useful to get a quick overview of the performed crawl. The output have following meaning:

DB_unfetched are pages that are linked to by fetched pages, but not fetched yet (because they are not passing the url filters or are not in the TopN links that Nutch selects for its next fetch cycle).

DB_gone means that a 404 or some other presumably permanent error was encountered. This status prevents future attempts to fetch a url.

DB_fetched is the number of document that have been fetched and indexed. That's what is important. If you have "status 2 (DB_fetched): 0", then something went wrong.