Note this document is EXTREMELY outdated. If you are able to contribute documentation here then please contact us on dev@nutch. In the meantime please check out the other pages about Nutch2 FrontPage#Nutch_2.0

Overview

Reuse of existing Nutch codebase

While some things will change this architecture is more of a refactor than a complete re-write. Much of the existing codebase including plugin functionality should be reused.

Remove the plugin framework

After some experimenting, DI using spring or another similar framework presents problems. Good news is that we can achieve the same thing using the configuration objects from hadoop along with creating new instances using reflectionutils. This is more service locator than dependency injection but it still gives us the same benefits.

Have the ability to change the jobconfiguration settings for tools. This can be accomplished through some type of properties file on the classpath and would be useful for testing, for example the ability to switch out an outputformat to see the output in text format.

MapWritable would contain Text → Writable or Writable[] and would allow the parsing of all different types of elements within the content (href, headers, etc.)

Processing

Processing would take the ParsedContent and translate that into multiple specific data parts. These data parts aren't used by any part of the system except Scoring.

Processing would be specific functions including updating the CrawlBase, peforming analysis on ParsedContent, Integration of data from other sources.

Some processors would translate content into formats needed by scorers.

Processors are not constrained by specific data structures to allow flexibility in analysis, updating, blocking or removal, and other forms of data processing. The only requirement is scoring programs must be able to access processing output data structure in a one to one relationship.

Scoring

Url → Field

Url → Float

Field is a name, value(s), and score, being Text, Text, and Float respectively.

The fields become the fields that are indexed with the scores becoming field boosts.

Scoring takes the specific data parts from analysis and outputs the above formats.

Field needs lucene semantics.

Indexing

Indexing indexes Fields for a document according to the field values and boosts. Indexing does not determine either field values or boost values.