keep only stuff HtmlParseFilters (probably with a different API) so that we can post-process the DOM created in Tika from whatever original format.

Modify code so that parser can generate multiple documents which is what 1.x does but not 2.0

Offload url filtering and url normalization, URL state management, perhaps deduplication to [http://code.google.com/p/crawler-commons/]. We should coordinate our efforts, and share code freely so that other projects (bixo, heritrix,droids) may contribute to this shared pool of functionality, much like Tika does for the common need of parsing complex formats.

Rewrite SOLR deduplication : do everything using the webtable and avoid retrieving content from Solr

--we may still keep a thin abstract layer to allow other indexing/search backends (ElasticSearch?), but the current mess of indexing/query filters and competing indexing frameworks (lucene, fields, solr) should go away. We should go directly from DOM to a NutchDocument, and stop there.--