Julien Nioche
added a comment - 14/Feb/12 14:49 But what about segments fetched with and without this new feature and db.parsemeta.to.crawldb=Content-Type property?
I assume i'd have to update the segments before this change with the property enabled and update the segments fetched with this feature without the db.parsemeta.to.crawldb property.
yep

Splendid work my friend! The fetcher runs smoothly again! I'll check out your patch for NUTCH-1258 this week.
But what about segments fetched with and without this new feature and db.parsemeta.to.crawldb=Content-Type property?

I assume i'd have to update the segments before this change with the property enabled and update the segments fetched with this feature without the db.parsemeta.to.crawldb property.

Markus Jelsma
added a comment - 14/Feb/12 14:40 Splendid work my friend! The fetcher runs smoothly again! I'll check out your patch for NUTCH-1258 this week.
But what about segments fetched with and without this new feature and db.parsemeta.to.crawldb=Content-Type property?
I assume i'd have to update the segments before this change with the property enabled and update the segments fetched with this feature without the db.parsemeta.to.crawldb property.

Hey Julien, there's something wrong with this commit. We're seeing NPE's in the Fetcher without stack trace now. The fetcher doesn't die but the generated seed list is quickly terminated and few records get processed instead of millions. It looks like it's triggered when a fetch error occurs. You can reproduce this error by injecting a unknown host but it's likely to happen as well when socket time outs and related errors are thrown.

Markus Jelsma
added a comment - 14/Feb/12 13:56 Hey Julien, there's something wrong with this commit. We're seeing NPE's in the Fetcher without stack trace now. The fetcher doesn't die but the generated seed list is quickly terminated and few records get processed instead of millions. It looks like it's triggered when a fetch error occurs. You can reproduce this error by injecting a unknown host but it's likely to happen as well when socket time outs and related errors are thrown.
fetch of http: //idonotexist.openindex.io/ failed with: java.net.UnknownHostException: idonotexist.openindex.io
fetch of http: //idonotexist.openindex.io/ failed with: java.lang.NullPointerException
fetcher caught:java.lang.NullPointerException
Can you look at it?

Markus Jelsma
added a comment - 10/Feb/12 16:58 NUTCH-1024 relies on the Content-Type to be added crawldatum metadata via db.parsemeta.to.crawldb.
Anyway, i agree. Will you open another issue?
have a nice weekend

I haven't looked at NUTCH-1024. Does it take the detected value from Content or the one from the parse md?
As for storing it in the CrawlDatum that would require changing the object, its version, making sure it remains compatible etc... so I'd rather store it in the crawldatum md for now. It means that it can be overriden indeed but this is quite unlikely to happen unless you write a custom resource etc... Let's keep this option in mind for later maybe

Julien Nioche
added a comment - 10/Feb/12 14:57 I haven't looked at NUTCH-1024 . Does it take the detected value from Content or the one from the parse md?
As for storing it in the CrawlDatum that would require changing the object, its version, making sure it remains compatible etc... so I'd rather store it in the crawldatum md for now. It means that it can be overriden indeed but this is quite unlikely to happen unless you write a custom resource etc... Let's keep this option in mind for later maybe

Sounds good! We already store the Content-Type in de CrawlDatum's metadata for NUTCH-1024 via db.parsemeta.to.crawldb. Wouldn't it be better to store it in the CrawlDatum object itself just like the signature? Then someone cannot override it by accident.

Markus Jelsma
added a comment - 10/Feb/12 12:46 Sounds good! We already store the Content-Type in de CrawlDatum's metadata for NUTCH-1024 via db.parsemeta.to.crawldb. Wouldn't it be better to store it in the CrawlDatum object itself just like the signature? Then someone cannot override it by accident.

Thanks for the example. Here is a summary of what is happening.
The correct Mime-type guessed by Tika is stored in the Content object. This is what is then used during the parsing step to determine which implementation of the parser should be used. This value is what you can see displayed by the parser checker e.g.

This is different from the value displayed in the content metadata which corresponds to what is returned in the protocol headers. It is also different from the value found in parse metadata which what can be found in the content. Note that there is no guarantee that these two values can be found.

Now the problem with https://issues.apache.org/jira/browse/NUTCH-1258 is that while the ParserFilters have access to the Content object, this is not the case of the IndexingFilters. One option would be to have a bespoke Parser implementation to store a custom metadata to store the CT in the Content object (i.e. the one Tika guessed) then use that in the indexing filter. That's unnecessarily messy.

I think a cleaner approach would be to store the guessed content-type in the crawldatum metadata. This way we :

can access it from the indexing filters (the parsing filter would still get it from Content if necessary)

do not override the value stored in parse metadata

can access it regardless of whether a document has been parsed or not

have a mechanism which is independent from the actual parser used (html / tika / other)

have the possibility of taking a different decision as to which value should be used (guessed vs protocol vs content)

Julien Nioche
added a comment - 09/Feb/12 16:19 Thanks for the example. Here is a summary of what is happening.
The correct Mime-type guessed by Tika is stored in the Content object. This is what is then used during the parsing step to determine which implementation of the parser should be used. This value is what you can see displayed by the parser checker e.g.
fetching: http://kam.mff.cuni.cz/conferences/GraDR/
parsing: http://kam.mff.cuni.cz/conferences/GraDR/
contentType: text/html
signature: 575aecee981b1aa03a145e3dc5b4de72
This is different from the value displayed in the content metadata which corresponds to what is returned in the protocol headers. It is also different from the value found in parse metadata which what can be found in the content. Note that there is no guarantee that these two values can be found.
Now the problem with https://issues.apache.org/jira/browse/NUTCH-1258 is that while the ParserFilters have access to the Content object, this is not the case of the IndexingFilters. One option would be to have a bespoke Parser implementation to store a custom metadata to store the CT in the Content object (i.e. the one Tika guessed) then use that in the indexing filter. That's unnecessarily messy.
I think a cleaner approach would be to store the guessed content-type in the crawldatum metadata. This way we :
can access it from the indexing filters (the parsing filter would still get it from Content if necessary)
do not override the value stored in parse metadata
can access it regardless of whether a document has been parsed or not
have a mechanism which is independent from the actual parser used (html / tika / other)
have the possibility of taking a different decision as to which value should be used (guessed vs protocol vs content)
keep a trace of why such or such parser was used on a given document
This would be done in the output method of the class Fetcher.
What do you think?

Consider the following URL that produces bad output. This URL is not the only producing bad output. We've seen countless of examples that produce funky values in both content meta and parse meta, or no value at all.

It's an application/x-trash according to content meta and no data is available in parse meta. But, it's just an ordinary HTML page. This cannot be true, from an index point of view we will never know that this is an HTML page. With this patch enabled we will get the following output:

For us, this solves all problems as we now only rely on Tika's MIME-detector and store it in parse meta. The value of content meta cannot be trusted. It's the same as with languages, when we do not use Tika to detect the language we get all sorts of crap.

Since the upgrade to Tika 1.0 and with NUTCH-1230 we obtain the detected MIME-type but it's not added to the parse meta. Now it is.

Markus Jelsma
added a comment - 09/Feb/12 14:46 Hi,
Consider the following URL that produces bad output. This URL is not the only producing bad output. We've seen countless of examples that produce funky values in both content meta and parse meta, or no value at all.
http://kam.mff.cuni.cz/conferences/GraDR/
The current Nutch trunk shows us the following meta data for this URL obtained via parsechecker with only parse-tika enabled:
Content Metadata: Vary=negotiate,accept,Accept-Encoding Date=Thu, 09 Feb 2012 14:37:47 GMT Content-Length=4911 TCN=choice Content-Encoding=gzip Content-Location=index.html.bak Content-Type=application/x-trash Connection=close Accept-Ranges=bytes Server=Apache/2.2.9 (Debian) mod_auth_kerb/5.3 PHP/5.2.6-1+lenny14 with Suhosin-Patch mod_ssl/2.2.9 OpenSSL/0.9.8g
Parse Metadata: Content-Encoding=ISO-8859-1
It's an application/x-trash according to content meta and no data is available in parse meta. But, it's just an ordinary HTML page. This cannot be true, from an index point of view we will never know that this is an HTML page. With this patch enabled we will get the following output:
Content Metadata: Vary=negotiate,accept,Accept-Encoding Date=Thu, 09 Feb 2012 14:40:15 GMT Content-Length=4911 TCN=choice Content-Encoding=gzip Content-Location=index.html.bak Content-Type=application/x-trash Connection=close Accept-Ranges=bytes Server=Apache/2.2.9 (Debian) mod_auth_kerb/5.3 PHP/5.2.6-1+lenny14 with Suhosin-Patch mod_ssl/2.2.9 OpenSSL/0.9.8g
Parse Metadata: Content-Encoding=ISO-8859-1 Content-Type=text/html
For us, this solves all problems as we now only rely on Tika's MIME-detector and store it in parse meta. The value of content meta cannot be trusted. It's the same as with languages, when we do not use Tika to detect the language we get all sorts of crap.
Since the upgrade to Tika 1.0 and with NUTCH-1230 we obtain the detected MIME-type but it's not added to the parse meta. Now it is.
Do you have another suggestion?

Anyway, probably a good idea NOT to add it to the parse-metadata as it has already been detected from the content and stored in the content metadata, however I can't think of a reason why we'd want to duplicate that to the parse metadata as well. The value in the content metadata is the one set by the detector and should be the correct one. Or am I missing something?

Julien Nioche
added a comment - 09/Feb/12 11:48 // DO NOT ADD Content-Type FROM HTTP_HEADERS, ONLY ADD THE DETECTED TYPE SEE https://issues.apache.org/jira/browse/NUTCH-1259
hmmm, isn't that the content-type from the HTML headers instead?
Anyway, probably a good idea NOT to add it to the parse-metadata as it has already been detected from the content and stored in the content metadata, however I can't think of a reason why we'd want to duplicate that to the parse metadata as well. The value in the content metadata is the one set by the detector and should be the correct one. Or am I missing something?

Lewis John McGibbney
added a comment - 07/Feb/12 18:35 Hey Markus. I'm literally up to my eye balls with stuff the now so sorry for not having the time to look through your work. The best I can do is have a look tomorrow, I'll give it my all then. Thanks

Markus Jelsma
added a comment - 07/Feb/12 15:46 you're right. but since you're most of the time the only person reviewing and the fact this issue has your attention now, what is your opinion on this problem?

Markus, I understand that you could be frustrated with having your issues not reviewed as quickly as you'd wish but it would be nice to have a bit more notice. There aren't many active committers in the project and I can't follow the pace at which you submit patches

Julien Nioche
added a comment - 07/Feb/12 15:36 I'll commit this one tomorrow unless there are objections.
Markus, I understand that you could be frustrated with having your issues not reviewed as quickly as you'd wish but it would be nice to have a bit more notice. There aren't many active committers in the project and I can't follow the pace at which you submit patches

Markus Jelsma
added a comment - 07/Feb/12 15:26 Here's a patch for 1.5. Comments? We have this running in production and it does works very good. It completely solves the big problem of ending up with many thousands of crap content-types.
I'll commit this one tomorrow unless there are objections.

Markus Jelsma
added a comment - 07/Feb/12 15:23 I'll comment on it myself then: the code above fixes the issue and adds a proper content-type to parsemeta. Consider the following URL with a very bad content-type:
http://kam.mff.cuni.cz/conferences/GraDR/
I'll upload a patch in a minute that sets the detected content type in the metadata instead

Markus Jelsma
added a comment - 25/Jan/12 13:37 A solution would be to prevent the type to be added just like what is already being done with the title field. Now, a reliable Content-Type value is added to the ParseMetaData.
// populate Nutch metadata with Tika metadata
String [] TikaMDNames = tikamd.names();
for ( String tikaMDName : TikaMDNames) {
if (tikaMDName.equalsIgnoreCase(Metadata.TITLE))
continue ;
// DO NOT ADD Content-Type FROM HTTP_HEADERS, ONLY ADD THE DETECTED TYPE SEE https://issues.apache.org/jira/browse/NUTCH-1259
if (tikaMDName.equalsIgnoreCase(Metadata.CONTENT_TYPE))
continue ;
// TODO what if multivalued?
nutchMetadata.add(tikaMDName, tikamd.get(tikaMDName));
}
// Only add the detected TYPE
nutchMetadata.add( "Content-Type" , mimeType);