HTML stripper is splitting tokens

Details

Description

The Solr HTML stripper is replacing any removed HTML with whitespace. This is to keep offsets correct for highlighting.

However, as was already pointed out in SOLR-42, this means that any token containing an HTML entity will be split into several tokens. That makes the HTML stripper completely unreliable for international text (and any text is potentially interantional).

The current code is actually deficient for BOTH highlighting and indexing, where the previous incarnation (that did not insert spaces) only had problems with highlighting.

The only workaround is to not use entities at all, which is impossible in some situations and inconvenient in most situations. If the client is required to transform entities before handing it to Solr, it might as well be required to also strip tags, and then the HTML stripper would not be needed at all.

Today, we have a better solution that can be used: offset correction. We can then avoid inserting extra whitespace, but still get correct offsets. The attached patch implements just that.

Yonik Seeley
added a comment - 16/Oct/09 23:25 Now you got me in a better mood, so I can fix that error if you like?
Didn't see your message before the commit - yes, a patch would be great! Could possibly make it into 1.4 still if you're quick

Anders Melchiorsen
added a comment - 16/Oct/09 23:17 Thanks, that sounds great.
There is an existing off-by-one error in the numWhitespace calculation with hexadecimal numeric entities.
I noticed that while reworking the patch, but did not bother to report it in here because I was annoyed from being ignored. Now you got me in a better mood, so I can fix that error if you like?

I've been testing this with a bunch of different HTML, and I don't see any places where this is worse, and it prevents splitting of tokens when it shouldn't.
Given that the splitting is clearly a bug, and that changes to this filter won't affect the rest of Solr, I plan on committing this shortly.

Things still aren't perfect as far as offsets and highlighting, but this patch makes it no worse.

I modified the solr.xml document to escape the '&' and then added the strip char filter to the text field.
The query was héllo OR hello OR unicode
Before this patch: Good <em>unicode</em> support: héllo <em>(hell</em>o with an accent over the e)
After this patch: Good <em>unicode</em> support: <em>héll</em>o <em>(hell</em>o with é accent over the e)

Yonik Seeley
added a comment - 16/Oct/09 22:58 I've been testing this with a bunch of different HTML, and I don't see any places where this is worse, and it prevents splitting of tokens when it shouldn't.
Given that the splitting is clearly a bug, and that changes to this filter won't affect the rest of Solr, I plan on committing this shortly.
Things still aren't perfect as far as offsets and highlighting, but this patch makes it no worse.
I modified the solr.xml document to escape the '&' and then added the strip char filter to the text field.
The query was héllo OR hello OR unicode
Before this patch: Good <em>unicode</em> support: héllo <em>(hell</em>o with an accent over the e)
After this patch: Good <em>unicode</em> support: <em>héll</em>o <em>(hell</em>o with é accent over the e)

We're seeing the error again, we're going to try this patch and
HTMLStripReader and we'll see what happens. Here's the latest
stacktrace, which is pretty much the same as the last one:

SEVERE: java.io.IOException: Mark invalid
at java.io.BufferedReader.reset(BufferedReader.java:485)
at org.apache.lucene.analysis.CharReader.reset(CharReader.java:63)
at org.apache.solr.analysis.HTMLStripCharFilter.restoreState(HTMLStripCharFilter.java:170)
at org.apache.solr.analysis.HTMLStripCharFilter.read(HTMLStripCharFilter.java:727)
at org.apache.solr.analysis.HTMLStripCharFilter.read(HTMLStripCharFilter.java:741)
at org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill(StandardTokenizerImpl.java:451
)
at org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken(StandardTokenizerImpl.java
:637)
at org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:174)
at org.apache.lucene.analysis.standard.StandardFilter.incrementToken(StandardFilter.java:50)
at org.apache.lucene.analysis.LowerCaseFilter.incrementToken(LowerCaseFilter.java:38)
at org.apache.solr.analysis.SnowballPorterFilter.incrementToken(SnowballPorterFilterFactory.java:116
)
at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:401)
at org.apache.solr.analysis.BufferedTokenStream.read(BufferedTokenStream.java:94)
at org.apache.solr.analysis.BufferedTokenStream.next(BufferedTokenStream.java:80)
at org.apache.lucene.analysis.TokenStream.incrementToken(TokenStream.java:316)
at org.apache.lucene.analysis.StopFilter.incrementToken(StopFilter.java:224)
at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:138)
at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.jav
a:244)
at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:772)
at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:755)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2611)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2583)
at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:241)
at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:61)
at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:140)

Jason Rutherglen
added a comment - 23/Sep/09 01:54 Anders,
We're seeing the error again, we're going to try this patch and
HTMLStripReader and we'll see what happens. Here's the latest
stacktrace, which is pretty much the same as the last one:
SEVERE: java.io.IOException: Mark invalid
at java.io.BufferedReader.reset(BufferedReader.java:485)
at org.apache.lucene.analysis.CharReader.reset(CharReader.java:63)
at org.apache.solr.analysis.HTMLStripCharFilter.restoreState(HTMLStripCharFilter.java:170)
at org.apache.solr.analysis.HTMLStripCharFilter.read(HTMLStripCharFilter.java:727)
at org.apache.solr.analysis.HTMLStripCharFilter.read(HTMLStripCharFilter.java:741)
at org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill(StandardTokenizerImpl.java:451
)
at org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken(StandardTokenizerImpl.java
:637)
at org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:174)
at org.apache.lucene.analysis.standard.StandardFilter.incrementToken(StandardFilter.java:50)
at org.apache.lucene.analysis.LowerCaseFilter.incrementToken(LowerCaseFilter.java:38)
at org.apache.solr.analysis.SnowballPorterFilter.incrementToken(SnowballPorterFilterFactory.java:116
)
at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:401)
at org.apache.solr.analysis.BufferedTokenStream.read(BufferedTokenStream.java:94)
at org.apache.solr.analysis.BufferedTokenStream.next(BufferedTokenStream.java:80)
at org.apache.lucene.analysis.TokenStream.incrementToken(TokenStream.java:316)
at org.apache.lucene.analysis.StopFilter.incrementToken(StopFilter.java:224)
at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:138)
at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.jav
a:244)
at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:772)
at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:755)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2611)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2583)
at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:241)
at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:61)
at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:140)

To me (the author), things are much better with the patch, and I have seen no ill effects. The exception that Jason reported happens without the patch.

My problem is that I cannot really vouch for the patch, as I have no previous experience with the Solr code. So, it would be really nice if someone with the experience could take fifteen minutes to review the three tiny modifications.

When replacing read() with next() in one of the patches, I noted that I was not sure why it worked better. I have later figured out that read() is the external interface, so it should (probably?) not be used internally by the stripper.

Anders Melchiorsen
added a comment - 19/Sep/09 19:39 To me (the author), things are much better with the patch, and I have seen no ill effects. The exception that Jason reported happens without the patch.
My problem is that I cannot really vouch for the patch, as I have no previous experience with the Solr code. So, it would be really nice if someone with the experience could take fifteen minutes to review the three tiny modifications.
When replacing read() with next() in one of the patches, I noted that I was not sure why it worked better. I have later figured out that read() is the external interface, so it should (probably?) not be used internally by the stripper.

What's clear is that the html stripper still has problems - less clear to me is if this patch as it currently exists is better than what's in trunk.... if people think it is, we could commit it quickly for 1.4

Yonik Seeley
added a comment - 19/Sep/09 17:31 What's clear is that the html stripper still has problems - less clear to me is if this patch as it currently exists is better than what's in trunk.... if people think it is, we could commit it quickly for 1.4

Anders Melchiorsen
added a comment - 15/Sep/09 09:22 Jason, did you get that exception with (and not without) my patch? You said that you had other problems as well, so I just want to make sure that it is a problem with the patch ...

Caused by: java.io.IOException: Mark invalid
at java.io.BufferedReader.reset(BufferedReader.java:485)
at org.apache.solr.analysis.HTMLStripReader.restoreState(HTMLStripReader.java:171)
at org.apache.solr.analysis.HTMLStripReader.read(HTMLStripReader.java:728)
at org.apache.solr.analysis.HTMLStripReader.read(HTMLStripReader.java:742)
at org.apache.lucene.analysis.CharReader.read(CharReader.java:51)
at org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill(StandardTokenizerImpl.java:451)
at org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken(StandardTokenizerImpl.java:637)
at org.apache.lucene.analysis.standard.StandardTokenizer.next(StandardTokenizer.java:198)
at org.apache.lucene.analysis.standard.StandardFilter.next(StandardFilter.java:84)
at org.apache.lucene.analysis.LowerCaseFilter.next(LowerCaseFilter.java:53)
at org.apache.solr.analysis.EnglishPorterFilter.next(EnglishPorterFilterFactory.java:108)
at org.apache.lucene.analysis.StopFilter.next(StopFilter.java:245)
at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:162)

Jason Rutherglen
added a comment - 11/Sep/09 01:13 Here's the exception:
Caused by: java.io.IOException: Mark invalid
at java.io.BufferedReader.reset(BufferedReader.java:485)
at org.apache.solr.analysis.HTMLStripReader.restoreState(HTMLStripReader.java:171)
at org.apache.solr.analysis.HTMLStripReader.read(HTMLStripReader.java:728)
at org.apache.solr.analysis.HTMLStripReader.read(HTMLStripReader.java:742)
at org.apache.lucene.analysis.CharReader.read(CharReader.java:51)
at org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill(StandardTokenizerImpl.java:451)
at org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken(StandardTokenizerImpl.java:637)
at org.apache.lucene.analysis.standard.StandardTokenizer.next(StandardTokenizer.java:198)
at org.apache.lucene.analysis.standard.StandardFilter.next(StandardFilter.java:84)
at org.apache.lucene.analysis.LowerCaseFilter.next(LowerCaseFilter.java:53)
at org.apache.solr.analysis.EnglishPorterFilter.next(EnglishPorterFilterFactory.java:108)
at org.apache.lucene.analysis.StopFilter.next(StopFilter.java:245)
at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:162)

Anders Melchiorsen
added a comment - 29/Aug/09 15:31 Proof-of-concept fixes. I don't expect them to be perfect.
There is one test that still breaks. I think the test does not make sense with my changes, but I did not remove it.
Also, the new functionality (offset correction) is not tested. I don't know how to do it.