In that discussion, Grant asked the original poster to open a Jira issue, but I didn't see one so I'm opening one; please feel free to merge or close if it's redundant.

My stack trace follows.

Jul 15, 2009 8:36:42 AM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/update params={} status=500 QTime=3
Jul 15, 2009 8:36:42 AM org.apache.solr.common.SolrException log
SEVERE: java.io.IOException: Mark invalid
at java.io.BufferedReader.reset(BufferedReader.java:485)
at org.apache.solr.analysis.HTMLStripReader.restoreState(HTMLStripReader.java:171)
at org.apache.solr.analysis.HTMLStripReader.read(HTMLStripReader.java:728)
at org.apache.solr.analysis.HTMLStripReader.read(HTMLStripReader.java:742)
at java.io.Reader.read(Reader.java:123)
at org.apache.lucene.analysis.CharTokenizer.next(CharTokenizer.java:108)
at org.apache.lucene.analysis.StopFilter.next(StopFilter.java:178)
at org.apache.lucene.analysis.standard.StandardFilter.next(StandardFilter.java:84)
at org.apache.lucene.analysis.LowerCaseFilter.next(LowerCaseFilter.java:53)
at org.apache.solr.analysis.WordDelimiterFilter.next(WordDelimiterFilter.java:347)
at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:159)
at org.apache.lucene.index.DocFieldConsumersPerField.processFields(DocFieldConsumersPerField.java:36)
at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:234)
at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:765)
at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:748)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2512)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2484)
at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:240)
at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:61)
at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:140)
at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)
at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1292)
at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
at org.mortbay.jetty.Server.handle(Server.java:285)
at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)

Activity

The below-listed exception, which appears to be the same as that in other reports on this issue, is triggered when processing with HTMLStripCharFilter the ClueWeb09 documents with TREC-IDs clueweb09-en0000-00-14171, clueweb09-en0000-00-14228, clueweb09-en0000-00-14235, clueweb09-en0000-00-14240, clueweb09-en0000-00-14248, and clueweb09-en0000-00-14265:

java.io.IOException: Mark invalid
at java.io.BufferedReader.reset(BufferedReader.java:485)
at org.apache.lucene.analysis.CharReader.reset(CharReader.java:69)
at org.apache.lucene.analysis.charfilter.HTMLStripCharFilter.restoreState(HTMLStripCharFilter.java:171)
at org.apache.lucene.analysis.charfilter.HTMLStripCharFilter.read(HTMLStripCharFilter.java:734)

Once LUCENE-3690 has been committed, this will only affect the (deprecated) old implementation, which will be renamed to LegacyHTMLStripCharFilter.

Steve Rowe
added a comment - 22/Jan/12 04:54 The below-listed exception, which appears to be the same as that in other reports on this issue, is triggered when processing with HTMLStripCharFilter the ClueWeb09 documents with TREC-IDs clueweb09-en0000-00-14171, clueweb09-en0000-00-14228, clueweb09-en0000-00-14235, clueweb09-en0000-00-14240, clueweb09-en0000-00-14248, and clueweb09-en0000-00-14265:
java.io.IOException: Mark invalid
at java.io.BufferedReader.reset(BufferedReader.java:485)
at org.apache.lucene.analysis.CharReader.reset(CharReader.java:69)
at org.apache.lucene.analysis.charfilter.HTMLStripCharFilter.restoreState(HTMLStripCharFilter.java:171)
at org.apache.lucene.analysis.charfilter.HTMLStripCharFilter.read(HTMLStripCharFilter.java:734)
Once LUCENE-3690 has been committed, this will only affect the (deprecated) old implementation, which will be renamed to LegacyHTMLStripCharFilter .

Yonik Seeley
added a comment - 26/Jan/11 22:05 Since it looks like the primary use of numRead is in relation to mark() and reset() on the underlying stream, it does look like #1 is the correct interpretation (i.e. the patch looks correct)

Hoss Man
added a comment - 06/Jul/10 20:53 As i mentioned in IRC (prior to Grant's previously posted comments) the core issue is: what is the intended purpose of the "numRead" counter?
If it's suppose to count the number of times "input.read()" is called (ie: "num read from inner stream"), then "peek" has a bug by not incrementing.
If it's suppose to count the number of times "next()" returns a char (ie: "num read from outer stream"), then as grant mentioned "next" has a bug by not incrementing when using the stack.
The patch currently assumes the former and seems to fix the bug, i haven't tried the same test case with an approach to the later, but i suspect that may also work.

I wonder if the issue isn't that in next()[21:35] gsingers: if it gets something off the stack (pushed) it doesn't increment numRead[21:37] gsingers: but, I guess one could argue that numRead should track exactly what is read off the InputStream[21:38] gsingers: and in that case, peek is still doing a read[21:38] gsingers: so it should inc. it[21:38] gsingers: I suppose the only harm in more aggressively incrementing it is that you don't hold as much in buffer as you could otherwise

Grant Ingersoll
added a comment - 03/Jul/10 02:40 From IRC:
I wonder if the issue isn't that in next()
[21:35] gsingers: if it gets something off the stack (pushed) it doesn't increment numRead
[21:37] gsingers: but, I guess one could argue that numRead should track exactly what is read off the InputStream
[21:38] gsingers: and in that case, peek is still doing a read
[21:38] gsingers: so it should inc. it
[21:38] gsingers: I suppose the only harm in more aggressively incrementing it is that you don't hold as much in buffer as you could otherwise

Hoss Man
added a comment - 03/Jul/10 00:45 we have a patch that seems to work, so we should dfinitely try to get this into the next release ... i'm hoping someone more familiar with the code can sanity check it soon.

Updates patch to trunk (where the charfilter stuff has been refactored into the new top level "modules" directory)

I'm not familiar with the HTMLStripCharFilter stuff, so i can't say whether the "fix" is correct (no idea if "peek" should be incrementing that counter – that's why even private methods should have javadocs), but the test certainly looks valid to me

Hoss Man
added a comment - 03/Jul/10 00:41 Updates patch to trunk (where the charfilter stuff has been refactored into the new top level "modules" directory)
I'm not familiar with the HTMLStripCharFilter stuff, so i can't say whether the "fix" is correct (no idea if "peek" should be incrementing that counter – that's why even private methods should have javadocs), but the test certainly looks valid to me

The issue is also happening in current trunk (revision 903234), with the class HTMLStripCharFilter (replacing deprecated HTMLStripReader it seems).

Example of stacktrace:

26 janv. 2010 16:02:56 org.apache.solr.common.SolrException log
GRAVE: java.io.IOException: Mark invalid
at java.io.BufferedReader.reset(BufferedReader.java:485)
at org.apache.lucene.analysis.CharReader.reset(CharReader.java:63)
at org.apache.solr.analysis.HTMLStripCharFilter.restoreState(HTMLStripCharFilter.java:172)
at org.apache.solr.analysis.HTMLStripCharFilter.read(HTMLStripCharFilter.java:734)
at org.apache.solr.analysis.HTMLStripCharFilter.read(HTMLStripCharFilter.java:748)
at java.io.Reader.read(Reader.java:122)
at org.apache.lucene.analysis.CharTokenizer.incrementToken(CharTokenizer.java:77)
at org.apache.lucene.analysis.ISOLatin1AccentFilter.incrementToken(ISOLatin1AccentFilter.java:43)
at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:383)
at org.apache.lucene.analysis.ISOLatin1AccentFilter.next(ISOLatin1AccentFilter.java:64)
at org.apache.solr.analysis.WordDelimiterFilter.next(WordDelimiterFilter.java:379)
at org.apache.lucene.analysis.TokenStream.incrementToken(TokenStream.java:318)
at org.apache.lucene.analysis.StopFilter.incrementToken(StopFilter.java:225)
at org.apache.lucene.analysis.LowerCaseFilter.incrementToken(LowerCaseFilter.java:38)
at org.apache.solr.analysis.SnowballPorterFilter.incrementToken(SnowballPorterFilterFactory.java:116)
at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:406)
at org.apache.solr.analysis.BufferedTokenStream.read(BufferedTokenStream.java:97)
at org.apache.solr.analysis.BufferedTokenStream.next(BufferedTokenStream.java:83)
at org.apache.lucene.analysis.TokenStream.incrementToken(TokenStream.java:321)
at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:138)
at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:244)
at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:781)
at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:764)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2630)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2602)
at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:241)
at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:61)
at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:139)
at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)
at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1317)
at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:341)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244)
at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
at org.mortbay.jetty.Server.handle(Server.java:285)
at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:723)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)

After a quick code review, it seems this one is due to the peek function which can read a byte from the input stream, while not incrementing the numRead variable (as done in the next function): functions checking whether read ahead limit was reached rely on numRead.
The exception can then be triggered when reading exceeds the read ahead limit, as for example with a big document containing a malformed processing instruction like

<?> ?????
... (anything except '?>')

Note: the issue is triggered here because readProcessingInstruction calls peek whenever the character '?' was found (to check whether it is followed by '>').

You will find attached a patch to fix the issue, as well as an updated JUnit test (which actually only checks for the malformed processing instruction, maybe you will find a more general test to perform on the next/peek functions).

Julien Coloos
added a comment - 26/Jan/10 16:37 The issue is also happening in current trunk (revision 903234), with the class HTMLStripCharFilter (replacing deprecated HTMLStripReader it seems).
Example of stacktrace:
26 janv. 2010 16:02:56 org.apache.solr.common.SolrException log
GRAVE: java.io.IOException: Mark invalid
at java.io.BufferedReader.reset(BufferedReader.java:485)
at org.apache.lucene.analysis.CharReader.reset(CharReader.java:63)
at org.apache.solr.analysis.HTMLStripCharFilter.restoreState(HTMLStripCharFilter.java:172)
at org.apache.solr.analysis.HTMLStripCharFilter.read(HTMLStripCharFilter.java:734)
at org.apache.solr.analysis.HTMLStripCharFilter.read(HTMLStripCharFilter.java:748)
at java.io.Reader.read(Reader.java:122)
at org.apache.lucene.analysis.CharTokenizer.incrementToken(CharTokenizer.java:77)
at org.apache.lucene.analysis.ISOLatin1AccentFilter.incrementToken(ISOLatin1AccentFilter.java:43)
at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:383)
at org.apache.lucene.analysis.ISOLatin1AccentFilter.next(ISOLatin1AccentFilter.java:64)
at org.apache.solr.analysis.WordDelimiterFilter.next(WordDelimiterFilter.java:379)
at org.apache.lucene.analysis.TokenStream.incrementToken(TokenStream.java:318)
at org.apache.lucene.analysis.StopFilter.incrementToken(StopFilter.java:225)
at org.apache.lucene.analysis.LowerCaseFilter.incrementToken(LowerCaseFilter.java:38)
at org.apache.solr.analysis.SnowballPorterFilter.incrementToken(SnowballPorterFilterFactory.java:116)
at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:406)
at org.apache.solr.analysis.BufferedTokenStream.read(BufferedTokenStream.java:97)
at org.apache.solr.analysis.BufferedTokenStream.next(BufferedTokenStream.java:83)
at org.apache.lucene.analysis.TokenStream.incrementToken(TokenStream.java:321)
at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:138)
at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:244)
at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:781)
at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:764)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2630)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2602)
at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:241)
at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:61)
at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:139)
at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)
at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1317)
at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:341)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244)
at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
at org.mortbay.jetty.Server.handle(Server.java:285)
at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:723)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
After a quick code review, it seems this one is due to the peek function which can read a byte from the input stream, while not incrementing the numRead variable (as done in the next function): functions checking whether read ahead limit was reached rely on numRead .
The exception can then be triggered when reading exceeds the read ahead limit, as for example with a big document containing a malformed processing instruction like
<?> ?????
... (anything except '?>')
Note: the issue is triggered here because readProcessingInstruction calls peek whenever the character ' ? ' was found (to check whether it is followed by ' > ').
You will find attached a patch to fix the issue, as well as an updated JUnit test (which actually only checks for the malformed processing instruction, maybe you will find a more general test to perform on the next / peek functions).
Regards

It seems to me that the code should bail out and just assume that a "<" did not begin an HTML tag if it still isn't sure after reading the DEFAULT_READ_AHEAD (8,192) characters. It looks like the code was intended to do that (see the checks against safeReadAheadLimit) but must be missing some case.

David Bowen
added a comment - 27/Oct/09 17:29 It seems to me that the code should bail out and just assume that a "<" did not begin an HTML tag if it still isn't sure after reading the DEFAULT_READ_AHEAD (8,192) characters. It looks like the code was intended to do that (see the checks against safeReadAheadLimit) but must be missing some case.

I now have a workaround. The documents I'm indexing don't actually have html in them, but the schema was set up to use HTMLStripReader anyway. I switched to the standard analyzer and the problem went away, and indexing also seems to be running faster than before. I do still think the issue needs fixing since I'm sure some people use solr to index large web pages which do need html stripping. Anyway, thanks to Erik H. for advice about this.

solrize
added a comment - 23/Jul/09 23:14 I now have a workaround. The documents I'm indexing don't actually have html in them, but the schema was set up to use HTMLStripReader anyway. I switched to the standard analyzer and the problem went away, and indexing also seems to be running faster than before. I do still think the issue needs fixing since I'm sure some people use solr to index large web pages which do need html stripping. Anyway, thanks to Erik H. for advice about this.

solrize
added a comment - 22/Jul/09 18:33 Is the buffer size the parameter DEFAULT_READ_AHEAD (set to 8192) in HTMLStripReader.java?
Should I set it to be the same as maxFieldLength from solrconfig.xml? That would let it hold the entire document. I currently have that config parameter set to 10000000 (10 MB).
Thanks

Right now I'm getting a ton of these errors. It doesn't seem strictly dependent on the doc size. If I can crank up the buffer size enough that the error happens only occasionally instead of frequently, that would be a big improvement over the present situation. Thanks!

solrize
added a comment - 16/Jul/09 23:20 Right now I'm getting a ton of these errors. It doesn't seem strictly dependent on the doc size. If I can crank up the buffer size enough that the error happens only occasionally instead of frequently, that would be a big improvement over the present situation. Thanks!

Grant Ingersoll
added a comment - 16/Jul/09 21:10 We should make the buffer size configurable, I guess. However, there's always the potential to go past it or use up a lot of memory in the meantime (if one is expecting really large files)