Analysis.jsp and AnalaysisRequestHandlerBase do not correctly clear attributes on caching tokens

Details

Description

When caching tokens, the helper TokenStreams in analysis.jsp and AnalysisRequestHandlerBase do not clear all attributes.
The issue is tricky. The cached tokens do not contain all attributes in early stages, so copyTo() does not necessarily overwrite all attributes in "this". Calling clearAttributes ensures this.

Was: LUCENE-2901 broke protected words by only setting and never clearing (that change should have been accompanied by offsetting code to clear the attribute somewhere).

The problem here was, that this attribute was added later in anaylsis chain, so cached tokens don't include this. Sorry, that was my fault when rewriting analysis.jsp together with Robert

Yonik, clearing should be done by the Tokenizer or other token-producers (if a filter inserts Tokens, it also has to clear Attributes). If the Tokenizer does not clear all Attributes using clearAttributes(), it is borken. But not this one.

Can you post the config of your Tokenizers and Filters or which Analyzer is affected?

Uwe Schindler
added a comment - 20/Apr/11 19:47 Yonik, clearing should be done by the Tokenizer or other token-producers (if a filter inserts Tokens, it also has to clear Attributes). If the Tokenizer does not clear all Attributes using clearAttributes(), it is borken. But not this one.
Can you post the config of your Tokenizers and Filters or which Analyzer is affected?

Robert Muir
added a comment - 20/Apr/11 19:49 I see the problem with the example config: simply enter "dontstems foo"
But, we need to figure out:
is it only a bug in analysis.jsp?
if not, who isn't clearing attributes.

Robert Muir
added a comment - 20/Apr/11 19:54 This is just a bug in analysis.jsp, compare the query debug output "dontstems bees" to the analysis.jsp output of dontstems bees, and you will see what I mean.
There is nothing wrong with the lucene filter here!

Yonik Seeley
added a comment - 20/Apr/11 19:55 clearing should be done by the Tokenizer or other token-producers
Whew... ok, I didn't realize that pre-dated LUCENE-2901
So perhaps this is just an analysis.jsp bug, since a query of
http://localhost:8983/solr/select?q= "dontstems hellos"&debugQuery=true
seems to work fine and produce "dontstems hello"

Robert Muir
added a comment - 20/Apr/11 19:58 I created an issue for the analysis.jsp: SOLR-2473
the problem is likely to cause a lot of confusion, and as Uwe said we should check the similar AnalysisRequestHandler too

I found the bug:
The problem is in analysis.jsp (3.x version), line 227: there should be a clearAttributes() first.
Reason is: In early stages, cached tokens dont have the Keyword Attribute, so the following copyTo() does not overwrite all attributes.
Same applies for AnalysisReqHandlerBase. The depreacted AnalysisReqHandler does not has this problem as it does not debug all filters and caches no Tokens.
Patch is coming...

Uwe Schindler
added a comment - 20/Apr/11 20:05 I found the bug:
The problem is in analysis.jsp (3.x version), line 227: there should be a clearAttributes() first.
Reason is: In early stages, cached tokens dont have the Keyword Attribute, so the following copyTo() does not overwrite all attributes.
Same applies for AnalysisReqHandlerBase. The depreacted AnalysisReqHandler does not has this problem as it does not debug all filters and caches no Tokens.
Patch is coming...