Details

Description

Because of some backwards compatibility problems (LUCENE-1906) we changed the CharStream/CharFilter API a little bit. Tokenizer now only has a input field of type java.io.Reader (as before the CharStream code). To correct offsets, it is now needed to call the Tokenizer.correctOffset(int) method, which delegates to the CharStream (if input is subclass of CharStream), else returns an uncorrected offset. Normally it is enough to change all occurences of input.correctOffset() to this.correctOffset() in Tokenizers. It should also be checked, if custom Tokenizers in Solr do correct their offsets.

I thought I call tokenizer.correctOffset() in newToken() method, but I couldn't because the method is protected. In this patch, I converted the anonymous Tokenizer class to PatternTokenizer, and PatternTokenizer has the following:

Koji Sekiguchi
added a comment - 14/Sep/09 02:25 I thought I call tokenizer.correctOffset() in newToken() method, but I couldn't because the method is protected. In this patch, I converted the anonymous Tokenizer class to PatternTokenizer, and PatternTokenizer has the following:
+ public int correct( int currentOffset ){
+ return correctOffset( currentOffset );
+ }

I have seen this PatternTokenizer, too. The method is protected, as the corectOffset should only be called from inside Tokenizer (e.g. in incrementToken), never from the outside. Why does the PatternTokenizer does not have the methods newToken and so on in its own class (by the way: This should be updated to new TokenStream API)?

Uwe Schindler
added a comment - 14/Sep/09 06:02 I have seen this PatternTokenizer, too. The method is protected, as the corectOffset should only be called from inside Tokenizer (e.g. in incrementToken), never from the outside. Why does the PatternTokenizer does not have the methods newToken and so on in its own class (by the way: This should be updated to new TokenStream API)?

I searched for setOffset() in Solr source code and found one additional occurence of it without offset correcting in FieldType.java. This patch fixes this.

I will provide a completely streaming PatternTokenizer not using ArrayLists soon. It will move the while(matcher.find()) loop into the incrementToken() enumeration and will also use the new TokenStream API.

Uwe Schindler
added a comment - 14/Sep/09 06:41 I searched for setOffset() in Solr source code and found one additional occurence of it without offset correcting in FieldType.java. This patch fixes this.
I will provide a completely streaming PatternTokenizer not using ArrayLists soon. It will move the while(matcher.find()) loop into the incrementToken() enumeration and will also use the new TokenStream API.

This is a complete more effective rewrite of the whole Tokenizer (I would like to put this into, Lucene contrib, too!) using the new TokenStream API.

When going through the code, I realized the following: This Tokenizer can return empty tokens, it only filters enpty tokens in split() mode. Is this exspected? If empty tokens should be omitted, the if (matcher.find()) should be replaced by while (match.find()) with if match.length==0 continue; - The logic behind the strange omit empty token at the end will get very simple after this change.

This patch removes the whole split()/group() methods from the factory as not needed anymore. If this is a backwards break, replace them by not used dummies (e.g. initialize a Tokenizer and return the token's TermText).

In my opinion, one should never index empty tokens...

A second thing: Lucene has a new BaseTokenStreamTest class for checking tokens without Token instances (which would no loger work, when Lucene 3.0 switches to Attributes only). Maybe you should update these test and use assertAnalyzesTo from the new base class instead.

Uwe Schindler
added a comment - 14/Sep/09 09:53 This is a complete more effective rewrite of the whole Tokenizer (I would like to put this into, Lucene contrib, too!) using the new TokenStream API.
When going through the code, I realized the following: This Tokenizer can return empty tokens, it only filters enpty tokens in split() mode. Is this exspected? If empty tokens should be omitted, the if (matcher.find()) should be replaced by while (match.find()) with if match.length==0 continue; - The logic behind the strange omit empty token at the end will get very simple after this change.
This patch removes the whole split()/group() methods from the factory as not needed anymore. If this is a backwards break, replace them by not used dummies (e.g. initialize a Tokenizer and return the token's TermText).
In my opinion, one should never index empty tokens...
A second thing: Lucene has a new BaseTokenStreamTest class for checking tokens without Token instances (which would no loger work, when Lucene 3.0 switches to Attributes only). Maybe you should update these test and use assertAnalyzesTo from the new base class instead.

Why does the PatternTokenizer does not have the methods newToken and so on in its own class

Yeah, I'd realized it immediately after posting the patch, but I was going to be out.

And thank you for adapting it for new TokenStream API.

I searched for setOffset() in Solr source code and found one additional occurence of it without offset correcting in FieldType.java. This patch fixes this.

Good catch, Uwe! I slipped over it.

I think the empty tokens is a bug and should be omitted in this patch.

A second thing: Lucene has a new BaseTokenStreamTest class for checking tokens without Token instances (which would no loger work, when Lucene 3.0 switches to Attributes only). Maybe you should update these test and use assertAnalyzesTo from the new base class instead.

Koji Sekiguchi
added a comment - 14/Sep/09 23:32 The patch that is Uwe's one with replacing split()/group() methods.
Why does the PatternTokenizer does not have the methods newToken and so on in its own class
Yeah, I'd realized it immediately after posting the patch, but I was going to be out.
And thank you for adapting it for new TokenStream API.
I searched for setOffset() in Solr source code and found one additional occurence of it without offset correcting in FieldType.java. This patch fixes this.
Good catch, Uwe! I slipped over it.
I think the empty tokens is a bug and should be omitted in this patch.
A second thing: Lucene has a new BaseTokenStreamTest class for checking tokens without Token instances (which would no loger work, when Lucene 3.0 switches to Attributes only). Maybe you should update these test and use assertAnalyzesTo from the new base class instead.
Very nice! Can you open a separate ticket?

I think the empty tokens is a bug and should be omitted in this patch.

The Javadocs say, that it works like String.split() which return empty tokens, but strips empty tokens at the end of the string. This functionality is provided by Solr before and with this patch.
The code would get simplier, if the Tokenizer would generally strip empty tokens, but it is a backwards break. I would tend to just commit and then open another issue.

Uwe Schindler
added a comment - 15/Sep/09 07:57 I think the empty tokens is a bug and should be omitted in this patch.
The Javadocs say, that it works like String.split() which return empty tokens, but strips empty tokens at the end of the string. This functionality is provided by Solr before and with this patch.
The code would get simplier, if the Tokenizer would generally strip empty tokens, but it is a backwards break. I would tend to just commit and then open another issue.
Very nice! Can you open a separate ticket?
Will open one about Lucene's BaseTokenStreamTestCase

Some refactoring (I moved the PatternTokenizer to its own class, like PatternReplaceFilter). This patch is functionally identical to current trunk, but more effective and uses new TokenStream API and implements end() (which sets the offset to the end of the string).

Uwe Schindler
added a comment - 15/Sep/09 08:58 Some refactoring (I moved the PatternTokenizer to its own class, like PatternReplaceFilter). This patch is functionally identical to current trunk, but more effective and uses new TokenStream API and implements end() (which sets the offset to the end of the string).
I will soon post a patch, which removes empty tokens.

This is a patch that fixes the empty tokens:
This Tokenizer is not backwards compatible, as it only return non-zero length tokens. Maybe we should have a switch somewhere to change this behaviour. It is currently for discussion only.

Uwe Schindler
added a comment - 15/Sep/09 10:21 This is a patch that fixes the empty tokens:
This Tokenizer is not backwards compatible, as it only return non-zero length tokens. Maybe we should have a switch somewhere to change this behaviour. It is currently for discussion only.

Yonik Seeley
added a comment - 15/Sep/09 14:25 This Tokenizer is not backwards compatible, as it only return non-zero length tokens. Maybe we should have a switch somewhere to change this behaviour.
Passing through only non-zero length tokens was probably always the intent, and the old behavior is a bug and isn't useful, so I don't think we need a switch.

Then you could use SOLR-1423-fix-empty-tokens.patch it should work. The comparison with String.split() in one of the tests was commented out, because it does not work with the tokenizer (as empty tokens are not returned).

I only wanted to check, that the offsets are calculated correctly. The second tests does this, but I want to be sure, that they are correct for both group=-1 and group>=0.

Uwe Schindler
added a comment - 15/Sep/09 16:41 Then you could use SOLR-1423 -fix-empty-tokens.patch it should work. The comparison with String.split() in one of the tests was commented out, because it does not work with the tokenizer (as empty tokens are not returned).
I only wanted to check, that the offsets are calculated correctly. The second tests does this, but I want to be sure, that they are correct for both group=-1 and group>=0.

It has an additional test for the offsets, if group!=-1. It also is more optimized, as it uses setTermBuffer( string, offset, len) to copy the chars into the termbuffer, which is faster than allocating a new string with substring().

Uwe Schindler
added a comment - 15/Sep/09 17:04 Attached a new patch with the empty token fix.
It has an additional test for the offsets, if group!=-1. It also is more optimized, as it uses setTermBuffer( string, offset, len) to copy the chars into the termbuffer, which is faster than allocating a new string with substring().