Details

Description

This patch implements additional functionality in the filter to "mark" reversed tokens with a special marker character (Unicode 0001). This is useful when indexing both straight and reversed tokens (e.g. to implement efficient leading wildcards search).

Robert Muir
added a comment - 15/Aug/09 18:07 andrzej, the reverse() methods are public, can you supply default impls (withMark=false) just in the case that someone is using them?
alternatively, maybe the reverse() methods could stay the same, and the marking could happen in incrementToken() ?

Andrzej Bialecki
added a comment - 15/Aug/09 19:08 Either way is fine with me. To preserve the public API I think it's better to move this marking logic to incrementToken(). I'll prepare an updated patch.

Very very minor thing, but does it make more sense to choose a more suitable character? U+0001 is an assigned character, with some semantic meaning ("Start of Heading", same as ASCII character 0x01) which isn't really relevant to this use. It mightn't be a bad idea to (a) choose a control character which makes sense in context, if there is one (I can't see one, myself), (b) using a character from the private-use area (U+E000 to U+F8FF) or (c) my preferred option, using the Unicode tag characters. The tag characters are designed for just such a purpose.. embedding contextual metadata in text fields. The general syntax for a tag is <TAG TYPE> followed by one or more <TAG CHARACTER>s. Unfortunately, only one tag type is defined in unicode at present (language tag), which isn't suitable.

That said, I think it makes sense (and is probably 'nicer') to pick one of the Unicode tag characters – say, U+E0052 TAG LATIN CAPITAL LETTER R (for 'reverse') and use that. This could lead to a de facto standard for Lucene fields, where different variations of the same token could use different leading tag characters. Rather than just everyone picking a character at random, this could lead to some sort of structure around similar situations (i.e. I could envisage a filter which uses U+E004E TAG LATIN CAPITAL LETTER N for a normalised version of the token, etc).

Paul Cowan
added a comment - 17/Aug/09 00:29 Very very minor thing, but does it make more sense to choose a more suitable character? U+0001 is an assigned character, with some semantic meaning ("Start of Heading", same as ASCII character 0x01) which isn't really relevant to this use. It mightn't be a bad idea to (a) choose a control character which makes sense in context, if there is one (I can't see one, myself), (b) using a character from the private-use area (U+E000 to U+F8FF) or (c) my preferred option, using the Unicode tag characters. The tag characters are designed for just such a purpose.. embedding contextual metadata in text fields. The general syntax for a tag is <TAG TYPE> followed by one or more <TAG CHARACTER>s. Unfortunately, only one tag type is defined in unicode at present (language tag), which isn't suitable.
That said, I think it makes sense (and is probably 'nicer') to pick one of the Unicode tag characters – say, U+E0052 TAG LATIN CAPITAL LETTER R (for 'reverse') and use that. This could lead to a de facto standard for Lucene fields, where different variations of the same token could use different leading tag characters. Rather than just everyone picking a character at random, this could lead to some sort of structure around similar situations (i.e. I could envisage a filter which uses U+E004E TAG LATIN CAPITAL LETTER N for a normalised version of the token, etc).
Sorry, I'm really anal about Unicode. Can't help it.

Robert Muir
added a comment - 17/Aug/09 00:44
Sorry, I'm really anal about Unicode. Can't help it.
Me too My problem with tag characters is that they are deprecated.
I will take your advice and look and see if there is something more suitable.

another issue, besides the fact they are deprecated, is that tag characters are outside of the BMP.

Currently, the reverse filter does not properly reverse characters outside of the BMP [it does not recognize them as one character],
This means characters such as tag characters will be 'reversed' into trail surrogate followed by lead surrogate (two unpaired surrogates).
But we cannot fix the above, as lucene wildcard support does not recognize codepoints > FFFF as one 'character' either.

If we are gonna pick a character other than U+0001, it needs to be inside the BMP.

Robert Muir
added a comment - 17/Aug/09 00:51 another issue, besides the fact they are deprecated, is that tag characters are outside of the BMP.
Currently, the reverse filter does not properly reverse characters outside of the BMP [it does not recognize them as one character] ,
This means characters such as tag characters will be 'reversed' into trail surrogate followed by lead surrogate (two unpaired surrogates).
But we cannot fix the above, as lucene wildcard support does not recognize codepoints > FFFF as one 'character' either.
If we are gonna pick a character other than U+0001, it needs to be inside the BMP.

I'd suggest choosing a range of Private Use characters from the BMP block then, that's what they're for. Doesn't really matter which... we can pick a block of (say) 256 and use the first one for this, then others can be used for other purposes later if required. U+ECxx, maybe, because that's got 3 letters out of 'lucene' in it . So EC00 means 'reversed', and then people who need other similar filters can organise amongst themselves.

Paul Cowan
added a comment - 17/Aug/09 01:04 - edited Yeah, ok, makes sense.
I'd suggest choosing a range of Private Use characters from the BMP block then, that's what they're for. Doesn't really matter which... we can pick a block of (say) 256 and use the first one for this, then others can be used for other purposes later if required. U+ECxx, maybe, because that's got 3 letters out of 'lucene' in it . So EC00 means 'reversed', and then people who need other similar filters can organise amongst themselves.

I looked into this and I think using the private use area (U+E000 to U+F8FF) would also not be the best.
I do not think Lucene should use PUA characters system-internally, besides I have at least a few docs with PUA characters, and I think others will as well.
We should leave PUA characters available to the end user.

So personally I have nothing against this U+0001, but I'll take any recommendations...

Robert Muir
added a comment - 17/Aug/09 01:10 I looked into this and I think using the private use area (U+E000 to U+F8FF) would also not be the best.
I do not think Lucene should use PUA characters system-internally, besides I have at least a few docs with PUA characters, and I think others will as well.
We should leave PUA characters available to the end user.
So personally I have nothing against this U+0001, but I'll take any recommendations...

what if we simply make it so there is no boolean option for a marker character, instead it is ReverseFilter() and ReverseFilter(char marker)
This way, lucene does not define the character used for this operation, and someone can feel free to select whichever they want (such as U+0001)

When we are on java 5 and can support supp. characters properly (reversing/wildcards,etc), then we can change this to ReverseFilter(int marker) and someone can use anything they want, including outside of the BMP?

Robert Muir
added a comment - 17/Aug/09 01:18 what if we simply make it so there is no boolean option for a marker character, instead it is ReverseFilter() and ReverseFilter(char marker)
This way, lucene does not define the character used for this operation, and someone can feel free to select whichever they want (such as U+0001)
When we are on java 5 and can support supp. characters properly (reversing/wildcards,etc), then we can change this to ReverseFilter(int marker) and someone can use anything they want, including outside of the BMP?
If this is ok, I will supply a patch.

OK, cool. I'm taking an interest in this purely because I have some ideas for other token filters which would do something similar, and really like the idea of tagging them in the same way just with different 'headers'. It would be really beneficial, I think, to come up with something that can be reused and, more importantly, combined (so different filters don't 'clash' with their output). What about making it 2 characters, at least?

U+0001 START OF HEADER
U+xxxx whatever you like to indicate 'reversing' (i.e. an 'R', or just a 0-byte as this is the first purpose allocated, or whatever)

This adds 2 bytes to each term, not 1, but terms generally don't take up that much room in the scale of a whole index and I think it's worth the flexibility. Hell, if you're willing to use 3 (that IS starting to seem wasteful, I admit) then maybe

U+0001 START OF HEADER
U+xxxx whatever
U+0002 START OF TEXT

That's at least semantically meaningful. Other ideas, just looking at the ASCII control characters:

Paul Cowan
added a comment - 17/Aug/09 01:27 OK, cool. I'm taking an interest in this purely because I have some ideas for other token filters which would do something similar, and really like the idea of tagging them in the same way just with different 'headers'. It would be really beneficial, I think, to come up with something that can be reused and, more importantly, combined (so different filters don't 'clash' with their output). What about making it 2 characters, at least?
U+0001 START OF HEADER
U+xxxx whatever you like to indicate 'reversing' (i.e. an 'R', or just a 0-byte as this is the first purpose allocated, or whatever)
This adds 2 bytes to each term, not 1, but terms generally don't take up that much room in the scale of a whole index and I think it's worth the flexibility. Hell, if you're willing to use 3 (that IS starting to seem wasteful, I admit) then maybe
U+0001 START OF HEADER
U+xxxx whatever
U+0002 START OF TEXT
That's at least semantically meaningful. Other ideas, just looking at the ASCII control characters:
U+xxxx whatever
U+001F UNIT SEPARATOR
or
U+000E SHIFT OUT
U+xxxx whatever
U+000F SHIFT IN
I don't really mind, but it's always nice to plan ahead.

Robert Muir
added a comment - 17/Aug/09 01:40 updated patch so you can choose your own character for marking.
if one character is not enough let me know (i suppose we could make it a sequence), but I'd rather keep this simple.

+1. One comment, perhaps stating the obvious .. I picked char 0001 for two reasons - it's not likely to be used in regular text, and its UTF-8 encoding uses one byte. The use case for this filter means that it will create more or less as many tokens as there were in the original token stream, thus doubling the size of term dictionary. One byte here, one byte there, and suddenly it matters whether we use 0001 or FFFF ...

Andrzej Bialecki
added a comment - 17/Aug/09 08:20 +1. One comment, perhaps stating the obvious .. I picked char 0001 for two reasons - it's not likely to be used in regular text, and its UTF-8 encoding uses one byte. The use case for this filter means that it will create more or less as many tokens as there were in the original token stream, thus doubling the size of term dictionary. One byte here, one byte there, and suddenly it matters whether we use 0001 or FFFF ...

Ted Dunning
added a comment - 17/Aug/09 09:04
I understand the desire to use a mark that requires fewer bytes, but the unicode bidi marks might be better for the purpose of marking writing direction: (U+200E LTR or U+200F RTL)

Robert Muir
added a comment - 17/Aug/09 12:27 Ted, with the current patch you can do this: new ReverseStringFilter('\u200E'), or new ReverseStringFilter('\u200F'), or new ReverseStringFilter('\u0001'), or whatever.
Also, for anyone using this filter its my understanding that each term in lucene's term dictionary is a "delta" versus the previous term, so the character you choose should not affect its size?

Perhaps it is useful to define a few constants for each of these suggested characters to make it super easy for people to use them? Just a thought. Otherwise, I like the idea of passing in your own marker.

Grant Ingersoll
added a comment - 17/Aug/09 14:47 Perhaps it is useful to define a few constants for each of these suggested characters to make it super easy for people to use them? Just a thought. Otherwise, I like the idea of passing in your own marker.

I like the idea of a constant and it presented as a default. I suggest that others be given in the JavaDoc.

I have some texts which are using PUAs until Unicode includes the code points (e.g. Myanmar text), so I'm glad that allowing a choice doesn't create a potential conflict there. I think PUA should be left to the text author.

As my texts are all derived from XML, I like the use of a character that is not allowed in XML. I think 0001 is just fine, even if not from a purity perspective.

Some of my texts have BIDI markers and while these will be stripped by filters, I don't think this use is analogous.

DM Smith
added a comment - 17/Aug/09 15:58 I like the idea of a constant and it presented as a default. I suggest that others be given in the JavaDoc.
I have some texts which are using PUAs until Unicode includes the code points (e.g. Myanmar text), so I'm glad that allowing a choice doesn't create a potential conflict there. I think PUA should be left to the text author.
As my texts are all derived from XML, I like the use of a character that is not allowed in XML. I think 0001 is just fine, even if not from a purity perspective.
Some of my texts have BIDI markers and while these will be stripped by filters, I don't think this use is analogous.

Robert Muir
added a comment - 17/Aug/09 21:27 thanks for your comments guys, I like the idea of constants for some of these suggested characters.
i will update the patch later tonight if no one wants to tackle it and beat me to it first