Description

Currently if ShingleFilter.outputUnigrams==false and the underlying token stream is only one token long, then ShingleFilter.next() won't return any tokens. This patch provides a new option, outputUnigramIfNoNgrams; if this option is set and the underlying stream is only one token long, then ShingleFilter will return that token, regardless of the setting of outputUnigrams.

My use case here is speeding up phrase queries. The technique is as follows:

Activity

However I'm curious to why you don't query for unigrams unless the input is a single token? That means you always require a 0-slop between any two tokens of the input. I know nothing about your needs, but it could be dangerous. You can always boost the bigrams a bit more than the unigrams if they cause a problem. I think you should benchmark the cost. I'm sure it's rather small and that you'll get better quality results by doing that. Users tend to never enter a query the way I want them to.

Karl Wettin
added a comment - 31/Aug/08 14:18 It's an OK filter setting if you ask me.
However I'm curious to why you don't query for unigrams unless the input is a single token? That means you always require a 0-slop between any two tokens of the input. I know nothing about your needs, but it could be dangerous. You can always boost the bigrams a bit more than the unigrams if they cause a problem. I think you should benchmark the cost. I'm sure it's rather small and that you'll get better quality results by doing that. Users tend to never enter a query the way I want them to.

Karl, that's a good point about my setup being incompatible with non-0 slop. However, the performance gains I'm seeing with this patch on my data are substantial. When I last tested on the same index, same machine, same # of threads in my testing process, etc., and went from analyzing my queries with

outputUnigrams==true

to analyzing with

using outputUnigrams==false and outputUnigramIfNoNgrams==true

phrase queries ended up performing something like 50x as fast. Which is good, because the initial performance wasn't acceptable.

The performance gains from outputUnigramIfNoNgrams were greater than those from when I instead tried moving the index to a solid state drive. (It was a a fairly entry-level SSD drive, but still.) It would be interesting to compare to moving to a machine with an obscene amount of RAM. (Not quite sure what would count as "obscene", but my index is 90+GB. Maybe half of that is taken up by stored fields.)

Chris Harris
added a comment - 31/Aug/08 20:52 Karl, that's a good point about my setup being incompatible with non-0 slop. However, the performance gains I'm seeing with this patch on my data are substantial. When I last tested on the same index, same machine, same # of threads in my testing process, etc., and went from analyzing my queries with
outputUnigrams==true
to analyzing with
using outputUnigrams==false and outputUnigramIfNoNgrams==true
phrase queries ended up performing something like 50x as fast. Which is good, because the initial performance wasn't acceptable.
The performance gains from outputUnigramIfNoNgrams were greater than those from when I instead tried moving the index to a solid state drive. (It was a a fairly entry-level SSD drive, but still.) It would be interesting to compare to moving to a machine with an obscene amount of RAM. (Not quite sure what would count as "obscene", but my index is 90+GB. Maybe half of that is taken up by stored fields.)

Do you say it is 50x faster with shingle queries that only contains bigram compared to shingle queries that contains uni- and bigrams? Or is it 50x faster using shingles compared to phrase queries? (I've myself seen performance gains similar to the latter.)

Karl Wettin
added a comment - 31/Aug/08 21:32 Do you say it is 50x faster with shingle queries that only contains bigram compared to shingle queries that contains uni- and bigrams? Or is it 50x faster using shingles compared to phrase queries? (I've myself seen performance gains similar to the latter.)

: Do you say it is 50x faster with shingle queries that only contains bigram
: compared to shingle queries that contains uni- and bigrams? Or is it 50x
: faster using shingles compared to phrase queries? (I've myself seen
: performance gains similar to the latter.)

I'm not sure I totally understand you, so let me try rephrasing things. What I mean is that phrase queries that have only bigrams, like this:

If it clarifies things any further, let me say that I'm handling all quoted phrase queries with the normal PhraseQuery class; I'm not, for instance, turning quoted phrases into some kind of BooleanQuery. (Technically it's not me that's making the PhraseQuery object but Solr and its query parser. But Solr is indeed turning my quoted phrase queries into normal PhraseQuery objects.)

Any ideas on whether it will ever make sense for this patch to make it into the trunk? Some random thoughts:

The latest patch does make ShingleFilter marginally less clean. (Maybe my least favorite part is how fillShingleBuffer() is now responsible for setting firstToken; that it should do so is not obvious from the method name.)

This patch's functionality could potentially be implemented without modifying ShingleFilter itself. For example, maybe instead of patching ShingleFilter we could have a ShingleFilterUnigramWrapper class, that would delegate to ShingleFilter, except when ShingleFilter failed to produce ngrams. I'm a little worried that this would require using a CachingTokenFilter, and that might not be ideal from an efficiency perspective.

It might be good to rename outputUnigramIfNoNgrams to something like forceUnigramIfNoNgrams. With the current naming scheme, you end up setting some contradictory-sounding options, e.g. setting outputUnigrams==false and outputUnigramIfNoNgrams==true. If you look at the code this might not be confusing, but it'd be nice if it were more straightforward without making you look at the code.

I gather that some people have an interest in making a minShingleSize option. (See http://www.nabble.com/shingle-filter-td25125367.html) I'm not sure how best to modify this patch should that get implemented. It might depend on the typical use cases for minShingleSize, and if there's any overlap with my use case here.

Chris Harris
added a comment - 09/Sep/09 19:19 Any ideas on whether it will ever make sense for this patch to make it into the trunk? Some random thoughts:
The latest patch does make ShingleFilter marginally less clean. (Maybe my least favorite part is how fillShingleBuffer() is now responsible for setting firstToken; that it should do so is not obvious from the method name.)
This patch's functionality could potentially be implemented without modifying ShingleFilter itself. For example, maybe instead of patching ShingleFilter we could have a ShingleFilterUnigramWrapper class, that would delegate to ShingleFilter, except when ShingleFilter failed to produce ngrams. I'm a little worried that this would require using a CachingTokenFilter, and that might not be ideal from an efficiency perspective.
It might be good to rename outputUnigramIfNoNgrams to something like forceUnigramIfNoNgrams. With the current naming scheme, you end up setting some contradictory-sounding options, e.g. setting outputUnigrams==false and outputUnigramIfNoNgrams==true. If you look at the code this might not be confusing, but it'd be nice if it were more straightforward without making you look at the code.
I gather that some people have an interest in making a minShingleSize option. (See http://www.nabble.com/shingle-filter-td25125367.html ) I'm not sure how best to modify this patch should that get implemented. It might depend on the typical use cases for minShingleSize, and if there's any overlap with my use case here.

Robert Muir
added a comment - 10/Sep/09 13:54 Chris, just a small comment:
if (outputUnigramIfNoNgrams && firstToken == null ) {
firstToken = captureState();
}
shingleBuf.add(captureState());
here i think you could save a clone() by not calling captureState twice?
even though it doesnt have to recompute the state, captureState does have to clone it.
(I am thinking about trying out your patch where i have some very short fields, this is why i mentioned it)

even though it doesnt have to recompute the state, captureState does have to clone it.

The recomputation is only done once for the lifetime of the TokenStream (as far as there are no modifications in the number of attributes). The State linked list is just another representation of the instances LinkedHashMap for faster iteration during cloning (no new Iterators to create and so on).

Uwe Schindler
added a comment - 10/Sep/09 14:41 even though it doesnt have to recompute the state, captureState does have to clone it.
The recomputation is only done once for the lifetime of the TokenStream (as far as there are no modifications in the number of attributes). The State linked list is just another representation of the instances LinkedHashMap for faster iteration during cloning (no new Iterators to create and so on).

Robert Muir
added a comment - 10/Sep/09 14:49 Uwe, I see... i had not looked far enough yet to see how this was done, thanks.
but i think my comment still holds true, we can save a clone() here on the first token.
for long fields does not matter so much, but for short fields a small savings.

Chris Harris
added a comment - 10/Sep/09 20:53
here i think you could save a clone() by not calling captureState twice?
even though it doesnt have to recompute the state, captureState does have to clone it.
So is the idea to replace
if (getNextToken()) {
if (outputUnigramIfNoNgrams && firstToken == null ) {
firstToken = captureState();
}
shingleBuf.add(captureState());
with
if (getNextToken()) {
State curState = captureState();
if (outputUnigramIfNoNgrams && firstToken == null ) {
firstToken = curState;
}
shingleBuf.add(curState);
That seems fine, unless there's some hidden reason why you can't share State objects.
I'd guess you could optimize more than that, but I think you run into diminishing returns, making the code harder to read more than you're making it faster. For example:
if (getNextToken()) {
if (outputUnigramIfNoNgrams && firstToken == null ) {
firstToken = captureState();
shingleBuf.add(firstToken);
}
else {
shingleBuf.add(captureState());
}

At the risk of being annoying, is there any chance this patch (perhaps slightly refined, if you guys want) could make it into Lucene 3.0? I think I'm not the only person who wants to use the ShingleFilter in this slightly way. For example, I just noticed that the new Solr 1.4 Enterprise Search Server book (on p. 288) makes brief mention of SOLR-744, which depends on this patch. I'm fine keeping this out of the Lucene distribution if people think ShingleFilter.outputUnigrams is a silly hack and there's a better way to get the job done. But if this captures a legitimate use case, are there compelling reasons to keep it out?

Chris Harris
added a comment - 06/Nov/09 06:15 At the risk of being annoying, is there any chance this patch (perhaps slightly refined, if you guys want) could make it into Lucene 3.0? I think I'm not the only person who wants to use the ShingleFilter in this slightly way. For example, I just noticed that the new Solr 1.4 Enterprise Search Server book (on p. 288) makes brief mention of SOLR-744 , which depends on this patch. I'm fine keeping this out of the Lucene distribution if people think ShingleFilter.outputUnigrams is a silly hack and there's a better way to get the job done. But if this captures a legitimate use case, are there compelling reasons to keep it out?
Thanks.

Robert Muir
added a comment - 05/Oct/10 21:40 I think this patch is really important to get into 3.x and trunk...
its a dead simple way (analysis configuration only) for folks to get a massive performance improvement for phrasequeries.

The previous patch predates the rewrite ShingleFilter was subjected to as a result of LUCENE-2218, so it needed to be rejiggered somewhat.

Changes from the previous patch:

The new patch simply enables unigram output if the number of input tokens is less than minShingleSize. The existing code then handles the situation appropriately, and reset() restores the original unigram output option.

I renamed the option "outputUnigramIfNoNgrams" to "outputUnigramsIfNoShingles", because:

Unigram -> Unigrams: the output could result in more than one unigram if minShingleSize is greater than the default 2; and

Ngrams -> Shingles: for consistency with the class's name.

I renamed "returnedAnyTokensYet" to "noShingleOutput", and reversed its (boolean) sense, because:

unigrams should be output only if no shingles can be output, rather than no tokens; and

reversing the sense allowed the test using it to avoid negation, and allowed the name to be shorter.

I added a note to the setOutputUnigramsIfNoShingles() method javadoc to the effect that if outputUnigram == true, unigrams will always be output regardless of the setting of outputUnigramsIfNoShingles.

I added a test that makes sure that when minShingleSize > 2 and the number of input tokens is less than minShingleSize, (multiple) unigrams are output

Steve Rowe
added a comment - 06/Oct/10 01:21 The previous patch predates the rewrite ShingleFilter was subjected to as a result of LUCENE-2218 , so it needed to be rejiggered somewhat.
Changes from the previous patch:
The new patch simply enables unigram output if the number of input tokens is less than minShingleSize. The existing code then handles the situation appropriately, and reset() restores the original unigram output option.
I renamed the option "outputUnigramIfNoNgrams" to "outputUnigramsIfNoShingles", because:
Unigram -> Unigrams: the output could result in more than one unigram if minShingleSize is greater than the default 2; and
Ngrams -> Shingles: for consistency with the class's name.
I renamed "returnedAnyTokensYet" to "noShingleOutput", and reversed its (boolean) sense, because:
unigrams should be output only if no shingles can be output, rather than no tokens ; and
reversing the sense allowed the test using it to avoid negation, and allowed the name to be shorter.
I added a note to the setOutputUnigramsIfNoShingles() method javadoc to the effect that if outputUnigram == true, unigrams will always be output regardless of the setting of outputUnigramsIfNoShingles.
I added a test that makes sure that when minShingleSize > 2 and the number of input tokens is less than minShingleSize, (multiple) unigrams are output

Steve Rowe
added a comment - 06/Oct/10 01:46 - edited Added lucene/CHANGES.txt entry, and slightly modified reset() code switching back the unigram output option.
All tests pass.
If there are no objections, I'll commit this in a couple of days.
edit : I added an entry to *modules/analysis/*CHANGES.txt, not *lucene/*CHANGES.txt

Robert Muir
added a comment - 06/Oct/10 02:36 Looks good to me, I tested it and had no problems.
Maybe we should add the option to ShingleAnalyzerWrapper too?
This would be very convenient for Lucene users.

Steve Rowe
added a comment - 06/Oct/10 06:19 Looks good to me, I tested it and had no problems.
Thanks for testing, Robert.
Maybe we should add the option to ShingleAnalyzerWrapper too?
Yes, I missed that one - I've added support for it in this version of the patch, along with one test in ShingleAnalyzerWrapperTest (for the single-token case).