Details

Description

During work on LUCENE-3319 I ran into issues with BooleanQuery compared to PhraseQuery in the exact case. If I disable scoring on PhraseQuery and bypass the position matching, essentially doing a conjunction match, ExactPhraseScorer beats plain boolean scorer by 40% which is a sizeable gain. I converted a ConjunctionScorer to use DocsEnum directly but still didn't get all the 40% from PhraseQuery. Yet, it turned out with further optimizations this gets very close to PhraseQuery. The biggest gain here came from converting the hand crafted loop in ConjunctionScorer#doNext to a for loop which seems to be less confusing to hotspot. In this particular case I think code specialization makes lots of sense since BQ with TQ is by far one of the most common queries.

Activity

Indeed breaks in Java are no real gotos. The label must always be placed before the loop that you want to break out (it would give syntax error otherwise). In fact it simply exits the loop to this level and proceeds working after the loop. A simple break without a label is identical to a non-hierarchical loop with a autogenerated label in front of it.

Uwe Schindler
added a comment - 24/Jul/11 11:12 Indeed breaks in Java are no real gotos. The label must always be placed before the loop that you want to break out (it would give syntax error otherwise). In fact it simply exits the loop to this level and proceeds working after the loop. A simple break without a label is identical to a non-hierarchical loop with a autogenerated label in front of it.

Paul Elschot
added a comment - 24/Jul/11 10:54 this works and looks as expected.
Indeed so. I mistook the working of a labeled break for jumping to the beginning instead of to the end.
(The last time I used a label was actually with another programming language...)

I think advancing the lead in program code at the place of the break comment could fix this.

Paul this works and looks as expected. Once we break to the advanceHead label we step out the inner do {} while; and advance the head. Maybe I don't understand your comment correctly?

There is certainly space for improvement here. for instance could the head be advanced to the doc we break on but the advance call there actually yields a perf hit. Yet, we can play some tricks like if (DF / maxdoc > X) enum.advance( n ) else while(n > enum.nextDoc()); which I think I'll look into after vacation

Simon Willnauer
added a comment - 22/Jul/11 21:38 - edited I think advancing the lead in program code at the place of the break comment could fix this.
Paul this works and looks as expected. Once we break to the advanceHead label we step out the inner do {} while; and advance the head. Maybe I don't understand your comment correctly?
There is certainly space for improvement here. for instance could the head be advanced to the doc we break on but the advance call there actually yields a perf hit. Yet, we can play some tricks like if (DF / maxdoc > X) enum.advance( n ) else while(n > enum.nextDoc()); which I think I'll look into after vacation

Sorry for not looking at this earlier, but in trunk now in ConjunctionTermScorer in the doNext method the lead TermScorer is not advanced when breaking to the advanceHead label even though the comment at the break states so.
I would expect this to works correctly but it is probably not as efficient as intended.

I think advancing the lead in program code at the place of the break comment could fix this.

Paul Elschot
added a comment - 22/Jul/11 20:24 Sorry for not looking at this earlier, but in trunk now in ConjunctionTermScorer in the doNext method the lead TermScorer is not advanced when breaking to the advanceHead label even though the comment at the break states so.
I would expect this to works correctly but it is probably not as efficient as intended.
I think advancing the lead in program code at the place of the break comment could fix this.

2) after a hit was found
3) whenever a doc matched m out of n enums, 1 < m < n

maybe I miss something here but really how can we know how many docs are left in an enum during 2. and 3. We could use the doc an enum has advanced to in 3. to also advance the leader but as I said in a comment above I am still afraid of the advance call since it can give a serious perf hit if docs are close together.

Simon Willnauer
added a comment - 22/Jul/11 11:02
2) after a hit was found
3) whenever a doc matched m out of n enums, 1 < m < n
maybe I miss something here but really how can we know how many docs are left in an enum during 2. and 3. We could use the doc an enum has advanced to in 3. to also advance the leader but as I said in a comment above I am still afraid of the advance call since it can give a serious perf hit if docs are close together.

The ConjunctionTermScorer sorts the DocsEnums by their frequency in the ctor. The leader will always be the lowest frequent term in the set. is this what you mean here?

Cool, yeah that's roughly what I meant. In general, it's best to always pick the lowest-df enum as leader:
1) after initialization
2) after a hit was found
3) whenever a doc matched m out of n enums, 1 < m < n

I think what you described covers situation 1), does it also cover 2) and 3)?

Michael Busch
added a comment - 21/Jul/11 17:42
The ConjunctionTermScorer sorts the DocsEnums by their frequency in the ctor. The leader will always be the lowest frequent term in the set. is this what you mean here?
Cool, yeah that's roughly what I meant. In general, it's best to always pick the lowest-df enum as leader:
1) after initialization
2) after a hit was found
3) whenever a doc matched m out of n enums, 1 < m < n
I think what you described covers situation 1), does it also cover 2) and 3)?

I'm wondering if you considered having ConjunctionTermScorer use the terms' IDF values to decide which iterator to advance when all are on the same docID? It should always be best to pick the rarest term.

The ConjunctionTermScorer sorts the DocsEnums by their frequency in the ctor. The leader will always be the lowest frequent term in the set. is this what you mean here?

We could even optimize the doNext loop and advance the lead to the last document we stepped out of the inner loop since this is guaranteed to be greater than the document the lead enum is on. I just wonder if we at some point step into the slowness of DocsEnum#advance(). It very important to make #advance(doc+1) as fast as #nextDoc() in order to keep our algs clean!

Simon Willnauer
added a comment - 21/Jul/11 07:08 I'm wondering if you considered having ConjunctionTermScorer use the terms' IDF values to decide which iterator to advance when all are on the same docID? It should always be best to pick the rarest term.
The ConjunctionTermScorer sorts the DocsEnums by their frequency in the ctor. The leader will always be the lowest frequent term in the set. is this what you mean here?
We could even optimize the doNext loop and advance the lead to the last document we stepped out of the inner loop since this is guaranteed to be greater than the document the lead enum is on. I just wonder if we at some point step into the slowness of DocsEnum#advance(). It very important to make #advance(doc+1) as fast as #nextDoc() in order to keep our algs clean!

I'm wondering if you considered having ConjunctionTermScorer use the terms' IDF values to decide which iterator to advance when all are on the same docID? It should always be best to pick the rarest term.

We've talked about doing that in the past, but it's hard to support this for any type of subclause, because you'd have to add the ability to estimate the IDFs of possible subclauses.

But with this change it seems very feasible to try for BQs that only have TQ clauses.

Michael Busch
added a comment - 21/Jul/11 06:23 Nice improvements!
I'm wondering if you considered having ConjunctionTermScorer use the terms' IDF values to decide which iterator to advance when all are on the same docID? It should always be best to pick the rarest term.
We've talked about doing that in the past, but it's hard to support this for any type of subclause, because you'd have to add the ability to estimate the IDFs of possible subclauses.
But with this change it seems very feasible to try for BQs that only have TQ clauses.

next iteration. I added two util methods to get a TermsEnum and a ExactDocScorer from a TermWeight which allowed to make all the members private again. This looks little cleaner now and eventually if we cut over PhraseQuery to use BQ + PosIterators it can simply get all it needs from the TermsEnum like DocsAndPosEnum & totalTermFreq.

I also tried to optimize the doNext() loop even further to save one more comparison to hopefully help hotspot even further. I need to benchmark this patch one more time but overall this looks close.

Simon Willnauer
added a comment - 20/Jul/11 23:38 next iteration. I added two util methods to get a TermsEnum and a ExactDocScorer from a TermWeight which allowed to make all the members private again. This looks little cleaner now and eventually if we cut over PhraseQuery to use BQ + PosIterators it can simply get all it needs from the TermsEnum like DocsAndPosEnum & totalTermFreq.
I also tried to optimize the doNext() loop even further to save one more comparison to hopefully help hotspot even further. I need to benchmark this patch one more time but overall this looks close.
I also added a CHANGES.TXT entry.

Also, I am usually positive on backporting improvements to 3.x to get them back to the users quickly, but I think
we should do this only in trunk.

The reason is: 3.x is going to be very hairy here, since the computation of scoring (including score caches and shit) is
conflated into the postings list matching for termquery, since it has no termstate to pull from, ...

Robert Muir
added a comment - 20/Jul/11 17:02 Also, I am usually positive on backporting improvements to 3.x to get them back to the users quickly, but I think
we should do this only in trunk.
The reason is: 3.x is going to be very hairy here, since the computation of scoring (including score caches and shit) is
conflated into the postings list matching for termquery, since it has no termstate to pull from, ...

I agree with you that in this case a scorer is more elegant. Yet, the rewrite process is a very powerful process that can be used for query optimization etc. from the outside. What I am trying to say is that this is not necessarily bound to "simplification". All your points are valid!

Simon Willnauer
added a comment - 20/Jul/11 16:16 I agree with you that in this case a scorer is more elegant. Yet, the rewrite process is a very powerful process that can be used for query optimization etc. from the outside. What I am trying to say is that this is not necessarily bound to "simplification". All your points are valid!

I disagree here, if this would be the case it should be called simplify(Query). In general its a rewrite method and should not be judged if it simplifies or not.

I think this is really important to hash out: if we want to optimize query execution, we should do this totally internally at the lowest level possible.
If the optimization is to use a specialized scorer, then I think the right place to do this is inside the Weight.

I don't think we should create a bunch of queries that are really the same and rewrite to each other: because this is more 'exposed' to end users, e.g.
highlighting, caching, and who knows what people are doing in their custom code.

It also requires a heavy maintenance burden of duplicate logic and testing for explain, hashcode, equals, etc.

Robert Muir
added a comment - 20/Jul/11 15:24
I disagree here, if this would be the case it should be called simplify(Query). In general its a rewrite method and should not be judged if it simplifies or not.
I think this is really important to hash out: if we want to optimize query execution, we should do this totally internally at the lowest level possible.
If the optimization is to use a specialized scorer, then I think the right place to do this is inside the Weight.
I don't think we should create a bunch of queries that are really the same and rewrite to each other: because this is more 'exposed' to end users, e.g.
highlighting, caching, and who knows what people are doing in their custom code.
It also requires a heavy maintenance burden of duplicate logic and testing for explain, hashcode, equals, etc.

Robert Muir
added a comment - 20/Jul/11 12:35 here is the same thing, only as a scorer that booleanweight picks.
this is much simpler imo:
no state-keeping of 'rewriteToConjunctionTermsQuery' in the Query
no hassles for any highlighters or anything doing instanceof
plays correct with booleanquery's rewrite: e.g. hoisting of single-clauses into termqueries.
In general i think Query.rewrite should be reserved for simplifying Queries, this is not a simpler query, just a faster scorer