Explore performance of multi-PQ vs single-PQ sorting API

Details

Description

Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
where a simpler (non-segment-based) comparator API is proposed that
gathers results into multiple PQs (one per segment) and then merges
them in the end.

I started from John's multi-PQ code and worked it into
contrib/benchmark so that we could run perf tests. Then I generified
the Python script I use for running search benchmarks (in
contrib/benchmark/sortBench.py).

The script first creates indexes with 1M docs (based on
SortableSingleDocSource, and based on wikipedia, if available). Then
it runs various combinations:

The important constants are INDEX_DIR_BASE (where created indexes are
stored), WIKI_FILE (points to .tar.bz2 or .tar export of wikipedia; if
this file can't be found the script just skips the wikipedia tests).
You can also change INDEX_NUM_DOCS and INDEX_NUM_THREADS.

If you don't have the wiki export downloaded, that's fine... the
script should just run the tests based on the random index.

Michael McCandless
added a comment - 20/Oct/09 16:05 Attached patch.
Note that patch is based on 2.9.x branch, so first checkout 2.9.x,
apply the patch, then:
cd contrib/benchmark
ant compile
<edit constants @ top of sortBench.py>
python -u sortBench.py -run results
python -u sortBench.py -report results
The important constants are INDEX_DIR_BASE (where created indexes are
stored), WIKI_FILE (points to .tar.bz2 or .tar export of wikipedia; if
this file can't be found the script just skips the wikipedia tests).
You can also change INDEX_NUM_DOCS and INDEX_NUM_THREADS.
If you don't have the wiki export downloaded, that's fine... the
script should just run the tests based on the random index.

It'd be great if others with more mainstream platforms (Linux,
Windows) could run this and post back.

Raw results (only ran on the log-sized segments):

Seg size

Query

Tot hits

Sort

Top N

QPS old

QPS new

Pct change

log

1

318481

title

10

114.26

112.40

-1.6%

log

1

318481

title

25

117.59

110.08

-6.4%

log

1

318481

title

50

116.22

106.96

-8.0%

log

1

318481

title

100

114.48

100.07

-12.6%

log

1

318481

title

500

103.16

73.98

-28.3%

log

1

318481

title

1000

95.60

57.85

-39.5%

log

<all>

1000000

title

10

95.71

109.41

14.3%

log

<all>

1000000

title

25

111.56

101.73

-8.8%

log

<all>

1000000

title

50

110.56

98.84

-10.6%

log

<all>

1000000

title

100

104.09

93.02

-10.6%

log

<all>

1000000

title

500

93.36

66.67

-28.6%

log

<all>

1000000

title

1000

97.07

50.03

-48.5%

log

<all>

1000000

rand string

10

118.10

109.63

-7.2%

log

<all>

1000000

rand string

25

107.68

102.33

-5.0%

log

<all>

1000000

rand string

50

107.12

100.37

-6.3%

log

<all>

1000000

rand string

100

110.63

95.17

-14.0%

log

<all>

1000000

rand string

500

79.97

72.09

-9.9%

log

<all>

1000000

rand string

1000

76.82

54.67

-28.8%

log

<all>

1000000

country

10

129.49

103.63

-20.0%

log

<all>

1000000

country

25

111.74

102.60

-8.2%

log

<all>

1000000

country

50

108.82

100.90

-7.3%

log

<all>

1000000

country

100

108.01

96.84

-10.3%

log

<all>

1000000

country

500

97.60

72.02

-26.2%

log

<all>

1000000

country

1000

85.19

54.56

-36.0%

log

<all>

1000000

rand int

10

151.75

110.37

-27.3%

log

<all>

1000000

rand int

25

138.06

109.15

-20.9%

log

<all>

1000000

rand int

50

135.40

106.49

-21.4%

log

<all>

1000000

rand int

100

108.30

101.86

-5.9%

log

<all>

1000000

rand int

500

94.45

73.42

-22.3%

log

<all>

1000000

rand int

1000

88.30

54.71

-38.0%

Some observations:

MultiPQ seems like it's generally slower, thought it is faster in
one case, when topN = 10, sorting by title. It's only faster with
the : (MatchAllDocsQuery) query, not with the TermQuery for
term=1, which is odd.

MultiPQ slows down, relatively, as topN increases.

Sorting by int acts differently: MultiPQ is quite a bit slower
across the board, except for topN=100

Michael McCandless
added a comment - 20/Oct/09 17:41 New patch attached:
Turn off testing on the balanced index by default (set DO_BALANCED to True if you want to change this)
Minor formatting fixes in generating the report

Yonik Seeley
added a comment - 23/Oct/09 05:25 While Java5 numbers are still important, I'd say that Java6 (-server of course) should be weighted far heavier? That must be what a majority of people are running in production for new systems?

Java6 is standard in production servers, since when? What justified lucene staying java1.4 for so long if this is the case? In my own experience, my last job only moved to java1.5 a year ago, and at my current company, we're still on 1.5, and I've seen that be pretty common, and I'm in the Valley, where things update pretty quickly.

Jake Mannix
added a comment - 23/Oct/09 05:40 Java6 is standard in production servers, since when? What justified lucene staying java1.4 for so long if this is the case? In my own experience, my last job only moved to java1.5 a year ago, and at my current company, we're still on 1.5, and I've seen that be pretty common, and I'm in the Valley, where things update pretty quickly.

Jake Mannix
added a comment - 23/Oct/09 05:42 I would say that of course weighting more highly linux and solaris should be done over results on macs, because while I love my mac, I've yet to see a production cluster running on MacBook Pros...

Maybe I'm wrong... it was just a guess. It's just what I've seen most customers deploying new projects on.

What justified lucene staying java1.4 for so long if this is the case?

The decision of what JVM a business should use to deploy their new app is a very different one than what Lucene should require.
A minority of users may be justification enough to avoid requring a new JVM... unless the benefits are really that huge. Lucene does not target the JVM that most people will be deploying on - if that were the case, I have a feeling we'd be switching to Java6 instead of Java5.

Yonik Seeley
added a comment - 23/Oct/09 05:53 Java6 is standard in production servers, since when?
Maybe I'm wrong... it was just a guess. It's just what I've seen most customers deploying new projects on.
What justified lucene staying java1.4 for so long if this is the case?
The decision of what JVM a business should use to deploy their new app is a very different one than what Lucene should require.
A minority of users may be justification enough to avoid requring a new JVM... unless the benefits are really that huge. Lucene does not target the JVM that most people will be deploying on - if that were the case, I have a feeling we'd be switching to Java6 instead of Java5.

Maybe I'm wrong... it was just a guess. It's just what I've seen most customers deploying new projects on.

Thats my impression too - Java 1.6 is mainly just a bug fix and performance release and has been out for a while, so its usually the choice I've seen.
Sounds like Uwe thinks its more buggy though, so who knows if thats a good idea

Mark Miller
added a comment - 23/Oct/09 06:05 Java6 is standard in production servers, since when?
Maybe I'm wrong... it was just a guess. It's just what I've seen most customers deploying new projects on.
Thats my impression too - Java 1.6 is mainly just a bug fix and performance release and has been out for a while, so its usually the choice I've seen.
Sounds like Uwe thinks its more buggy though, so who knows if thats a good idea

There was a bad stretch in Java6... they plopped in a major JVM upgrade (not just bug fixes) and there were bugs. I think that's been behind us for a little while now though. If someone were starting a project today, I'd recommend the latest Java6 JVM.

Yonik Seeley
added a comment - 23/Oct/09 06:12 There was a bad stretch in Java6... they plopped in a major JVM upgrade (not just bug fixes) and there were bugs. I think that's been behind us for a little while now though. If someone were starting a project today, I'd recommend the latest Java6 JVM.

Mark Miller
added a comment - 23/Oct/09 06:37 There was a bad stretch in Java6...
But how can that be ? Number 10 of the top 10 of whats new is the -lites!
10. The -lities: Quality, Compatibility, Stability

Thats my impression too - Java 1.6 is mainly just a bug fix and performance release and has been out for a while, so its usually the choice I've seen. Sounds like Uwe thinks its more buggy though, so who knows if thats a good idea

Because of this, for Lucene 3.0 we should say, it's a Java 1.5 compatible release. As Mark said, Java 6 does not contain anything really new useable for Lucene, so we are fine with staying on 1.5. If somebody wants to use 1.5 or 1.6 it's his choice, but we should not force people to use 1.6. If at least one developer uses 1.5 for developing, we have no problem with maybe some added functions in core classes we accidently use (like String.isEmpty() - which is a common problem because it was added in 1.6 and many developers use it intuitive).

Even though 1.5 is EOLed by Sun, they recently added a new release 1.5.0_21. I was also wondering about that, but it seems that Sun is still providing "support" for it.

About the stability: maybe it is better now, but I have seen so many crashed JVMs in the earlier versions <= _12, so I stayed on 1.5. But we are also thinking of switching here at some time.

Uwe Schindler
added a comment - 23/Oct/09 07:03 Thats my impression too - Java 1.6 is mainly just a bug fix and performance release and has been out for a while, so its usually the choice I've seen. Sounds like Uwe thinks its more buggy though, so who knows if thats a good idea
Because of this, for Lucene 3.0 we should say, it's a Java 1.5 compatible release. As Mark said, Java 6 does not contain anything really new useable for Lucene, so we are fine with staying on 1.5. If somebody wants to use 1.5 or 1.6 it's his choice, but we should not force people to use 1.6. If at least one developer uses 1.5 for developing, we have no problem with maybe some added functions in core classes we accidently use (like String.isEmpty() - which is a common problem because it was added in 1.6 and many developers use it intuitive).
Even though 1.5 is EOLed by Sun, they recently added a new release 1.5.0_21. I was also wondering about that, but it seems that Sun is still providing "support" for it.
About the stability: maybe it is better now, but I have seen so many crashed JVMs in the earlier versions <= _12, so I stayed on 1.5. But we are also thinking of switching here at some time.

John Wang
added a comment - 23/Oct/09 08:15 wrote a small test and verified that 64bit vm's string compare is much faster than that of 32-bit. (kinda makes sense)
and the above numbers now all make sense.

So it does not have something to do with Java 1.5/1.6 but more with 32/64 bit. As most servers are running 64 bit, I think the new 2.9 search API is fine?

I agree with you, the new API is cleaner at all, the old API could only be reimplemented with major refactorings, as it does not fit well in multi-segment search.

By the way, I found during refactoring for Java5 some inconsistenceies in MultiSearcher/ParallelMultiSearcher, which uses FieldDocSortedHitQueue (its used nowhere else anymore): During sorting it uses (when merging the queues of all Searchers) some native compareTo operations, which may not work correctly with custom comparators. Is this correct? In my opinion this queue should also somehow use at least the FieldComparator's compare functions.
Mark, do not understand it completely, but how does this fit together. I added a warning because of very strange casts in the source code (unsafe casts) and a SuppressWarnings("unchecked") so its easy to find in FieldDocSortedHitQueue. The temp variable is just there for the unchecked warning supress (but it really needs to be fixed).

Uwe Schindler
added a comment - 23/Oct/09 08:53 - edited So it does not have something to do with Java 1.5/1.6 but more with 32/64 bit. As most servers are running 64 bit, I think the new 2.9 search API is fine?
I agree with you, the new API is cleaner at all, the old API could only be reimplemented with major refactorings, as it does not fit well in multi-segment search.
By the way, I found during refactoring for Java5 some inconsistenceies in MultiSearcher/ParallelMultiSearcher, which uses FieldDocSortedHitQueue (its used nowhere else anymore): During sorting it uses (when merging the queues of all Searchers) some native compareTo operations, which may not work correctly with custom comparators. Is this correct? In my opinion this queue should also somehow use at least the FieldComparator's compare functions.
Mark, do not understand it completely, but how does this fit together. I added a warning because of very strange casts in the source code (unsafe casts) and a SuppressWarnings("unchecked") so its easy to find in FieldDocSortedHitQueue. The temp variable is just there for the unchecked warning supress (but it really needs to be fixed).

Thats what Comparable FieldComparator#value is for - fillFields will grab all those and load up FieldDoc fields - so the custom FieldComparator is tied into it - it creates Comparable objects that can be compared by the native compareTos. (the old API did the same thing)

/**
* Given a queue Entry, creates a corresponding FieldDoc
* that contains the values used to sort the given document.
* These values are not the raw values out of the index, but the internal
* representation of them. This is so the given search hit can be collated by
* a MultiSearcher with other search hits.
*
* @param entry The Entry used to create a FieldDoc
* @return The newly created FieldDoc
* @see Searchable#search(Weight,Filter,int,Sort)
*/
FieldDoc fillFields(final Entry entry) {
finalint n = comparators.length;
final Comparable[] fields = new Comparable[n];
for (int i = 0; i < n; ++i) {
fields[i] = comparators[i].value(entry.slot);
}
//if (maxscore > 1.0f) doc.score /= maxscore; // normalize scores
returnnew FieldDoc(entry.docID, entry.score, fields);
}

Mark Miller
added a comment - 23/Oct/09 14:30 - edited but how does this fit together.
Thats what Comparable FieldComparator#value is for - fillFields will grab all those and load up FieldDoc fields - so the custom FieldComparator is tied into it - it creates Comparable objects that can be compared by the native compareTos. (the old API did the same thing)
/**
* Given a queue Entry, creates a corresponding FieldDoc
* that contains the values used to sort the given document.
* These values are not the raw values out of the index, but the internal
* representation of them. This is so the given search hit can be collated by
* a MultiSearcher with other search hits.
*
* @param entry The Entry used to create a FieldDoc
* @ return The newly created FieldDoc
* @see Searchable#search(Weight,Filter, int ,Sort)
*/
FieldDoc fillFields( final Entry entry) {
final int n = comparators.length;
final Comparable[] fields = new Comparable[n];
for ( int i = 0; i < n; ++i) {
fields[i] = comparators[i].value(entry.slot);
}
// if (maxscore > 1.0f) doc.score /= maxscore; // normalize scores
return new FieldDoc(entry.docID, entry.score, fields);
}

Mark Miller
added a comment - 23/Oct/09 14:33 As most servers are running 64 bit,
Aren't we at the tipping point where even non servers are 64bit now? My consumer desktop/laptops have been 64-bit for years now.

Uwe Schindler
added a comment - 23/Oct/09 14:54 it creates Comparable objects that can be compared by the native compareTos. (the old API did the same thing)
OK understood. I will try to fix the generics somehow to be able to remove the SuppressWarnings.

Made some basic code level optimizations, eg created an explicit
DocIDPriorityQueue (that deals in int not Object, to avoid
casting), subclassed that directly to a SortByStringQueue and a
SortByIntQueue. It turns out that if statement (when comparing
int values) must stay because the subtraction can overflow int.

Added "sortBench.py -verify" that quickly runs each API across all
tests and confirms results are identical – proxy for real unit
tests

Michael McCandless
added a comment - 23/Oct/09 17:06 New patch attached:
Made some basic code level optimizations, eg created an explicit
DocIDPriorityQueue (that deals in int not Object, to avoid
casting), subclassed that directly to a SortByStringQueue and a
SortByIntQueue. It turns out that if statement (when comparing
int values) must stay because the subtraction can overflow int.
Added "sortBench.py -verify" that quickly runs each API across all
tests and confirms results are identical – proxy for real unit
tests
Added "Source" (wiki or random) to Jira table output
Print java/os version at start
I'll re-run my test.

Mike, thanks for all the hard work on this - it's clearly far more work than anyone has spent yet on just doing the upgrade to the newer api, and that's appreciated.

Am I wrong in thinking that these results are pretty ambiguous? How often to people take the top 500 or top 1000 sorted hits? If you don't focus on that case (that of looking for pages 50 through 100 of normal 10-per-page search results), there's a bunch of green, a bunch of red, both techniques are +/- 10-20% of each other?

Is that what everyone else sees of Mike's newest numbers here, or am I misreading them?

Jake Mannix
added a comment - 23/Oct/09 18:56 Mike, thanks for all the hard work on this - it's clearly far more work than anyone has spent yet on just doing the upgrade to the newer api, and that's appreciated.
Am I wrong in thinking that these results are pretty ambiguous? How often to people take the top 500 or top 1000 sorted hits? If you don't focus on that case (that of looking for pages 50 through 100 of normal 10-per-page search results), there's a bunch of green, a bunch of red, both techniques are +/- 10-20% of each other?
Is that what everyone else sees of Mike's newest numbers here, or am I misreading them?

With the previous numbers, I would have said I'd -1 it. Now the numbers have changed. Its less clear.

However - I'm still leaning against. I don't like the 30-50% drops even if top500,1000 are not as common as top 10,100. Its a nasty hit for those that do it. It doesn't carry tons of weight, but I don't like it.

I also really don't like shifting back to this API right after rolling out the new one. Its very ugly. Its not a good precedent to set for our users. And unless we make a change in our back compat policy, we are stuck with both API's till 4.0. Managing two API's is something else I don't like.

Finally, creating a custom sort is an advanced operation. The far majority of Lucene users will be good with the built in sorts. If you need a new custom one, you are into some serious stuff already. You can handle the new API. We have seen users handle it. Uwe had ideas for helping in that regard, and documentation can probably still be improved based on future user interactions.

I'm not as dead set against it as I was, but I still don't think I'm for the change myself.

Mark Miller
added a comment - 25/Oct/09 19:04 Time for the reevaluations?
With the previous numbers, I would have said I'd -1 it. Now the numbers have changed. Its less clear.
However - I'm still leaning against. I don't like the 30-50% drops even if top500,1000 are not as common as top 10,100. Its a nasty hit for those that do it. It doesn't carry tons of weight, but I don't like it.
I also really don't like shifting back to this API right after rolling out the new one. Its very ugly. Its not a good precedent to set for our users. And unless we make a change in our back compat policy, we are stuck with both API's till 4.0. Managing two API's is something else I don't like.
Finally, creating a custom sort is an advanced operation. The far majority of Lucene users will be good with the built in sorts. If you need a new custom one, you are into some serious stuff already. You can handle the new API. We have seen users handle it. Uwe had ideas for helping in that regard, and documentation can probably still be improved based on future user interactions.
I'm not as dead set against it as I was, but I still don't think I'm for the change myself.

Mark, you say with the previous numbers, you'd say "-1", but if you look at the most common use case (top 10), the simpler API is faster in almost all cases, and in some cases it's 10-20% faster. Top 500, top 1000 are not only just "not as common", they're probably at the 1% level, or less.

As far as shifting back, API-wise, that really shouldn't be a factor: 2.9 just came out, and what, we stick with a slightly slower API (for the most common use case across all Lucene users), which happens to be more complex, and more importantly: just very nonstandard - Comparable is very familiar to everyone, even if you have to have two forms, one for primitives, one for Objects - an api which doesn't have the whole slew of compare(), compareBottom(), copy(), setBottom(), value() and setNextReader() has a tremendous advantage over one which does.

It's "advanced" to implement a custom sort, but it will be easier if it's not complex, and then it doesn't need to be "advanced" (shouldn't we be striving to make there be less APIs which are listed as "advanced", and instead more features which can do complex things but are still listed as things "normal users" can do).

I think it's great precedent to set with users to say, "oops! we found that this new (just now as of this version) api was unnecessarily clumsy, we're shifting back to a simpler one which is just like the one you used to have". Sticking with a worse api because it performs better in only extreme scenarios because "we already moved on to this new api, shouldn't go back now, don't want to admit we ever made a mistake!" is what is "ugly".

The main thing to remember is that the entire thinking around making this different from the old was only because it seemed that using a simpler api would perform much worse than this one, and it does not appear that this is the case. If that original reasoning turns out to have been incorrect, then the answer is simple: go with the simpler API now before users do get used to using the new one.

If it turns out I'm wrong, and lots of users sort based on field values for the top 1000 entries often, or that the most recent runs turn out to be flukes and are not typical performance, only then would I'd change my opinion.

Jake Mannix
added a comment - 25/Oct/09 20:31 Mark, you say with the previous numbers, you'd say "-1", but if you look at the most common use case (top 10), the simpler API is faster in almost all cases, and in some cases it's 10-20% faster. Top 500, top 1000 are not only just "not as common", they're probably at the 1% level, or less.
As far as shifting back, API-wise, that really shouldn't be a factor: 2.9 just came out, and what, we stick with a slightly slower API (for the most common use case across all Lucene users), which happens to be more complex , and more importantly: just very nonstandard - Comparable is very familiar to everyone, even if you have to have two forms, one for primitives, one for Objects - an api which doesn't have the whole slew of compare(), compareBottom(), copy(), setBottom(), value() and setNextReader() has a tremendous advantage over one which does.
It's "advanced" to implement a custom sort, but it will be easier if it's not complex, and then it doesn't need to be "advanced" (shouldn't we be striving to make there be less APIs which are listed as "advanced", and instead more features which can do complex things but are still listed as things "normal users" can do).
I think it's great precedent to set with users to say, "oops! we found that this new (just now as of this version) api was unnecessarily clumsy, we're shifting back to a simpler one which is just like the one you used to have". Sticking with a worse api because it performs better in only extreme scenarios because "we already moved on to this new api, shouldn't go back now, don't want to admit we ever made a mistake!" is what is "ugly".
The main thing to remember is that the entire thinking around making this different from the old was only because it seemed that using a simpler api would perform much worse than this one, and it does not appear that this is the case. If that original reasoning turns out to have been incorrect, then the answer is simple: go with the simpler API now before users do get used to using the new one.
If it turns out I'm wrong, and lots of users sort based on field values for the top 1000 entries often, or that the most recent runs turn out to be flukes and are not typical performance, only then would I'd change my opinion.

Given good enough reasons, I could see saying we made a mistake and switching back - as it is, for the reasons I've said, I don't find that to be the case. I don't feel the new API was a mistake yet.

Lots of other guys to weigh in though. If everyone else feels like its the right move, I'm not going to -1 it - just weighing in with how I feel.

I'm not seeing 10-20% faster across the board - on my system it doesnt even hit 10% and I'm a linux user and advocate. I'm all for performance, but < 10% here and there is not enough to sway me against 30-50% loses in the large queue cases, combined with having to shift back. Its not a clear win either way, but I've said which way I lean.

Luckily, its not just me you have to convince. Lots of smart people still to weigh in.

Mark Miller
added a comment - 25/Oct/09 20:42 Given good enough reasons, I could see saying we made a mistake and switching back - as it is, for the reasons I've said, I don't find that to be the case. I don't feel the new API was a mistake yet.
Lots of other guys to weigh in though. If everyone else feels like its the right move, I'm not going to -1 it - just weighing in with how I feel.
I'm not seeing 10-20% faster across the board - on my system it doesnt even hit 10% and I'm a linux user and advocate. I'm all for performance, but < 10% here and there is not enough to sway me against 30-50% loses in the large queue cases, combined with having to shift back. Its not a clear win either way, but I've said which way I lean.
Luckily, its not just me you have to convince. Lots of smart people still to weigh in.

Looking more at this, I think there are a few problems with the
current test:

The "random" index is smallish – it has only 7 segments. I'd
like to test across a wider range of segments.

My "optimized" code for the multi-PQ (old) API is too optimized –
I conflated the comparator directly into the PQ (eg I have
SortByIntQueue, SortByStringQueue, that directly subclass
DocIDPriorityQueue and override only compare). For source code
specialization this would be appropriate (it is "correct"), but in
order to test the two extension APIs, it's not. I'll rework it to
pull it back out into a separate comparator.

We should test more realistic queries & index. Of all tested so
far, query "1" on wiki index is the most realistic, and we see the
least gains there. The : query is unnatural, in part because
the I think first N title in wiki'd index appear to be roughly
alpha sorted. Random strings & random ints are very realistic.

Also, I really do not like the large perf hits at highish topN sizes.
Admittedly it's unusual (though I suspect not THAT rare) to use such a
large topN, but, I esepically don't like that it doesn't degrade
gracefully – it's surprising. Likewise I'd like to see with highish
number of segments how things degrade.

I would also give more weight to the JRE 1.6, 64 bit, results. Yes we
can debate what's the popular JRE in production today, but this API is
going to last a looong time in Lucene so I think we should bias
towards what will be the future JRE. Maybe we should even test on JDK
7 preview... though it's probably too early to make any decisisions
based in its performance.

Michael McCandless
added a comment - 26/Oct/09 10:37 Looking more at this, I think there are a few problems with the
current test:
The "random" index is smallish – it has only 7 segments. I'd
like to test across a wider range of segments.
My "optimized" code for the multi-PQ (old) API is too optimized –
I conflated the comparator directly into the PQ (eg I have
SortByIntQueue, SortByStringQueue, that directly subclass
DocIDPriorityQueue and override only compare). For source code
specialization this would be appropriate (it is "correct"), but in
order to test the two extension APIs, it's not. I'll rework it to
pull it back out into a separate comparator.
We should test more realistic queries & index. Of all tested so
far, query "1" on wiki index is the most realistic, and we see the
least gains there. The : query is unnatural, in part because
the I think first N title in wiki'd index appear to be roughly
alpha sorted. Random strings & random ints are very realistic.
We need more results from diverse envs – eg Windows (different
versions), different Linux's, JREs, etc.
Also, I really do not like the large perf hits at highish topN sizes.
Admittedly it's unusual (though I suspect not THAT rare) to use such a
large topN, but, I esepically don't like that it doesn't degrade
gracefully – it's surprising. Likewise I'd like to see with highish
number of segments how things degrade.
I would also give more weight to the JRE 1.6, 64 bit, results. Yes we
can debate what's the popular JRE in production today, but this API is
going to last a looong time in Lucene so I think we should bias
towards what will be the future JRE. Maybe we should even test on JDK
7 preview... though it's probably too early to make any decisisions
based in its performance.

New patch attached, that un-conflates the comparator & PQ. I think this patch more accurately separates the comparator from the queue, ie, better matches the approach we'd go with if we did this for "real".

Also, I turned on the BALANCED case in sortBench.py, which also generates 20-segment balanced (same segment size) index for both wiki & random, and runs the tests on those.

Michael McCandless
added a comment - 27/Oct/09 09:34 New patch attached, that un-conflates the comparator & PQ. I think this patch more accurately separates the comparator from the queue, ie, better matches the approach we'd go with if we did this for "real".
Also, I turned on the BALANCED case in sortBench.py, which also generates 20-segment balanced (same segment size) index for both wiki & random, and runs the tests on those.
Finally, I added query "2" for testing, if the index is wikipedia.

Excellent, good to see that my big-O analysis is holding up on the 5M doc set: as the sub-leading terms drop off and become negligible, any improvement of singlePQ over multiPQ starts to go away entirely.

Still seems to be some statistical fluctuations here and there though (why would 1000 hits ever have better perf for multiPQ vs singlePQ compared to 500 hits?), but I guess that's entropy for you...

Jake Mannix
added a comment - 27/Oct/09 17:31 Excellent, good to see that my big-O analysis is holding up on the 5M doc set: as the sub-leading terms drop off and become negligible, any improvement of singlePQ over multiPQ starts to go away entirely.
Still seems to be some statistical fluctuations here and there though (why would 1000 hits ever have better perf for multiPQ vs singlePQ compared to 500 hits?), but I guess that's entropy for you...

First, my reverseMul changes did not actually compile (I've now added "ant compile" to sortBench.py). Second, the field we are sorting on for int seems to be null / have too many zeros. I'm still digging on why...

Michael McCandless
added a comment - 27/Oct/09 19:05 First, my reverseMul changes did not actually compile (I've now added "ant compile" to sortBench.py). Second, the field we are sorting on for int seems to be null / have too many zeros. I'm still digging on why...

OK new rev attached. In previous tests, ints were restricted to 0..20000. Now they span the full int range (randomly). I added "ant compile" to sortBench.py. I improved a bit the merging at the end of multi-PQ, by up-front getting the Comparable (rather than recomputing on every comparison), and I use this to return the topN comparables, which I now print in the logs.

Michael McCandless
added a comment - 27/Oct/09 19:46 OK new rev attached. In previous tests, ints were restricted to 0..20000. Now they span the full int range (randomly). I added "ant compile" to sortBench.py. I improved a bit the merging at the end of multi-PQ, by up-front getting the Comparable (rather than recomputing on every comparison), and I use this to return the topN comparables, which I now print in the logs.
I'll return on my opensolaris box...

Yet another patch. This one incorporates thread safety fixes (from LUCENE-1994) that were messing up at least the sort-by-country cases by creating many docs with null country. Also, I noticed we were unfairly penalizing the single PQ test by passing in "false" for "docsScoredInOrder", whereas the multi-PQ case currently assumes docs are scored in order. So I changed that param to "true" to make the test fair.

Michael McCandless
added a comment - 28/Oct/09 10:09 Yet another patch. This one incorporates thread safety fixes (from LUCENE-1994 ) that were messing up at least the sort-by-country cases by creating many docs with null country. Also, I noticed we were unfairly penalizing the single PQ test by passing in "false" for "docsScoredInOrder", whereas the multi-PQ case currently assumes docs are scored in order. So I changed that param to "true" to make the test fair.

Mark Miller
added a comment - 28/Oct/09 13:38 I think thats just a limitation of the parallel indexing task? I've seen it not hit the target number exactly due to how it divides the docs between threads.

Yonik Seeley
added a comment - 28/Oct/09 14:09 One thing that bothers me about multiPQ is the memory usage if you start paging deeper and have many segments. I've seen up to 100 segments in production systems. 100x the memory use isn't pretty.
So another thought is... 2 queues instead of N queues?
search segment 1 into queue A
search segment 2 into queue B (with possible short circuiting by the smallest value in queueA)
queue B will have larger values than queue A on average, so merge queue A into queue B (unless B is much smaller?)
search segment 3 into queue A, short circuit by smallest in B, then merge B into A, etc

search segment 2 into queue B (with possible short circuiting by the smallest value in queueA)

But this needs the new Comparator API's compareBottom (with which I have no problem! ). Nevertheless, I have still the opinion, that the new Collector API is much better (even it is more complicated). Maybe the implementation behind could be switchable and work different. Even with the new API, we could have more than one PQ.

Uwe Schindler
added a comment - 28/Oct/09 14:19 Good catch!
search segment 2 into queue B (with possible short circuiting by the smallest value in queueA)
But this needs the new Comparator API's compareBottom (with which I have no problem! ). Nevertheless, I have still the opinion, that the new Collector API is much better (even it is more complicated). Maybe the implementation behind could be switchable and work different. Even with the new API, we could have more than one PQ.

Michael McCandless
added a comment - 28/Oct/09 14:23 I think thats just a limitation of the parallel indexing task? I've seen it not hit the target number exactly due to how it divides the docs between threads.
Actually that's my bad – I divide the number by 4. So long as the number is 0 mod 4 it should work.

search segment 2 into queue B (with possible short circuiting by the smallest value in queueA)

Well, we're not doing the short circuit trick on multiPQ right now, are we? It would certainly speed things up, but requires the API have the convert() method available, which was the big savings on the API side to multiPQ. If it was available, I think multiPQ (either with N or 2 queues) would perform strictly better than singlePQ, but I didn't suggest this because it seems to negate the cleanliness of the API.

One thing John mentioned offhand is that perhaps the convert() method could be optional? If you don't implement it, you don't get to short-circuit using knowledge of previous segments, but if you do, you get maximum performance in the cases where multiPQ performs worse (mid-range hitCount, high numResultsToReturn, and in the numeric sorting case).

I think maybe combining this idea with 2 queues could be the best of all worlds, with best overall speed, only twice the memory of singlePQ, and the simplest API with the addition of one new optional method?

Jake Mannix
added a comment - 28/Oct/09 14:25 search segment 2 into queue B (with possible short circuiting by the smallest value in queueA)
Well, we're not doing the short circuit trick on multiPQ right now, are we? It would certainly speed things up, but requires the API have the convert() method available, which was the big savings on the API side to multiPQ. If it was available, I think multiPQ (either with N or 2 queues) would perform strictly better than singlePQ, but I didn't suggest this because it seems to negate the cleanliness of the API.
One thing John mentioned offhand is that perhaps the convert() method could be optional? If you don't implement it, you don't get to short-circuit using knowledge of previous segments, but if you do, you get maximum performance in the cases where multiPQ performs worse (mid-range hitCount, high numResultsToReturn, and in the numeric sorting case).
I think maybe combining this idea with 2 queues could be the best of all worlds, with best overall speed, only twice the memory of singlePQ, and the simplest API with the addition of one new optional method?

I personally think this is a ways from being resolved one way or another... we shouldn't rush it, and we also shouldn't just necessarily "revert" to the previous API. If we so end up switching away from FieldComparator, we should consider it a new change, and make it the best we can.

the new Collector API is much better (even it is more complicated)

The power is nice... and it does allow certain optimizations that the old one did not - such as caching a value that's not a trivial lookup by docid. But I think that when singlePQ does lose, it's perhaps due to the extra indirection overhead of FieldComparator... how else can one explain multiPQ sometimes being faster with integers?

Yonik Seeley
added a comment - 28/Oct/09 14:31 I personally think this is a ways from being resolved one way or another... we shouldn't rush it, and we also shouldn't just necessarily "revert" to the previous API. If we so end up switching away from FieldComparator, we should consider it a new change, and make it the best we can.
the new Collector API is much better (even it is more complicated)
The power is nice... and it does allow certain optimizations that the old one did not - such as caching a value that's not a trivial lookup by docid. But I think that when singlePQ does lose, it's perhaps due to the extra indirection overhead of FieldComparator... how else can one explain multiPQ sometimes being faster with integers?

Actually that's my bad - I divide the number by 4. So long as the number is 0 mod 4 it should work.

Ah right - the parallel indexing task multiplies up (eg you give the docs per thread, not the total docs), it doesn't divide down - so that doesn't make sense. I was confusing with a different reporting oddity I've seen when using the parallel task - would have to investigate again to remember it correctly though.

Mark Miller
added a comment - 28/Oct/09 14:33 Actually that's my bad - I divide the number by 4. So long as the number is 0 mod 4 it should work.
Ah right - the parallel indexing task multiplies up (eg you give the docs per thread, not the total docs), it doesn't divide down - so that doesn't make sense. I was confusing with a different reporting oddity I've seen when using the parallel task - would have to investigate again to remember it correctly though.

Michael McCandless
added a comment - 28/Oct/09 14:46 how else can one explain multiPQ sometimes being faster with integers?
Right, I think that's due to the higher constant overhead of single PQ. In my most recent run, multi PQ is a bit faster when sorting by int only when queue size is 10 or 25.

So if we're considering new comparator APIs, and the indirection seems to be slowing things down... one thing to think about is how to eliminate that indirection.

Even thinking about the multiPQ case - why should one need more than a single PQ when dealing with primitives that don't depend on context (i.e. everything except ord). If the comparator API had a way to set (or return) a primitive value for a single docid, and then those were compared (either directly by the PQ or via a callback), there wouldn't be an issue with reader transitions (because you don't compare id vs id) and hence no need for multiple priority queues. Avoiding the creation of intermediate Comparable objects also seems desirable.

Perhaps do it how "score" is handled now... inlined into Entry? Should make heap rebalancing faster (fewer callbacks, fewer array lookups).

Yonik Seeley
added a comment - 28/Oct/09 16:27 So if we're considering new comparator APIs, and the indirection seems to be slowing things down... one thing to think about is how to eliminate that indirection.
Even thinking about the multiPQ case - why should one need more than a single PQ when dealing with primitives that don't depend on context (i.e. everything except ord). If the comparator API had a way to set (or return) a primitive value for a single docid, and then those were compared (either directly by the PQ or via a callback), there wouldn't be an issue with reader transitions (because you don't compare id vs id) and hence no need for multiple priority queues. Avoiding the creation of intermediate Comparable objects also seems desirable.
Perhaps do it how "score" is handled now... inlined into Entry? Should make heap rebalancing faster (fewer callbacks, fewer array lookups).

If we do go with a multi-queue approach (anywhere from 2 to N), perhaps there are ways to optimize queue merging too. It looks like the current code fully pops all the source queues?

If you're going to insert an element into a different queue, it's a waste to maintain heap order on the source queue, and it's less efficient to start with the smallest elements. Start with the leaves for the most effective short-circuiting. We could even optionally prune... if two children aren't competetive, neither will the parent be.

Yonik Seeley
added a comment - 28/Oct/09 17:03 If we do go with a multi-queue approach (anywhere from 2 to N), perhaps there are ways to optimize queue merging too. It looks like the current code fully pops all the source queues?
If you're going to insert an element into a different queue, it's a waste to maintain heap order on the source queue, and it's less efficient to start with the smallest elements. Start with the leaves for the most effective short-circuiting. We could even optionally prune... if two children aren't competetive, neither will the parent be.

That should speed things up, but that's way subleading in complexity. This is an additive term O(numSegments * numDesiredResults) total operations when done "slowly" (as opposed to the best merge, which is O(numDesiredResults * log(numSegments)) ), in comparison to the primary subleading piece for multiPQ, which is O(numSegments * numDesiredResults * log(numDesiredResults) * log(numHitsPerSegment) ), so that's taking a piece of the CPU time which is smaller by a factor of 20-100 already than the total PQ insert time, and reducing it by a further factor of maybe 5-10.

Jake Mannix
added a comment - 28/Oct/09 17:19 That should speed things up, but that's way subleading in complexity. This is an additive term O(numSegments * numDesiredResults) total operations when done "slowly" (as opposed to the best merge, which is O(numDesiredResults * log(numSegments)) ), in comparison to the primary subleading piece for multiPQ, which is O(numSegments * numDesiredResults * log(numDesiredResults) * log(numHitsPerSegment) ), so that's taking a piece of the CPU time which is smaller by a factor of 20-100 already than the total PQ insert time, and reducing it by a further factor of maybe 5-10.
If it's easy to code up, sure, why not. But it's not really "inner loop" necessary optimizations anymore, I'd argue.

The more I think about it though, the more I'd like to not simply return to compare(int doca, int docb). Getting the values for the documents to compare is not always a fast operation for custom comparators, so the ability to set/cache that value is a big win in those cases.

Yonik Seeley
added a comment - 28/Oct/09 17:27 The more I think about it though, the more I'd like to not simply return to compare(int doca, int docb). Getting the values for the documents to compare is not always a fast operation for custom comparators, so the ability to set/cache that value is a big win in those cases.

But part of the point is that you don't have to get the values - you can have a fast in-memory structure which just encodes their sort-order, right? This is the whole point of using the ordinal - you pre-sort all of the possible values to get the ordinals, and now arbitrarily complex comparator reduces to int compare at sort time. In the custom comparators we use, for example, this allows for even sorting by multivalued fields in a custom way, via the simple compare(int doca, int docb) way.

This reminds me: Mike, you switched the compare for ord values from being "return ordA - ordB" to being "return ordA > ordB ? 1 : (ordA == ordB ? 0 : -1)", on the basis of int overflow at some point, right? This is only true if we're really sorting by integers, which could overflow - if they're ordinals, then these are both non-negative numbers, and their difference will always be greater than -MAX_INT, so the branching can be avoided in this innermost comparison in this case.

Jake Mannix
added a comment - 28/Oct/09 17:44 But part of the point is that you don't have to get the values - you can have a fast in-memory structure which just encodes their sort-order, right? This is the whole point of using the ordinal - you pre-sort all of the possible values to get the ordinals, and now arbitrarily complex comparator reduces to int compare at sort time . In the custom comparators we use, for example, this allows for even sorting by multivalued fields in a custom way, via the simple compare(int doca, int docb) way.
This reminds me: Mike, you switched the compare for ord values from being "return ordA - ordB" to being "return ordA > ordB ? 1 : (ordA == ordB ? 0 : -1)", on the basis of int overflow at some point, right? This is only true if we're really sorting by integers, which could overflow - if they're ordinals, then these are both non-negative numbers, and their difference will always be greater than -MAX_INT, so the branching can be avoided in this innermost comparison in this case.

But part of the point is that you don't have to get the values - you can have a fast in-memory structure which just encodes their sort-order, right?

For certain types of comparators... but some custom comparators do more work and more indirect lookups (still fast enough to be a comparator, but certainly slower than just a direct array access). It would be nice to avoid doing it more than once per id, and this is where FieldComparator is superior.

Yonik Seeley
added a comment - 28/Oct/09 18:24 But part of the point is that you don't have to get the values - you can have a fast in-memory structure which just encodes their sort-order, right?
For certain types of comparators... but some custom comparators do more work and more indirect lookups (still fast enough to be a comparator, but certainly slower than just a direct array access). It would be nice to avoid doing it more than once per id, and this is where FieldComparator is superior.

Jake Mannix
added a comment - 28/Oct/09 18:31 Can you tell me more about this? What kind of comparator can't pre-create a fixed ordinal list for all the possible values? I'm sure I've seen this too, but I can't bring one to mind right now.

What kind of comparator can't pre-create a fixed ordinal list for all the possible values?

It's less about "can't" and more about there being too many disadvantages to pre-create in many cases.
Sparse representations would fit in this category... the number of documents with a value is small, so you use a hash (like Solr's query elevation component).

Solr's random sort comparator is another - it hashes directly from docid to get the sort value. Not slow per se, but it's certainly going to be faster to avoid recalculating it on every compare.

Yonik Seeley
added a comment - 28/Oct/09 18:51 What kind of comparator can't pre-create a fixed ordinal list for all the possible values?
It's less about "can't" and more about there being too many disadvantages to pre-create in many cases.
Sparse representations would fit in this category... the number of documents with a value is small, so you use a hash (like Solr's query elevation component).
Solr's random sort comparator is another - it hashes directly from docid to get the sort value. Not slow per se, but it's certainly going to be faster to avoid recalculating it on every compare.

This reminds me: Mike, you switched the compare for ord values from being "return ordA - ordB" to being "return ordA > ordB ? 1 : (ordA == ordB ? 0 : -1)", on the basis of int overflow at some point, right? This is only true if we're really sorting by integers, which could overflow - if they're ordinals, then these are both non-negative numbers, and their difference will always be greater than -MAX_INT, so the branching can be avoided in this innermost comparison in this case.

Right, in the latest patch, for ords I just do the subtraction; for arbitrary ints, I do the if statement.

Michael McCandless
added a comment - 28/Oct/09 19:04
This reminds me: Mike, you switched the compare for ord values from being "return ordA - ordB" to being "return ordA > ordB ? 1 : (ordA == ordB ? 0 : -1)", on the basis of int overflow at some point, right? This is only true if we're really sorting by integers, which could overflow - if they're ordinals, then these are both non-negative numbers, and their difference will always be greater than -MAX_INT, so the branching can be avoided in this innermost comparison in this case.
Right, in the latest patch, for ords I just do the subtraction; for arbitrary ints, I do the if statement.

> What kind of comparator can't pre-create a fixed ordinal list for all the
> possible values? I'm sure I've seen this too, but I can't bring one to mind
> right now.

I think the only time the ordinal list can't be created is when the source
array contains some value that can't be compared against another value – e.g.
some variant on NULL – or when the comparison function is broken, e.g. when
a < b and b < c but c > a.

For current KinoSearch and future Lucy, we pre-build the ord array at index
time and mmap it at search time. (Thanks to mmap, sort caches have virtually
no impact on IndexReader launch time.)

Marvin Humphrey
added a comment - 28/Oct/09 23:38 > What kind of comparator can't pre-create a fixed ordinal list for all the
> possible values? I'm sure I've seen this too, but I can't bring one to mind
> right now.
I think the only time the ordinal list can't be created is when the source
array contains some value that can't be compared against another value – e.g.
some variant on NULL – or when the comparison function is broken, e.g. when
a < b and b < c but c > a.
For current KinoSearch and future Lucy, we pre-build the ord array at index
time and mmap it at search time. (Thanks to mmap, sort caches have virtually
no impact on IndexReader launch time.)

One thing that bothers me about multiPQ is the memory usage if you start paging deeper and have many segments. I've seen up to 100 segments in production systems. 100x the memory use isn't pretty.

That's 100x the memory only for heaps, plus memory for Comparables - not nice.

What kind of comparator can't pre-create a fixed ordinal list for all the possible values?

Any comparator that has query-dependent ordering. Distance sort (of any kind, be it geo, or just any kind of value being close to your sample) for instance.

I think the only time the ordinal list can't be created is when the source array contains some value that can't be compared against another value - e.g. some variant on NULL - or when the comparison function is broken, e.g. when a < b and b < c but c > a.

With such comparison function you're busted anyway - the order of your hits is dependent on segment traversal order for instance. If you sharded your search - it depends on the order your shards responded to meta-search. Ugly.

Earwin Burrfoot
added a comment - 29/Oct/09 08:10 One thing that bothers me about multiPQ is the memory usage if you start paging deeper and have many segments. I've seen up to 100 segments in production systems. 100x the memory use isn't pretty.
That's 100x the memory only for heaps, plus memory for Comparables - not nice.
What kind of comparator can't pre-create a fixed ordinal list for all the possible values?
Any comparator that has query-dependent ordering. Distance sort (of any kind, be it geo, or just any kind of value being close to your sample) for instance.
I think the only time the ordinal list can't be created is when the source array contains some value that can't be compared against another value - e.g. some variant on NULL - or when the comparison function is broken, e.g. when a < b and b < c but c > a.
With such comparison function you're busted anyway - the order of your hits is dependent on segment traversal order for instance. If you sharded your search - it depends on the order your shards responded to meta-search. Ugly.

If the comparator API had a way to set (or return) a primitive value for a single docid, and then those were compared (either directly by the PQ or via a callback), there wouldn't be an issue with reader transitions (because you don't compare id vs id) and hence no need for multiple priority queues.

I like this idea; I'll explore it. Also, your (Yonik's) results showed a very sizable gain for multi-PQ when sorting by int, which is surprising.

Michael McCandless
added a comment - 29/Oct/09 10:30 If the comparator API had a way to set (or return) a primitive value for a single docid, and then those were compared (either directly by the PQ or via a callback), there wouldn't be an issue with reader transitions (because you don't compare id vs id) and hence no need for multiple priority queues.
I like this idea; I'll explore it. Also, your (Yonik's) results showed a very sizable gain for multi-PQ when sorting by int, which is surprising.

Also, your (Yonik's) results showed a very sizable gain for multi-PQ when sorting by int, which is surprising.

Yeah... I can't help but wondering if we're measuring the edge (i.e. what won't be the bottleneck in typical searches). *:* is fast and doesn't access the index at all. Int sorting is also super fast in general, so we're down to mostly measuring the indirection time and overhead of the comparators... that's good when tweaking/optimizing, but you don't necessarily want to make bigger tradeoffs based on the fastest part of the system.

Yonik Seeley
added a comment - 29/Oct/09 11:54 Also, your (Yonik's) results showed a very sizable gain for multi-PQ when sorting by int, which is surprising.
Yeah... I can't help but wondering if we're measuring the edge (i.e. what won't be the bottleneck in typical searches). *:* is fast and doesn't access the index at all. Int sorting is also super fast in general, so we're down to mostly measuring the indirection time and overhead of the comparators... that's good when tweaking/optimizing, but you don't necessarily want to make bigger tradeoffs based on the fastest part of the system.

Added inlined single PQ approach, for sorting by int, so now
sortBench.py runs trunk as baseline and inlined single PQ as
"new". It now only does the int sort. The external API now looks
like function queries DocValues (I added a simple
IntDocValueSource/IntDocValues, that has one method to get the int
for a given docID).

Added random int field when building wiki index; you'll have to
remove existing wiki index. This is so we can test "more normal"
queries, sorting by int.

Dropped number of threads during indexing to 1, and seeded the
Random, so that we all produce the same index

Michael McCandless
added a comment - 30/Oct/09 10:37
Added inlined single PQ approach, for sorting by int, so now
sortBench.py runs trunk as baseline and inlined single PQ as
"new". It now only does the int sort. The external API now looks
like function queries DocValues (I added a simple
IntDocValueSource/IntDocValues, that has one method to get the int
for a given docID).
Added random int field when building wiki index; you'll have to
remove existing wiki index. This is so we can test "more normal"
queries, sorting by int.
Dropped number of threads during indexing to 1, and seeded the
Random, so that we all produce the same index

This table compares trunk (= old) with the "inline int value directly into single PQ" approach (=new). So, a green result means the inline-single-PQ is faster; red means it's slower.

The results baffle me. I would have expected for the 5M hits, with shallow topN, that the diffs would be minor since the sub-leading cost should be in the noise for either approach, and then as we go to fewer hits and deeper topN, that the inline-single-PQ approach would be faster. And, we still see strangeness at topN=100 where current trunk is always substantially better. The only difference between these ought to be the constant in front of the net number of inserstions. Strange!

The number of insertions into the queue is miniscule for these tests.
EG with topN=10, the query "1" against the 5M wikipedia index, causes
110 insertions.

Even at topN=1000 we see only 8053 insertions.

So, the time difference of these runs is really driven by the "compare
to bottom" check that's done for every hit.

What baffles me is even if I take the inline-single-PQ from the last
patch, and instead of invoking a separate class's
"IntDocValues.intValue(doc)" I look it up straight from the int[] I
get from FieldCache, I'm still seeing worse performance vs trunk.

I think at this point this test is chasing java ghosts, so, we really
can't conclude much.

Also, I think, if you are sorting by native value per doc, likely the
fastest way to take "bottom" into account is to push the check all the
way down into the bottom TermScorers that're pulling docIDs from the
posting lists. Ie, if your queue has converged, and you know a given
doc must have value < 7 (say) to get into the queue, you can check
each doc's value immediately on pulling it from the posting list and
skip it if it won't compete (and, if you don't require exact total
hits count).

For queries that are otherwise costly, this can save alot of CPU.
This is what the source code specialization (LUCENE-1594) does, and it
results in excellent gains (if I'm remembering right!).

Michael McCandless
added a comment - 01/Nov/09 11:08 The number of insertions into the queue is miniscule for these tests.
EG with topN=10, the query "1" against the 5M wikipedia index, causes
110 insertions.
Even at topN=1000 we see only 8053 insertions.
So, the time difference of these runs is really driven by the "compare
to bottom" check that's done for every hit.
What baffles me is even if I take the inline-single-PQ from the last
patch, and instead of invoking a separate class's
"IntDocValues.intValue(doc)" I look it up straight from the int[] I
get from FieldCache, I'm still seeing worse performance vs trunk.
I think at this point this test is chasing java ghosts, so, we really
can't conclude much.
Also, I think, if you are sorting by native value per doc, likely the
fastest way to take "bottom" into account is to push the check all the
way down into the bottom TermScorers that're pulling docIDs from the
posting lists. Ie, if your queue has converged, and you know a given
doc must have value < 7 (say) to get into the queue, you can check
each doc's value immediately on pulling it from the posting list and
skip it if it won't compete (and, if you don't require exact total
hits count).
For queries that are otherwise costly, this can save alot of CPU.
This is what the source code specialization ( LUCENE-1594 ) does, and it
results in excellent gains (if I'm remembering right!).

Any plans/decisions on moving forward with multiQ within Lucene? I am planning on making the change locally for my project, but I would rather not duplicate the work if you are planning on doing this within lucene.

John Wang
added a comment - 02/Nov/09 21:53 Hi Michael:
Any plans/decisions on moving forward with multiQ within Lucene? I am planning on making the change locally for my project, but I would rather not duplicate the work if you are planning on doing this within lucene.
Thanks
-John

Michael McCandless
added a comment - 02/Nov/09 22:54 Hi John – it seems unlikely to happen any time too soon. We're still iterating here, and there are concerns (eg increased memory usage) with the multi PQ approach.

The current concern is to do with the memory? I'm more concerned with the weird "java ghosts" that are flying around, sometimes swaying results by 20-40%... the memory could only be an issue on a setup with hundreds of segments and sorting the top 1000 values (do we really try to optimize for this performance case?). In the normal case (no more than tens of segments, and the top 10 or 100 hits), we're talking about what, 100-1000 PQ entries?

Jake Mannix
added a comment - 02/Nov/09 23:03 The current concern is to do with the memory? I'm more concerned with the weird "java ghosts" that are flying around, sometimes swaying results by 20-40%... the memory could only be an issue on a setup with hundreds of segments and sorting the top 1000 values (do we really try to optimize for this performance case?). In the normal case (no more than tens of segments, and the top 10 or 100 hits), we're talking about what, 100-1000 PQ entries?

I am a bit confused here with memory, since most users don't go beyond page one, I can't see memory is even a concern here comparing to the amount of memory lucene uses overall. Am I missing something?

John Wang
added a comment - 02/Nov/09 23:35 Hi Michael:
Thanks for the heads up. I will work on it locally then.
I am a bit confused here with memory, since most users don't go beyond page one, I can't see memory is even a concern here comparing to the amount of memory lucene uses overall. Am I missing something?
-John

I just looked at the most recent patch. Every entry in the PQ is an extra int, so even at the very very very rare and extreme case, 100th page(assuming 10 entries per page) and 100 segment index, we are looking at 400k.
Is this really a concern? I must be missing something...

John Wang
added a comment - 02/Nov/09 23:53 I just looked at the most recent patch. Every entry in the PQ is an extra int, so even at the very very very rare and extreme case, 100th page(assuming 10 entries per page) and 100 segment index, we are looking at 400k.
Is this really a concern? I must be missing something...
-John

Right now DocComparator cheats and stores fields striped, keeping primitives(or in my case any value that can be represented with primitive) - primitive. If the same approach is taken with multiPQs, BOOM! - there goes your API simplicity.

Earwin Burrfoot
added a comment - 03/Nov/09 00:14 Right now DocComparator cheats and stores fields striped, keeping primitives(or in my case any value that can be represented with primitive) - primitive. If the same approach is taken with multiPQs, BOOM! - there goes your API simplicity.

100th page at the same time index is at 100 segments? How many very's would you give it?

Earwin:

Field values are in FieldCache. Not in the PQ. It is the PQ's memory consumption is at question here (If I am not misunderstanding) You only materialize after the merge, which even at the N*Very case, it is only a page worth, which is the same as the singlePQ approach.

John Wang
added a comment - 03/Nov/09 00:36 Mark:
100th page at the same time index is at 100 segments? How many very's would you give it?
Earwin:
Field values are in FieldCache. Not in the PQ. It is the PQ's memory consumption is at question here (If I am not misunderstanding) You only materialize after the merge, which even at the N*Very case, it is only a page worth, which is the same as the singlePQ approach.
-John

100th page at the same time index is at 100 segments? How many very's would you give it?

I'm not claiming 100th page with many segments - I have no info on that, and I agree it would be more rare. But it has come to my attention that 100th page is more common than I would have thought. (sorry - I wasn't very clear on that in my last comment - I am just referring to the deep paging - I previously would have thought its more rare than I do now - though even before, its something I wouldnt want to see a huge perf drop on)

In any case - no one is saying this change won't happen. Just that its not likely to happen soon.

edit

Let me answer the question though - based on my experience with the mergefactors people like to use, and the cost of optimizing, I would say 100 segments deserves no very. At best, it might be semi rare. Mixed with the 100 page req, I'd take it to rare. But thats just me guessing based on my Lucene/Solr experience - so its not worth a whole ton.

Mark Miller
added a comment - 03/Nov/09 00:38 - edited 100th page at the same time index is at 100 segments? How many very's would you give it?
I'm not claiming 100th page with many segments - I have no info on that, and I agree it would be more rare. But it has come to my attention that 100th page is more common than I would have thought. (sorry - I wasn't very clear on that in my last comment - I am just referring to the deep paging - I previously would have thought its more rare than I do now - though even before, its something I wouldnt want to see a huge perf drop on)
In any case - no one is saying this change won't happen. Just that its not likely to happen soon.
edit
Let me answer the question though - based on my experience with the mergefactors people like to use, and the cost of optimizing, I would say 100 segments deserves no very. At best, it might be semi rare. Mixed with the 100 page req, I'd take it to rare. But thats just me guessing based on my Lucene/Solr experience - so its not worth a whole ton.

The point of discussion is memory, unless a few hundred K of memory consumption implies a "huge perf drop". (I see you are being conservative and using only 1 huge )

Even with 100 segment which I am guessing you agree that it is rare, it is 400K, (in this discussion, I am using it as an upper bound, perhaps I should state it more explicitly) and thus my inability to understand that being a memory concern.

BTW, I am interested the percentage of "deep paging" you are seeing. You argue it is not rare, do you have some concrete numbers? The stats I have seen from our production logs and also web search logs when I was working on that, the percentage is very very very very very (5 very's) low. (sharp drop usually is at page 4, let alone page 100)

John Wang
added a comment - 03/Nov/09 00:49 Mark:
The point of discussion is memory, unless a few hundred K of memory consumption implies a "huge perf drop". (I see you are being conservative and using only 1 huge )
Even with 100 segment which I am guessing you agree that it is rare, it is 400K, (in this discussion, I am using it as an upper bound, perhaps I should state it more explicitly) and thus my inability to understand that being a memory concern.
BTW, I am interested the percentage of "deep paging" you are seeing. You argue it is not rare, do you have some concrete numbers? The stats I have seen from our production logs and also web search logs when I was working on that, the percentage is very very very very very (5 very's) low. (sharp drop usually is at page 4, let alone page 100)
-John

The point of discussion is memory, unless a few hundred K of memory consumption implies a "huge perf drop". (I see you are being conservative and using only 1 huge )

I know, I was purposely avoiding getting into the mem argument and just focusing on how rare the situation is. And whether there is going to be a huge perf drop with queue sizes of 1000, I just don't know. The tests have been changing a lot - which is why I think its a little early to come to final conclusions.

Even with 100 segment which I am guessing you agree that it is rare, it is 400K, (in this discussion, I am using it as an upper bound, perhaps I should state it more explicitly) and thus my inability to understand that being a memory concern.

Yes - I do agree its rare.

BTW, I am interested the percentage of "deep paging" you are seeing. You argue it is not rare, do you have some concrete numbers? The stats I have seen from our production logs and also web search logs when I was working on that, the percentage is very very very very very (5 very's) low. (sharp drop usually is at page 4, let alone page 100)

I don't have numbers I can share - but this isn't for situations with users paging through an interface (like a web search page) - its users that are using Lucene for other tasks - and there are plenty of those. Lucene is used a lot for websites with users click through 10 results at a time - but its also used in many, many other apps (and I do mean two manys ! )

Mark Miller
added a comment - 03/Nov/09 00:55 The point of discussion is memory, unless a few hundred K of memory consumption implies a "huge perf drop". (I see you are being conservative and using only 1 huge )
I know, I was purposely avoiding getting into the mem argument and just focusing on how rare the situation is. And whether there is going to be a huge perf drop with queue sizes of 1000, I just don't know. The tests have been changing a lot - which is why I think its a little early to come to final conclusions.
Even with 100 segment which I am guessing you agree that it is rare, it is 400K, (in this discussion, I am using it as an upper bound, perhaps I should state it more explicitly) and thus my inability to understand that being a memory concern.
Yes - I do agree its rare.
BTW, I am interested the percentage of "deep paging" you are seeing. You argue it is not rare, do you have some concrete numbers? The stats I have seen from our production logs and also web search logs when I was working on that, the percentage is very very very very very (5 very's) low. (sharp drop usually is at page 4, let alone page 100)
I don't have numbers I can share - but this isn't for situations with users paging through an interface (like a web search page) - its users that are using Lucene for other tasks - and there are plenty of those. Lucene is used a lot for websites with users click through 10 results at a time - but its also used in many, many other apps (and I do mean two manys ! )

Actually - while I cannot share any current info I have, I'll share an example from my last job. I worked on a system that librarians used to maintain a newspaper archive. The feed for the paper would come in daily and the librarians would "enhance" the data - adding keywords, breaking up stories, etc. Then reporters or end users could search this data. Librarians, who I learned are odd in there requirements by nature, insisted on bringing in thousands of results that they could scroll through at a time. This was demanded at paper after paper. So we regularly fed back up to 5000 thousand results at a time with our software (though they'd have preferred no limit - "what are you talking about ! I want them all!" - we made them click more buttons for that ). Thats just one small example, but I know for a fact there are many, many more.

edit

We also actually ran into many situations were there were lots of segments in this scenario as well - before I knew better, I'd regularly build the indexes with a high merge factor for speed - and then be stuck, unable to optimize because it killed performance and newspapers need to be up pretty much 24/7 - without bringing there server to a crawl - so I would be unable to optimize (this was before you could optimize down to n segments and work slowly over time). Not the greatest example, but a situation I found myself in.

Mark Miller
added a comment - 03/Nov/09 01:03 - edited Actually - while I cannot share any current info I have, I'll share an example from my last job. I worked on a system that librarians used to maintain a newspaper archive. The feed for the paper would come in daily and the librarians would "enhance" the data - adding keywords, breaking up stories, etc. Then reporters or end users could search this data. Librarians, who I learned are odd in there requirements by nature, insisted on bringing in thousands of results that they could scroll through at a time. This was demanded at paper after paper. So we regularly fed back up to 5000 thousand results at a time with our software (though they'd have preferred no limit - "what are you talking about ! I want them all!" - we made them click more buttons for that ). Thats just one small example, but I know for a fact there are many, many more.
edit
We also actually ran into many situations were there were lots of segments in this scenario as well - before I knew better, I'd regularly build the indexes with a high merge factor for speed - and then be stuck, unable to optimize because it killed performance and newspapers need to be up pretty much 24/7 - without bringing there server to a crawl - so I would be unable to optimize (this was before you could optimize down to n segments and work slowly over time). Not the greatest example, but a situation I found myself in.

Another observation, with multiQ approach, seems there would be no need for the set of OutOfOrder*Comparators. Because we would know each Q corresponds to 1 segment, and would not matter if docs come out of order.
Please correct me if I am misunderstanding here.

John Wang
added a comment - 03/Nov/09 04:53 Another observation, with multiQ approach, seems there would be no need for the set of OutOfOrder*Comparators. Because we would know each Q corresponds to 1 segment, and would not matter if docs come out of order.
Please correct me if I am misunderstanding here.

though they'd have preferred no limit - "what are you talking about ! I want them all!"

Same here. People searching on a job site don't really care for top 10 vacancies/resumes, they want eeeeeverything! that matched their requirements.

John: on the other hand, we here already have code to merge results coming from different shards, with stripes, primitives and whistles to appease GC. Might as well reuse it asis to merge between segments.

Earwin Burrfoot
added a comment - 03/Nov/09 10:19 though they'd have preferred no limit - "what are you talking about ! I want them all!"
Same here. People searching on a job site don't really care for top 10 vacancies/resumes, they want eeeeeverything! that matched their requirements.
John: on the other hand, we here already have code to merge results coming from different shards, with stripes, primitives and whistles to appease GC. Might as well reuse it asis to merge between segments.

Another observation, with multiQ approach, seems there would be no need for the set of OutOfOrder*Comparators.

I think it'd still be beneficial to differentiate In vs OutOf order collectors, because even within 1 segment if you know the docIDs arrive in order then the compare bottom is cheaper (need not break ties).

Even with 100 segment which I am guessing you agree that it is rare, it is 400K,

Don't forget that this is multiplied by however many queries are currently in flight.

Michael McCandless
added a comment - 03/Nov/09 10:43 Another observation, with multiQ approach, seems there would be no need for the set of OutOfOrder*Comparators.
I think it'd still be beneficial to differentiate In vs OutOf order collectors, because even within 1 segment if you know the docIDs arrive in order then the compare bottom is cheaper (need not break ties).
Even with 100 segment which I am guessing you agree that it is rare, it is 400K,
Don't forget that this is multiplied by however many queries are currently in flight.
Yonik also raised an important difference of the single PQ API (above: https://issues.apache.org/jira/browse/LUCENE-1997?focusedCommentId=12771008&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12771008 ), ie, the fact that it references "slots" instead of "docid" means you can cache something private in your "slots" based on previous compare/copy.
Since each approach has distinct advantages, why not offer both ("simple" and "expert") comparator extensions APIs?

Since each approach has distinct advantages, why not offer both ("simple" and "expert") comparator extensions APIs?

+1 from me on this one, as long as the simpler one is around. I'll bet we'll find that we regret keeping the "expert" one by 3.2 or so though, but I'll take any compromise which gets the simpler API in there.

Don't forget that this is multiplied by however many queries are currently in flight.

Sure, so if you're running with 100 queries per second on a single shard (pretty fast!), with 100 segments, and you want to do sorting by value on the top 1000 values (how far down the long tail of extreme cases are we at now? Do librarians hit their search servers with 100 QPS and have indices poorly built with hundreds of segments and can't take downtime to ever optimize?), we're now talking about 40MB.

Forty megabytes. On a beefy machine which is supposed to be handling 100QPS across an index big enough to need 100 segments. How much heap would such a machine already be allocating? 4GB? 6? More?

We're talking about less than 1% of the heap is being used by the multiPQ approach in comparison to singlePQ.

Jake Mannix
added a comment - 03/Nov/09 18:50 Since each approach has distinct advantages, why not offer both ("simple" and "expert") comparator extensions APIs?
+1 from me on this one, as long as the simpler one is around. I'll bet we'll find that we regret keeping the "expert" one by 3.2 or so though, but I'll take any compromise which gets the simpler API in there.
Don't forget that this is multiplied by however many queries are currently in flight.
Sure, so if you're running with 100 queries per second on a single shard (pretty fast!), with 100 segments, and you want to do sorting by value on the top 1000 values (how far down the long tail of extreme cases are we at now? Do librarians hit their search servers with 100 QPS and have indices poorly built with hundreds of segments and can't take downtime to ever optimize?), we're now talking about 40MB.
Forty megabytes . On a beefy machine which is supposed to be handling 100QPS across an index big enough to need 100 segments. How much heap would such a machine already be allocating? 4GB? 6? More?
We're talking about less than 1% of the heap is being used by the multiPQ approach in comparison to singlePQ.

If this is about ease of use, its pretty easy to return Comparable to the fieldcache, add a Comparable fieldcomparator, and let users that need it (havn't seen the clamoring yet though) just implement getComparable(String). Its not that hard to support that with a single queue either.

Its still my opinion that users that need this are advanced enough to look at the provided impls and figure it out pretty quick. Its not rocket science, and each method impls is generally a couple simple lines of code.

Mark Miller
added a comment - 03/Nov/09 23:59 If this is about ease of use, its pretty easy to return Comparable to the fieldcache, add a Comparable fieldcomparator, and let users that need it (havn't seen the clamoring yet though) just implement getComparable(String). Its not that hard to support that with a single queue either.
Its still my opinion that users that need this are advanced enough to look at the provided impls and figure it out pretty quick. Its not rocket science, and each method impls is generally a couple simple lines of code.