I'm currently expermenting with this. To increase the speed it seems logical to me the FacetsCollector needs to return less hits. I have a slighly modified version that I will attach.
It uses a sampling technique that divides the total hits in to 'bins' of a given size; and takes one sample of that bin. I have implemented it as keeping that one sample as 'hit' of the search if it was a hit, and clearing all other bits. See the attached file.

By using this technique the distribution of the results should not be altered too much, while the performance gains can be significant.

A quick test revealed that for 1M results and binsize 500, the sampled version is twice as fast.

The problem it that the resulting {{FacetResult}}s are not correct, as the number of hits is reduced. This can be fixed afterwards for counting facets by multiplying with the binsize; but for other facets it will be more difficult or will require other approaches.

Rob Audenaerde
added a comment - 28/Feb/14 10:38 I'm currently expermenting with this. To increase the speed it seems logical to me the FacetsCollector needs to return less hits. I have a slighly modified version that I will attach.
It uses a sampling technique that divides the total hits in to 'bins' of a given size; and takes one sample of that bin. I have implemented it as keeping that one sample as 'hit' of the search if it was a hit, and clearing all other bits. See the attached file.
By using this technique the distribution of the results should not be altered too much, while the performance gains can be significant.
A quick test revealed that for 1M results and binsize 500, the sampled version is twice as fast.
The problem it that the resulting {{FacetResult}}s are not correct, as the number of hits is reduced. This can be fixed afterwards for counting facets by multiplying with the binsize; but for other facets it will be more difficult or will require other approaches.
What do you think?

Michael McCandless
added a comment - 28/Feb/14 12:43 This looks great!
To increase the speed it seems logical to me the FacetsCollector needs to return less hits.
fewer hits (Sorry, pet peeve).
Maybe, you could add a utility method on SamplingFacetsCollector to "fixup" the FacetResult assuming it's a simple count (e.g., multiply the aggregate by the bin size)? The old sampling code ( https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_4_6/lucene/facet/src/java/org/apache/lucene/facet/sampling ) has something like this.
It might be good to allow passing the random seed, for repeatable results?
Another option, which would save the 2nd pass, would be to do the sampling during Docs.addDoc.
Also, instead of the bin-checking, you could just pull the next random double and check if it's < 1.0/binSize?

Currently that SampledDocs.getDocIdSet() creates a new sample on every call. This is bad not only from a performance point of view, but more because if there are few Facets* that call it, they will compute weights on different samples

Instead, I think that SampledDocs.getDocIdSet() should return the same sampled DIS, i.e. by caching it.

Also, I think it would be good if it exposes a getSampledDocIdSet which takes some parameters e.g. the sample size

I think the original FBS should not be modified. E.g. in the previous sampling impl, you could compute the sampled top-facets but then return their exact counts by traversing the original matching docs and counting only them.

But maybe we should leave that for later, it could be a different SFC impl. But still please don't compute the sample on every getDocIdSet() call.

The old implementation let you specify different parameters such as sample size, minimum number of documents to evaluate, maximum number of documents to evaluate etc. I think those are important since defining e.g. 10% sample size could still cause you to process 10M docs, which is slow.

I like that this impl samples per-segment as it allows to tune the sample on a per-segment basis. E.g. small segments (as in NRT) probably don't need to be sampled at all. If we allow passing different parameters such as sampleRatio, min/maxSampleSize, we could tune sampling per-segment.

I agree with Mike that we need to allow passing a seed for repeatability (I assume it will be important when testing). Maybe wrap all the parameters in a SamplingConfig? We can keep the sugar ctors on SFC to take only e.g. sampleRatio.

Also, one thing that we did in the old impl is not use Math.random() or Random at all. Rather some crazy formula was used to create faster samples. I don't say we need to do that now, but maybe drop a TODO. At any rate, I don't think we should use Math.random(), but rather init a Random once and call nextDouble(). This will also allow to reuse a seed.

Shai Erera
added a comment - 28/Feb/14 13:11 This looks like a great start! I have few comments/suggestions, based on the previous sampling impl:
I think SamplingFC.createDocs should return a declared SampledDocs (see later) instead of anonymous class
Currently that SampledDocs.getDocIdSet() creates a new sample on every call. This is bad not only from a performance point of view, but more because if there are few Facets* that call it, they will compute weights on different samples
Instead, I think that SampledDocs.getDocIdSet() should return the same sampled DIS, i.e. by caching it.
Also, I think it would be good if it exposes a getSampledDocIdSet which takes some parameters e.g. the sample size
I think the original FBS should not be modified. E.g. in the previous sampling impl, you could compute the sampled top-facets but then return their exact counts by traversing the original matching docs and counting only them.
But maybe we should leave that for later, it could be a different SFC impl. But still please don't compute the sample on every getDocIdSet() call.
The old implementation let you specify different parameters such as sample size, minimum number of documents to evaluate, maximum number of documents to evaluate etc. I think those are important since defining e.g. 10% sample size could still cause you to process 10M docs, which is slow.
I like that this impl samples per-segment as it allows to tune the sample on a per-segment basis. E.g. small segments (as in NRT) probably don't need to be sampled at all. If we allow passing different parameters such as sampleRatio, min/maxSampleSize, we could tune sampling per-segment.
I agree with Mike that we need to allow passing a seed for repeatability (I assume it will be important when testing). Maybe wrap all the parameters in a SamplingConfig? We can keep the sugar ctors on SFC to take only e.g. sampleRatio.
Also, one thing that we did in the old impl is not use Math.random() or Random at all. Rather some crazy formula was used to create faster samples. I don't say we need to do that now, but maybe drop a TODO. At any rate, I don't think we should use Math.random(), but rather init a Random once and call nextDouble(). This will also allow to reuse a seed.

I also considered this. It is far better for clarity-sake but it also costs a copy of the original. I will try some approaches and will make sure the sampling is only done once.

I like that this impl samples per-segment as it allows to tune the sample on a per-segment basis. E.g. small segments (as in NRT) probably don't need to be sampled at all. If we allow passing different parameters such as sampleRatio, min/maxSampleSize, we could tune sampling per-segment.

This was more or less by accident, but indeed seems useful. All segments need the same ratio of sampling though, else it would be really hard to correct the counts afterwards. (Or am I missing something here?)

Maybe wrap all the parameters in a SamplingConfig?

Yes. Very useful and makes it more stable.

The old implementation let you specify different parameters such as sample size, minimum number of documents to evaluate, maximum number of documents to evaluate etc

The old style sampling indeed had a fixed sample size, which I found very useful. However, I have not yet found a way to implement this as I do not know the total number of results when I start facetting, so I cannot determine the samplingRatio. I could of course first count all results, but that also impacts performance as I would need two passes. I will give it some more thought, but maybe you have an idea on how to accomplish this in a better way?

Rob Audenaerde
added a comment - 28/Feb/14 13:49 Thanks guys for the feedback (also on my language skills, I need to improve my English )
It might be good to allow passing the random seed, for repeatable results?
Yes! This is very sensible for testing and more 'stable' screenresults and I will add this.
Another option, which would save the 2nd pass, would be to do the sampling during Docs.addDoc.
I considered sampling on the 'addDocument' but I figured it would be more expensive as then for each hit we need to do a random() calculation.
I think SamplingFC.createDocs should return a declared SampledDocs (see later) instead of anonymous class
I also considered this. It is far better for clarity-sake but it also costs a copy of the original. I will try some approaches and will make sure the sampling is only done once.
I like that this impl samples per-segment as it allows to tune the sample on a per-segment basis. E.g. small segments (as in NRT) probably don't need to be sampled at all. If we allow passing different parameters such as sampleRatio, min/maxSampleSize, we could tune sampling per-segment.
This was more or less by accident, but indeed seems useful. All segments need the same ratio of sampling though, else it would be really hard to correct the counts afterwards. (Or am I missing something here?)
Maybe wrap all the parameters in a SamplingConfig?
Yes. Very useful and makes it more stable.
The old implementation let you specify different parameters such as sample size, minimum number of documents to evaluate, maximum number of documents to evaluate etc
The old style sampling indeed had a fixed sample size, which I found very useful. However, I have not yet found a way to implement this as I do not know the total number of results when I start facetting, so I cannot determine the samplingRatio. I could of course first count all results, but that also impacts performance as I would need two passes. I will give it some more thought, but maybe you have an idea on how to accomplish this in a better way?

Also, one thing that we did in the old impl is not use Math.random() or Random at all. Rather some crazy formula was used to create faster samples. I don't say we need to do that now, but maybe drop a TODO. At any rate, I don't think we should use Math.random(), but rather init a Random once and call nextDouble(). This will also allow to reuse a seed.

Robert Muir
added a comment - 01/Mar/14 14:01
Also, one thing that we did in the old impl is not use Math.random() or Random at all. Rather some crazy formula was used to create faster samples. I don't say we need to do that now, but maybe drop a TODO. At any rate, I don't think we should use Math.random(), but rather init a Random once and call nextDouble(). This will also allow to reuse a seed.
xor-shift might be a good choice here.

I wish to throw in another part - the description of this issue is about sampling, but the implementation is about random sampling.
This is not always the case, nor it is very fast (indeed, calling 1M times Random.nextInt would be measurable by itself IMHO).
A different sample could be (pseudo-code)

This should be faster as a sampler, and perhaps saves us from creating a new DocIdSet.

One last thing - if I did the math right - the sample crafted by the code in the patch would be twice as large as the user may expect.
For a sample ratio of 0.1, the random.nextInt() would be called with 10, so the avg. "jump" is actually 5 - and every 5th document in the original set (again, in avg) would be selected, and not every 10th in avg. I think the random.nextInt should be called with twice the size it is called now (e.g 20, making the avg random selection 10).

Gilad Barkai
added a comment - 01/Mar/14 19:00 - edited Great effort!
I wish to throw in another part - the description of this issue is about sampling, but the implementation is about random sampling.
This is not always the case, nor it is very fast (indeed, calling 1M times Random.nextInt would be measurable by itself IMHO).
A different sample could be (pseudo-code)
int acceptedModulu = ( int )(1/sampleRatio);
int next() {
do {
nextDoc = inner .next();
} while (nextDoc != NO_MORE_DOCX && nextDoc % acceptedModulu != 0) ;
return nextDoc;
}
This should be faster as a sampler, and perhaps saves us from creating a new DocIdSet .
One last thing - if I did the math right - the sample crafted by the code in the patch would be twice as large as the user may expect.
For a sample ratio of 0.1, the random.nextInt() would be called with 10, so the avg. "jump" is actually 5 - and every 5th document in the original set (again, in avg) would be selected, and not every 10th in avg. I think the random.nextInt should be called with twice the size it is called now (e.g 20, making the avg random selection 10).

SampledDocs should be static class, to not carry over the FC reference. We've been bitten by such things in the past and I don't see a good reason to do it here.

The 'sampling' package may be an overkill? if we want to do it, need to add package.html. But I think you can wrap them all in a SFC now (SampledDocs and SamplingParams as static classes) as it's not a lot of code.

Could you take a look at the old tests and see if they can be ported to the new API?

Shai Erera
added a comment - 01/Mar/14 19:16 About the patch:
SampledDocs should be static class, to not carry over the FC reference. We've been bitten by such things in the past and I don't see a good reason to do it here.
The 'sampling' package may be an overkill? if we want to do it, need to add package.html. But I think you can wrap them all in a SFC now (SampledDocs and SamplingParams as static classes) as it's not a lot of code.
Could you take a look at the old tests and see if they can be ported to the new API?
About the sampling itself, Rob referred me to this link http://dmurphy747.wordpress.com/2011/03/23/xorshift-vs-random-performance-in-java/ where a simple implementation for PRNG is used, following XOR-Shift method. Could you take a look and see if it can be applied here instead of using Java's Random. According to the link, it performs 7 times faster...

I don't think we need a sampling package (overkill), and I also don't think we need a separate "config" class (SamplingParams/Config) which will bring complexity/questions itself; I think it's a non-goal here to revive the previous sampling implementation with all of its complexity, nor to push all of this responsibility onto Rob, who's doing great work here.

Rather, I think we should start simple (what Rob needs, here), commit that, and iterate from there, bringing back features as users need them.

Michael McCandless
added a comment - 01/Mar/14 20:57 I don't think we need a sampling package (overkill), and I also don't think we need a separate "config" class (SamplingParams/Config) which will bring complexity/questions itself; I think it's a non-goal here to revive the previous sampling implementation with all of its complexity, nor to push all of this responsibility onto Rob, who's doing great work here.
Rather, I think we should start simple (what Rob needs, here), commit that, and iterate from there, bringing back features as users need them.

I agree Mike. Rob wrote though in a previous comment: "The old style sampling indeed had a fixed sample size, which I found very useful", so I assumed that's something he wants to push for in this issue as well. I'm OK w/ re-introducing sampling in baby steps, but if Rob's goal is to use sampling + fixed sampling, then we should help him do that in this issue - there's no reason to break this into two issues?

Rob, if you only want to introduce sampling ratio, then I agree with Mike that SamplingParams is an overkill for this issue. And in anyway I think a separate sampling package is an overkill as well. If sampling code grows, we can always refactor and move it under its own package.

Shai Erera
added a comment - 02/Mar/14 06:57 I agree Mike. Rob wrote though in a previous comment: "The old style sampling indeed had a fixed sample size, which I found very useful" , so I assumed that's something he wants to push for in this issue as well. I'm OK w/ re-introducing sampling in baby steps, but if Rob's goal is to use sampling + fixed sampling, then we should help him do that in this issue - there's no reason to break this into two issues?
Rob, if you only want to introduce sampling ratio, then I agree with Mike that SamplingParams is an overkill for this issue. And in anyway I think a separate sampling package is an overkill as well. If sampling code grows, we can always refactor and move it under its own package.

I reviewed createSample in the patch, and I think something's wrong with it. As I understand it, you set binsize to 1.0/sampleRatio, so if sR=0.1, binsize=10. This means that we should keep 1 out of every 10 matching documents. Then you iterate over the bitset, for every group of 10 documents you draw a representative index at random, and clear all other documents.

But what seems wrong is that the iteration advances through the "bin" irrespective of which documents were returned. So e.g. if the FBS has 100 docs total, and docs 5, 15, 25, 35, ... returned and say random.nextInt(binsize) picks all indexes but 5, if I understand the code correctly, all bits will be cleared! I double-checked with the following short main:

And indeed in many iterations the main prints nothing, in some only 2-3 docs "survive" the sampling.

So first I think Gilad's pseudo-code is more correct in that it iterates over the matching documents and I think you should do the same here. When you do that, you no longer need to check bits.get in order to decide if to clear it.

I wonder if your benchmark results are correct (unless you run a MatchAllDocsQuery?) – can you confirm that you indeed count 10% of the documents?

Another point raised by Gilad is the method of sampling. While you implement random sampling, there are other methods like "take the 10th document" etc. which will be faster. I think one can experiment with these through an extension of SampledDocs.createSample (and SamplingFC.createDocs), but just in case you want to give such a simple sampling method a try, would be interesting to compare randomness with something simpler.

Shai Erera
added a comment - 02/Mar/14 13:23 - edited I reviewed createSample in the patch, and I think something's wrong with it. As I understand it, you set binsize to 1.0/sampleRatio , so if sR=0.1, binsize=10. This means that we should keep 1 out of every 10 matching documents. Then you iterate over the bitset, for every group of 10 documents you draw a representative index at random, and clear all other documents.
But what seems wrong is that the iteration advances through the "bin" irrespective of which documents were returned. So e.g. if the FBS has 100 docs total, and docs 5, 15, 25, 35, ... returned and say random.nextInt(binsize) picks all indexes but 5, if I understand the code correctly, all bits will be cleared! I double-checked with the following short main:
public static void main( String [] args) throws Exception {
Random random = new Random();
FixedBitSet sampledBits = new FixedBitSet(100);
for ( int i = 5; i < 100; i += 10) {
sampledBits.set(i);
}
int size = 100;
int binsize = 10;
int countInBin = 0;
int randomIndex = random.nextInt(binsize);
for ( int i = 0; i < size; i++) {
countInBin++;
if (countInBin == binsize) {
countInBin = 0;
randomIndex = random.nextInt(binsize);
}
if (sampledBits.get(i) && !(countInBin == randomIndex)) {
sampledBits.clear(i);
}
}
for ( int i = 0; i < 100; i++) {
if (sampledBits.get(i)) {
System .out.print(i + " " );
}
}
System .out.println();
}
And indeed in many iterations the main prints nothing, in some only 2-3 docs "survive" the sampling.
So first I think Gilad's pseudo-code is more correct in that it iterates over the matching documents and I think you should do the same here. When you do that, you no longer need to check bits.get in order to decide if to clear it.
I wonder if your benchmark results are correct (unless you run a MatchAllDocsQuery?) – can you confirm that you indeed count 10% of the documents?
Another point raised by Gilad is the method of sampling. While you implement random sampling, there are other methods like "take the 10th document" etc. which will be faster. I think one can experiment with these through an extension of SampledDocs.createSample (and SamplingFC.createDocs), but just in case you want to give such a simple sampling method a try, would be interesting to compare randomness with something simpler.

My implementation is not correct in the fact that I indeed run over all the bits in the returning FixedBitSet, and will work better when only looking at the set bits (the real results). The sampling I have implemented will sample the total set of documents which is correct in the MatchAllDocsQuery, but will behave worse and worse the fewer hits are returned; as the chance to sample a hit decreases fast.

The reason I choose to do random sampling in bins instead of just using modulo is to prevent accidental periodicity-effects. For example, if a document is created each day and you sample with modulo 7, you only get the 'monday' results. This can cause weird biased results.

So my next tasks will be:

implement xorshift, as it seems a good improvement over the Math.random() (although I switched to hte faster Random.nextInt( n )).

Rob Audenaerde
added a comment - 02/Mar/14 14:00 - edited Thanks all for the insight!
My implementation is not correct in the fact that I indeed run over all the bits in the returning FixedBitSet, and will work better when only looking at the set bits (the real results). The sampling I have implemented will sample the total set of documents which is correct in the MatchAllDocsQuery, but will behave worse and worse the fewer hits are returned; as the chance to sample a hit decreases fast.
The reason I choose to do random sampling in bins instead of just using modulo is to prevent accidental periodicity-effects. For example, if a document is created each day and you sample with modulo 7, you only get the 'monday' results. This can cause weird biased results.
So my next tasks will be:
implement xorshift , as it seems a good improvement over the Math.random() (although I switched to hte faster Random.nextInt( n ) ).
only sample the hits
remove the package

When a segment has many docids but very few hits, they might be skipped altogether. If this happens in many segments, the estimate gets more and more off

When a segment is small, there might be a few hits that are never touched as the randomIndex in the 'bin-o-hits' is greater than the index of the hit.

Some small performance indicators:

Using an in memory index with 10M documents, retrieving 25% of them and using a samplingRatio of 0.001 (skipping the first call, average on the next three)

Exact: 184 ms.
Sampled: 67 ms.

Not the 10-fold increase I had hoped for, but significantly faster.

Btw. for my use case the two problems I described above are not a real issue. I do a global count anyway, and use this to determine if the facets need to be sampled or not. When they need to be sampled, I know for sure that there are enough hits to give a proper estimate. The price of counting and sampling is lower than the price of retrieving exact facets.

Rob Audenaerde
added a comment - 03/Mar/14 09:25 Here is a patch.
I'm not totally satisfied with it though. There are two loose ends:
When a segment has many docids but very few hits, they might be skipped altogether. If this happens in many segments, the estimate gets more and more off
When a segment is small, there might be a few hits that are never touched as the randomIndex in the 'bin-o-hits' is greater than the index of the hit.
Some small performance indicators:
Using an in memory index with 10M documents, retrieving 25% of them and using a samplingRatio of 0.001 (skipping the first call, average on the next three)
Exact: 184 ms.
Sampled: 67 ms.
Not the 10-fold increase I had hoped for, but significantly faster.
Btw. for my use case the two problems I described above are not a real issue. I do a global count anyway, and use this to determine if the facets need to be sampled or not. When they need to be sampled, I know for sure that there are enough hits to give a proper estimate. The price of counting and sampling is lower than the price of retrieving exact facets.

Rob Audenaerde
added a comment - 03/Mar/14 09:52 More performance data:
Exact: 185 ms.
Sampled with fixed interval : 51 ms.
The difference between random sampling with index in 'bins' and a fixed interval is not really that big. Although I probably need a larger index to get a better comparison.

I was surprised by the gain - only 3.5 time the gain for 1/1000 of the documents... I think this is because the random generation takes to long?

Perhaps it is to heavy to call .get(docId) on the original/clone and .clear() for every bit.

I crafted some code for creating a new (cleared) FixedBitSet instead of a clone, and setting only the sampled bits. With it, instead of going on ever document and calling .get(docId) I used the iterator, which may be more efficient when it comes to skipping cleared bits (deleted docs?).

Ran the code on a scenario as you described (Bitset sized at 10M, randomly selected 2.5M of it to be set, and than sampling a thousandth of it to 2,500 documents).
The results:

Gilad Barkai
added a comment - 03/Mar/14 10:24 Nice patch,
I was surprised by the gain - only 3.5 time the gain for 1/1000 of the documents... I think this is because the random generation takes to long?
Perhaps it is to heavy to call .get(docId) on the original/clone and .clear() for every bit.
I crafted some code for creating a new (cleared) FixedBitSet instead of a clone, and setting only the sampled bits. With it, instead of going on ever document and calling .get(docId) I used the iterator, which may be more efficient when it comes to skipping cleared bits (deleted docs?).
Ran the code on a scenario as you described (Bitset sized at 10M, randomly selected 2.5M of it to be set, and than sampling a thousandth of it to 2,500 documents).
The results:
none-cloned + iterator: ~20ms
cloned + for loop on each docId: ~50ms
int countInBin = 0;
int randomIndex = random.nextInt(binsize);
final DocIdSetIterator it = docIdSet.iterator();
try {
for ( int doc = it.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = it.nextDoc()) {
if (++countInBin == binsize) {
countInBin = 0;
randomIndex = random.nextInt(binsize);
}
if (countInBin == randomIndex) {
sampledBits.set(doc);
}
}
} catch (IOException e) {
// should not happen
throw new RuntimeException(e);
}
Also attaching the java file with a main() which compares the two implementations (original still named createSample() and the proposed one createSample2() .
Hopefully, if this will be consistent, the sampling end-to-end should be closer the 10 factor gain as hoped for.

But, can we please remove the SamplingParams class? There are only 2
params now; if we get up to 6 or 7 params then maybe we should switch
to a separate config class. I think one ctor taking only sampleRatio,
and then one other taking all params, is enough?

When a segment has many docids but very few hits, they might be skipped altogether. If this happens in many segments, the estimate gets more and more off

When a segment is small, there might be a few hits that are never touched as the randomIndex in the 'bin-o-hits' is greater than the index of the hit.

I think these limitations are acceptable? In a large index, the
number of such "tiny" segments should be a tiny percentage of the doc
space; if not, something serious is wrong.

Another option, which would save the 2nd pass, would be to do the sampling during Docs.addDoc.

I considered sampling on the 'addDocument' but I figured it would be
more expensive as then for each hit we need to do a random()
calculation.

Hmm, I don't think the single-pass option would require
xorshift.nextInt() on every call? Couldn't you keep a countInBin
field in your SampledDocs, increment it on each Docs.addDoc, and when
it's == randomIndex, add to the sampled bits and then pick a new
randomIndex? This should save a 2nd pass iteration?

Michael McCandless
added a comment - 03/Mar/14 10:40 Patch looks good, and those are nice performance results!
But, can we please remove the SamplingParams class? There are only 2
params now; if we get up to 6 or 7 params then maybe we should switch
to a separate config class. I think one ctor taking only sampleRatio,
and then one other taking all params, is enough?
When a segment has many docids but very few hits, they might be skipped altogether. If this happens in many segments, the estimate gets more and more off
When a segment is small, there might be a few hits that are never touched as the randomIndex in the 'bin-o-hits' is greater than the index of the hit.
I think these limitations are acceptable? In a large index, the
number of such "tiny" segments should be a tiny percentage of the doc
space; if not, something serious is wrong.
Another option, which would save the 2nd pass, would be to do the sampling during Docs.addDoc.
I considered sampling on the 'addDocument' but I figured it would be
more expensive as then for each hit we need to do a random()
calculation.
Hmm, I don't think the single-pass option would require
xorshift.nextInt() on every call? Couldn't you keep a countInBin
field in your SampledDocs, increment it on each Docs.addDoc, and when
it's == randomIndex, add to the sampled bits and then pick a new
randomIndex? This should save a 2nd pass iteration?

I implemented the single pass on the 'addDoc' method using a fixed interval (which should be the fastest implementation I guess)

Exact: 180ms
Single Pass (no-clone) Fixed Interval: 38ms.

This is an almost 4.75x speed-up.

I will compare this later on with a single pass random indexed approach.

Hmm, I don't think the single-pass option would require
xorshift.nextInt() on every call? Couldn't you keep a countInBin
field in your SampledDocs, increment it on each Docs.addDoc, and when
it's == randomIndex, add to the sampled bits and then pick a new
randomIndex? This should save a 2nd pass iteration?

Yes I thought of this too. I will update this. I think I prefer the single pass option myself, as I do not need the original bitset and the faster option.

Rob Audenaerde
added a comment - 03/Mar/14 12:44 - edited Quick update:
I implemented the single pass on the 'addDoc' method using a fixed interval (which should be the fastest implementation I guess)
Exact: 180ms
Single Pass (no-clone) Fixed Interval: 38ms.
This is an almost 4.75x speed-up.
I will compare this later on with a single pass random indexed approach.
Hmm, I don't think the single-pass option would require
xorshift.nextInt() on every call? Couldn't you keep a countInBin
field in your SampledDocs, increment it on each Docs.addDoc, and when
it's == randomIndex, add to the sampled bits and then pick a new
randomIndex? This should save a 2nd pass iteration?
Yes I thought of this too. I will update this. I think I prefer the single pass option myself, as I do not need the original bitset and the faster option.

I'm OK if the original collected hits is lost, let's just make sure we document this on the collector. Perhaps in the future (separate issue) we can add another collector (or enhance this one) to allow you to get both the original docs and the sampled docs.

I wonder what are the performance implications of sampling during collection or post collection. In the past, Mike and I saw that not interfering with collection improves performance, though what we measured was accessing the ordinals-DV during collection. Just wondering if in-collection performs better than post-collection. Especially if sampleRatio is low, it means we set far fewer hits on the FBS, just to clean most of them afterwards. On the other hand we add calls to random.nextInt/Double, so that's a tradeoff which would be good to measure. We don't even need random.nextDouble(), we can do what you/mike suggested above – work in "bins", for each bin draw a random index and discard all hits given to addDoc unless it is the binIndex.

I also think that if we keep the original FBS (whether we clone-and-clear, create-new-sampled-one or whatever), we should iterate on the matching docs and not all the bits. I don't see the logic of why that's good at all, unless the bitset.cardinality() is very big (like maybe 75% of the bitset size)? Of course, if we move to sample in-collection, that's not an issue at all.

Rob, I want to make sure we are all on the same page as to what we want to achieve in this issue (helps scope it):

Add SamplingFacetsCollector which takes sampleRatio and optional seed and random-samples the matching docs so that facets are counted on fewer hits

We should note that as a result the weights computed for facets are approximation only, and the application needs to correct them if it wants (e.g. multiplying by the inverse of sampleRatio or exact count etc.). At any rate, this is something that we don't plan to tackle in this issue, right?

Maybe we should call it RandomSamplingFacetsCollector to denote it's a random sample (vs e.g the approaches mentioned by Gilad above)? Then the parameter seed is clear as to what it means.

Shai Erera
added a comment - 03/Mar/14 15:36 +1 for removing SamplingParams.
I'm OK if the original collected hits is lost, let's just make sure we document this on the collector. Perhaps in the future (separate issue) we can add another collector (or enhance this one) to allow you to get both the original docs and the sampled docs.
I wonder what are the performance implications of sampling during collection or post collection. In the past, Mike and I saw that not interfering with collection improves performance, though what we measured was accessing the ordinals-DV during collection. Just wondering if in-collection performs better than post-collection. Especially if sampleRatio is low, it means we set far fewer hits on the FBS, just to clean most of them afterwards. On the other hand we add calls to random.nextInt/Double, so that's a tradeoff which would be good to measure. We don't even need random.nextDouble(), we can do what you/mike suggested above – work in "bins", for each bin draw a random index and discard all hits given to addDoc unless it is the binIndex.
I also think that if we keep the original FBS (whether we clone-and-clear, create-new-sampled-one or whatever), we should iterate on the matching docs and not all the bits. I don't see the logic of why that's good at all, unless the bitset.cardinality() is very big (like maybe 75% of the bitset size)? Of course, if we move to sample in-collection, that's not an issue at all.
Rob, I want to make sure we are all on the same page as to what we want to achieve in this issue (helps scope it):
Add SamplingFacetsCollector which takes sampleRatio and optional seed and random-samples the matching docs so that facets are counted on fewer hits
We should note that as a result the weights computed for facets are approximation only, and the application needs to correct them if it wants (e.g. multiplying by the inverse of sampleRatio or exact count etc.). At any rate, this is something that we don't plan to tackle in this issue, right?
Maybe we should call it RandomSamplingFacetsCollector to denote it's a random sample (vs e.g the approaches mentioned by Gilad above)? Then the parameter seed is clear as to what it means.

Rob Audenaerde
added a comment - 04/Mar/14 08:39 New patch.
My little benchmark results:
Exact :176 ms
Sampled:36 ms
I removed the SamplingParams .
I use random sampling in bins in the addDoc . This has almost no performance impact over fix-internal samling when running with a sampleRatio of 0.001.
Renamed the class to RandomSamplingFacetsCollector
I decided to always add the first document of a segment. This counters the fact that randomindex in the last bin is larger than the segment-size. Gives slightly better results.
I do have some questions:
Code style-wise: I see this. is sometimes used, and sometimes not. What do you prefer?
It seems FixedBitSet contains a lot of empty space, as only 1 in 1000 hits are preserved. I'm not that familiar with all the DocIdSet implementations, but maybe there are more suitable subclasses?

My use case is that I would like to present a 'overview' of the search results using facets. Because the index might well be over 10M documents, I need the speed-up that comes from sampling the facets. The results do not need to be exact; as the user will be informed that the results are estimated.

I do want to provide a 'order of magnitude' comparison, that is why I need the correctCountFacetResults. It is easier for the user to see that it is about 10.000 than just provide a percentage.

Rob Audenaerde
added a comment - 04/Mar/14 09:02 btw:
My use case is that I would like to present a 'overview' of the search results using facets. Because the index might well be over 10M documents, I need the speed-up that comes from sampling the facets. The results do not need to be exact; as the user will be informed that the results are estimated.
I do want to provide a 'order of magnitude' comparison, that is why I need the correctCountFacetResults . It is easier for the user to see that it is about 10.000 than just provide a percentage.

About constructors: can we have two ctors, one taking samplingRatio and another taking all the parameters?

And then I think that seed=0 could be used as "generate a seed for me", so you don't need to take a Long (can take primitive)

I'd make the first ctor call this(false, ratio, 0) and fold in init() inside the all-params ctor.

It's a matter of personal preference, but I like that static classes are either at the top of the class (above all class members) or at the end .. but not in the middle

correctCountFacetResults

I think this is redundant in the jdoc "Corrects FacetCounts if sampling is used." – you obviously call this method if you used sampling?

XORShift64Random: maybe we should move it under o.a.l.util (and lucene/core)? I think it's a useful utility class in general?

At any rate, it should be static class and let's not nest it inside SampledDocs

Now that the collector sub-samples during collection and does not record the original docs, I think SampledDocs can be declared private (static) class as I don't see any use for someone to care if it's SampledDocs or just Docs?

Code style-wise: I see this. is sometimes used, and sometimes not. What do you prefer?

This is a personal preference. I use this. only when the method has a parameter which conflicts w/ class members and I often use it explicitly in the ctor.

It seems FixedBitSet contains a lot of empty space

You can use WAH8DocIdSet which offers better compression, but it means that the collector would need to declare it doesn't support out-of-order collection. If keepScores=true it doesn't support that anyway, but I think you can explore with it? Can you benchmark with it and report back? If the results are good, maybe we can turn off out-of-order collection for sampling by default and let the app override this behavior if it cares about out-of-order?

Shai Erera
added a comment - 04/Mar/14 10:02 Thanks Rob. Few comments:
Typo in jdoc: "to low" – "too low"?
About constructors: can we have two ctors, one taking samplingRatio and another taking all the parameters?
And then I think that seed=0 could be used as "generate a seed for me", so you don't need to take a Long (can take primitive)
I'd make the first ctor call this(false, ratio, 0) and fold in init() inside the all-params ctor.
It's a matter of personal preference, but I like that static classes are either at the top of the class (above all class members) or at the end .. but not in the middle
correctCountFacetResults
I think this is redundant in the jdoc "Corrects FacetCounts if sampling is used." – you obviously call this method if you used sampling?
XORShift64Random: maybe we should move it under o.a.l.util (and lucene/core)? I think it's a useful utility class in general?
At any rate, it should be static class and let's not nest it inside SampledDocs
Now that the collector sub-samples during collection and does not record the original docs, I think SampledDocs can be declared private (static) class as I don't see any use for someone to care if it's SampledDocs or just Docs?
Code style-wise: I see this. is sometimes used, and sometimes not. What do you prefer?
This is a personal preference. I use this. only when the method has a parameter which conflicts w/ class members and I often use it explicitly in the ctor.
It seems FixedBitSet contains a lot of empty space
You can use WAH8DocIdSet which offers better compression, but it means that the collector would need to declare it doesn't support out-of-order collection. If keepScores=true it doesn't support that anyway, but I think you can explore with it? Can you benchmark with it and report back? If the results are good, maybe we can turn off out-of-order collection for sampling by default and let the app override this behavior if it cares about out-of-order?

Rob Audenaerde
added a comment - 04/Mar/14 10:34 - edited I have taken these points into account in the new patch.
Separated XORShift64Random to o.a.l.util
Moved SampledDocs and declared private
Fixed some typos / redundancy in the javadocs.
Refactored the c'tors, removed the init.

Looks good Rob. I apologize for not mentioning this, but now that XORShift64Random is a public class, it has to have jdocs on all methods and ctors, otherwise documentation linting will fail. Can you please add some in your next patch?

Also, are you planning to write some unit tests? You can either start with one of the existing tests or look at old tests. I think maybe start new will be easier. The key point is that in order to test sampling, we need to index many documents to make the samples count. So e.g. we want to make sure that if we give 10% sample ratio, then a category's count is ~10% of the expected count.

In the old tests we had issues w/ false positives - tests that failed on these asserts just because the nature of sampling isn't deterministic. Would be good if we can craft the test such that on one hand it does test sampling, but on the other hand doesn't cause unwanted noise.

I do think we can optimize SampledDocs to not use FixedBitSet even in the case of out-of-order collection (no scores) by keeping an int[] or some other compressed array, especially when the sample ratio is so small. We can do that later though - we need tests first.

Shai Erera
added a comment - 04/Mar/14 10:56 - edited Looks good Rob. I apologize for not mentioning this, but now that XORShift64Random is a public class, it has to have jdocs on all methods and ctors, otherwise documentation linting will fail. Can you please add some in your next patch?
Also, are you planning to write some unit tests? You can either start with one of the existing tests or look at old tests. I think maybe start new will be easier. The key point is that in order to test sampling, we need to index many documents to make the samples count . So e.g. we want to make sure that if we give 10% sample ratio, then a category's count is ~10% of the expected count.
In the old tests we had issues w/ false positives - tests that failed on these asserts just because the nature of sampling isn't deterministic. Would be good if we can craft the test such that on one hand it does test sampling, but on the other hand doesn't cause unwanted noise.
I do think we can optimize SampledDocs to not use FixedBitSet even in the case of out-of-order collection (no scores) by keeping an int[] or some other compressed array, especially when the sample ratio is so small. We can do that later though - we need tests first.

And of course, the tests. I have already some code that does testing in my mini benchmark. I will reduce the document count, for the unit tests need to be as fast as possible. Probably about 100 docs will do fine. Providing a seed will make sure we can test exact; but I will try to also add a test that uses a random seed and tests if the result fits in a given range. Should be doable, will take the other tests as example.

Btw. I also tried the WAH8DocIdSet with setInterval(8). It is slightly slower (36 ms for FSB, 42 ms for WAH8) but it is a too small margin to draw real conclusions I think. I agree it is better to have some tests first.

Rob Audenaerde
added a comment - 04/Mar/14 11:51 I will add the documentation to the XORShift64Random.
And of course, the tests. I have already some code that does testing in my mini benchmark. I will reduce the document count, for the unit tests need to be as fast as possible. Probably about 100 docs will do fine. Providing a seed will make sure we can test exact; but I will try to also add a test that uses a random seed and tests if the result fits in a given range. Should be doable, will take the other tests as example.
Btw. I also tried the WAH8DocIdSet with setInterval(8) . It is slightly slower (36 ms for FSB, 42 ms for WAH8) but it is a too small margin to draw real conclusions I think. I agree it is better to have some tests first.

-1 to adding this XOrShiftRandom to lucene/core. There is absolutely no need for it to be here. We are an IR library, not a JDK. Just because its "useful" is not good enough to expose something as a public API.

Moreover, this class does not extend java.util.Random, has no statistical tests, etc. Doesn't belong as a public api in lucene core. please keep it package private in facets.

Robert Muir
added a comment - 04/Mar/14 12:58 -1 to adding this XOrShiftRandom to lucene/core. There is absolutely no need for it to be here. We are an IR library, not a JDK. Just because its "useful" is not good enough to expose something as a public API.
Moreover, this class does not extend java.util.Random, has no statistical tests, etc. Doesn't belong as a public api in lucene core. please keep it package private in facets.

Sorry if this was mentioned before and I missed it - how can sampling work correctly (correctness of the end result) if it's done 'on the fly' ?
Beforehand, one cannot know the number of documents that would match the query, as such, the sampling ratio is unknown, given that we can afford faceted search over N documents only.
If the query yields 10K results and the sampling ration is 0.001 - would 10 documents make a good sample?
Same if the query yields 100M results - is 10K sample good enough? Is it perhaps to much?
I find it hard to figure a pre-defined sampling ratio which would fit different cases.

Gilad Barkai
added a comment - 04/Mar/14 13:24 Sorry if this was mentioned before and I missed it - how can sampling work correctly (correctness of the end result) if it's done 'on the fly' ?
Beforehand, one cannot know the number of documents that would match the query, as such, the sampling ratio is unknown, given that we can afford faceted search over N documents only.
If the query yields 10K results and the sampling ration is 0.001 - would 10 documents make a good sample?
Same if the query yields 100M results - is 10K sample good enough? Is it perhaps to much?
I find it hard to figure a pre-defined sampling ratio which would fit different cases.

That's good point Gilad. I think once this gets into Lucene it means other people will use it and we should offer a good sampling collector that works in more than one extreme case (always tons of results) even if it's well documented. One of the problems is that when you have a query Q, you don't know in advance how many documents it's going to match.

That's where the min/maxDocsToEvaluate came in handy in the previous solution – it made SamplingFC smart and adaptive. If the query matched very few documents, not only it didn't bother to sample and save CPU, it also didn't come up w/ a crappy sample (as Gilad says, 10 docs). The previous sampling worked on the entire query, the new collector can be used to use these threshold per-segment.

But I feel that this has to give a qualitative solution – the sample has be meaningful in order to be considered as representative at all, and we should let the app specify what "meaningful" is to it, in the form of minDocsToEvaluate(PerSegment).

And since sampling is about improving speed, we should also let the app specify a maxDocsToEvaluate(PerSegment), so a 1% sample still doesn't end up evaluating millions of documents.

Robert, I agree w/ your comment on XORShiftRandom - it was a mistake to suggest moving it under core.

Rob, I feel like I've thrown you back and forth with the patch. If you want, I can take a stab at making the changes to SFC.

Shai Erera
added a comment - 04/Mar/14 16:10 That's good point Gilad. I think once this gets into Lucene it means other people will use it and we should offer a good sampling collector that works in more than one extreme case (always tons of results) even if it's well documented. One of the problems is that when you have a query Q, you don't know in advance how many documents it's going to match.
That's where the min/maxDocsToEvaluate came in handy in the previous solution – it made SamplingFC smart and adaptive. If the query matched very few documents, not only it didn't bother to sample and save CPU, it also didn't come up w/ a crappy sample (as Gilad says, 10 docs). The previous sampling worked on the entire query, the new collector can be used to use these threshold per-segment.
But I feel that this has to give a qualitative solution – the sample has be meaningful in order to be considered as representative at all, and we should let the app specify what "meaningful" is to it, in the form of minDocsToEvaluate(PerSegment).
And since sampling is about improving speed, we should also let the app specify a maxDocsToEvaluate(PerSegment), so a 1% sample still doesn't end up evaluating millions of documents.
Robert, I agree w/ your comment on XORShiftRandom - it was a mistake to suggest moving it under core.
Rob, I feel like I've thrown you back and forth with the patch. If you want, I can take a stab at making the changes to SFC.

Actually, in my application, I always do a count before any other search/facetting. This means I can set a sampling ratio to be appropriate for the number of results (or even decide not to do sampling at all). So for me, I have enough functionality as-is in this issue.

But of course if would be nice if it could be implemented in one pass, because there are more people using Lucene

I think it would not be that hard to implement a lower-bound; if for example 15.000 would be the minimum number of documents before sampling kicks in, the RandomSamplingFacetCollector could also collect docs in a int[15000] to make sure that if there are less hits, the int[] is used, else the sampled set. The exact-part collector can stop collecting after these 15k docs.

A bit harder is the upper-bound, as you only know how many hits there are after collecting them all. To prevent collecting all the documents, you need a sampling rate to start skipping documents. If the result then contains too much document; the sampled result can be re-sampled to the desired size. I think this is more like the previous implementation of the SamplingCollector.

I have already implemented a small test (and fixed an off-by-one using it) ; I will create a patch today that contains this test and the xorshift merged back into the collector.

Is it a good idea to explore this upper-bound/lower-bound approach? Maybe you already thought of alternative approaches as well?

Rob Audenaerde
added a comment - 05/Mar/14 08:17 - edited Hi all, good points.
Actually, in my application, I always do a count before any other search/facetting. This means I can set a sampling ratio to be appropriate for the number of results (or even decide not to do sampling at all). So for me, I have enough functionality as-is in this issue.
But of course if would be nice if it could be implemented in one pass, because there are more people using Lucene
I think it would not be that hard to implement a lower-bound; if for example 15.000 would be the minimum number of documents before sampling kicks in, the RandomSamplingFacetCollector could also collect docs in a int [15000] to make sure that if there are less hits, the int[] is used, else the sampled set. The exact-part collector can stop collecting after these 15k docs.
A bit harder is the upper-bound, as you only know how many hits there are after collecting them all. To prevent collecting all the documents, you need a sampling rate to start skipping documents. If the result then contains too much document; the sampled result can be re-sampled to the desired size. I think this is more like the previous implementation of the SamplingCollector.
I have already implemented a small test (and fixed an off-by-one using it) ; I will create a patch today that contains this test and the xorshift merged back into the collector.
Is it a good idea to explore this upper-bound/lower-bound approach? Maybe you already thought of alternative approaches as well?

Actually, in my application, I always do a count before any other search/facetting

Hmm, what do you mean? How do you count the number of hits before you execute the search?

The reason why the previous sampling solution did not do sampling per-segment is that in order to get to a good sample size and representative set, you need to know first how many documents the query matches and only then you can do a good sampling, taking min/maxSampleSize into account. Asking the app to define these boundaries per-segment is odd because app may not know how many segments an index has, or even the distribution of the segment sizes. For instance, if an index contains 10 segments and the app is willing to fully evaluate 200K docs in order to get a good sampled set, it would be wrong to specify that each segment needs to sample 20K docs, because the last two segments may be tiny and so in practice you'll end up w/ a sampled set of ~160K docs. On the other hand, if the search is evaluated entirely, such that you know the List<MatchingDocs> before sampling, you can now take a global decision about which documents to sample, given the min/maxSampleSize constraints.

At the beginning of this issue I thought that sampling could work like that:

But the Facets impls all take FacetsCollector, so perhaps what we need is to implement RandomSamplingFacetsCollector and only override getMatchingDocs() to return the sampled set (and of course cache it). If we'll later want to return the original set, it's trivial to cache it aside (I don't think we should do it in this issue).

I realize it means we allocate bitsets unnecessarily, but that a correct way to create a meaningful sample. Unless we can do it per-segment, but I think it's tricky since we never know how many hits a segment will match a priori. Perhaps we should focus here to get a correct and meaningful sample, and improve performance only if it becomes a bottleneck? After all, setting a bit in a bitset if far faster than scoring the document.

Shai Erera
added a comment - 05/Mar/14 11:50 Actually, in my application, I always do a count before any other search/facetting
Hmm, what do you mean? How do you count the number of hits before you execute the search?
The reason why the previous sampling solution did not do sampling per-segment is that in order to get to a good sample size and representative set, you need to know first how many documents the query matches and only then you can do a good sampling, taking min/maxSampleSize into account. Asking the app to define these boundaries per-segment is odd because app may not know how many segments an index has, or even the distribution of the segment sizes. For instance, if an index contains 10 segments and the app is willing to fully evaluate 200K docs in order to get a good sampled set, it would be wrong to specify that each segment needs to sample 20K docs, because the last two segments may be tiny and so in practice you'll end up w/ a sampled set of ~160K docs. On the other hand, if the search is evaluated entirely, such that you know the List<MatchingDocs> before sampling, you can now take a global decision about which documents to sample, given the min/maxSampleSize constraints.
At the beginning of this issue I thought that sampling could work like that:
FacetsCollector fc = new FacetsCollector(...);
searcher.search(q, fc);
Sampler sampler = new RandomSampler(...);
List<MatchingDocs> sampledDocs = sampler.sample(fc.getMachingDoc());
facets.count(sampledDocs);
But the Facets impls all take FacetsCollector, so perhaps what we need is to implement RandomSamplingFacetsCollector and only override getMatchingDocs() to return the sampled set (and of course cache it). If we'll later want to return the original set, it's trivial to cache it aside (I don't think we should do it in this issue).
I realize it means we allocate bitsets unnecessarily, but that a correct way to create a meaningful sample. Unless we can do it per-segment, but I think it's tricky since we never know how many hits a segment will match a priori. Perhaps we should focus here to get a correct and meaningful sample, and improve performance only if it becomes a bottleneck? After all, setting a bit in a bitset if far faster than scoring the document.

Well, I do a search of course, but collect the hits using a TotalHitCountCollector and not retrieve any stored values. I did not find any other way to determine for sure if I needed to do sampling or not. I know this takes time. When I first implemented it however, it was faster to do a count and determine whether provide exact facets or needed to sample. Not optimal, but it worked. And because I could use sampling in the facets, the total time (1 pass counting, 1 pass sampling facets) was still much less than the time it would take to do a exact facet and count in one pass.

When only considering the samplingThreshold, facetting should still be doable without counting first. It can be done by storing the first samplingThreshold documents (in the addDoc) in a separate array (in the collector) without sampling. This way the count is not needed to decide on whether to sample or not as there will always be sampled. Only the sampled result is discarded if the total number of hits <= minSampleSize. I agree that this is not the nicest way to get a sample. (but can reduce the time to retrieve estimated facet results by 5 when using a sampling rate of 1 in 1000).

The alternative is to do it more like in your snippet (and more like the first approach); collect all documents and sample afterwards. This way you know the number of hits and adjusting the sample rate based on parameters is more straightforward.

Rob Audenaerde
added a comment - 05/Mar/14 13:18
How do you count the number of hits before you execute the search?
Well, I do a search of course, but collect the hits using a TotalHitCountCollector and not retrieve any stored values. I did not find any other way to determine for sure if I needed to do sampling or not. I know this takes time. When I first implemented it however, it was faster to do a count and determine whether provide exact facets or needed to sample. Not optimal, but it worked. And because I could use sampling in the facets, the total time (1 pass counting, 1 pass sampling facets) was still much less than the time it would take to do a exact facet and count in one pass.
When only considering the samplingThreshold , facetting should still be doable without counting first. It can be done by storing the first samplingThreshold documents (in the addDoc) in a separate array (in the collector) without sampling. This way the count is not needed to decide on whether to sample or not as there will always be sampled. Only the sampled result is discarded if the total number of hits <= minSampleSize. I agree that this is not the nicest way to get a sample. (but can reduce the time to retrieve estimated facet results by 5 when using a sampling rate of 1 in 1000).
The alternative is to do it more like in your snippet (and more like the first approach); collect all documents and sample afterwards. This way you know the number of hits and adjusting the sample rate based on parameters is more straightforward.
Either way is faster than using exact facets, so both ways are a win.

Well, I do a search of course, but collect the hits using a TotalHitCountCollector and not retrieve any stored values

That means that you evaluate the query twice, which is expensive ... but also, this doesn't guarantee to provide a "correct" sample. So say you found out the query matches 10M documents and you decide that 100K docs are a good sample, you'll set the sampling ratio to 0.01 but then you apply this ratio per-segment (as in this patch), and could easily end up with less than 100K docs (e.g. if randomness didn't really pick 0.01 of documents in a certain segment).

I don't think we should store the minSampleSize in an int[] and move to a bitset if we collected more docs. First, the collector works per-segment and I think sampling should work on the entire result set. So the int[] wouldn't be part of MatchingDocs, it'd need to be held inside the collector and then you'll need to know where to "cut" it for each MatchingDocs instance (per-segment).

I really think a simple solution is what we should start with. RandomSamplingFC only overrides .getMatchingDocs() and it can determine if sampling is needed or not, given minSampleSize and the sum of totalHits from all MatchingDocs. Then you do sampling per-segment, but with the "global picture" in mind, and you're able to correct the sample ratio so that we come as close to minSampleSize as possible.

To me, if we can factor in a maxSampleSize in this issue is a bonus, but I can definitely see that happening in a separate issue, as that's a performance thing. We should focus on giving our users a collector which produces a good sample, otherwise it's not valuable sampling.

Shai Erera
added a comment - 05/Mar/14 14:41 Well, I do a search of course, but collect the hits using a TotalHitCountCollector and not retrieve any stored values
That means that you evaluate the query twice, which is expensive ... but also, this doesn't guarantee to provide a "correct" sample. So say you found out the query matches 10M documents and you decide that 100K docs are a good sample, you'll set the sampling ratio to 0.01 but then you apply this ratio per-segment (as in this patch), and could easily end up with less than 100K docs (e.g. if randomness didn't really pick 0.01 of documents in a certain segment).
I don't think we should store the minSampleSize in an int[] and move to a bitset if we collected more docs. First, the collector works per-segment and I think sampling should work on the entire result set. So the int[] wouldn't be part of MatchingDocs, it'd need to be held inside the collector and then you'll need to know where to "cut" it for each MatchingDocs instance (per-segment).
I really think a simple solution is what we should start with. RandomSamplingFC only overrides .getMatchingDocs() and it can determine if sampling is needed or not, given minSampleSize and the sum of totalHits from all MatchingDocs. Then you do sampling per-segment, but with the "global picture" in mind, and you're able to correct the sample ratio so that we come as close to minSampleSize as possible.
To me, if we can factor in a maxSampleSize in this issue is a bonus, but I can definitely see that happening in a separate issue, as that's a performance thing. We should focus on giving our users a collector which produces a good sample, otherwise it's not valuable sampling.

Rob Audenaerde
added a comment - 06/Mar/14 08:46 I have a patch ready that implements the random sampling usign override on .getMatchingDocs() . It passes the test, so it should be ok .
It is slower however (only 3x speedup), but maybe there is room for optimization?
Exact :168 ms
Sampled:55 ms

I don't think that we need both totalHits and segmentHits. I know it may look not so expensive, but if you think about these ++ for millions of hits, they add up. Instead, I think we should stick w/ the original totalHits per-segment and in RandomSamplingFC do a first pass to sum up the global totalHits.

With that I think no changes are needed to FacetsCollector?

About RandomSamplingFacetsCollector:

I think we should fix the class jdoc's last "Note" as follows: "Note: if you use a counting
{\@link Facets}

implementation, you can fix the sampled counts by calling...".

Also, I think instead of correctFacetCounts we should call it amortizeFacetCounts or something like that. We do not implement here the exact facet counting method that was before.

I see that you remove sampleRatio, and now the ratio is computed as threshold/totalDocs but I think that's ... wrong? I.e. if threshold=10 and totalHits = 1000, I'll still get only 10 documents. But this is not what threshold means.

I think we should have minSampleSize, below which we don't sample at all (that's the threshold)

sampleRatio (e.g. 1%) is used only if totalHits > minSampleSize, and even then, we make sure that we sample at least minSampleSize

If we will have maxSampleSize as well, we can take that into account too, but it's OK if we don't do this in this issue

createSample seems to be memory-less – i.e. it doesn't carry over the bin residue to the next segment. Not sure if it's critical though, but it might affect the total sample size. If you feel like getting closer to the optimum, want to fix it? Otherwise, can you please drop a TODO?

Also, do you want to test using WAH8DocIdSet instead of FixedBitSet for the sampled docs? If not, could you please drop a TODO that we could use a more efficient bitset impl since it's a sparse vector?

About the test:

Could you remove the 's' from collectors in the test name?

Could you move to numDocs being a random number – something like atLeast(8000)?

I don't mean to nitpick but if you obtain an NRT reader, no need to commit()

Make the two collector instances take 100/10% of the numDocs when you fix it

Maybe use random.nextInt(10) for the facets instead of alternating sequentially?

I don't understand how you know that numChildren=5 when you ask for the 10 top children. Isn't it possible that w/ some random seed the number of children will be different?

In fact, I think that the random collectors should be initialized w/ a random seed that depends on the test? Currently they aren't and so always use 0xdeadbeef?

Shai Erera
added a comment - 06/Mar/14 09:55 Thanks Rob. Few comments:
I don't think that we need both totalHits and segmentHits. I know it may look not so expensive, but if you think about these ++ for millions of hits, they add up. Instead, I think we should stick w/ the original totalHits per-segment and in RandomSamplingFC do a first pass to sum up the global totalHits.
With that I think no changes are needed to FacetsCollector?
About RandomSamplingFacetsCollector:
I think we should fix the class jdoc's last "Note" as follows: "Note: if you use a counting
{\@link Facets}
implementation, you can fix the sampled counts by calling...".
Also, I think instead of correctFacetCounts we should call it amortizeFacetCounts or something like that. We do not implement here the exact facet counting method that was before.
I see that you remove sampleRatio, and now the ratio is computed as threshold/totalDocs but I think that's ... wrong? I.e. if threshold=10 and totalHits = 1000, I'll still get only 10 documents. But this is not what threshold means.
I think we should have minSampleSize, below which we don't sample at all (that's the threshold )
sampleRatio (e.g. 1%) is used only if totalHits > minSampleSize, and even then, we make sure that we sample at least minSampleSize
If we will have maxSampleSize as well, we can take that into account too, but it's OK if we don't do this in this issue
createSample seems to be memory-less – i.e. it doesn't carry over the bin residue to the next segment. Not sure if it's critical though, but it might affect the total sample size. If you feel like getting closer to the optimum, want to fix it? Otherwise, can you please drop a TODO?
Also, do you want to test using WAH8DocIdSet instead of FixedBitSet for the sampled docs? If not, could you please drop a TODO that we could use a more efficient bitset impl since it's a sparse vector?
About the test:
Could you remove the 's' from collectors in the test name?
Could you move to numDocs being a random number – something like atLeast(8000)?
I don't mean to nitpick but if you obtain an NRT reader, no need to commit()
Make the two collector instances take 100/10% of the numDocs when you fix it
Maybe use random.nextInt(10) for the facets instead of alternating sequentially?
I don't understand how you know that numChildren=5 when you ask for the 10 top children. Isn't it possible that w/ some random seed the number of children will be different?
In fact, I think that the random collectors should be initialized w/ a random seed that depends on the test? Currently they aren't and so always use 0xdeadbeef?
You have some sops left at the end of the test

I have fixed the points you noted about the collector. I renamed the sampleThreshold to sampleSize. It currently picks a samplingRatio that will reduce the number of hits to the sampleSize, if the number of hits is greater.

I have a general question about your remarks about the test, besides fixing the obvious (names, commit, sops). Is there a reason to add more randomness to one test? I normally try to test one aspect in a unit test. And if I also want to test some other aspect, like random document counts (to test the sampleratio for example), I add more tests.

Make the two collector instances take 100/10% of the numDocs when you fix it

Sorry, I don't get what you mean by this.

I don't understand how you know that numChildren=5 when you ask for the 10 top children. Isn't it possible that w/ some random seed the number of children will be different?
In fact, I think that the random collectors should be initialized w/ a random seed that depends on the test? Currently they aren't and so always use 0xdeadbeef?

There will be 5 facet values (0, 2, 4, 6 and 8), as only the even documents (i % 10) are hits. There is a REAL small chance that one of the five values will be entirely missed when sampling. But is that 0.8 (chance not to take a value) ^ 2000 * 5 (any can be missing) ~ 10^-193, so that is probable not going to happen .

Rob Audenaerde
added a comment - 06/Mar/14 10:47 Thanks Shai,
I have fixed the points you noted about the collector. I renamed the sampleThreshold to sampleSize. It currently picks a samplingRatio that will reduce the number of hits to the sampleSize, if the number of hits is greater.
I have a general question about your remarks about the test, besides fixing the obvious (names, commit, sops). Is there a reason to add more randomness to one test? I normally try to test one aspect in a unit test. And if I also want to test some other aspect, like random document counts (to test the sampleratio for example), I add more tests.
Make the two collector instances take 100/10% of the numDocs when you fix it
Sorry, I don't get what you mean by this.
I don't understand how you know that numChildren=5 when you ask for the 10 top children. Isn't it possible that w/ some random seed the number of children will be different?
In fact, I think that the random collectors should be initialized w/ a random seed that depends on the test? Currently they aren't and so always use 0xdeadbeef?
There will be 5 facet values (0, 2, 4, 6 and 8), as only the even documents (i % 10) are hits. There is a REAL small chance that one of the five values will be entirely missed when sampling. But is that 0.8 (chance not to take a value) ^ 2000 * 5 (any can be missing) ~ 10^-193 , so that is probable not going to happen .

Some imports are not used (o.a.l.u.Bits, o.a.l.s.Collector & o.a.l.s.DocIdSet)

Perhaps the parameters initialized in the RandomSamplingFacetsCollector c'tor could be made final

XORShift64Random.XORShift64Random() (default c'tor) is never used. Perhaps it was needed for usability when this was thought to be a core utility and was left by mistake? Should it be called somewhere?

getMatchingDocs()

when !sampleNeeded() there's a call to super.getMatchingDocs(), this may be redundant method call as 5 lines above we call it, and the code always compute the totalHits first. Perhaps the original matching docs could be stored as a member? This would also help for some implementations of correcting the sampled facet results.

totalHits is redundantly computed again in line 147-152

needsSampling() could perhaps be protected, allowing other criteria for sampling to be added

createSample()

randomIndex is initialized to 0, effectively making the first document of every segment's bin to be selected as the representative of that bin, neglecting the rest of the bin (regardless of the seed). So if a bin is the size of a 1000 documents, than there are 999 documents that regardless of the seed would always be neglected. It may be better so initialize as randomIndex = random.nextInt(binsize) as it happens for the 2nd and on bins.

While creating a new MatchingDocs with the sampled set, the original totalHits and original scores are used. I'm not 100% sure the first is an issue, but any facet accumulation which would rely on document scores would be hit by the second as the scores (at least by javadocs) are defined as non-sparse.

Gilad Barkai
added a comment - 06/Mar/14 11:59 Hi Rob, patch looks great.
A few comments:
Some imports are not used (o.a.l.u.Bits, o.a.l.s.Collector & o.a.l.s.DocIdSet)
Perhaps the parameters initialized in the RandomSamplingFacetsCollector c'tor could be made final
XORShift64Random.XORShift64Random() (default c'tor) is never used. Perhaps it was needed for usability when this was thought to be a core utility and was left by mistake? Should it be called somewhere?
getMatchingDocs()
when !sampleNeeded() there's a call to super.getMatchingDocs() , this may be redundant method call as 5 lines above we call it, and the code always compute the totalHits first. Perhaps the original matching docs could be stored as a member? This would also help for some implementations of correcting the sampled facet results.
totalHits is redundantly computed again in line 147-152
needsSampling() could perhaps be protected, allowing other criteria for sampling to be added
createSample()
randomIndex is initialized to 0 , effectively making the first document of every segment's bin to be selected as the representative of that bin, neglecting the rest of the bin (regardless of the seed). So if a bin is the size of a 1000 documents, than there are 999 documents that regardless of the seed would always be neglected. It may be better so initialize as randomIndex = random.nextInt(binsize) as it happens for the 2nd and on bins.
While creating a new MatchingDocs with the sampled set, the original totalHits and original scores are used. I'm not 100% sure the first is an issue, but any facet accumulation which would rely on document scores would be hit by the second as the scores (at least by javadocs) are defined as non-sparse.

when !sampleNeeded() there's a call to super.getMatchingDocs(), this may be redundant method call as 5 lines above we call it, and the code always compute the totalHits first. Perhaps the original matching docs could be stored as a member? This would also help for some implementations of correcting the sampled facet results.
totalHits is redundantly computed again in line 147-152

How could I have missed this... Must take a break I think.

createSample
I always take the first document, as I did not implement carrying-over of the segments. If I would pick a random index and this index would be greater than the number of document in the segment, the segment would not be sampled. This results is 'too few' sampled documents. Taking the first always might result in 'too many' but that gave a better overall distribution and average.
I think your argument about not-so-random documents and the fact that carry-over should not be that hard, I should implement carry over anyway.

Rob Audenaerde
added a comment - 06/Mar/14 12:18 Thanks,
when !sampleNeeded() there's a call to super.getMatchingDocs(), this may be redundant method call as 5 lines above we call it, and the code always compute the totalHits first. Perhaps the original matching docs could be stored as a member? This would also help for some implementations of correcting the sampled facet results.
totalHits is redundantly computed again in line 147-152
How could I have missed this... Must take a break I think.
createSample
I always take the first document, as I did not implement carrying-over of the segments. If I would pick a random index and this index would be greater than the number of document in the segment, the segment would not be sampled. This results is 'too few' sampled documents. Taking the first always might result in 'too many' but that gave a better overall distribution and average.
I think your argument about not-so-random documents and the fact that carry-over should not be that hard, I should implement carry over anyway.

but any facet accumulation which would rely on document scores would be hit by the second as the scores

That's a great point Gilad. We need a test which covers that with random sampling collector.

Is there a reason to add more randomness to one test?

It depends. I have a problem with numDocs=10,000 and percents being 10% .. it creates too perfect numbers if you know what I mean. I prefer a random number of documents to add some spice to the test. Since we're testing a random sampler, I don't think it makes sense to test it with a fixed seed (0xdeadbeef) ... this collector is all about randomness, so we should stress the randomness done there. Given our test framework, randomness is not a big deal at all, since once we get a test failure, we can deterministically reproduce the failure (when there is no multi-threading). So I say YES, in this test I think we should have randomness.

But e.g. when you add a test which ensures the collector works well w/ sampled docs and scores, I don't think you should add randomness – it's ok to test it once.

Also, in terms of test coverage, there are other cases which I think would be good if they were tested:

Docs + Scores (discussed above)

Multi-segment indexes (ensuring we work well there)

Different number of hits per-segment (to make sure our sampling on tiny segments works well too)

...

I wouldn't for example use RandomIndexWriter because we're only testing search (and so it just adds noise in this test). If we want many segments, we should commit/nrt-open every few docs, disable merge policy etc. These can be separate, real "unit-", tests.

Sorry, I don't get what you mean by this.

I meant that if you set numDocs = atLeast(8000), then the 10% sampler should not be hardcoded to 1,000, but numDocs * 0.1.

the original totalHits .. is used

I think that's OK. In fact, if we don't record that, it would be hard to fix the counts no?

There will be 5 facet values (0, 2, 4, 6 and 8), as only the even documents (i % 10) are hits. There is a REAL small chance that one of the five values will be entirely missed when sampling. But is that 0.8 (chance not to take a value) ^ 2000 * 5 (any can be missing) ~ 10^-193, so that is probable not going to happen

Ahh thanks, I missed that. I agree it's very improbable that one of the values is missing, but if we can avoid that at all it's better. First, it's not one of the values, we could be missing even 2 right – really depends on randomness. I find this assert just redundant – if we always expect 5, we shouldn't assert that we received 5. If we say that very infrequently we might get <5 and we're OK with it .. what's the point of asserting that at all?

I renamed the sampleThreshold to sampleSize. It currently picks a samplingRatio that will reduce the number of hits to the sampleSize, if the number of hits is greater.

It looks like it hasn't changed? I mean besides the rename. So if I set sampleSize=100K, it's 100K whether there are 101K docs or 100M docs, right? Is that your intention?

Shai Erera
added a comment - 06/Mar/14 18:54 - edited but any facet accumulation which would rely on document scores would be hit by the second as the scores
That's a great point Gilad. We need a test which covers that with random sampling collector.
Is there a reason to add more randomness to one test?
It depends. I have a problem with numDocs=10,000 and percents being 10% .. it creates too perfect numbers if you know what I mean. I prefer a random number of documents to add some spice to the test. Since we're testing a random sampler, I don't think it makes sense to test it with a fixed seed (0xdeadbeef) ... this collector is all about randomness, so we should stress the randomness done there. Given our test framework, randomness is not a big deal at all, since once we get a test failure, we can deterministically reproduce the failure (when there is no multi-threading). So I say YES, in this test I think we should have randomness.
But e.g. when you add a test which ensures the collector works well w/ sampled docs and scores, I don't think you should add randomness – it's ok to test it once.
Also, in terms of test coverage, there are other cases which I think would be good if they were tested:
Docs + Scores (discussed above)
Multi-segment indexes (ensuring we work well there)
Different number of hits per-segment (to make sure our sampling on tiny segments works well too)
...
I wouldn't for example use RandomIndexWriter because we're only testing search (and so it just adds noise in this test). If we want many segments, we should commit/nrt-open every few docs, disable merge policy etc. These can be separate, real "unit-", tests.
Sorry, I don't get what you mean by this.
I meant that if you set numDocs = atLeast(8000) , then the 10% sampler should not be hardcoded to 1,000, but numDocs * 0.1 .
the original totalHits .. is used
I think that's OK. In fact, if we don't record that, it would be hard to fix the counts no?
There will be 5 facet values (0, 2, 4, 6 and 8), as only the even documents (i % 10) are hits. There is a REAL small chance that one of the five values will be entirely missed when sampling. But is that 0.8 (chance not to take a value) ^ 2000 * 5 (any can be missing) ~ 10^-193, so that is probable not going to happen
Ahh thanks, I missed that. I agree it's very improbable that one of the values is missing, but if we can avoid that at all it's better. First, it's not one of the values, we could be missing even 2 right – really depends on randomness. I find this assert just redundant – if we always expect 5, we shouldn't assert that we received 5. If we say that very infrequently we might get <5 and we're OK with it .. what's the point of asserting that at all?
I renamed the sampleThreshold to sampleSize. It currently picks a samplingRatio that will reduce the number of hits to the sampleSize, if the number of hits is greater.
It looks like it hasn't changed? I mean besides the rename. So if I set sampleSize=100K, it's 100K whether there are 101K docs or 100M docs, right? Is that your intention?

...Given our test framework, randomness is not a big deal at all, since once we get a test failure, we can deterministically reproduce the failure (when there is no multi-threading)...

Ok, this makes sense to me.

It looks like it hasn't changed? I mean besides the rename. So if I set sampleSize=100K, it's 100K whether there are 101K docs or 100M docs, right? Is that your intention?

Correct, it is my intention. I actually prefer not to increase the sampleSize with more hits, as bigger samples are slower and 100K is a nice sample size anyway and more hits means more time. I adjust the sampleRatio so that the resulting set of documents is (close to) the sampleSize.

I find this assert just redundant – if we always expect 5, we shouldn't assert that we received 5. If we say that very infrequently we might get <5 and we're OK with it .. what's the point of asserting that at all?

Agreed with the <5. Asserting seems redundant, but is that not the point in unit-tests? The trick is that the assertion should still hold if you change the implementation..

I will add more next week.

Btw. Is there an easy way to retrieve the total facet counts for a ordinal? When correcting facet counts it would a quick win to limit the number of estimated documents to the actual number of documents in the index that match that facet. (And maybe use the distribution as well, to make better estimates)

Rob Audenaerde
added a comment - 07/Mar/14 16:39
...Given our test framework, randomness is not a big deal at all, since once we get a test failure, we can deterministically reproduce the failure (when there is no multi-threading)...
Ok, this makes sense to me.
It looks like it hasn't changed? I mean besides the rename. So if I set sampleSize=100K, it's 100K whether there are 101K docs or 100M docs, right? Is that your intention?
Correct, it is my intention. I actually prefer not to increase the sampleSize with more hits, as bigger samples are slower and 100K is a nice sample size anyway and more hits means more time. I adjust the sampleRatio so that the resulting set of documents is (close to) the sampleSize .
I find this assert just redundant – if we always expect 5, we shouldn't assert that we received 5. If we say that very infrequently we might get <5 and we're OK with it .. what's the point of asserting that at all?
Agreed with the <5. Asserting seems redundant, but is that not the point in unit-tests? The trick is that the assertion should still hold if you change the implementation..
I will add more next week.
Btw. Is there an easy way to retrieve the total facet counts for a ordinal? When correcting facet counts it would a quick win to limit the number of estimated documents to the actual number of documents in the index that match that facet. (And maybe use the distribution as well, to make better estimates)

Btw. Is there an easy way to retrieve the total facet counts for a ordinal? When correcting facet counts it would a quick win to limit the number of estimated documents to the actual number of documents in the index that match that facet. (And maybe use the distribution as well, to make better estimates)

That's a great idea!

The docFreq of the category drill-down term is an upper bound - and could be used as a limit.
It's cheap, but might not be the exact number as it also take under account deleted documents.

The limit should also take under account the total number of hits for the query, otherwise the estimate and the multiplication with the sampling factor may yield a larger number than the actual results.

Gilad Barkai
added a comment - 07/Mar/14 21:05
Btw. Is there an easy way to retrieve the total facet counts for a ordinal? When correcting facet counts it would a quick win to limit the number of estimated documents to the actual number of documents in the index that match that facet. (And maybe use the distribution as well, to make better estimates)
That's a great idea!
The docFreq of the category drill-down term is an upper bound - and could be used as a limit.
It's cheap, but might not be the exact number as it also take under account deleted documents.
The limit should also take under account the total number of hits for the query, otherwise the estimate and the multiplication with the sampling factor may yield a larger number than the actual results.

The limit should also take under account the total number of hits for the query, otherwise the estimate and the multiplication with the sampling factor may yield a larger number than the actual results.

I understand this statement is confusing, I'll try to elaborate.
If the sample was exactly at the sampling ratio, this would not be a problem, but since the sample - being random as it is - may be a bit larger, adjusting according to the original sampling ratio (rather than the actual one) may yield larger counts than the actual results.
This could be solved by either limiting to the number of results, or adjusting the samplingRate to be the exact, post-sampling, ratio.

Gilad Barkai
added a comment - 09/Mar/14 09:29
The limit should also take under account the total number of hits for the query, otherwise the estimate and the multiplication with the sampling factor may yield a larger number than the actual results.
I understand this statement is confusing, I'll try to elaborate.
If the sample was exactly at the sampling ratio, this would not be a problem, but since the sample - being random as it is - may be a bit larger, adjusting according to the original sampling ratio (rather than the actual one) may yield larger counts than the actual results.
This could be solved by either limiting to the number of results, or adjusting the samplingRate to be the exact, post-sampling, ratio.

The problem is when those false alarms cause noise. The previous sampling tests had a mechanism to reduce the noise as much as possible, but they didn't eliminate it. For example, the test was run few times, each time w/ increasing sample until it gave up and failed. At which point someone had to inspect the log and determine that this is a false positive. Since you expect at most 5 categories, and any number between 1-5 is fair game, I prefer not to assert on #categories at all. If you really want to assert something, then make sure 0 < #categories <= 5?

Shai Erera
added a comment - 09/Mar/14 09:38 Asserting seems redundant, but is that not the point in unit-tests?
The problem is when those false alarms cause noise. The previous sampling tests had a mechanism to reduce the noise as much as possible, but they didn't eliminate it. For example, the test was run few times, each time w/ increasing sample until it gave up and failed. At which point someone had to inspect the log and determine that this is a false positive. Since you expect at most 5 categories, and any number between 1-5 is fair game, I prefer not to assert on #categories at all. If you really want to assert something, then make sure 0 < #categories <= 5 ?

I currently have an amortizeFacetCounts that uses the IndexSearcher to retrieve a reader an a FacetsConfig to determine the Term for the docFreq. I'm not really sure how this will work for hierarchies though.

Using totalHits as upper bound is not really necessary I think; as the sampling rate is determined by the total number of hits and the sample size; so reversing this can never yield numbers greater that totalHits.

I alse removed the exact assert and switch to an atLeast. Will add a patch soon.

Rob Audenaerde
added a comment - 11/Mar/14 08:03 - edited Thanks again for the good points.
I currently have an amortizeFacetCounts that uses the IndexSearcher to retrieve a reader an a FacetsConfig to determine the Term for the docFreq . I'm not really sure how this will work for hierarchies though.
Using totalHits as upper bound is not really necessary I think; as the sampling rate is determined by the total number of hits and the sample size; so reversing this can never yield numbers greater that totalHits .
I alse removed the exact assert and switch to an atLeast . Will add a patch soon.

From the class javadocs: Note: the original set of hits will be available as documents...
I think we should just write "the original set of hits can be retrieved from getOriginal.." - I don't want anyone to be confused with the wording "will be available as documents".

Can you make NOT_CALCULATED static?

Typo: samplingRato

randomSeed: I think it should say "if 0..." not "if null".

getMatchingDocs – can you move the totalHits calculation to getTotalHits()? And then call it only if (sampledDocs==null)?

needsSampling – I know it was suggested to make it protected for inheritance purposes, but as it looks now, all members are private so I don't see how can one extend the class only to override this method (e.g. he won't have access to sampleSize even). Maybe we should keep it private and when someone asks to extend, we know better what needs to be protected? For instance, I think it's also important that we allow overriding createSampledDocList, but for now let's keep it simple.

I think that we need to document somewhere (maybe in class javadocs) that the returned sampled docs may include empty MatchingDocs instances (i.e. when no docs were "sampled" from a segment). Just so that we don't surprise anyone with empty instances. If people work w/ MatchingDocs as they should, by obtaining an iterator, it shouldn't be a problem, but better document it explicitly.

Perhaps we should also say something about the returned MatchingDocs.totalHits, which are the original totalHits and not the sampled set size?

About carryOver:

Doesn't it handle the TODO at the beginning of createSample?

Why does it need to always include the first document of the first segment? Rather you could initialize it to -1 and if carryOver == -1 set it to a random index within that bin? Seems more "random" to me.

amortizedFacetCounts:

It's a matter of style so I don't mind if you keep this like you wrote, but I would write it as if (!needsSampling() || res == null) and then the rest of the method isn't indented. Your call.

I think it's better to allocate childPath so that the first element is already the dimension. See what FacetsConfig.pathToString currently does. Currently it results in re-allocating the String[] for every label. Then you can call the pathToString variant which takes the array and the length.

Separately, would be good if FacetsConfig had an appendElement(Appendable, int idx) to append a specific element to the appendable. Then you could use a StringBuilder once, and skip the encoding done for the entire path except the last element.

Perhaps we should cap this (int) (res.value.doubleValue() / this.samplingRate) by e.g. the number of non-deleted documents?

About test:

This comment is wrong: //there will be 5000 hits.

Why do you initialize your own Random? It's impossible to debug the test like that. You should use random() instead.

This comment is wrong? //because a randomindexer is used, the results will vary each time. – it's not because the random indexer, but because of random sampling no?

Would you mind formatting the test code? I see empty lines, curly braces that are sometimes open in the next line etc. I can do it too before I commit it.

I see that createSample still has the scores array bug – it returns the original scores array, irrespective of the sampled docs. Before you fix it, can you please add a testcase which covers scores and fails?

Shai Erera
added a comment - 11/Mar/14 09:21
Javadocs:
From the class javadocs: Note: the original set of hits will be available as documents...
I think we should just write "the original set of hits can be retrieved from getOriginal.." - I don't want anyone to be confused with the wording "will be available as documents".
Can you make NOT_CALCULATED static?
Typo: samplingRato
randomSeed: I think it should say "if 0..." not "if null".
getMatchingDocs – can you move the totalHits calculation to getTotalHits() ? And then call it only if (sampledDocs==null) ?
needsSampling – I know it was suggested to make it protected for inheritance purposes, but as it looks now, all members are private so I don't see how can one extend the class only to override this method (e.g. he won't have access to sampleSize even). Maybe we should keep it private and when someone asks to extend, we know better what needs to be protected? For instance, I think it's also important that we allow overriding createSampledDocList, but for now let's keep it simple.
I think that we need to document somewhere (maybe in class javadocs) that the returned sampled docs may include empty MatchingDocs instances (i.e. when no docs were "sampled" from a segment). Just so that we don't surprise anyone with empty instances. If people work w/ MatchingDocs as they should, by obtaining an iterator, it shouldn't be a problem, but better document it explicitly.
Perhaps we should also say something about the returned MatchingDocs.totalHits, which are the original totalHits and not the sampled set size?
About carryOver:
Doesn't it handle the TODO at the beginning of createSample?
Why does it need to always include the first document of the first segment? Rather you could initialize it to -1 and if carryOver == -1 set it to a random index within that bin? Seems more "random" to me.
amortizedFacetCounts:
It's a matter of style so I don't mind if you keep this like you wrote, but I would write it as if (!needsSampling() || res == null) and then the rest of the method isn't indented. Your call.
I think it's better to allocate childPath so that the first element is already the dimension. See what FacetsConfig.pathToString currently does. Currently it results in re-allocating the String[] for every label. Then you can call the pathToString variant which takes the array and the length.
Separately, would be good if FacetsConfig had an appendElement(Appendable, int idx) to append a specific element to the appendable. Then you could use a StringBuilder once, and skip the encoding done for the entire path except the last element.
Perhaps we should cap this (int) (res.value.doubleValue() / this.samplingRate) by e.g. the number of non-deleted documents?
About test:
This comment is wrong: //there will be 5000 hits.
Why do you initialize your own Random? It's impossible to debug the test like that. You should use random() instead.
This comment is wrong? //because a randomindexer is used, the results will vary each time. – it's not because the random indexer, but because of random sampling no?
Would you mind formatting the test code? I see empty lines, curly braces that are sometimes open in the next line etc. I can do it too before I commit it.
I see that createSample still has the scores array bug – it returns the original scores array, irrespective of the sampled docs. Before you fix it, can you please add a testcase which covers scores and fails?
I think we're very close!

Making good progress, only I'm not really sure what to do with the scores. I could only keep the scores of the sampled documents (creating a new scores[] in the createSample. Or just leave scoring out completely for the sampler? (Passing keepscores = false, removing the c'tor param, setting {[scores}} to null?

Rob Audenaerde
added a comment - 12/Mar/14 08:12 Hi all,
Making good progress, only I'm not really sure what to do with the scores . I could only keep the scores of the sampled documents (creating a new scores[] in the createSample . Or just leave scoring out completely for the sampler? (Passing keepscores = false, removing the c'tor param, setting {[scores}} to null ?

If it's not too much work for you, I think you should just create a new float[]? You can separate the code such that you don't check if needs to keep scores for every sampled document, at the cost of duplicating code. But otherwise I think it would be good if we kept that functionality for sampled docs too.

Shai Erera
added a comment - 12/Mar/14 08:42 If it's not too much work for you, I think you should just create a new float[]? You can separate the code such that you don't check if needs to keep scores for every sampled document, at the cost of duplicating code. But otherwise I think it would be good if we kept that functionality for sampled docs too.

About the scores (the only part I got to review thus far), the scores should be a non-sparse float array.
E,g, if there are 1M documents and the original set contains 1000 documents the score[] array would be of length 1000, If the sampled set will only have 10 documents, the score[] array should be only 10.

The relevant part:

if (getKeepScores()) {
scores[doc] = docs.scores[doc];
}

should be changed as the scores[] size and index should be relative to the sampled set and not the original results.
Also the size of the score[] array could be the amount of bins?

Gilad Barkai
added a comment - 17/Mar/14 09:16 About the scores (the only part I got to review thus far), the scores should be a non-sparse float array.
E,g, if there are 1M documents and the original set contains 1000 documents the score[] array would be of length 1000, If the sampled set will only have 10 documents, the score[] array should be only 10.
The relevant part:
if (getKeepScores()) {
scores[doc] = docs.scores[doc];
}
should be changed as the scores[] size and index should be relative to the sampled set and not the original results.
Also the size of the score[] array could be the amount of bins?

Rob, I reviewed the patch and I agree with Gilad - the way you handle the scores array is wrong. It's not random access by doc. I believe if you added a test it would show up quickly. But perhaps we can keep scores out of this collector ... we can always add it later. So I don't mind if you want to wrap up w/o scores for now. Can you then fix the patch to always set keepScores=false? Also, I noticed few sops left in test.

Shai Erera
added a comment - 17/Mar/14 11:10 Rob, I reviewed the patch and I agree with Gilad - the way you handle the scores array is wrong. It's not random access by doc. I believe if you added a test it would show up quickly. But perhaps we can keep scores out of this collector ... we can always add it later. So I don't mind if you want to wrap up w/o scores for now. Can you then fix the patch to always set keepScores=false? Also, I noticed few sops left in test.

Removed needsSampling() since we don't offer an extension any access to e.g. totalHits. We can add it when there's demand.

Fixed a bug in how carryOver was implemented – replaced by two members leftoverBin and leftoverIndex. So now if leftoverBin != -1 we know to skip the first such documents in the next segment and depending on leftoverIndex, whether we need to sample any of them. Before that, we didn't really skip over the leftover docs in the bin, but started to count a new bin.

Added a CHANGES entry.

I reviewed the test - would be good if we can write a unit test which specifically matches only few documents in one segment compared to the rest. I will look into it perhaps later.

I think it's ready, but if anyone wants to give createSample() another look, to make sure this time leftover works well, I won't commit it by tomorrow anyway.

Shai Erera
added a comment - 19/Mar/14 16:07 I reviewed the patch more closely before I commit:
Modified few javadocs
Removed needsSampling() since we don't offer an extension any access to e.g. totalHits. We can add it when there's demand.
Fixed a bug in how carryOver was implemented – replaced by two members leftoverBin and leftoverIndex . So now if leftoverBin != -1 we know to skip the first such documents in the next segment and depending on leftoverIndex , whether we need to sample any of them. Before that, we didn't really skip over the leftover docs in the bin, but started to count a new bin.
Added a CHANGES entry.
I reviewed the test - would be good if we can write a unit test which specifically matches only few documents in one segment compared to the rest. I will look into it perhaps later.
I think it's ready, but if anyone wants to give createSample() another look, to make sure this time leftover works well, I won't commit it by tomorrow anyway.