[ https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13917407#comment-13917407
]
Shai Erera edited comment on LUCENE-5476 at 3/2/14 1:24 PM:
------------------------------------------------------------
I reviewed createSample in the patch, and I think something's wrong with it. As I understand
it, you set binsize to {{1.0/sampleRatio}}, so id sR=0.1, binsize=10. This means that we should
keep 1 out of every 10 matching documents. Then you iterate over the bitset, for every group
of 10 documents you draw a representative index at random, and clear all other documents.
But what seems wrong is that the iteration advances through the "bin" irrespective of which
documents were returned. So e.g. if the FBS has 100 docs total, and docs 5, 15, 25, 35, ...
returned and say random.nextInt(binsize) picks all indexes but 5, if I understand the code
correctly, all bits will be cleared! I double-checked with the following short main:
{code}
public static void main(String[] args) throws Exception {
Random random = new Random();
FixedBitSet sampledBits = new FixedBitSet(100);
for (int i = 5; i < 100; i += 10) {
sampledBits.set(i);
}
int size = 100;
int binsize = 10;
int countInBin = 0;
int randomIndex = random.nextInt(binsize);
for (int i = 0; i < size; i++) {
countInBin++;
if (countInBin == binsize) {
countInBin = 0;
randomIndex = random.nextInt(binsize);
}
if (sampledBits.get(i) && !(countInBin == randomIndex)) {
sampledBits.clear(i);
}
}
for (int i = 0; i < 100; i++) {
if (sampledBits.get(i)) {
System.out.print(i + " ");
}
}
System.out.println();
}
{code}
And indeed in many iterations the main prints nothing, in some only 2-3 docs "survive" the
sampling.
So first I think Gilad's pseudo-code is more correct in that it iterates over the matching
documents and I think you should do the same here. When you do that, you no longer need to
check bits.get(i) in order to decide if to clear it.
I wonder if your benchmark results are correct (unless you run a MatchAllDocsQuery?) -- can
you confirm that you indeed counter 10% of the documents?
Another point raised by Gilad is the method of sampling. While you implement random sampling,
there are other methods like "take the 10th document" etc. which will be faster. I think one
can experiment with these through an extension of SampledDocs.createSample (and SamplingFC.createDocs),
but just in case you want to give such a simple sampling method a try, would be interesting
to compare randomness with something simpler.
was (Author: shaie):
I reviewed createSample in the patch, and I think something's wrong with it. As I understand
it, you set binsize to {{1.0/sampleRatio}}, so id sR=0.1, binsize=10. This means that we should
keep 1 out of every 10 matching documents. Then you iterate over the bitset, for every group
of 10 documents you draw a representative index at random, and clear all other documents.
But what seems wrong is that the iteration advances through the "bin" irrespective of which
documents were returned. So e.g. if the FBS has 100 docs total, and docs 5, 15, 25, 35, ...
returned and say random.nextInt(binsize) picks all indexes but 5, if I understand the code
correctly, all bits will be cleared! I double-checked with the following short main:
{code}
public static void main(String[] args) throws Exception {
Random random = new Random();
FixedBitSet sampledBits = new FixedBitSet(100);
for (int i = 5; i < 100; i += 10) {
sampledBits.set(i);
}
int size = 100;
int binsize = 10;
int countInBin = 0;
int randomIndex = random.nextInt(binsize);
for (int i = 0; i < size; i++) {
countInBin++;
if (countInBin == binsize) {
countInBin = 0;
randomIndex = random.nextInt(binsize);
}
if (sampledBits.get(i) && !(countInBin == randomIndex)) {
sampledBits.clear(i);
}
}
for (int i = 0; i < 100; i++) {
if (sampledBits.get(i)) {
System.out.print(i + " ");
}
}
System.out.println();
}
{code}
And indeed in many iterations the main prints nothing, in some only 2-3 docs "survive" the
sampling.
So first I think Gilad's pseudo-code is more correct in that it iterates over the matching
documents and I think you should do the same here. When you do that, you no longer need to
check bits.get(i) in order to decide if to clear it.
I wonder if your benchmark results are correct (unless you run a MatchAllDocsQuery?) -- can
you confirm that you indeed counter 10% of the documents?
Another point raised by Gilad is the method of sampling. While you implement random sampling,
there are other methods like "take the 10th document" etc. which will be faster. I think one
can experiment with these through an extension of SampledDocs.createSample (and SamplingFC.createDocs),
but just in case you want to give such a simple sampling method a try, would be interesting
to compare randomness with something simpler.
> Facet sampling
> --------------
>
> Key: LUCENE-5476
> URL: https://issues.apache.org/jira/browse/LUCENE-5476
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Rob Audenaerde
> Attachments: LUCENE-5476.patch, SamplingFacetsCollector.java
>
>
> With LUCENE-5339 facet sampling disappeared.
> When trying to display facet counts on large datasets (>10M documents) counting facets
is rather expensive, as all the hits are collected and processed.
> Sampling greatly reduced this and thus provided a nice speedup. Could it be brought back?
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org