Michael McCandless
added a comment - 30/Jan/09 21:19 We need more performance data before going ahead with tombstone deletions. Sounds like the next step.
Or, I think morphing the tombstones idea into the "incremental copy on write block BitVector" idea you mentioned, is a good next step to explore (for realtime search).

Michael McCandless
added a comment - 30/Jan/09 21:12
This seems like something we can port to Java and get into
contrib/benchmark. Particularly automatically creating the indexes.
That would be great! Maybe it's not so much work... contrib/benchmark can already do the "rounds" to enumerate through the different combinations of query, index, etc.
Automatically creating the indices is trickier; you'd want to 1) create the first index, 2) copy it and do X%tg deletions, 3) repeat 2 for different values of X.
Also tricky is gathering a series of results and saving it somehow so that you can then make a baseline for other runs.
Or we could keep using Python (pickling is so easy ). It's nice to have a "scripting" layer on top of contrib/benchmark.

Thanks for running all these benchmarks. (I'm going to have to port the
benchmark suite sooner rather than later so that Lucy doesn't have to rely on
having algos worked out under Lucene.)

No problem And I don't mind working these things out / discussing
them in Lucene – many of these ideas turn out very well! EGLUCENE-1483.

Though... iteration may work better in C than it does in Java, so it's
hard to know what to conclude here for Lucy.

I think it's important to note that the degradation at low deletion
percentages is not as bad as the chart seems to imply. The worst performance
was on the cheapest query, and the best performance was on the most expensive
query.

True.

Furthermore, a 50% deletion percentage is very high. Regardless of how
deletions are implemented, the default merge policy ought to trigger
absorbtion of segments with deletion percentages over a certain threshold.
The actual threshold percentage is an arbitrary number because the ideal
tradeoff between index-time sacrifices and search-time sacrifices varies, but
my gut pick would be somewhere between 10% and 30% for bit vector deletions
and between 5% and 20% for iterated deletions.

Right, we should have the default merge policy try to coalsce segments
with too many deletions (Lucene does not do this today – you have to
call expungeDeletes yourself, and even that's not perfect since it
eliminates every single delete).

Ha, you're giving up a little early, amigo.

I do think it's clear that maintaining an individual deletions iterator for
each posting list isn't going to work well, particularly when it's implemented
using an opaque DocIdSetIterator and virtual methods. That can't compete with
what is essentially an array lookup (since BitVector.get() can be inlined) on
a shared bit vector, even at low deletion rates.

Right.

I also think we can conclude that high deletion rates cause an accelerating
degradation with iterated deletions, and that if they are implemented, that
problem probably needs to be addressed with more aggressive segment merging.

Agreed.

However, I don't think the benchmark data we've seen so far demonstrates that
the filtered deletions model won't work. Heck, with that model,
deletions get pulled out out TermDocs so we lose the per-iter null-check,
possibly yielding a small performance increase in the common case of zero
deletions.

By "filtered deletions model", you mean treating deletions as a
"normal" filter and moving "up" when they are applied? Right, we have
not explored that yet, here.

Except, filters are accessed by iteration in Lucene now (they used to
be random access before LUCENE-584).

So... now I'm wondering if we should go back and scrutinize the
performance cost of switching from random access to iteration for
filters, especially in cases where under-the-hood the filter needed to
create a non-sparse repr anyway. It looks like there was a fair
amount of discussion about performance, but no hard conclusions, inLUCENE-584, on first read. Hmmm.

I think to do a fair test, we need to move deleted docs "up higher",
but access it w/ random access API?

Except... the boolean query QPS are not that different with the
no-deletions case vs the 1% deletions. And it's only the non-simple
queries (boolean, span, phrase, etc.) that would see any difference
in moving deletions "up" as a filter? Ie their QPS is already so low
that the added cost of doing the deletes "at the bottom" appears to be
quite minor.

(Conceptually, I completely agree that deletions are really the same
thing as filters).

Maybe we should close this issue with a won't-fix and start a new one for
filtered deletions?

Michael McCandless
added a comment - 30/Jan/09 21:07
Thanks for running all these benchmarks. (I'm going to have to port the
benchmark suite sooner rather than later so that Lucy doesn't have to rely on
having algos worked out under Lucene.)
No problem And I don't mind working these things out / discussing
them in Lucene – many of these ideas turn out very well! EG
LUCENE-1483 .
Though... iteration may work better in C than it does in Java, so it's
hard to know what to conclude here for Lucy.
I think it's important to note that the degradation at low deletion
percentages is not as bad as the chart seems to imply. The worst performance
was on the cheapest query, and the best performance was on the most expensive
query.
True.
Furthermore, a 50% deletion percentage is very high. Regardless of how
deletions are implemented, the default merge policy ought to trigger
absorbtion of segments with deletion percentages over a certain threshold.
The actual threshold percentage is an arbitrary number because the ideal
tradeoff between index-time sacrifices and search-time sacrifices varies, but
my gut pick would be somewhere between 10% and 30% for bit vector deletions
and between 5% and 20% for iterated deletions.
Right, we should have the default merge policy try to coalsce segments
with too many deletions (Lucene does not do this today – you have to
call expungeDeletes yourself, and even that's not perfect since it
eliminates every single delete).
Ha, you're giving up a little early, amigo.
I do think it's clear that maintaining an individual deletions iterator for
each posting list isn't going to work well, particularly when it's implemented
using an opaque DocIdSetIterator and virtual methods. That can't compete with
what is essentially an array lookup (since BitVector.get() can be inlined) on
a shared bit vector, even at low deletion rates.
Right.
I also think we can conclude that high deletion rates cause an accelerating
degradation with iterated deletions, and that if they are implemented, that
problem probably needs to be addressed with more aggressive segment merging.
Agreed.
However, I don't think the benchmark data we've seen so far demonstrates that
the filtered deletions model won't work. Heck, with that model,
deletions get pulled out out TermDocs so we lose the per-iter null-check,
possibly yielding a small performance increase in the common case of zero
deletions.
By "filtered deletions model", you mean treating deletions as a
"normal" filter and moving "up" when they are applied? Right, we have
not explored that yet, here.
Except, filters are accessed by iteration in Lucene now (they used to
be random access before LUCENE-584 ).
So... now I'm wondering if we should go back and scrutinize the
performance cost of switching from random access to iteration for
filters, especially in cases where under-the-hood the filter needed to
create a non-sparse repr anyway. It looks like there was a fair
amount of discussion about performance, but no hard conclusions, in
LUCENE-584 , on first read. Hmmm.
I think to do a fair test, we need to move deleted docs "up higher",
but access it w/ random access API?
Except... the boolean query QPS are not that different with the
no-deletions case vs the 1% deletions. And it's only the non-simple
queries (boolean, span, phrase, etc.) that would see any difference
in moving deletions "up" as a filter? Ie their QPS is already so low
that the added cost of doing the deletes "at the bottom" appears to be
quite minor.
(Conceptually, I completely agree that deletions are really the same
thing as filters).
Maybe we should close this issue with a won't-fix and start a new one for
filtered deletions?
Maybe discuss a bit more here or on java-dev first?

Just run sortBench2.py in contrib/benchmark of trunk & patch
areas. Then run sortCollate2.py to make the Jira table (-jira) or
print a human readable output (default). You'll have to make your own
Wikipedia indices with the pctg deletes, then edit sortBench2.py &
sortCollate2.py to fix the paths.

All they do is write an alg file, run the test, and parse the output
file to gather best of 5.

This seems like something we can port to Java and get into
contrib/benchmark. Particularly automatically creating the indexes.

Jason Rutherglen
added a comment - 30/Jan/09 18:34
Just run sortBench2.py in contrib/benchmark of trunk & patch
areas. Then run sortCollate2.py to make the Jira table (-jira) or
print a human readable output (default). You'll have to make your own
Wikipedia indices with the pctg deletes, then edit sortBench2.py &
sortCollate2.py to fix the paths.
All they do is write an alg file, run the test, and parse the output
file to gather best of 5.
This seems like something we can port to Java and get into
contrib/benchmark. Particularly automatically creating the indexes.

Jason Rutherglen
added a comment - 30/Jan/09 17:56 > Maybe we should close this issue with a won't-fix and start a new one for filtered deletions?
We need more performance data before going ahead with tombstone deletions. Sounds like the next step.

Ah. Thanks for running all these benchmarks. (I'm going to have to port the
benchmark suite sooner rather than later so that Lucy doesn't have to rely on
having algos worked out under Lucene.)

I think it's important to note that the degradation at low deletion
percentages is not as bad as the chart seems to imply. The worst performance
was on the cheapest query, and the best performance was on the most expensive
query.

Furthermore, a 50% deletion percentage is very high. Regardless of how
deletions are implemented, the default merge policy ought to trigger
absorbtion of segments with deletion percentages over a certain threshold.
The actual threshold percentage is an arbitrary number because the ideal
tradeoff between index-time sacrifices and search-time sacrifices varies, but
my gut pick would be somewhere between 10% and 30% for bit vector deletions
and between 5% and 20% for iterated deletions.

> I'm now feeling like we're gonna have to keep random-access to deleted
> docs....

Ha, you're giving up a little early, amigo.

I do think it's clear that maintaining an individual deletions iterator for
each posting list isn't going to work well, particularly when it's implemented
using an opaque DocIdSetIterator and virtual methods. That can't compete with
what is essentially an array lookup (since BitVector.get() can be inlined) on
a shared bit vector, even at low deletion rates.

I also think we can conclude that high deletion rates cause an accelerating
degradation with iterated deletions, and that if they are implemented, that
problem probably needs to be addressed with more aggressive segment merging.

However, I don't think the benchmark data we've seen so far demonstrates that
the filtered deletions model won't work. Heck, with that model,
deletions get pulled out out TermDocs so we lose the per-iter null-check,
possibly yielding a small performance increase in the common case of zero
deletions.

Maybe we should close this issue with a won't-fix and start a new one for
filtered deletions?

Marvin Humphrey
added a comment - 30/Jan/09 17:38
> Actually I used your entire patch on its own.
Ah. Thanks for running all these benchmarks. (I'm going to have to port the
benchmark suite sooner rather than later so that Lucy doesn't have to rely on
having algos worked out under Lucene.)
I think it's important to note that the degradation at low deletion
percentages is not as bad as the chart seems to imply. The worst performance
was on the cheapest query, and the best performance was on the most expensive
query.
Furthermore, a 50% deletion percentage is very high. Regardless of how
deletions are implemented, the default merge policy ought to trigger
absorbtion of segments with deletion percentages over a certain threshold.
The actual threshold percentage is an arbitrary number because the ideal
tradeoff between index-time sacrifices and search-time sacrifices varies, but
my gut pick would be somewhere between 10% and 30% for bit vector deletions
and between 5% and 20% for iterated deletions.
> I'm now feeling like we're gonna have to keep random-access to deleted
> docs....
Ha, you're giving up a little early, amigo.
I do think it's clear that maintaining an individual deletions iterator for
each posting list isn't going to work well, particularly when it's implemented
using an opaque DocIdSetIterator and virtual methods. That can't compete with
what is essentially an array lookup (since BitVector.get() can be inlined) on
a shared bit vector, even at low deletion rates.
I also think we can conclude that high deletion rates cause an accelerating
degradation with iterated deletions, and that if they are implemented, that
problem probably needs to be addressed with more aggressive segment merging.
However, I don't think the benchmark data we've seen so far demonstrates that
the filtered deletions model won't work. Heck, with that model,
deletions get pulled out out TermDocs so we lose the per-iter null-check,
possibly yielding a small performance increase in the common case of zero
deletions.
Maybe we should close this issue with a won't-fix and start a new one for
filtered deletions?

Alas.... I had a bug in my original test (my SegmentTermDocs was
incorrectly returning some deleted docs). But even with that bug I
can't repro my original "it's faster at < 10% deletions". Here are my
results using a pre-computed array of deleted docIDs:

%tg deletes

query

hits

qps

qpsnew

pctg

0%

147

4984

5560.1

5392.5

-3.0%

0%

text

97191

347.3

334.1

-3.8%

0%

1 AND 2

234634

22.9

22.8

-0.4%

0%

1

386435

88.4

86.0

-2.7%

0%

1 OR 2

535584

20.9

20.8

-0.5%

1%

147

4933

5082.0

3643.5

-28.3%

1%

text

96143

313.9

304.9

-2.9%

1%

1 AND 2

232250

22.1

22.3

0.9%

1%

1

382498

81.0

82.3

1.6%

1%

1 OR 2

530212

20.2

20.2

0.0%

2%

147

4883

5133.6

3299.6

-35.7%

2%

text

95190

315.8

289.7

-8.3%

2%

1 AND 2

229870

22.2

22.1

-0.5%

2%

1

378641

81.2

80.9

-0.4%

2%

1 OR 2

524873

20.3

20.2

-0.5%

5%

147

4729

5073.6

2405.2

-52.6%

5%

text

92293

315.2

259.0

-17.8%

5%

1 AND 2

222859

22.5

22.0

-2.2%

5%

1

367000

81.0

77.6

-4.2%

5%

1 OR 2

508632

20.4

19.7

-3.4%

10%

147

4475

5049.6

1738.8

-65.6%

10%

text

87504

314.8

232.6

-26.1%

10%

1 AND 2

210982

22.9

21.7

-5.2%

10%

1

347664

81.5

74.0

-9.2%

10%

1 OR 2

481792

21.2

20.2

-4.7%

20%

147

4012

5045.0

1117.6

-77.8%

20%

text

77980

317.2

208.9

-34.1%

20%

1 AND 2

187605

23.9

21.4

-10.5%

20%

1

309040

82.0

68.2

-16.8%

20%

1 OR 2

428232

22.3

20.2

-9.4%

50%

147

2463

5283.2

522.3

-90.1%

50%

text

48331

336.9

176.4

-47.6%

50%

1 AND 2

116887

28.4

23.0

-19.0%

50%

1

193154

86.4

63.5

-26.5%

50%

1 OR 2

267525

27.6

22.4

-18.8%

I've attached my patch, but note that some tests fail because I don't update the list of deleted docs when deleteDocument is called.

I'm now feeling like we're gonna have to keep random-access to deleted docs....

Presumably you spliced the improved nextSetBit into Jason's patch, correct?

Actually I used your entire patch on its own.

Didn't you have a close-to-ideal patch using sorted ints that performed well up to 10% deletions? What did that look like?

I thought I did – but it was rather hacked up (I "fixed" SegmentReader to do always do an up-front conversion into int[] deletedDocs). I'll re-test it to try to repro my initial rough results.

I also suspect that when there are many deletions, the sheer number of method calls to perform the deletions iteration is a burden. The iterator has to compete with an inline-able method from a final class (BitVector).

Right, for a highish %tg deletion it seems likely that random-access will win.

Michael McCandless
added a comment - 30/Jan/09 09:53 Presumably you spliced the improved nextSetBit into Jason's patch, correct?
Actually I used your entire patch on its own.
Didn't you have a close-to-ideal patch using sorted ints that performed well up to 10% deletions? What did that look like?
I thought I did – but it was rather hacked up (I "fixed" SegmentReader to do always do an up-front conversion into int[] deletedDocs). I'll re-test it to try to repro my initial rough results.
I also suspect that when there are many deletions, the sheer number of method calls to perform the deletions iteration is a burden. The iterator has to compete with an inline-able method from a final class (BitVector).
Right, for a highish %tg deletion it seems likely that random-access will win.

Presumably you spliced the improved nextSetBit into Jason's patch, correct? I wonder how my patch on its own would do, since there's less abstraction. Didn't you have a close-to-ideal patch using sorted ints that performed well up to 10% deletions? What did that look like?

> I think we should try list-of-sorted-ints?

That should help with the situation where deletes are sparse, particularly when the term is rare (such as those searches for "147"), since it will remove the cost of scanning through a bunch of empty bits.

When there are no deletions, nextDeletion is set and left at Integer.MAX_VALUE, so we'd get a comparison that's always false for the life of the TermDocs instead of an always-null null check. Possibly we'd slow down the no-deletions case while speeding up all others, but maybe the processor does a good job at predicting the result of the comparison.

I also suspect that when there are many deletions, the sheer number of method calls to perform the deletions iteration is a burden. The iterator has to compete with an inline-able method from a final class (BitVector).

Marvin Humphrey
added a comment - 30/Jan/09 04:24 > Numbers with Marvin's latest patch:
Presumably you spliced the improved nextSetBit into Jason's patch, correct? I wonder how my patch on its own would do, since there's less abstraction. Didn't you have a close-to-ideal patch using sorted ints that performed well up to 10% deletions? What did that look like?
> I think we should try list-of-sorted-ints?
That should help with the situation where deletes are sparse, particularly when the term is rare (such as those searches for "147"), since it will remove the cost of scanning through a bunch of empty bits.
I'm also curious what happens if we do without the null-check here:
+ if (deletedDocsIt != null ) {
+ if (doc > nextDeletion) {
+ if (deletedDocsIt.skipTo(doc))
+ nextDeletion = deletedDocsIt.doc();
+ }
+ if (doc == nextDeletion)
+ continue ;
}
When there are no deletions, nextDeletion is set and left at Integer.MAX_VALUE, so we'd get a comparison that's always false for the life of the TermDocs instead of an always-null null check. Possibly we'd slow down the no-deletions case while speeding up all others, but maybe the processor does a good job at predicting the result of the comparison.
I also suspect that when there are many deletions, the sheer number of method calls to perform the deletions iteration is a burden. The iterator has to compete with an inline-able method from a final class (BitVector).

The main problem we're trying to solve is potential allocation of a
large del docs BV byte array for the copy on write of a cloned
reader.

Right, as long as normal search performance does not get worse.
Actually, I was hoping "deletes as iterator" and "deletes higher up as
filter" might give us some gains in search performance.

An option we haven't looked at is a MultiByteArray where
multiple byte arrays make up a virtual byte array checked by BV.get.
On deleteDocument, only the byte array chunks that are changed are
replaced in the new version, while the previously copied chunks are
kept. The overhead of the BV.get can be minimal, though in our tests
with an int array version the performance can either be equal to or
double based on factors we are not aware of.

I think that'd be a good approach (it amortizes the copy on write
cost), though it'd be a double deref per lookup with the
straightforward impl so I think it'll hurt normal search perf too.

And I don't think we should give up on iterator access just yet... I
think we should try list-of-sorted-ints?

Michael McCandless
added a comment - 30/Jan/09 00:28
The main problem we're trying to solve is potential allocation of a
large del docs BV byte array for the copy on write of a cloned
reader.
Right, as long as normal search performance does not get worse.
Actually, I was hoping "deletes as iterator" and "deletes higher up as
filter" might give us some gains in search performance.
An option we haven't looked at is a MultiByteArray where
multiple byte arrays make up a virtual byte array checked by BV.get.
On deleteDocument, only the byte array chunks that are changed are
replaced in the new version, while the previously copied chunks are
kept. The overhead of the BV.get can be minimal, though in our tests
with an int array version the performance can either be equal to or
double based on factors we are not aware of.
I think that'd be a good approach (it amortizes the copy on write
cost), though it'd be a double deref per lookup with the
straightforward impl so I think it'll hurt normal search perf too.
And I don't think we should give up on iterator access just yet... I
think we should try list-of-sorted-ints?

Marvin Humphrey
added a comment - 29/Jan/09 22:26 > the nextSetBit impl in BitVector (in the patch) is extremely inefficient.
Here's an improved version that doesn't call get().
(quasi_iterator_deletions_r3.diff)
The private method firstBitInNonZeroByte() in this patch could potentially
be replaced with a 256-byte lookup table.

It's a desktop machine though so it's going to have some stuff
running the background, most of which I'm not aware of being a Mac
newbie.

bq: Actually I meant a simple sorted list of ints, but even for that
I'm worried about the skipTo cost (if we use a normal binary search)

Skipping is slower because it unnecessarily checks bits that are not
useful to the query. A higher level deletions Filter implemented
perhaps in IndexSearcher requires docs that are deleted, pass through
the SegmentTermDocs doc[] cache which could add unnecessary overhead
from the vint decoding.

The main problem we're trying to solve is potential allocation of a
large del docs BV byte array for the copy on write of a cloned
reader. An option we haven't looked at is a MultiByteArray where
multiple byte arrays make up a virtual byte array checked by BV.get.
On deleteDocument, only the byte array chunks that are changed are
replaced in the new version, while the previously copied chunks are
kept. The overhead of the BV.get can be minimal, though in our tests
with an int array version the performance can either be equal to or
double based on factors we are not aware of.

Jason Rutherglen
added a comment - 29/Jan/09 22:08 bq: shut down all extraneous processes
It's a desktop machine though so it's going to have some stuff
running the background, most of which I'm not aware of being a Mac
newbie.
bq: Actually I meant a simple sorted list of ints, but even for that
I'm worried about the skipTo cost (if we use a normal binary search)
Skipping is slower because it unnecessarily checks bits that are not
useful to the query. A higher level deletions Filter implemented
perhaps in IndexSearcher requires docs that are deleted, pass through
the SegmentTermDocs doc[] cache which could add unnecessary overhead
from the vint decoding.
The main problem we're trying to solve is potential allocation of a
large del docs BV byte array for the copy on write of a cloned
reader. An option we haven't looked at is a MultiByteArray where
multiple byte arrays make up a virtual byte array checked by BV.get.
On deleteDocument, only the byte array chunks that are changed are
replaced in the new version, while the previously copied chunks are
kept. The overhead of the BV.get can be minimal, though in our tests
with an int array version the performance can either be equal to or
double based on factors we are not aware of.

Actually I made one mistake running your standalone test – I had
allowed the "createIndex" to run more than once, and so I think I had
tested 30K docs with 1875 deletes (6.25%).

I just removed the index and recreated it, so I have 15K docs and 1875
deletes (12.5%). On the mac pro I now see the patch at 4.0% slower
(4672 ms to 4859 ms), and on a Debian Linux box (kernel 2.6.22.1, java
1.5.0_08-b03) I see it 0.8% slower (7298 ms to 7357 ms).

The Mac can be somewhat unreliable for performance results

I've actually found it to be quite reliable. What I love most about
it is, as long as you shut down all extraneous processes, it gives
very repeatable results. I haven't found the same true (or, less so)
of various Linux's & Windows.

OpenBitSet didn't seem to make much of a difference

This is very hard to believe – the nextSetBit impl in BitVector (in
the patch) is extremely inefficient. OpenBitSet's impl ought to be
much faster.

The other option is something like P4Delta which stores the doc
ids in a compressed form solely for iterating.

I think that will be too costly here (but is a good fit for
postings).

Is this what you mean by sparse representation?

Actually I meant a simple sorted list of ints, but even for that I'm
worried about the skipTo cost (if we use a normal binary search). I'm
not sure it can be made fast enough (ie faster than random access
we have today).

Michael McCandless
added a comment - 29/Jan/09 20:17
Actually I made one mistake running your standalone test – I had
allowed the "createIndex" to run more than once, and so I think I had
tested 30K docs with 1875 deletes (6.25%).
I just removed the index and recreated it, so I have 15K docs and 1875
deletes (12.5%). On the mac pro I now see the patch at 4.0% slower
(4672 ms to 4859 ms), and on a Debian Linux box (kernel 2.6.22.1, java
1.5.0_08-b03) I see it 0.8% slower (7298 ms to 7357 ms).
The Mac can be somewhat unreliable for performance results
I've actually found it to be quite reliable. What I love most about
it is, as long as you shut down all extraneous processes, it gives
very repeatable results. I haven't found the same true (or, less so)
of various Linux's & Windows.
OpenBitSet didn't seem to make much of a difference
This is very hard to believe – the nextSetBit impl in BitVector (in
the patch) is extremely inefficient. OpenBitSet's impl ought to be
much faster.
The other option is something like P4Delta which stores the doc
ids in a compressed form solely for iterating.
I think that will be too costly here (but is a good fit for
postings).
Is this what you mean by sparse representation?
Actually I meant a simple sorted list of ints, but even for that I'm
worried about the skipTo cost (if we use a normal binary search). I'm
not sure it can be made fast enough (ie faster than random access
we have today).

Just run sortBench2.py in contrib/benchmark of trunk & patch areas. Then run sortCollate2.py to make the Jira table (-jira) or print a human readable output (default). You'll have to make your own Wikipedia indices with the pctg deletes, then edit sortBench2.py & sortCollate2.py to fix the paths.

All they do is write an alg file, run the test, and parse the output file to gather best of 5.

Interesting results. Mike, can you post your test code? The Mac can
be somewhat unreliable for performance results as today the
TestDeletesDocIdSet is about the same speed as trunk.

OpenBitSet didn't seem to make much of a difference however when
running your tests I can check it out. The other option is something
like P4Delta which stores the doc ids in a compressed form solely for
iterating. Is this what you mean by sparse representation?

Jason Rutherglen
added a comment - 29/Jan/09 19:11 Interesting results. Mike, can you post your test code? The Mac can
be somewhat unreliable for performance results as today the
TestDeletesDocIdSet is about the same speed as trunk.
OpenBitSet didn't seem to make much of a difference however when
running your tests I can check it out. The other option is something
like P4Delta which stores the doc ids in a compressed form solely for
iterating. Is this what you mean by sparse representation?

Hmm... when I run the TestDeletesDocIdSet, I don't see as much
improvement: trunk gets 10.507 seconds, patch gets 10.158 (~3.3%
faster). I'm running on OS X 10.5.6, quad core Intel; java version is
"1.6.0_07-b06-153" and I run "java -server -Xbatch -Xmx1024M
-Xms1024M".

But that test is rather synthetic: you create 15,000 docs, then delete
1 in 8, then do a search (for "text") that matches all of the docs.

So I went back to contrib/benchmark... I created a single-segment
index, with 0%, 1%, 2%, 5%, 10%, 20% and 50% deletions, using first 2M
docs from Wikipedia, then ran 5 different queries and compared qps w/
patch vs trunk. Results:

%tg deletes

query

hits

qps

qpsnew

%tg change

0%

147

4984

5560.1

5486.2

-1.3%

0%

text

97191

347.3

339.4

-2.3%

0%

1 AND 2

234634

22.9

22.8

-0.4%

0%

1

386435

88.4

87.2

-1.4%

0%

1 OR 2

535584

20.9

20.9

0.0%

1%

147

4933

5082.0

1419.2

-72.1%

1%

text

96143

313.9

142.0

-54.8%

1%

1 AND 2

232250

22.1

18.6

-15.8%

1%

1

382498

81.0

62.2

-23.2%

1%

1 OR 2

530212

20.2

17.5

-13.4%

2%

147

4883

5133.6

1959.0

-61.8%

2%

text

95190

315.8

149.2

-52.8%

2%

1 AND 2

229870

22.2

18.4

-17.1%

2%

1

378641

81.2

58.9

-27.5%

2%

1 OR 2

524873

20.3

17.1

-15.8%

5%

147

4729

5073.6

2600.8

-48.7%

5%

text

92293

315.2

166.9

-47.0%

5%

1 AND 2

222859

22.5

17.8

-20.9%

5%

1

367000

81.0

56.2

-30.6%

5%

1 OR 2

508632

20.4

16.3

-20.1%

10%

147

4475

5049.6

2953.7

-41.5%

10%

text

87504

314.8

180.9

-42.5%

10%

1 AND 2

210982

22.9

17.8

-22.3%

10%

1

347664

81.5

53.8

-34.0%

10%

1 OR 2

481792

21.2

16.5

-22.2%

20%

147

4012

5045.0

3455.5

-31.5%

20%

text

77980

317.2

204.7

-35.5%

20%

1 AND 2

187605

23.9

19.2

-19.7%

20%

1

309040

82.0

54.7

-33.3%

20%

1 OR 2

428232

22.3

17.5

-21.5%

50%

147

2463

5283.2

4731.4

-10.4%

50%

text

48331

336.9

290.2

-13.9%

50%

1 AND 2

116887

28.4

25.9

-8.8%

50%

1

193154

86.4

74.9

-13.3%

50%

1 OR 2

267525

27.6

24.9

-9.8%

I think one major source of slowness is the BitVector.nextSetBit()
impl: it now checks one bit at a time, but it'd be much better to use
OpenBitSet's approach.

I also think, realistically, this approach won't perform very well
until we switch to a sparse representation for the bit set, so that
next() and skipTo() perform well.

IndexReader failed to
mark getDeletedDocs public (I made them public).

contrib/instantiated failed to compile, because its IndexReader
impl failed to implement getDocIdSet... but we can't add abstract
IndexReader.getDocIdSet (this breaks back compat). I switched to
non-abstract default method that throws
UnsupportedOperationException instead, and added nocommit comment
to remind us to fix all IndexReader subclasses in contrib.

I don't think you should change MemoryIndexReader from private to
public? Why was that done? (I reverted this).

Replaced MemoryIndex.NullDocIdSet with the new
DocIdSet.EMPTY_DOCIDSET

Somehow you lost recent fixes to MemoryIndex.java (committed as
part of LUCENE-1316). This was causing NPE test failure (I
reverted this, test passes now).

New patch attached. All tests pass... next I'll try to repro the
contrib/benchmark oddness.

Michael McCandless
added a comment - 29/Jan/09 16:01
I had some trouble w/ the latest patch (I'm trying to reproduce your
strange problem w/ contrib/benchmark & run perf tests):
MultiDocIdSet was missing (I just pulled from earlier patch)
Compilation errors because
{Filter,Parallel}
IndexReader failed to
mark getDeletedDocs public (I made them public).
contrib/instantiated failed to compile, because its IndexReader
impl failed to implement getDocIdSet... but we can't add abstract
IndexReader.getDocIdSet (this breaks back compat). I switched to
non-abstract default method that throws
UnsupportedOperationException instead, and added nocommit comment
to remind us to fix all IndexReader subclasses in contrib.
I don't think you should change MemoryIndexReader from private to
public? Why was that done? (I reverted this).
Replaced MemoryIndex.NullDocIdSet with the new
DocIdSet.EMPTY_DOCIDSET
Somehow you lost recent fixes to MemoryIndex.java (committed as
part of LUCENE-1316 ). This was causing NPE test failure (I
reverted this, test passes now).
New patch attached. All tests pass... next I'll try to repro the
contrib/benchmark oddness.

I created basic test code TestDeletesDocIdSet.java because I thought
the previous results may have something to do with my
misunderstanding of the contrib/benchmark code.

The results show a 17% increase in performance for this patch's
deleted docs skipto code. I think now that it's probably the fact
that ANT_OPTS isn't using the -server option properly as
contrib/benchmark build.xml forks the process. -server on the MacOSX
seems to optimize method calls a far better than -client.

Jason Rutherglen
added a comment - 28/Jan/09 22:54 I created basic test code TestDeletesDocIdSet.java because I thought
the previous results may have something to do with my
misunderstanding of the contrib/benchmark code.
The results show a 17% increase in performance for this patch's
deleted docs skipto code. I think now that it's probably the fact
that ANT_OPTS isn't using the -server option properly as
contrib/benchmark build.xml forks the process. -server on the MacOSX
seems to optimize method calls a far better than -client.
Trunk:
warmup search duration: 6208
final search duration: 6371
LUCENE-1476 , skipto:
warmup search duration: 5926
final search duration: 5450

Michael McCandless
added a comment - 27/Jan/09 19:38 Perhaps this should be an option in the benchmark output?
That's a great idea!
Something silly must be going on... 99% performance drop can't be right.

M.M. LUCENE-1516 comment: "I think the larger number of[harder-for-cpu-to-predict] if statements may be the cause of the
slowdown once %tg deletes gets high enough?"

I have been looking at the performance with YourKit and don't have
any conclusions yet. The main difference between using skipto and
BV.get is the if statements and some added method calls, which even
if they are inlined I suspect will not make up the difference.

Next steps:
1. Deletes as a NOT boolean query which probably should
be it's own patch
2. Pluggable alternative representations such as
OpenBitSet and int array, part of this patch?

Jason Rutherglen
added a comment - 27/Jan/09 16:24 The percentage performance decrease in the previous
results is 99%.
Jason can you format those results using a Jira table?
Perhaps this should be an option in the benchmark output?
M.M. LUCENE-1516 comment: "I think the larger number of
[harder-for-cpu-to-predict] if statements may be the cause of the
slowdown once %tg deletes gets high enough?"
I have been looking at the performance with YourKit and don't have
any conclusions yet. The main difference between using skipto and
BV.get is the if statements and some added method calls, which even
if they are inlined I suspect will not make up the difference.
Next steps:
1. Deletes as a NOT boolean query which probably should
be it's own patch
2. Pluggable alternative representations such as
OpenBitSet and int array, part of this patch?

Michael McCandless
added a comment - 23/Jan/09 20:44
For the perf tests, I would use an optimized index with maybe 2M
wikipedia docs in it.
Then test with maybe 0, 1, 5, 10, 25, 50, 75 percent deletions, across
various kinds of queries (single term, OR, AND, phrase/span).
Baseline w/ trunk, and then test w/ this patch (keeps deletion access
low (@ SegmentTermDocs) but switches to iterator API). I'd love to
also see numbers for deletion-applied-as-filter (high) eventually.
[Actually if ever deletion %tg is > 50%, we should presumably invert
the bit set and work with that instead. And same with filters.]
You might want to start with the Python scripts attached to
LUCENE-1483 ; with some small mods you could easily fix them to run
these tests.

We probably need to support both as implementing top level deleted
docs filtering may have unknown side effects. The user may decide
based on their queries and other variables such as the number of
deleted docs.

I agree... and then, if the performance difference is large enough, it
seems like we'll need some simple "search policy" for the interesting
(Boolean) query scorers to pick the best way to execute a query.

This could include which order to visit the segments in (we broached
this in LUCENE-1483, since depending on the query different orders may
perform better). And when (high vs low) & how (iterator vs random
access) to apply a filter would also be decided by the search policy.

Deprecating isDeleted might be good.

I wonder how this method is used [externally] by applications,
today... I'll go ask on java-user. And, whether all such uses could
migrate to an iterator API instead without much cost.

Would we need the read only readers?

Good question... I'm guessing there would still be a performance
benefit if the underlying data structures for deletions &
column-stride fields know they cannot change?

Michael McCandless
added a comment - 23/Jan/09 20:17
We probably need to support both as implementing top level deleted
docs filtering may have unknown side effects. The user may decide
based on their queries and other variables such as the number of
deleted docs.
I agree... and then, if the performance difference is large enough, it
seems like we'll need some simple "search policy" for the interesting
(Boolean) query scorers to pick the best way to execute a query.
This could include which order to visit the segments in (we broached
this in LUCENE-1483 , since depending on the query different orders may
perform better). And when (high vs low) & how (iterator vs random
access) to apply a filter would also be decided by the search policy.
Deprecating isDeleted might be good.
I wonder how this method is used [externally] by applications,
today... I'll go ask on java-user. And, whether all such uses could
migrate to an iterator API instead without much cost.
Would we need the read only readers?
Good question... I'm guessing there would still be a performance
benefit if the underlying data structures for deletions &
column-stride fields know they cannot change?

Should we deprecate isDeleted? And also deprecate document()
checking whether a document is deleted (ie switch to a new API that
returns document w/o checking deletions).

Deprecating isDeleted might be good. Would we need the read only
readers? We can still offer get method access to the deleted docs if
it's a bit set by creating an abstract class DocIdBitSet that
BitVector, java.util.BitSet, and OpenBItSet implement. This can
happen by casting IR.getDeletedDocs to DocIdBItSet.

Or perhaps move isDeleted to a new API that makes it clear
that there is a performance cost to random-access of deletions? (And
the first time it's called, it materializes a full bit vector).

Seems best to keep deleted docs access to the DocIdSet and DocIdBitSet
and not have IR.isDeleted.

Jason Rutherglen
added a comment - 22/Jan/09 01:44
Should we deprecate isDeleted? And also deprecate document()
checking whether a document is deleted (ie switch to a new API that
returns document w/o checking deletions).
Deprecating isDeleted might be good. Would we need the read only
readers? We can still offer get method access to the deleted docs if
it's a bit set by creating an abstract class DocIdBitSet that
BitVector, java.util.BitSet, and OpenBItSet implement. This can
happen by casting IR.getDeletedDocs to DocIdBItSet.
Or perhaps move isDeleted to a new API that makes it clear
that there is a performance cost to random-access of deletions? (And
the first time it's called, it materializes a full bit vector).
Seems best to keep deleted docs access to the DocIdSet and DocIdBitSet
and not have IR.isDeleted.

Is there info on the wiki about how to use the contrib/benchmarks? (didn't find any)

M.M.: "1) Should we switch to iterator only API for accessing deleted docs
(vs random-access API we use today)?"

I think we can by default support both. We may want to test (later
on) the difference in speed of iterating over an OpenBitSet vs
BitVector for the deleted docs.

M.M.: "2) Should we take deletes into account at a higher level
(BooleanScorer/2, as a top-level Filter) vs the lowest level (each
posting, in SegmentTermDocs that we do today)?"

We probably need to support both as implementing top level deleted
docs filtering may have unknown side effects. The user may decide
based on their queries and other variables such as the number of
deleted docs.

Jason Rutherglen
added a comment - 21/Jan/09 23:01 Is there info on the wiki about how to use the contrib/benchmarks? (didn't find any)
M.M.: "1) Should we switch to iterator only API for accessing deleted docs
(vs random-access API we use today)?"
I think we can by default support both. We may want to test (later
on) the difference in speed of iterating over an OpenBitSet vs
BitVector for the deleted docs.
M.M.: "2) Should we take deletes into account at a higher level
(BooleanScorer/2, as a top-level Filter) vs the lowest level (each
posting, in SegmentTermDocs that we do today)?"
We probably need to support both as implementing top level deleted
docs filtering may have unknown side effects. The user may decide
based on their queries and other variables such as the number of
deleted docs.

I am going to run the contrib benchmark tests on this patch vs trunk to see if there is a difference in performance

I think this is the most important next step before we go too much
further.

There are two separate questions we are exploring here:

1) Should we switch to iterator only API for accessing deleted docs
(vs random-access API we use today)?

2) Should we take deletes into account at a higher level
(BooleanScorer/2, as a top-level Filter) vs the lowest level
(each posting, in SegmentTermDocs that we do today)?

We need to understand the performance impact with each combination of
1) and 2), across different queries. Likely the answer will be "it
depends on the query", but we should understand just how much. I
think performance cost/benefit is the driver here,

Michael McCandless
added a comment - 21/Jan/09 21:08
I am going to run the contrib benchmark tests on this patch vs trunk to see if there is a difference in performance
I think this is the most important next step before we go too much
further.
There are two separate questions we are exploring here:
1) Should we switch to iterator only API for accessing deleted docs
(vs random-access API we use today)?
2) Should we take deletes into account at a higher level
(BooleanScorer/2, as a top-level Filter) vs the lowest level
(each posting, in SegmentTermDocs that we do today)?
We need to understand the performance impact with each combination of
1) and 2), across different queries. Likely the answer will be "it
depends on the query", but we should understand just how much. I
think performance cost/benefit is the driver here,

Michael McCandless
added a comment - 21/Jan/09 19:57
Indeed, how should we solve isDeleted for the tombstones
implementation?
Should we deprecate isDeleted? And also deprecate document() checking whether a document is deleted (ie switch to a new API that returns document w/o checking deletions).
Or perhaps move isDeleted to a new API that makes it clear that there is a performance cost to random-access of deletions? (And the first time it's called, it materializes a full bit vector).

Jason Rutherglen
added a comment - 21/Jan/09 19:54 I am going to run the contrib benchmark tests on this patch vs trunk to see if there is a difference in performance. Do the benchmarks have deleted docs?

M.M.: "At a higher level, I'd imagine we'd want to store the deleted doc IDs as an
integer array rather than a BitVector if there aren't very many of them. But
I think that will mess with the random access required by
IndexReader.isDeleted()."

Indeed, how should we solve isDeleted for the tombstones
implementation? Or do we simply assume it will be slow (requiring a
linear scan?) Or perhaps borrow from Solr's HashDocSet (a minimal
primitive int based hash) to implement?

Jason Rutherglen
added a comment - 21/Jan/09 19:38
M.M.: "At a higher level, I'd imagine we'd want to store the deleted doc IDs as an
integer array rather than a BitVector if there aren't very many of them. But
I think that will mess with the random access required by
IndexReader.isDeleted()."
Indeed, how should we solve isDeleted for the tombstones
implementation? Or do we simply assume it will be slow (requiring a
linear scan?) Or perhaps borrow from Solr's HashDocSet (a minimal
primitive int based hash) to implement?

> I implemented the SegmentTermsDocs using DocIdSetIterator.skipTo rather than
> nextSetBit.

Ah, I see. And you handled the -1 value properly. Well done.

> Do you think the skipTo can be optimized?

Provided that the compiler is clever enough to inline the various methods
within BitVectorDocIdSetIterator (it should be), I don't see a way to improve
on that part.

It's probably possible to optimize this loop from my patch in
BitVector.nextSetBit(), by retrieving the byte value and operating directly on
it rather than calling get(). (The compiler would have to be quite clever to
figure out that optimization on its own.)

At a higher level, I'd imagine we'd want to store the deleted doc IDs as an
integer array rather than a BitVector if there aren't very many of them. But
I think that will mess with the random access required by
IndexReader.isDeleted().

Marvin Humphrey
added a comment - 21/Jan/09 19:24 > I implemented the SegmentTermsDocs using DocIdSetIterator.skipTo rather than
> nextSetBit.
Ah, I see. And you handled the -1 value properly. Well done.
> Do you think the skipTo can be optimized?
Provided that the compiler is clever enough to inline the various methods
within BitVectorDocIdSetIterator (it should be), I don't see a way to improve
on that part.
It's probably possible to optimize this loop from my patch in
BitVector.nextSetBit(), by retrieving the byte value and operating directly on
it rather than calling get(). (The compiler would have to be quite clever to
figure out that optimization on its own.)
+ // There's a non-zero bit in this byte .
+ final int base = pos * 8;
+ for ( int i = 0; i < 8; i++) {
+ final int candidate = base + i;
+ if (candidate < size && candidate >= bit) {
+ if (get(candidate)) {
+ return candidate;
+ }
+ }
+ }
At a higher level, I'd imagine we'd want to store the deleted doc IDs as an
integer array rather than a BitVector if there aren't very many of them. But
I think that will mess with the random access required by
IndexReader.isDeleted().

Jason Rutherglen
added a comment - 21/Jan/09 18:27
M.M.: I believe there's an inefficiency in my original patch.
Marvin, I implemented the SegmentTermsDocs using DocIdSetIterator.skipTo rather than nextSetBit. Do you think the skipTo can be optimized?

> For OR like queries, that use Scorer.next(), deletions might be treated as
> early as possible, since each hit will cause a matching doc and a
> corresponding score calculation.

It depends whether the OR-like query spawns a Scorer that pre-fetches.

At present, deletions are handled in Lucene at the SegmentTermDocs level.
Filtering deletions that early minimizes overhead from all Scorer.next()
implementations, including those that pre-fetch like the original
BooleanScorer. The cost is that we must check every single posting in every
single SegmentTermDocs to see if the doc has been deleted.

The Scorer_Collect() routine above, which skips past deletions, consolidates
the cost of checking every single posting into one check. However, the
original BooleanScorer can't skip, so it will incur overhead from scoring
documents which have been deleted. Do the savings and the additional costs
balance each other out? Its hard to say.

However, with OR-like Scorers which implement skipping and don't pre-fetch –
Lucene's DisjunctionSumScorer and its KS analogue ORScorer – you get the best
of both worlds. Not only do you avoid the per-posting-per-SegTermDocs
deletions check, you get to skip past deletions and avoid the extra overhead
in Scorer.next().

Of course the problem is that because ScorerDocQueue isn't very efficient,
DisjunctionSumScorer and ORScorer are often slower than the distributed sort
in the original BooleanScorer.

> For AND like queries, that use Scorer.skipTo(), deletions can be treated
> later, for example in skipTo() of the conjunction/and/scorer.

Yes, and it could be implemented by adding a Filter sub-clause to the
conjunction/and/scorer.

Marvin Humphrey
added a comment - 21/Jan/09 17:58 > For OR like queries, that use Scorer.next(), deletions might be treated as
> early as possible, since each hit will cause a matching doc and a
> corresponding score calculation.
It depends whether the OR-like query spawns a Scorer that pre-fetches.
At present, deletions are handled in Lucene at the SegmentTermDocs level.
Filtering deletions that early minimizes overhead from all Scorer.next()
implementations, including those that pre-fetch like the original
BooleanScorer. The cost is that we must check every single posting in every
single SegmentTermDocs to see if the doc has been deleted.
The Scorer_Collect() routine above, which skips past deletions, consolidates
the cost of checking every single posting into one check. However, the
original BooleanScorer can't skip, so it will incur overhead from scoring
documents which have been deleted. Do the savings and the additional costs
balance each other out? Its hard to say.
However, with OR-like Scorers which implement skipping and don't pre-fetch –
Lucene's DisjunctionSumScorer and its KS analogue ORScorer – you get the best
of both worlds. Not only do you avoid the per-posting-per-SegTermDocs
deletions check, you get to skip past deletions and avoid the extra overhead
in Scorer.next().
Of course the problem is that because ScorerDocQueue isn't very efficient,
DisjunctionSumScorer and ORScorer are often slower than the distributed sort
in the original BooleanScorer.
> For AND like queries, that use Scorer.skipTo(), deletions can be treated
> later, for example in skipTo() of the conjunction/and/scorer.
Yes, and it could be implemented by adding a Filter sub-clause to the
conjunction/and/scorer.

>>> If we move the deletions filtering up, then we'd increase traffic through
>>> that cache
>>
>> OK, right. So we may have some added cost because of this. I think it's only
>> TermScorer that uses the bulk API though.

Paul Elschot
added a comment - 21/Jan/09 07:36 >>> If we move the deletions filtering up, then we'd increase traffic through
>>> that cache
>>
>> OK, right. So we may have some added cost because of this. I think it's only
>> TermScorer that uses the bulk API though.
>The original BooleanScorer also pre-fetches. (That doesn't affect KS because
> ORScorer, ANDScorer, NOTScorer and RequiredOptionScorer (which have
> collectively replaced BooleanScorer) all proceed doc-at-a-time and implement
> skipping.)
For OR like queries, that use Scorer.next(), deletions might be treated as early
as possible, since each hit will cause a matching doc and a corresponding score calculation.
For AND like queries, that use Scorer.skipTo(), deletions can be treated later,
for example in skipTo() of the conjunction/and/scorer.
In the same way, prefetching into a larger term documents buffer helps for OR queries,
but gets in the way for AND queries.

>> If we move the deletions filtering up, then we'd increase traffic through
>> that cache
>
> OK, right. So we may have some added cost because of this. I think it's only
> TermScorer that uses the bulk API though.

The original BooleanScorer also pre-fetches. (That doesn't affect KS because
ORScorer, ANDScorer, NOTScorer and RequiredOptionScorer (which have
collectively replaced BooleanScorer) all proceed doc-at-a-time and implement
skipping.)

And I'm still not certain it's a good idea from an API standpoint: it's
strange to have the PostingList and Scorer iterators included deleted docs.

Nevertheless, I've now changed over KS to use the deletions-as-filter approach
in svn trunk. The tie-breaker was related to the ongoing modularization of
IndexReader: if PostingList doesn't have to handle deletions, then
PostingsReader doesn't have to know about DeletionsReader, and if
PostingsReader doesn't have to know about DeletionsReader, then all
IndexReader sub-components can be implemented independently.

The code to implement the deletion skipping turned out to be more verbose and
fiddly than anticipated. It's easy to make fencepost errors when dealing with
advancing two iterators in sync, especially when you can only invoke the
skipping iterator method once for a given doc num.

While technically correct, extra work is being done. When nextSetBit() can't
find any more bits, it returns -1 (just like java.util.BitSet.nextSetBit() does).
This causes the deletion loop to be checked on each iteration.

The solution is to test for -1, as in the new version of the patch (also
tested but not benchmarked):

Theoretically, we could also change the behavior of nextSetBit() so that it
returns Integer.MAX_VALUE when it can't find any more set bits. However,
that's a little misleading (since it's a positive number and could thus
represent a true set bit), and also would break the intended API mimicry by
BitVector.nextSetBit() of java.util.BitSet.nextSetBit().

Marvin Humphrey
added a comment - 21/Jan/09 02:29 Jason,
> Incorporated Marvin's patch into SegmentTermDocs and BitVector.
I believe there's an inefficiency in my original patch. Code like this occurs in three places:
+ if (doc >= nextDeletion) {
+ if (doc > nextDeletion)
+ nextDeletion = deletedDocs.nextSetBit(doc);
+ if (doc == nextDeletion)
+ continue ;
}
While technically correct, extra work is being done. When nextSetBit() can't
find any more bits, it returns -1 (just like java.util.BitSet.nextSetBit() does).
This causes the deletion loop to be checked on each iteration.
The solution is to test for -1, as in the new version of the patch (also
tested but not benchmarked):
+ if (doc >= nextDeletion) {
+ if (doc > nextDeletion) {
+ nextDeletion = deletedDocs.nextSetBit(doc);
+ if (nextDeletion == -1) {
+ nextDeletion = Integer .MAX_VALUE;
+ }
+ }
+ if (doc == nextDeletion) {
+ continue ;
+ }
}
Theoretically, we could also change the behavior of nextSetBit() so that it
returns Integer.MAX_VALUE when it can't find any more set bits. However,
that's a little misleading (since it's a positive number and could thus
represent a true set bit), and also would break the intended API mimicry by
BitVector.nextSetBit() of java.util.BitSet.nextSetBit().

Jason Rutherglen
added a comment - 21/Jan/09 01:41 First cut. All tests pass.
Implemented IndexReader.getDeletedDocs in all readers
Created MultiDocIdSet that iterates over multiple DocIdSets which is
used by MultiSegmentReader and MultiReader
Incorporated Marvin's patch into SegmentTermDocs and BitVector.
SegmentTermDocs iterates using deleted docs DocIdSet instead of
calling BitVector.get.

If we moved to using only iterator API for accessing deleted docs within Lucene then we could explore fixes for the copy-on-write cost w/o changing on-disk representation of deletes. IE tombstones are perhaps overkill for Lucene, since we're not using the filesystem as the intermediary for communicating deletes to a reopened reader. We only need an in-RAM incremental solution.

+1 Agreed. Good point about not needing to change the on disk representation as that would make implementation a bit more complicated. Sounds like we need a tombstones patch as well that plays well with IndexReader.clone.

Exposing deleted docs as a DocIdSet allows possible future implementations that DO return deleted docs as discussed (via a flag to IndexReader) from TermDocs. Deleted docs DocIdSet can then be used on a higher level as a filter/query.

Jason Rutherglen
added a comment - 16/Jan/09 17:28
If we moved to using only iterator API for accessing deleted docs within Lucene then we could explore fixes for the copy-on-write cost w/o changing on-disk representation of deletes. IE tombstones are perhaps overkill for Lucene, since we're not using the filesystem as the intermediary for communicating deletes to a reopened reader. We only need an in-RAM incremental solution.
+1 Agreed. Good point about not needing to change the on disk representation as that would make implementation a bit more complicated. Sounds like we need a tombstones patch as well that plays well with IndexReader.clone.
Exposing deleted docs as a DocIdSet allows possible future implementations that DO return deleted docs as discussed (via a flag to IndexReader) from TermDocs. Deleted docs DocIdSet can then be used on a higher level as a filter/query.

Woops, sorry: I missed that it would return a DocIdSet (iterator only) vs underlying (current) BitVector. So then MultiReader could return a DocIdSet.

If the segment is large, tombstones can solve this.

Right; I was saying, as things are today (single BitVector holds all deleted docs), one limitation of the realtime approach we are moving towards is the copy-on-write cost of the first delete on a freshly cloned reader for a large segment.

If we moved to using only iterator API for accessing deleted docs within Lucene then we could explore fixes for the copy-on-write cost w/o changing on-disk representation of deletes. IE tombstones are perhaps overkill for Lucene, since we're not using the filesystem as the intermediary for communicating deletes to a reopened reader. We only need an in-RAM incremental solution.

Michael McCandless
added a comment - 16/Jan/09 17:19 Why? The returned iterator can traverse the multiple bitvectors.
Woops, sorry: I missed that it would return a DocIdSet (iterator only) vs underlying (current) BitVector. So then MultiReader could return a DocIdSet.
If the segment is large, tombstones can solve this.
Right; I was saying, as things are today (single BitVector holds all deleted docs), one limitation of the realtime approach we are moving towards is the copy-on-write cost of the first delete on a freshly cloned reader for a large segment.
If we moved to using only iterator API for accessing deleted docs within Lucene then we could explore fixes for the copy-on-write cost w/o changing on-disk representation of deletes. IE tombstones are perhaps overkill for Lucene, since we're not using the filesystem as the intermediary for communicating deletes to a reopened reader. We only need an in-RAM incremental solution.

it's fine to expose this API, mark it expert & subject to
change, and state that it simply returns the current deleted docs
BitVector and any synchronization issues must be handled by the app.
Probably it should be package private or protected.

+1 Sounds good

Also, I think Multi*Reader would not implement this API? Meaning
you must get the individual SegmentReader and ask it.

Why? The returned iterator can traverse the multiple bitvectors.

I think one weak part of
the design is the blocking copy-on-write required for the first
deletion on a newly cloned reader. The time taken is in proportion to
the size of the segment.

Jason Rutherglen
added a comment - 16/Jan/09 17:05
it's fine to expose this API, mark it expert & subject to
change, and state that it simply returns the current deleted docs
BitVector and any synchronization issues must be handled by the app.
Probably it should be package private or protected.
+1 Sounds good
Also, I think Multi*Reader would not implement this API? Meaning
you must get the individual SegmentReader and ask it.
Why? The returned iterator can traverse the multiple bitvectors.
I think one weak part of
the design is the blocking copy-on-write required for the first
deletion on a newly cloned reader. The time taken is in proportion to
the size of the segment.
If the segment is large, tombstones can solve this.

I think it's fine to expose this API, mark it expert & subject to
change, and state that it simply returns the current deleted docs
BitVector and any synchronization issues must be handled by the app.
Probably it should be package private or protected.

Also, I think Multi*Reader would not implement this API? Meaning
you must get the individual SegmentReader and ask it.

Since we are discussing possible changes to deleted docs, eg switching
to iterator-only accesss, maybe applying deletions as a filter, maybe
using "tombstones" under-the-hood, etc., this API could very well
change.

For our current thinking on realtime search, I think one weak part of
the design is the blocking copy-on-write required for the first
deletion on a newly cloned reader. The time taken is in proportion to
the size of the segment. Switching to a design that can background
this copy-on-write will necessarily impact how we represent
and access deletions.

Michael McCandless
added a comment - 15/Jan/09 10:42 I think it's fine to expose this API, mark it expert & subject to
change, and state that it simply returns the current deleted docs
BitVector and any synchronization issues must be handled by the app.
Probably it should be package private or protected.
Also, I think Multi*Reader would not implement this API? Meaning
you must get the individual SegmentReader and ask it.
Since we are discussing possible changes to deleted docs, eg switching
to iterator-only accesss, maybe applying deletions as a filter, maybe
using "tombstones" under-the-hood, etc., this API could very well
change.
For our current thinking on realtime search, I think one weak part of
the design is the blocking copy-on-write required for the first
deletion on a newly cloned reader. The time taken is in proportion to
the size of the segment. Switching to a design that can background
this copy-on-write will necessarily impact how we represent
and access deletions.

I think it should be up to the user. If the user concurrently
modifies then they're responsible for the possibly spurious effects.
However if we want to be protective, a philosophy I don't think works
well in LUCEN-1516 either, we can offer IR.getDeletedDocs only from a
read only index reader. This solves the issues brought up such as
"undesirable synchronization".

Jason Rutherglen
added a comment - 14/Jan/09 22:01 I think it should be up to the user. If the user concurrently
modifies then they're responsible for the possibly spurious effects.
However if we want to be protective, a philosophy I don't think works
well in LUCEN-1516 either, we can offer IR.getDeletedDocs only from a
read only index reader. This solves the issues brought up such as
"undesirable synchronization".

Seems like a good idea, but the difficulty lies in defining the semantics of such a call.
Are subsequent deletes reflected in the returned DocIdSet? Perhaps make this undefined - may or may not be.
Can the DocIdSet be used concurrently with IndexReader.deleteDocument()? This could work (w/o extra undesirable synchronization) if we're careful.

Yonik Seeley
added a comment - 14/Jan/09 18:36 IndexReader.getDeletedDocs that returns a DocIdSet
Seems like a good idea, but the difficulty lies in defining the semantics of such a call.
Are subsequent deletes reflected in the returned DocIdSet? Perhaps make this undefined - may or may not be.
Can the DocIdSet be used concurrently with IndexReader.deleteDocument()? This could work (w/o extra undesirable synchronization) if we're careful.

To simplify the patch, can we agree to add a method,
IndexReader.getDeletedDocs that returns a DocIdSet? This way
applications may have access to deleteDocs and not be encumbered by
the IR.isDeleted method.

Jason Rutherglen
added a comment - 14/Jan/09 17:50 To simplify the patch, can we agree to add a method,
IndexReader.getDeletedDocs that returns a DocIdSet? This way
applications may have access to deleteDocs and not be encumbered by
the IR.isDeleted method.

> One way to think of the realtime problem is in terms of segments rather than FSDirs and RAMDirs. Some segments are on disk, some in RAM.

I think that's a good approach. IndexWriter could care less which dir
each segment resides in.

For starters let's flush all new segments into a single Directory (and
swap in RAMDir, single-file-Dir, etc., if/when necessary for better
performance).

Thinking out loud.... what if we added IndexWriter.getReader(), which
returns an IndexReader for the current index?

It would read segments that are not "visible" to a newly opened
IndexReader on that directory (ie, the SegmentInfos inside IndexWriter
was used to do the open, not the latest segments_N in the Directory).

That IndexReader is free to do deletions / set norms (shares write
lock under the hood w/ IndexWriter), and when reopen is called it
grabs the current SegmentInfos from IndexWriter and re-opens that.

We would un-deprecate flush(). The, the transaction layer would make
changes, call flush(), call reopen(), and return the reopened reader.

Michael McCandless
added a comment - 09/Jan/09 13:28
> One way to think of the realtime problem is in terms of segments rather than FSDirs and RAMDirs. Some segments are on disk, some in RAM.
I think that's a good approach. IndexWriter could care less which dir
each segment resides in.
For starters let's flush all new segments into a single Directory (and
swap in RAMDir, single-file-Dir, etc., if/when necessary for better
performance).
Thinking out loud.... what if we added IndexWriter.getReader(), which
returns an IndexReader for the current index?
It would read segments that are not "visible" to a newly opened
IndexReader on that directory (ie, the SegmentInfos inside IndexWriter
was used to do the open, not the latest segments_N in the Directory).
That IndexReader is free to do deletions / set norms (shares write
lock under the hood w/ IndexWriter), and when reopen is called it
grabs the current SegmentInfos from IndexWriter and re-opens that.
We would un-deprecate flush(). The, the transaction layer would make
changes, call flush(), call reopen(), and return the reopened reader.

Michael McCandless
added a comment - 09/Jan/09 12:08
> Under the current system, I'm not certain that the deletions checks are that
> excessive.
I made a simple test that up-front converted the deleted docs
BitVector into an int[] containing the deleted docsIDs, and then did
the same nextDeletedDoc change to SegmentTermDocs.
This was only faster if < 10% of docs were deleted, though I didn't do
very thorough testing.
I think the problem is with this change the CPU must predict a new
branch (docNum >= nextDeletion) vs doing a no-branching lookup of the
bit.

> We could potentially make that choice with Lucy/KS, thus allowing us to remove
> the deletions check in the PostingList iterator (as above) and getting a
> potential speedup. But even then I hesitate to push the deletions API upwards
> into a space where users of raw Scorer and TermDocs classes have to deal with
> it - especially since iterator-style deletions aren't very user-friendly.

Can't you just put sugar on top? Ie, add an API that matches today's
API and efficiently applies the "deleted docs filter" for you? Then
you have 100% back compat?

Michael McCandless
added a comment - 09/Jan/09 12:04
> We could potentially make that choice with Lucy/KS, thus allowing us to remove
> the deletions check in the PostingList iterator (as above) and getting a
> potential speedup. But even then I hesitate to push the deletions API upwards
> into a space where users of raw Scorer and TermDocs classes have to deal with
> it - especially since iterator-style deletions aren't very user-friendly.
Can't you just put sugar on top? Ie, add an API that matches today's
API and efficiently applies the "deleted docs filter" for you? Then
you have 100% back compat?

> We can hide the sparse representation and the internal state, having the
> object lazily build the a non-sparse representation. That's what I had in
> mind with the code for TombstoneDelEnum.nextDeletion().
> TombstoneDelEnum.nextInternal() would be a private method used for building up
> the internal BitVector.

Got it, though for a low deletion rate presumably you'd want to store
the int docIDs directly so iterating through them doesn't require O(N)
scan for the next set bit.

I think what you'd want to lazily do is merge the N tombstone streams
for this one segment into a single data structure; whether that data
structure is sparse or unsparse is a separate decision.

Michael McCandless
added a comment - 09/Jan/09 12:01
> We can hide the sparse representation and the internal state, having the
> object lazily build the a non-sparse representation. That's what I had in
> mind with the code for TombstoneDelEnum.nextDeletion().
> TombstoneDelEnum.nextInternal() would be a private method used for building up
> the internal BitVector.
Got it, though for a low deletion rate presumably you'd want to store
the int docIDs directly so iterating through them doesn't require O(N)
scan for the next set bit.
I think what you'd want to lazily do is merge the N tombstone streams
for this one segment into a single data structure; whether that data
structure is sparse or unsparse is a separate decision.

> Mmm. I think I might have given IndexWriter.commit() slightly different
> semantics. Specifically, I might have given it a boolean "sync" argument
> which defaults to false.

It seems like there are various levels of increasing "durability" here:

Make available to a reader in the same JVM (eg flush new segment
to a RAMDir) – not exposed today.

Make available to a reader sharing the filesystem right now (flush
to Directory in "real" filesystem, but don't sync) – exposed
today (but deprecated as a public API) as flush.

Make available to readers even if OS/machine crashes (flush to
Directory in "real" filesystem, and sync) – exposed today as commit.

> Two comments. First, if you don't sync, but rather leave it up to the OS when
> it wants to actually perform the actual disk i/o, how expensive is flushing? Can
> we make it cheap enough to meet Jason's absolute change rate requirements?

Right I've been wondering the same thing. I think this should be our
first approach to realtime indexing, and then we swap in RAMDir if
performance is not good enough.

> Second, the multi-index model is very tricky when dealing with "updates". How
> do you guarantee that you always see the "current" version of a given
> document, and only that version? When do you expose new deletes in the
> RAMDirectory, when do you expose new deletes in the FSDirectory, how do you
> manage slow merges from the RAMDirectory to the FSDirectory, how do you manage
> new adds to the RAMDirectory that take place during slow merges...
>
> Building a single-index, two-writer model that could handle fast updates while
> performing background merging was one of the main drivers behind the tombstone
> design.

I'm not proposing multi-index model (at least I think I'm not!). A
single IW could flush new tiny segments into a RAMDir and later merge
them into a real Dir. But I agree: let's start w/ a single Dir and
move to RAMDir only if necessary.

> Building a single-index, two-writer model that could handle fast updates while
> performing background merging was one of the main drivers behind the tombstone
> design.

I think carrying the deletions in RAM (reopening the reader) is
probably fastest for Lucene. Lucene with the "reopened stream of
readers" approach can do this, but Lucy/KS (with mmap) must use
filesystem as the intermediary.

Michael McCandless
added a comment - 09/Jan/09 11:58
> Mmm. I think I might have given IndexWriter.commit() slightly different
> semantics. Specifically, I might have given it a boolean "sync" argument
> which defaults to false.
It seems like there are various levels of increasing "durability" here:
Make available to a reader in the same JVM (eg flush new segment
to a RAMDir) – not exposed today.
Make available to a reader sharing the filesystem right now (flush
to Directory in "real" filesystem, but don't sync) – exposed
today (but deprecated as a public API) as flush.
Make available to readers even if OS/machine crashes (flush to
Directory in "real" filesystem, and sync) – exposed today as commit.
> Two comments. First, if you don't sync, but rather leave it up to the OS when
> it wants to actually perform the actual disk i/o, how expensive is flushing? Can
> we make it cheap enough to meet Jason's absolute change rate requirements?
Right I've been wondering the same thing. I think this should be our
first approach to realtime indexing, and then we swap in RAMDir if
performance is not good enough.
> Second, the multi-index model is very tricky when dealing with "updates". How
> do you guarantee that you always see the "current" version of a given
> document, and only that version? When do you expose new deletes in the
> RAMDirectory, when do you expose new deletes in the FSDirectory, how do you
> manage slow merges from the RAMDirectory to the FSDirectory, how do you manage
> new adds to the RAMDirectory that take place during slow merges...
>
> Building a single-index, two-writer model that could handle fast updates while
> performing background merging was one of the main drivers behind the tombstone
> design.
I'm not proposing multi-index model (at least I think I'm not!). A
single IW could flush new tiny segments into a RAMDir and later merge
them into a real Dir. But I agree: let's start w/ a single Dir and
move to RAMDir only if necessary.
> Building a single-index, two-writer model that could handle fast updates while
> performing background merging was one of the main drivers behind the tombstone
> design.
I think carrying the deletions in RAM (reopening the reader) is
probably fastest for Lucene. Lucene with the "reopened stream of
readers" approach can do this, but Lucy/KS (with mmap) must use
filesystem as the intermediary.

M.M.:" I think the transactions layer would also sit on top of this
"realtime" layer? EG this "realtime" layer would expose a commit()
method, and the transaction layer above it would maintain the
transaction log, periodically calling commit() and truncating the
transaction log?"

One approach that may be optimal is to expose from IndexWriter a createTransaction method that accepts new documents and deletes. All documents have an associated UID. The new documents could feasibly be encoded into a single segment that represents the added documents for that transaction. The deletes would be represented as document long UIDs rather than int doc ids. Then the commit method would be called on the transaction object who returns a reader representing the latest version of the index, plus the changes created by the transaction. This system would be a part of IndexWriter and would not rely on a transaction log. IndexWriter.commit would flush the in ram realtime indexes to disk. The realtime merge policy would flush based on the RAM usage or number of docs.

When commit is called, the disk segment reader flush their deletes to disk which is fast. The in ram realtime index is merged to disk. The process is described in more detail further down.

M.H.: "how about writing a single-file Directory implementation?"

I'm not sure we need this because and appending rolling transaction log should work. Segments don't change, only things like norms and deletes which can be appended to a rolling transaction log file system. If we had a generic transaction logging system, the future column stride fields, deletes, norms, and future realtime features could use it and be realtime.

M.H.: "How do you guarantee that you always see the "current" version of a given document, and only that version?

Each transaction returns an IndexReader. Each "row" or "object" could use a unique id in the transaction log model which would allow documents that were merged into other segments to be deleted during a transaction log replay.

M.H.: "When do you expose new deletes in the RAMDirectory, when do you expose new deletes in the FSDirectory"

When do you expose new deletes in the RAMDir, when do you expose new deletes in the FSDirectory, how do you manage slow merges from the RAMDir to the FSDirectory, how do you manage new adds to the RAMDir that take place during slow merges..."

Queue deletes to the RAMDir, while copying the RAMDir to the FSDir in the background, perform the deletes after the copy is completed, then instantiate a new reader with the newly merged FSDirectory and a new RAMDirs. Writes that were occurring during this process would be happening to another new RAMDir.

One way to think of the realtime problem is in terms of segments rather than FSDirs and RAMDirs. Some segments are on disk, some in RAM. Each transaction is an instance of some segments and their deletes (and we're not worried about the deletes being flushed or not so assume they exist as BitVectors). The system should expose an API to checkpoint/flush at a given transaction level (usually the current) and should not stop new updates from happening.

When I wrote this type of system, I managed individual segments outside of IndexWriter's merge policy and performed the merging manually by placing each segment in it's own FSDirectory (the segment size was 64MB) which minimized the number of directories. I do not know the best approach for this when performed within IndexWriter.

M.H.: "Two comments. First, if you don't sync, but rather leave it up to the OS when
it wants to actually perform the actual disk i/o, how expensive is flushing? Can
we make it cheap enough to meet Jason's absolute change rate requirements?"

When I tried out the transaction log a write usually mapped pretty quickly to a hard disk write. I don't think it's safe to leave writes up to the OS.

M.M.: "maintain & updated deleted docs even though IndexWriter has the write lock"

In my previous realtime search implementation I got around this by having each segment in it's own directory. Assuming this is non-optimal, we will need to expose an IndexReader that has the writelock of the IndexWriter.

Jason Rutherglen
added a comment - 09/Jan/09 01:40 M.M.:" I think the transactions layer would also sit on top of this
"realtime" layer? EG this "realtime" layer would expose a commit()
method, and the transaction layer above it would maintain the
transaction log, periodically calling commit() and truncating the
transaction log?"
One approach that may be optimal is to expose from IndexWriter a createTransaction method that accepts new documents and deletes. All documents have an associated UID. The new documents could feasibly be encoded into a single segment that represents the added documents for that transaction. The deletes would be represented as document long UIDs rather than int doc ids. Then the commit method would be called on the transaction object who returns a reader representing the latest version of the index, plus the changes created by the transaction. This system would be a part of IndexWriter and would not rely on a transaction log. IndexWriter.commit would flush the in ram realtime indexes to disk. The realtime merge policy would flush based on the RAM usage or number of docs.
IndexWriter iw = new IndexWriter();
Transaction tr = iw.createTransaction();
tr.addDocument( new Document());
tr.addDocument( new Document());
tr.deleteDocument(1200l);
IndexReader ir = tr.flush(); // flushes transaction to the index (probably to a ram index)
IndexReader latestReader = iw.getReader(); // same as ir
iw.commit( boolean doWait); // commits the in ram realtime index to disk
When commit is called, the disk segment reader flush their deletes to disk which is fast. The in ram realtime index is merged to disk. The process is described in more detail further down.
M.H.: "how about writing a single-file Directory implementation?"
I'm not sure we need this because and appending rolling transaction log should work. Segments don't change, only things like norms and deletes which can be appended to a rolling transaction log file system. If we had a generic transaction logging system, the future column stride fields, deletes, norms, and future realtime features could use it and be realtime.
M.H.: "How do you guarantee that you always see the "current" version of a given document, and only that version?
Each transaction returns an IndexReader. Each "row" or "object" could use a unique id in the transaction log model which would allow documents that were merged into other segments to be deleted during a transaction log replay.
M.H.: "When do you expose new deletes in the RAMDirectory, when do you expose new deletes in the FSDirectory"
When do you expose new deletes in the RAMDir, when do you expose new deletes in the FSDirectory, how do you manage slow merges from the RAMDir to the FSDirectory, how do you manage new adds to the RAMDir that take place during slow merges..."
Queue deletes to the RAMDir, while copying the RAMDir to the FSDir in the background, perform the deletes after the copy is completed, then instantiate a new reader with the newly merged FSDirectory and a new RAMDirs. Writes that were occurring during this process would be happening to another new RAMDir.
One way to think of the realtime problem is in terms of segments rather than FSDirs and RAMDirs. Some segments are on disk, some in RAM. Each transaction is an instance of some segments and their deletes (and we're not worried about the deletes being flushed or not so assume they exist as BitVectors). The system should expose an API to checkpoint/flush at a given transaction level (usually the current) and should not stop new updates from happening.
When I wrote this type of system, I managed individual segments outside of IndexWriter's merge policy and performed the merging manually by placing each segment in it's own FSDirectory (the segment size was 64MB) which minimized the number of directories. I do not know the best approach for this when performed within IndexWriter.
M.H.: "Two comments. First, if you don't sync, but rather leave it up to the OS when
it wants to actually perform the actual disk i/o, how expensive is flushing? Can
we make it cheap enough to meet Jason's absolute change rate requirements?"
When I tried out the transaction log a write usually mapped pretty quickly to a hard disk write. I don't think it's safe to leave writes up to the OS.
M.M.: "maintain & updated deleted docs even though IndexWriter has the write lock"
In my previous realtime search implementation I got around this by having each segment in it's own directory. Assuming this is non-optimal, we will need to expose an IndexReader that has the writelock of the IndexWriter.

> So, net/net it seems like "deletes-as-a-filter" approach is compelling?

In terms of CPU-cycles, maybe.

My gut tells me that it's all but mandatory if we use merged-on-the-fly
tombstone streams, but if Lucene goes that route it should cache a BitVector
and use a shared pseudo-iterator – in which case the costs will no longer be
significantly more than the current system.

Under the current system, I'm not certain that the deletions checks are that
excessive. Consider this loop from TermDocs.read():

Again, the CPU is probably going to do a pretty good job of predicting the
results of the deletion check. And even then, we're accessing the same shared
BitVector across all TermDocs, and its bits are hopefully a cache hit.

To really tighten this loop, you have to do what Nate and I want with Lucy/KS:

In terms of public API, I don't think it's reasonable to change Lucene's
Scorer and TermDocs classes so that their iterators start returning deleted
docs.

We could potentially make that choice with Lucy/KS, thus allowing us to remove
the deletions check in the PostingList iterator (as above) and getting a
potential speedup. But even then I hesitate to push the deletions API upwards
into a space where users of raw Scorer and TermDocs classes have to deal with
it – especially since iterator-style deletions aren't very user-friendly.

> if it's sparse, you need an iterator (state) to remember where you are.

We can hide the sparse representation and the internal state, having the
object lazily build the a non-sparse representation. That's what I had in
mind with the code for TombstoneDelEnum.nextDeletion().
TombstoneDelEnum.nextInternal() would be a private method used for building up
the internal BitVector.

Marvin Humphrey
added a comment - 08/Jan/09 20:33 Mike McCandless:
> if it's sparse, you need an iterator (state) to remember where you are.
We can hide the sparse representation and the internal state, having the
object lazily build the a non-sparse representation. That's what I had in
mind with the code for TombstoneDelEnum.nextDeletion().
TombstoneDelEnum.nextInternal() would be a private method used for building up
the internal BitVector.

> Commit is for crash recovery, and for knowing when it's OK to delete
> prior commits. Simply writing the files (and not syncing them), and
> perhaps giving IndexReader.open the SegmentInfos to use directly (and
> not writing a segments_N via the filesystem) would allow us to search
> added docs without paying the cost of sync'ing all the files.

Mmm. I think I might have given IndexWriter.commit() slightly different
semantics. Specifically, I might have given it a boolean "sync" argument
which defaults to false.

> Also: brand new, tiny segments should be written into a RAMDirectory
> and then merged over time into the real Directory.

Two comments. First, if you don't sync, but rather leave it up to the OS when
it wants to actually perform the actual disk i/o, how expensive is flushing? Can
we make it cheap enough to meet Jason's absolute change rate requirements?

Second, the multi-index model is very tricky when dealing with "updates". How
do you guarantee that you always see the "current" version of a given
document, and only that version? When do you expose new deletes in the
RAMDirectory, when do you expose new deletes in the FSDirectory, how do you
manage slow merges from the RAMDirectory to the FSDirectory, how do you manage
new adds to the RAMDirectory that take place during slow merges...

Building a single-index, two-writer model that could handle fast updates while
performing background merging was one of the main drivers behind the tombstone
design.

Marvin Humphrey
added a comment - 08/Jan/09 20:28 Mike McCandless:
> Commit is for crash recovery, and for knowing when it's OK to delete
> prior commits. Simply writing the files (and not syncing them), and
> perhaps giving IndexReader.open the SegmentInfos to use directly (and
> not writing a segments_N via the filesystem) would allow us to search
> added docs without paying the cost of sync'ing all the files.
Mmm. I think I might have given IndexWriter.commit() slightly different
semantics. Specifically, I might have given it a boolean "sync" argument
which defaults to false.
> Also: brand new, tiny segments should be written into a RAMDirectory
> and then merged over time into the real Directory.
Two comments. First, if you don't sync, but rather leave it up to the OS when
it wants to actually perform the actual disk i/o, how expensive is flushing? Can
we make it cheap enough to meet Jason's absolute change rate requirements?
Second, the multi-index model is very tricky when dealing with "updates". How
do you guarantee that you always see the "current" version of a given
document, and only that version? When do you expose new deletes in the
RAMDirectory, when do you expose new deletes in the FSDirectory, how do you
manage slow merges from the RAMDirectory to the FSDirectory, how do you manage
new adds to the RAMDirectory that take place during slow merges...
Building a single-index, two-writer model that could handle fast updates while
performing background merging was one of the main drivers behind the tombstone
design.

Michael McCandless
added a comment - 08/Jan/09 20:11
> How about if we model deletions-as-iterator on BitSet.nextSetBit(int tick) instead of a true iterator that keeps state?
That works if under-the-hood it's a non-sparse representation. But if it's sparse, you need an iterator (state) to remember where you are.

> If we move the deletions filtering up, then we'd increase traffic through that cache

OK, right. So we may have some added cost because of this. I think
it's only TermScorer that uses the bulk API though.

> If you were applying deletions filtering after Scorer.next(), then it seems
> likely that costs would go up because of extra hit processing. However, if
> you use Scorer.skipTo() to jump past deletions, as in the loop I provided
> above, then PhraseScorer etc. shouldn't incur any more costs themselves.

Ahhh, now I got it! Good, you're right.

> Under the skipTo() loop, I think the filter effectively does get applied
> earlier in the chain. Does that make sense?

Right. This is how Lucene works today. Excellent.

So, net/net it seems like "deletes-as-a-filter" approach is compelling?

Michael McCandless
added a comment - 08/Jan/09 20:09
> If we move the deletions filtering up, then we'd increase traffic through that cache
OK, right. So we may have some added cost because of this. I think
it's only TermScorer that uses the bulk API though.
> If you were applying deletions filtering after Scorer.next(), then it seems
> likely that costs would go up because of extra hit processing. However, if
> you use Scorer.skipTo() to jump past deletions, as in the loop I provided
> above, then PhraseScorer etc. shouldn't incur any more costs themselves.
Ahhh, now I got it! Good, you're right.
> Under the skipTo() loop, I think the filter effectively does get applied
> earlier in the chain. Does that make sense?
Right. This is how Lucene works today. Excellent.
So, net/net it seems like "deletes-as-a-filter" approach is compelling?

> I'm also curious what cost you see of doing the merge sort for every
> search; I think it could be uncomfortably high since it's so
> hard-for-cpu-to-predict-branch-intensive.

Probably true. You're going to get accelerating degradation as the number of
deletions increases. In a large index, you could end up merging 20, 30
streams. Based on how the priority queue in ORScorer tends to take up space
in profiling data, that might not be good.

It'd be manageable if you can keep your index reasonably in good shape, but you'll
be suckin' pondwater if it gets flabby.

> We could take the first search that doesn't use skipTo and save the result
> of the merge sort, essentially doing an in-RAM-only "merge" of those
> deletes, and let subsequent searches use that single merged stream.

Marvin Humphrey
added a comment - 08/Jan/09 20:05 Mike McCandless:
> I'm also curious what cost you see of doing the merge sort for every
> search; I think it could be uncomfortably high since it's so
> hard-for-cpu-to-predict-branch-intensive.
Probably true. You're going to get accelerating degradation as the number of
deletions increases. In a large index, you could end up merging 20, 30
streams. Based on how the priority queue in ORScorer tends to take up space
in profiling data, that might not be good.
It'd be manageable if you can keep your index reasonably in good shape, but you'll
be suckin' pondwater if it gets flabby.
> We could take the first search that doesn't use skipTo and save the result
> of the merge sort, essentially doing an in-RAM-only "merge" of those
> deletes, and let subsequent searches use that single merged stream.
That was what I had in mind when proposing the pseudo-iterator model.
class TombStoneDelEnum extends DelEnum {
int nextDeletion( int docNum) {
while (currentMax < docNum) { nextInternal(); }
return bits.nextSetBit(docNum);
}
// ...
}
> (This is not MMAP friendly, though).
Yeah. Ironically, that use of tombstones is more compatible with the Lucene
model.
I'd be reluctant to have Lucy/KS realize those large BitVectors in per-object
process RAM. That'd spoil the "cheap wrapper around system i/o cache"
IndexReader plan.
I can't see an answer yet. But the one thing I do know is that Lucy/KS needs
a pluggable deletions mechanism to make experimentation easier – so that's
what I'm working on today.

> It would be exposed as a combination reader writer that manages the transaction status of each update.

I think the transactions layer would also sit on top of this
"realtime" layer? EG this "realtime" layer would expose a commit()
method, and the transaction layer above it would maintain the
transaction log, periodically calling commit() and truncating the
transaction log?

This "realtime" layer, then, would internally maintain a single
IndexWriter and the readers. IndexWriter would flush (not commit) new
segments into a RAMDir and yield its in-RAM SegmentInfos to
IndexReader.reopen. MergePolicy periodically gets those into the real
Directory. When reopening a reader we have the freedom to use old
(already merged away) segments if the newly merged segment isn't warm
yet.

We "just" need to open some things up in IndexWriter:

IndexReader.reopen with the in-RAM SegmentInfos

Willingness to allow an IndexReader to maintain & updated deleted
docs even though IndexWriter has the write lock

Access to segments that were already merged away (I think we could
make a DeletionPolicy that pays attention to when the newly merged
segment is not yet warmed and keeps thue prior segments around).
I think this'd require allowing DeletionPolicy to see "flush
points" in addition to commit points (it doesn't today).

But I'm still hazy on the details on exactly how to open up
IndexWriter.

Michael McCandless
added a comment - 08/Jan/09 20:00
> It would be exposed as a combination reader writer that manages the transaction status of each update.
I think the transactions layer would also sit on top of this
"realtime" layer? EG this "realtime" layer would expose a commit()
method, and the transaction layer above it would maintain the
transaction log, periodically calling commit() and truncating the
transaction log?
This "realtime" layer, then, would internally maintain a single
IndexWriter and the readers. IndexWriter would flush (not commit) new
segments into a RAMDir and yield its in-RAM SegmentInfos to
IndexReader.reopen. MergePolicy periodically gets those into the real
Directory. When reopening a reader we have the freedom to use old
(already merged away) segments if the newly merged segment isn't warm
yet.
We "just" need to open some things up in IndexWriter:
IndexReader.reopen with the in-RAM SegmentInfos
Willingness to allow an IndexReader to maintain & updated deleted
docs even though IndexWriter has the write lock
Access to segments that were already merged away (I think we could
make a DeletionPolicy that pays attention to when the newly merged
segment is not yet warmed and keeps thue prior segments around).
I think this'd require allowing DeletionPolicy to see "flush
points" in addition to commit points (it doesn't today).
But I'm still hazy on the details on exactly how to open up
IndexWriter.

> If Lucene crashed for some reason the transaction log would be replayed.

I think the transaction log is useful for some applications, but could
(should) be built as a separate (optional) layer entirely on top of
Lucene's core. Ie, neither IndexWriter nor IndexReader need to be
aware of the transaction log, which update belongs to which
transaction, etc?

Michael McCandless
added a comment - 08/Jan/09 19:48
> If Lucene crashed for some reason the transaction log would be replayed.
I think the transaction log is useful for some applications, but could
(should) be built as a separate (optional) layer entirely on top of
Lucene's core. Ie, neither IndexWriter nor IndexReader need to be
aware of the transaction log, which update belongs to which
transaction, etc?

> There's going to be a change rate that overwhelms the multi-file
> commit system, and it seems that you've determined you're up against
> it.

Well... IndexWriter need not "commit" in order to allow a reader to
see the files?

Commit is for crash recovery, and for knowing when it's OK to delete
prior commits. Simply writing the files (and not syncing them), and
perhaps giving IndexReader.open the SegmentInfos to use directly (and
not writing a segments_N via the filesystem) would allow us to search
added docs without paying the cost of sync'ing all the files.

Also: brand new, tiny segments should be written into a RAMDirectory
and then merged over time into the real Directory.

Michael McCandless
added a comment - 08/Jan/09 19:43
> There's going to be a change rate that overwhelms the multi-file
> commit system, and it seems that you've determined you're up against
> it.
Well... IndexWriter need not "commit" in order to allow a reader to
see the files?
Commit is for crash recovery, and for knowing when it's OK to delete
prior commits. Simply writing the files (and not syncing them), and
perhaps giving IndexReader.open the SegmentInfos to use directly (and
not writing a segments_N via the filesystem) would allow us to search
added docs without paying the cost of sync'ing all the files.
Also: brand new, tiny segments should be written into a RAMDirectory
and then merged over time into the real Directory.

> I found in making the realtime search write speed fast enough that writing
> to individual files per segment can become too costly (they accumulate fast,
> appending to a single file is faster than creating new files, deleting the
> files becomes costly).

I saw you mentioning i/o overhead on Windows in particular. I can't see a way
to mod Lucene so that it doesn't generate a bunch of files for each commit,
and FWIW Lucy/KS is going to generate even more files than Lucene.

Half-seriously... how about writing a single-file Directory implementation?

> For example, writing to small individual files per commit, if the number of
> segments is large and the delete spans multiple segments will generate many
> files.

There would be a maximum of two files per segment to hold the tombstones: one
to hold the tombstone rows, and one to map segment identifiers to tombstone
rows. (In Lucy/KS, the mappings would probably be stored in the JSON-encoded
"segmeta" file, which stores human-readable metadata on behalf of multiple
components.)

Segments containing tombstones would be merged according to whatever merge
policy was in place. So there won't ever be an obscene number of tombstone
files unless you allow an obscene number of segments to accumulate.

> Many users may not want a transaction log as they may be storing the updates
> in a separate SQL database instance (this is the case where I work) and so a
> transaction log is redundant and should be optional.

I can see how this would be quite useful at the application level. However, I
think it might be challenging to generalize the transaction log concept at the
library level:

CustomAnalyzer analyzer = new CustomAnalyzer();
IndexWriter indexWriter = new IndexWriter(analyzer, "/path/to/index");
indexWriter.add(nextDoc());
analyzer.setFoo(2); // change of state not recorded by transaction log
indexWriter.add(nextDoc());

MySQL is more of a closed system than Lucene, which I think makes options
available that aren't available to us.

> The reader stack is drained based on whether a reader is too old to be
> useful anymore (i.e. no references to it, or it's has N number of readers
> ahead of it).

Right, this is the kind of thing that Lucene has to do because of the
single-reader model, and that were trying to get away from in Lucy/KS by
exploiting mmap and making IndexReaders cheap wrappers around the system i/o
cache.

I don't think I can offer any alternative design suggestions that meet your
needs. There's going to be a change rate that overwhelms the multi-file
commit system, and it seems that you've determined you're up against it.

FWIW, we contemplated a multi-index system with an index on a RAM disk for
fast changes and a primary index on the main file system. It would have
worked fine for pure adds, but it was very tricky to manage state for
documents which were being "updated", i.e. deleted and re-added. How are you
handling all these small adds with your combo reader/writer? Do you not have
that problem?

Marvin Humphrey
added a comment - 08/Jan/09 19:11 Jason Rutherglen:
> I found in making the realtime search write speed fast enough that writing
> to individual files per segment can become too costly (they accumulate fast,
> appending to a single file is faster than creating new files, deleting the
> files becomes costly).
I saw you mentioning i/o overhead on Windows in particular. I can't see a way
to mod Lucene so that it doesn't generate a bunch of files for each commit,
and FWIW Lucy/KS is going to generate even more files than Lucene.
Half-seriously... how about writing a single-file Directory implementation?
> For example, writing to small individual files per commit, if the number of
> segments is large and the delete spans multiple segments will generate many
> files.
There would be a maximum of two files per segment to hold the tombstones: one
to hold the tombstone rows, and one to map segment identifiers to tombstone
rows. (In Lucy/KS, the mappings would probably be stored in the JSON-encoded
"segmeta" file, which stores human-readable metadata on behalf of multiple
components.)
Segments containing tombstones would be merged according to whatever merge
policy was in place. So there won't ever be an obscene number of tombstone
files unless you allow an obscene number of segments to accumulate.
> Many users may not want a transaction log as they may be storing the updates
> in a separate SQL database instance (this is the case where I work) and so a
> transaction log is redundant and should be optional.
I can see how this would be quite useful at the application level. However, I
think it might be challenging to generalize the transaction log concept at the
library level:
CustomAnalyzer analyzer = new CustomAnalyzer();
IndexWriter indexWriter = new IndexWriter(analyzer, "/path/to/index" );
indexWriter.add(nextDoc());
analyzer.setFoo(2); // change of state not recorded by transaction log
indexWriter.add(nextDoc());
MySQL is more of a closed system than Lucene, which I think makes options
available that aren't available to us.
> The reader stack is drained based on whether a reader is too old to be
> useful anymore (i.e. no references to it, or it's has N number of readers
> ahead of it).
Right, this is the kind of thing that Lucene has to do because of the
single-reader model, and that were trying to get away from in Lucy/KS by
exploiting mmap and making IndexReaders cheap wrappers around the system i/o
cache.
I don't think I can offer any alternative design suggestions that meet your
needs. There's going to be a change rate that overwhelms the multi-file
commit system, and it seems that you've determined you're up against it.
What's killing us is something different: not absolute change rate, but poor
worst-case performance.
FWIW, we contemplated a multi-index system with an index on a RAM disk for
fast changes and a primary index on the main file system. It would have
worked fine for pure adds, but it was very tricky to manage state for
documents which were being "updated", i.e. deleted and re-added. How are you
handling all these small adds with your combo reader/writer? Do you not have
that problem?

It does improve the write speed. I found in making the realtime search write speed fast enough that writing to individual files per segment can become too costly (they accumulate fast, appending to a single file is faster than creating new files, deleting the files becomes costly). For example, writing to small individual files per commit, if the number of segments is large and the delete spans multiple segments will generate many files. This is variable based on how often the updates are expected to occur. I modeled this after the extreme case of the frequency of updates of a MySQL instance backing data for a web application.

The MySQL design, translated to Lucene is a transaction log per index. Where the updates consisting of documents and deletes are written to the transaction log file. If Lucene crashed for some reason the transaction log would be replayed. The in memory indexes and newly deleted document bitvectors would be held in RAM (LUCENE-1314) until flushed (the in memory indexes and deleted documents) manually or based on memory usage. Many users may not want a transaction log as they may be storing the updates in a separate SQL database instance (this is the case where I work) and so a transaction log is redundant and should be optional. The first implementation of this will not have a transaction log.

Marvin: "I don't think I understand. Is this the "combination index reader/writer" model, where the writer prepares a data structure that then gets handed off to the reader?"

It would be exposed as a combination reader writer that manages the transaction status of each update. The internal architecture is such that after each update a new reader representing the new documents and deletes for the transaction is generated and put onto a stack. The reader stack is drained based on whether a reader is too old to be useful anymore (i.e. no references to it, or it's has N number of readers ahead of it).

Jason Rutherglen
added a comment - 08/Jan/09 16:45 Marvin: "The whole tombstone idea arose out of the need for (close to) realtime search! It's intended to improve write speed."
It does improve the write speed. I found in making the realtime search write speed fast enough that writing to individual files per segment can become too costly (they accumulate fast, appending to a single file is faster than creating new files, deleting the files becomes costly). For example, writing to small individual files per commit, if the number of segments is large and the delete spans multiple segments will generate many files. This is variable based on how often the updates are expected to occur. I modeled this after the extreme case of the frequency of updates of a MySQL instance backing data for a web application.
The MySQL design, translated to Lucene is a transaction log per index. Where the updates consisting of documents and deletes are written to the transaction log file. If Lucene crashed for some reason the transaction log would be replayed. The in memory indexes and newly deleted document bitvectors would be held in RAM ( LUCENE-1314 ) until flushed (the in memory indexes and deleted documents) manually or based on memory usage. Many users may not want a transaction log as they may be storing the updates in a separate SQL database instance (this is the case where I work) and so a transaction log is redundant and should be optional. The first implementation of this will not have a transaction log.
Marvin: "I don't think I understand. Is this the "combination index reader/writer" model, where the writer prepares a data structure that then gets handed off to the reader?"
It would be exposed as a combination reader writer that manages the transaction status of each update. The internal architecture is such that after each update a new reader representing the new documents and deletes for the transaction is generated and put onto a stack. The reader stack is drained based on whether a reader is too old to be useful anymore (i.e. no references to it, or it's has N number of readers ahead of it).

I noticed that in one version of the patch for segment-centric search (LUCENE-1483), each sorted search involved the creation of sub-searchers, which were then used to compile Scorers. It would make sense to cache those as individual SegmentSearcher objects, no?

Thats a fairly old version I think (based on using MutliSearcher as a hack). Now we are using one queue and running it through each subreader of the MultiReader.

Mark Miller
added a comment - 08/Jan/09 15:19 I noticed that in one version of the patch for segment-centric search ( LUCENE-1483 ), each sorted search involved the creation of sub-searchers, which were then used to compile Scorers. It would make sense to cache those as individual SegmentSearcher objects, no?
Thats a fairly old version I think (based on using MutliSearcher as a hack). Now we are using one queue and running it through each subreader of the MultiReader.

> For a TermQuery (one term) the cost of the two approaches should be
> the same.

It'll be close, but I don't think that's quite true. TermScorer pre-fetches
document numbers in batches from the TermDocs object. At present, only
non-deleted doc nums get cached. If we move the deletions filtering up, then
we'd increase traffic through that cache. However, filling it would be
slightly cheaper, because we wouldn't be performing the deletions check.

In theory. I'm not sure there's a way to streamline away that deletions check
in TermDocs and maintain backwards compatibility. And while this is a fun
brainstorm, I'm still far from convinced that having TermDocs.next() and
Scorer.next() return deleted docs by default is a good idea.

> For AND (and other) queries I'm not sure. In theory, having to
> process more docIDs is more costly, eg a PhraseQuery or SpanXXXQuery
> may see much higher net cost.

If you were applying deletions filtering after Scorer.next(), then it seems
likely that costs would go up because of extra hit processing. However, if
you use Scorer.skipTo() to jump past deletions, as in the loop I provided
above, then PhraseScorer etc. shouldn't incur any more costs themselves.

> a costly per-docID search
> with a very restrictive filter could be far more efficient if you
> applied the Filter earlier in the chain.

Under the skipTo() loop, I think the filter effectively does get applied
earlier in the chain. Does that make sense?

I think the potential performance downside comes down to prefetching in
TermScorer, unless there are other classes that do similar prefetching.

Marvin Humphrey
added a comment - 08/Jan/09 15:18 Mike McCandless:
> For a TermQuery (one term) the cost of the two approaches should be
> the same.
It'll be close, but I don't think that's quite true. TermScorer pre-fetches
document numbers in batches from the TermDocs object. At present, only
non-deleted doc nums get cached. If we move the deletions filtering up, then
we'd increase traffic through that cache. However, filling it would be
slightly cheaper, because we wouldn't be performing the deletions check.
In theory. I'm not sure there's a way to streamline away that deletions check
in TermDocs and maintain backwards compatibility. And while this is a fun
brainstorm, I'm still far from convinced that having TermDocs.next() and
Scorer.next() return deleted docs by default is a good idea.
> For AND (and other) queries I'm not sure. In theory, having to
> process more docIDs is more costly, eg a PhraseQuery or SpanXXXQuery
> may see much higher net cost.
If you were applying deletions filtering after Scorer.next(), then it seems
likely that costs would go up because of extra hit processing. However, if
you use Scorer.skipTo() to jump past deletions, as in the loop I provided
above, then PhraseScorer etc. shouldn't incur any more costs themselves.
> a costly per-docID search
> with a very restrictive filter could be far more efficient if you
> applied the Filter earlier in the chain.
Under the skipTo() loop, I think the filter effectively does get applied
earlier in the chain. Does that make sense?
I think the potential performance downside comes down to prefetching in
TermScorer, unless there are other classes that do similar prefetching.

Marvin Humphrey
added a comment - 08/Jan/09 14:48 How about if we model deletions-as-iterator on BitSet.nextSetBit(int tick) instead of a true iterator that keeps state?
You can do that now by implementing BitVector.nextSetBit(int tick) and using that in TermDocs to set a nextDeletion member var instead of checking every doc num with BitVector.get().
That way, the object that provides deletions can still be shared.

I like the idea of a SegmentSearcher in general. A little while back, I wondered whether exposing SegmentReaders was really the best way to handle segment-centric search. Upon reflection, I think it is. Segments are a good unit. They're pure inverted indexes (notwithstanding doc stores and tombstones); the larger composite only masquerades as one.

I noticed that in one version of the patch for segment-centric search (LUCENE-1483), each sorted search involved the creation of sub-searchers, which were then used to compile Scorers. It would make sense to cache those as individual SegmentSearcher objects, no?

And then, to respond to the original suggestion, the SegmentSearcher level seems like a good place to handle application of a deletions quasi-filter. I think we could avoid having to deal with segment-start offsets that way.

Marvin Humphrey
added a comment - 08/Jan/09 14:32 Paul Elschot:
> How about a SegmentSearcher?
I like the idea of a SegmentSearcher in general. A little while back, I wondered whether exposing SegmentReaders was really the best way to handle segment-centric search. Upon reflection, I think it is. Segments are a good unit. They're pure inverted indexes (notwithstanding doc stores and tombstones); the larger composite only masquerades as one.
I noticed that in one version of the patch for segment-centric search ( LUCENE-1483 ), each sorted search involved the creation of sub-searchers, which were then used to compile Scorers. It would make sense to cache those as individual SegmentSearcher objects, no?
And then, to respond to the original suggestion, the SegmentSearcher level seems like a good place to handle application of a deletions quasi-filter. I think we could avoid having to deal with segment-start offsets that way.

Paul Elschot
added a comment - 08/Jan/09 11:35 To minimize CPU cycles, it would theoretically make more sense to handle deletions much higher up, at the top level Scorer, Searcher, or even the HitCollector level.
How about a SegmentSearcher?

> PostingList would be completely ignorant of deletions, as would classes like NOTScorer and MatchAllScorer:

This is a neat idea! Deletions are then applied just like a Filter.

For a TermQuery (one term) the cost of the two approaches should be
the same.

For OR'd Term queries, it actually seems like your proposed approach
may be lower cost? Ie rather than each TermEnum doing the "AND NOT
deleted" intersection, you only do it once at the top. There is added
cost in that each TermEnum is now returning more docIDs than before,
but the deleted ones are eliminated before scoring.

For AND (and other) queries I'm not sure. In theory, having to
process more docIDs is more costly, eg a PhraseQuery or SpanXXXQuery
may see much higher net cost. We should test.

Conceivably, a future "search optimization phase" could pick & choose
the best point to inject the "AND NOT deleted" filter. In fact, it
could also pick when to inject a Filter... a costly per-docID search
with a very restrictive filter could be far more efficient if you
applied the Filter earlier in the chain.

I'm also curious what cost you see of doing the merge sort for every
search; I think it could be uncomfortably high since it's so
hard-for-cpu-to-predict-branch-intensive. We could take the first
search that doesn't use skipTo and save the result of the merge sort,
essentially doing an in-RAM-only "merge" of those deletes, and let
subsequent searches use that single merged stream. (This is not MMAP
friendly, though).

In my initial rough testing, I switched to iterator API for
SegmentTermEnum and found if %tg deletes is < 10% the search was a bit
faster using an iterator vs random access, but above that was slower.
This was with an already "merged" list of in-order docIDs.

Switching to an iterator API for accessing field values for many docs
(LUCENE-831 – new FieldCache API, LUCENE-1231 – column stride
fields) shouldn't have this same problem since it's the "top level"
that's accessing the values (ie, one iterator per field X query).

Michael McCandless
added a comment - 08/Jan/09 10:57
> PostingList would be completely ignorant of deletions, as would classes like NOTScorer and MatchAllScorer:
This is a neat idea! Deletions are then applied just like a Filter.
For a TermQuery (one term) the cost of the two approaches should be
the same.
For OR'd Term queries, it actually seems like your proposed approach
may be lower cost? Ie rather than each TermEnum doing the "AND NOT
deleted" intersection, you only do it once at the top. There is added
cost in that each TermEnum is now returning more docIDs than before,
but the deleted ones are eliminated before scoring.
For AND (and other) queries I'm not sure. In theory, having to
process more docIDs is more costly, eg a PhraseQuery or SpanXXXQuery
may see much higher net cost. We should test.
Conceivably, a future "search optimization phase" could pick & choose
the best point to inject the "AND NOT deleted" filter. In fact, it
could also pick when to inject a Filter... a costly per-docID search
with a very restrictive filter could be far more efficient if you
applied the Filter earlier in the chain.
I'm also curious what cost you see of doing the merge sort for every
search; I think it could be uncomfortably high since it's so
hard-for-cpu-to-predict-branch-intensive. We could take the first
search that doesn't use skipTo and save the result of the merge sort,
essentially doing an in-RAM-only "merge" of those deletes, and let
subsequent searches use that single merged stream. (This is not MMAP
friendly, though).
In my initial rough testing, I switched to iterator API for
SegmentTermEnum and found if %tg deletes is < 10% the search was a bit
faster using an iterator vs random access, but above that was slower.
This was with an already "merged" list of in-order docIDs.
Switching to an iterator API for accessing field values for many docs
( LUCENE-831 – new FieldCache API, LUCENE-1231 – column stride
fields) shouldn't have this same problem since it's the "top level"
that's accessing the values (ie, one iterator per field X query).

Paul Elschot
added a comment - 08/Jan/09 07:51 Jason,
This issue LUCENE-584 had an implementation of a "matcher" class that the scorers implemented which I do not think made it into the committed patch.
The only functional difference between the DocIdSetIterator as committed and the Matcher class at 584 is the Matcher.explain() method, which did not make it into DocIdSetIterator.

> For realtime search where a new transaction may only have a handful of
> deletes the tombstones may not be optimal

The whole tombstone idea arose out of the need for (close to) realtime search! It's intended to improve write speed.

When you make deletes with the BitSet model, you have to rewrite files that scale with segment size, regardless of how few deletions you make. Deletion of a single document in a large segment may necessitate writing out a substantial bit vector file.

In contrast, i/o throughput for writing out a tombstone file scales with the number of tombstones.

> because too many tombstones would accumulate (I believe).

Say that you make a string of commits that are nothing but deleting a single document – thus adding a new segment each time that contains nothing but a single tombstone. Those are going to be cheap to merge, so it seems unlikely that we'll end up with an unwieldy number of tombstone streams to interleave at search-time.

The more likely problem is the one McCandless articulated regarding a large segment accumulating a lot of tombstone streams against it. But I agree with him that it only gets truly serious if your merge policy neglects such segments and allows them to deteriorate for too long.

Marvin Humphrey
added a comment - 08/Jan/09 02:38 Jason Rutherglen:
> For realtime search where a new transaction may only have a handful of
> deletes the tombstones may not be optimal
The whole tombstone idea arose out of the need for (close to) realtime search! It's intended to improve write speed.
When you make deletes with the BitSet model, you have to rewrite files that scale with segment size, regardless of how few deletions you make. Deletion of a single document in a large segment may necessitate writing out a substantial bit vector file.
In contrast, i/o throughput for writing out a tombstone file scales with the number of tombstones.
> because too many tombstones would accumulate (I believe).
Say that you make a string of commits that are nothing but deleting a single document – thus adding a new segment each time that contains nothing but a single tombstone. Those are going to be cheap to merge, so it seems unlikely that we'll end up with an unwieldy number of tombstone streams to interleave at search-time.
The more likely problem is the one McCandless articulated regarding a large segment accumulating a lot of tombstone streams against it. But I agree with him that it only gets truly serious if your merge policy neglects such segments and allows them to deteriorate for too long.
> For this scenario rolling bitsets may be better. Meaning pool bit sets and
> throw away unused readers.
I don't think I understand. Is this the "combination index reader/writer" model, where the writer prepares a data structure that then gets handed off to the reader?

I like this idea of tombstones and we should figure out a way to support it. This issue https://issues.apache.org/jira/browse/LUCENE-584 had an implementation of a "matcher" class that the scorers implemented which I do not think made it into the committed patch.

> I think the inefficiency of needing individual iterator objects applies to OpenBitSet as well, right?

Yes that is true.

For realtime search where a new transaction may only have a handful of deletes the tombstones may not be optimal because too many tombstones would accumulate (I believe). For this scenario rolling bitsets may be better. Meaning pool bit sets and throw away unused readers.

Jason Rutherglen
added a comment - 08/Jan/09 01:44 I like this idea of tombstones and we should figure out a way to support it. This issue https://issues.apache.org/jira/browse/LUCENE-584 had an implementation of a "matcher" class that the scorers implemented which I do not think made it into the committed patch.
> I think the inefficiency of needing individual iterator objects applies to OpenBitSet as well, right?
Yes that is true.
For realtime search where a new transaction may only have a handful of deletes the tombstones may not be optimal because too many tombstones would accumulate (I believe). For this scenario rolling bitsets may be better. Meaning pool bit sets and throw away unused readers.

While converting over PostingList (the Lucy/KS analogue to TermDocs) to use a deletions iterator, it occurred to me that the because the iterator has to keep state, a unique DelEnum object has to be created for every PostingList. In contrast, a BitVector object, which is accessed only via get(), can be shared.

It bugs me that each PostingList will have its own DelEnum performing interleaving of tombstones. With very large queries against indexes with large numbers of deletions, it seems like a lot of duplicated work.

To minimize CPU cycles, it would theoretically make more sense to handle deletions much higher up, at the top level Scorer, Searcher, or even the HitCollector level. PostingList would be completely ignorant of deletions, as would classes like NOTScorer and MatchAllScorer:

The problem is that PostingLists spun off from indexReader.postingList(field, term) would include deleted documents, as would Scorers.

I suppose you could band-aid indexReader.postingList() by adding a boolean suppressDeletions argument which would default to true, but that seems messy.

Jason, I think the inefficiency of needing individual iterator objects applies to OpenBitSet as well, right? As soon as we do away with random access, we have to keep state. Dunno if it's going to be noticeable, but it's conceptually annoying.

Marvin Humphrey
added a comment - 08/Jan/09 00:00 While converting over PostingList (the Lucy/KS analogue to TermDocs) to use a deletions iterator, it occurred to me that the because the iterator has to keep state, a unique DelEnum object has to be created for every PostingList. In contrast, a BitVector object, which is accessed only via get(), can be shared.
It bugs me that each PostingList will have its own DelEnum performing interleaving of tombstones. With very large queries against indexes with large numbers of deletions, it seems like a lot of duplicated work.
To minimize CPU cycles, it would theoretically make more sense to handle deletions much higher up, at the top level Scorer, Searcher, or even the HitCollector level. PostingList would be completely ignorant of deletions, as would classes like NOTScorer and MatchAllScorer:
i32_t
MatchAllScorer_next(MatchAllScorer* self)
{
if (++self->doc_num > self->max_docs) {
self->doc_num--;
return 0;
}
return self->doc_num;
}
void
Scorer_collect(Scorer *self, Hitcollector *collector, DelEnum *del_enum)
{
i32_t next_deletion = del_enum ? 0 : I32_MAX;
i32_t doc_num = 1;
while (1) {
while (doc_num >= next_deletion) {
next_deletion = DelEnum_Skip_To(del_enum, target);
while (doc_num == next_deletion) {
doc_num++;
next_deletion = DelEnum_Skip_To(del_enum, doc_num);
}
}
doc_num = Scorer_Skip_To(scorer, doc_num);
if (doc_num) {
HC_Collect(collector, doc_num, Scorer_Tally(scorer));
}
else {
break ;
}
}
}
The problem is that PostingLists spun off from indexReader.postingList(field, term) would include deleted documents, as would Scorers.
I suppose you could band-aid indexReader.postingList() by adding a boolean suppressDeletions argument which would default to true, but that seems messy.
Jason, I think the inefficiency of needing individual iterator objects applies to OpenBitSet as well, right? As soon as we do away with random access, we have to keep state. Dunno if it's going to be noticeable, but it's conceptually annoying.

Wouldn't it be good to remove BitVector and replace it with OpenBitSet? OBS is faster, has the DocIdSetIterator already. It just needs to implement write to disk compression of the bitset (dgaps?). This would be a big win for almost all searches. We could also create an interface so that any bitset implementation could be used.

Jason Rutherglen
added a comment - 08/Dec/08 22:06 Wouldn't it be good to remove BitVector and replace it with OpenBitSet? OBS is faster, has the DocIdSetIterator already. It just needs to implement write to disk compression of the bitset (dgaps?). This would be a big win for almost all searches. We could also create an interface so that any bitset implementation could be used.
Such as:
public interface WriteableBitSet {
public void write(IndexOutput output) throws IOException;
}

Marvin:
"I'm also bothered by the proliferation of small deletions files. Probably
you'd want automatic consolidation of files under 4k, but you still could end
up with a lot of files in a big index."

A transaction log might be better here if we want to go to 0ish millisecond realtime.
On Windows at least creating files rapidly and deleting them creates significant IO overhead.
UNIX is probably faster but I do not know.

Jason Rutherglen
added a comment - 08/Dec/08 21:58 Marvin:
"I'm also bothered by the proliferation of small deletions files. Probably
you'd want automatic consolidation of files under 4k, but you still could end
up with a lot of files in a big index."
A transaction log might be better here if we want to go to 0ish millisecond realtime.
On Windows at least creating files rapidly and deleting them creates significant IO overhead.
UNIX is probably faster but I do not know.

It's also incremental in cost (cost of flush/commit is in proportion
to how many deletes were done), but you are storing the "packet" of
incremental deletes with the segment you just flushed and not against
the N segments that had deletes. And you write only one file to hold
all the tombstones, which for commit() (file sync) is much less cost.

And it's great that we don't need a new merge policy to handle all the
delete files.

Though one possible downside is, for a very large segment in a very
large index you will likely be merging (at search time) quite a few
delete packets. But, with the cutover to
deletes-accessed-only-by-iterator, this cost is probably not high
until a large pctg of the segment's docs are deleted, at which point
you should really expungeDeletes() or optimize() or optimize(int)
anyway.

Michael McCandless
added a comment - 08/Dec/08 21:05
I like this approach!!
It's also incremental in cost (cost of flush/commit is in proportion
to how many deletes were done), but you are storing the "packet" of
incremental deletes with the segment you just flushed and not against
the N segments that had deletes. And you write only one file to hold
all the tombstones, which for commit() (file sync) is much less cost.
And it's great that we don't need a new merge policy to handle all the
delete files.
Though one possible downside is, for a very large segment in a very
large index you will likely be merging (at search time) quite a few
delete packets. But, with the cutover to
deletes-accessed-only-by-iterator, this cost is probably not high
until a large pctg of the segment's docs are deleted, at which point
you should really expungeDeletes() or optimize() or optimize(int)
anyway.
If only we could write code as quickly as we can dream...

That would improve the average performance of deleting a document, at the cost
of some added complexity. Worst-case performance – which you'd hit when you
consolidated those sub-segment deletions files – would actually degrade a
bit.

To manage consolidation, you'd need a deletions merge policy that operated
independently from the primary merge policy. Aside from the complexity penalty,
having two un-coordinated merge policies would be bad for real-time search,
because you want to be able to control exactly when you pay for a big merge.

I'm also bothered by the proliferation of small deletions files. Probably
you'd want automatic consolidation of files under 4k, but you still could end
up with a lot of files in a big index.

So... what if we wrote, merged, and removed deletions files on the same
schedule as ordinary segment files? Instead of going back and quasi-modifying
an existing segment by associating a next-generation .del file with it, we write
deletions to a NEW segment and have them reference older segments.

In other words, we add "tombstones" rather than "delete" documents.

Logically speaking, each tombstone segment file would consist of an array of
segment identifiers, each of which would point to a "tombstone row" array of
vbyte-encoded doc nums:

The thing that makes this possible is that the dead docs marked by tombstones
never get their doc nums shuffled during segment merging – they just
disappear. If deleted docs lived to be consolidated into new segments and
acquire new doc nums, tombstones wouldn't work. However, we can associate
tombstone rows with segment names and they only need remain valid as long
as the segments they reference survive.

Some tombstone rows will become obsolete once the segments they reference go
away, but we never arrive at a scenario where we are forced to discard valid
tombstones. Merging tombstone files simply involves dropping obsolete
tombstone rows and collating valid ones.

At search time, we'd use an iterator with an internal priority queue to
collate tombstone rows into a stream – so there's still no need to slurp the
files at IndexReader startup.

Marvin Humphrey
added a comment - 08/Dec/08 00:57 > One approach would be to use a "segmented" model.
That would improve the average performance of deleting a document, at the cost
of some added complexity. Worst-case performance – which you'd hit when you
consolidated those sub-segment deletions files – would actually degrade a
bit.
To manage consolidation, you'd need a deletions merge policy that operated
independently from the primary merge policy. Aside from the complexity penalty,
having two un-coordinated merge policies would be bad for real-time search,
because you want to be able to control exactly when you pay for a big merge.
I'm also bothered by the proliferation of small deletions files. Probably
you'd want automatic consolidation of files under 4k, but you still could end
up with a lot of files in a big index.
So... what if we wrote, merged, and removed deletions files on the same
schedule as ordinary segment files? Instead of going back and quasi-modifying
an existing segment by associating a next-generation .del file with it, we write
deletions to a NEW segment and have them reference older segments.
In other words, we add "tombstones" rather than "delete" documents.
Logically speaking, each tombstone segment file would consist of an array of
segment identifiers, each of which would point to a "tombstone row" array of
vbyte-encoded doc nums:
// _6.tombstone
_2: [3, 4, 25]
_3: [13]
// _7.tombstone
_2: [5]
// _8.tombstone
_1: [94]
_2: [7, 8]
_5: [54, 55]
The thing that makes this possible is that the dead docs marked by tombstones
never get their doc nums shuffled during segment merging – they just
disappear. If deleted docs lived to be consolidated into new segments and
acquire new doc nums, tombstones wouldn't work. However, we can associate
tombstone rows with segment names and they only need remain valid as long
as the segments they reference survive.
Some tombstone rows will become obsolete once the segments they reference go
away, but we never arrive at a scenario where we are forced to discard valid
tombstones. Merging tombstone files simply involves dropping obsolete
tombstone rows and collating valid ones.
At search time, we'd use an iterator with an internal priority queue to
collate tombstone rows into a stream – so there's still no need to slurp the
files at IndexReader startup.

Mike, you were planning on managing IndexComponents via IndexReader and IndexWriter constructor args, but won't that get unwieldy if there are too many components? A Schema class allows you to group them together. You don't have to use it to manage fields the way KS does - just leave that out.

Michael McCandless
added a comment - 06/Dec/08 10:18 Mike, you were planning on managing IndexComponents via IndexReader and IndexWriter constructor args, but won't that get unwieldy if there are too many components? A Schema class allows you to group them together. You don't have to use it to manage fields the way KS does - just leave that out.
Agreed. I'll try to do something along these lines under LUCENE-1458 .

> It would be great if instead of relying on Lucene to manage the
> deletedDocs file, the API would be pluggable

In LUCENE-1478, "IndexComponent" was proposed, with potential subclasses including PostingsComponent, LexiconComponent/TermDictComponent, TermVectorsComponent, and so on. Since then, it has become apparent that SnapshotComponent and DeletionsComponent also belong at the top level.

Mike, you were planning on managing IndexComponents via IndexReader and IndexWriter constructor args, but won't that get unwieldy if there are too many components? A Schema class allows you to group them together. You don't have to use it to manage fields the way KS does – just leave that out.

Marvin Humphrey
added a comment - 05/Dec/08 22:46 > It would be great if instead of relying on Lucene to manage the
> deletedDocs file, the API would be pluggable
In LUCENE-1478 , "IndexComponent" was proposed, with potential subclasses including PostingsComponent, LexiconComponent/TermDictComponent, TermVectorsComponent, and so on. Since then, it has become apparent that SnapshotComponent and DeletionsComponent also belong at the top level.
In Lucy/KS, these would all be specified within a Schema:
class MySchema extends Schema {
DeletionsComponent deletionsComponent() {
return new DocIdBitSetDeletionsComponent();
}
void initFields() {
addField( "title" , "text" );
addField( "content" , "text" );
}
Analyzer analyzer() {
return new PolyAnalyzer( "en" );
}
}
Mike, you were planning on managing IndexComponents via IndexReader and IndexWriter constructor args, but won't that get unwieldy if there are too many components? A Schema class allows you to group them together. You don't have to use it to manage fields the way KS does – just leave that out.

Marvin Humphrey
added a comment - 05/Dec/08 21:55 > Does it really need to throw an exception?
Aside from back compat, I don't see why it would need to. I think the only rationale is to serve as a backstop protecting against invalid reads.

It would be great if instead of relying on Lucene to manage the deletedDocs file, the API would be pluggable enough such that a DocIdBitSet (DocIdSet with random access) could be set in a SegmentReader, and the file access (reading and writing) could be managed from outside. Of course this is difficult to do and still make things backwards compatible, however for 3.0 I would really like to be a part of efforts to create a completely generic and pluggable API that is cleanly separated from the underlying index format and files. This would mean that the analyzing, querying, scoring portions of Lucene could access an IndexReader like pluggable class where the underlying index files, when and how the index files are written to disk is completely separated.

One motivation for this patch is to allow custom queries access to the deletedDocs in a clean way (meaning not needing to be a part of the o.a.l.s. package)

I am wondering if it is good to try to get IndexReader.clone working again, or if there is some other better way related to this patch to externally manage the deletedDocs.

Improving the performance of deletedDocs would help for every query so it's worth looking at.

Jason Rutherglen
added a comment - 05/Dec/08 20:51 It would be great if instead of relying on Lucene to manage the deletedDocs file, the API would be pluggable enough such that a DocIdBitSet (DocIdSet with random access) could be set in a SegmentReader, and the file access (reading and writing) could be managed from outside. Of course this is difficult to do and still make things backwards compatible, however for 3.0 I would really like to be a part of efforts to create a completely generic and pluggable API that is cleanly separated from the underlying index format and files. This would mean that the analyzing, querying, scoring portions of Lucene could access an IndexReader like pluggable class where the underlying index files, when and how the index files are written to disk is completely separated.
One motivation for this patch is to allow custom queries access to the deletedDocs in a clean way (meaning not needing to be a part of the o.a.l.s. package)
I am wondering if it is good to try to get IndexReader.clone working again, or if there is some other better way related to this patch to externally manage the deletedDocs.
Improving the performance of deletedDocs would help for every query so it's worth looking at.

Michael McCandless
added a comment - 05/Dec/08 20:01 but IndexReader.document throws an exception if the document is deleted...0 so you still need random access
Does it really need to throw an exception? (Of course for back compat it does, but we could move away from that to a new method that doesn't check).

In many cases after you have read an index, and retrieved document numbers, these are lazily returned to the client.

By the time some records are needed to be read, they may have already been deleted (at least this was the usage in old lucene, where deletions happened in the reader).

I think a lot of code assumes this, and calls the isDeleted() to ensure the document is still valid.

But isn't that an uncommon use case? It's dangerous to wait a long
time after getting a docID from a reader, before looking up the
document. Most apps pull the doc right away, send it to the user, and
the docID isn't kept (I think?).

But still I agree: we can't eliminate random access to isDeleted
entirely. We'd still have to offer it for such external cases.

I'm just saying the internal uses of isDeleted could all be switched
to iteration instead, and, we might get some performance gains from
it especially when the number of deletes on a segment is relatively low.

Michael McCandless
added a comment - 05/Dec/08 18:48
In many cases after you have read an index, and retrieved document numbers, these are lazily returned to the client.
By the time some records are needed to be read, they may have already been deleted (at least this was the usage in old lucene, where deletions happened in the reader).
I think a lot of code assumes this, and calls the isDeleted() to ensure the document is still valid.
But isn't that an uncommon use case? It's dangerous to wait a long
time after getting a docID from a reader, before looking up the
document. Most apps pull the doc right away, send it to the user, and
the docID isn't kept (I think?).
But still I agree: we can't eliminate random access to isDeleted
entirely. We'd still have to offer it for such external cases.
I'm just saying the internal uses of isDeleted could all be switched
to iteration instead, and, we might get some performance gains from
it especially when the number of deletes on a segment is relatively low.

It seemed wrong to pay the method call overhead for IndexReader.isDeleted() on each iter in NOTScorer.next() or MatchAllScorer.next(), when we could just store the next deletion:

Nice! This is what I had in mind.

I think we could [almost] do this across the board for Lucene.
SegmentTermDocs would similarly store nextDeleted and apply the same
"AND NOT" logic.

that's because IndexReader.isDeleted() isn't exposed and because IndexReader.fetchDoc(int docNum) returns the doc even if it's deleted

Hmm – that is very nicely enabling.

I've actually been trying to figure out a new design for deletions because writing them out for big segments is our last big write bottleneck

One approach would be to use a "segmented" model. IE, if a few
deletions are added, write that to a new "deletes segment", ie a
single "normal segment" would then have multiple deletion files
associated with it. These would have to be merged (iterator) when
used during searching, and, periodically coalesced.

if we only need iterator access, we can use vbyte encoding instead

Right: if there are relatively few deletes against a segment, encoding
the "on bits" directly (or deltas) should be a decent win since
iteration is much faster.

Michael McCandless
added a comment - 05/Dec/08 18:42
It seemed wrong to pay the method call overhead for IndexReader.isDeleted() on each iter in NOTScorer.next() or MatchAllScorer.next(), when we could just store the next deletion:
Nice! This is what I had in mind.
I think we could [almost] do this across the board for Lucene.
SegmentTermDocs would similarly store nextDeleted and apply the same
"AND NOT" logic.
that's because IndexReader.isDeleted() isn't exposed and because IndexReader.fetchDoc(int docNum) returns the doc even if it's deleted
Hmm – that is very nicely enabling.
I've actually been trying to figure out a new design for deletions because writing them out for big segments is our last big write bottleneck
One approach would be to use a "segmented" model. IE, if a few
deletions are added, write that to a new "deletes segment", ie a
single "normal segment" would then have multiple deletion files
associated with it. These would have to be merged (iterator) when
used during searching, and, periodically coalesced.
if we only need iterator access, we can use vbyte encoding instead
Right: if there are relatively few deletes against a segment, encoding
the "on bits" directly (or deltas) should be a decent win since
iteration is much faster.

> Marvin, in KS/Lucy are you using random-access or iterator to access
> deletedDocs & norms?

Both. There's a DelEnum class which is used by NOTScorer and MatchAllScorer, but it's implemented using BitVectors which get the next deleted doc num by calling nextSetBit() internally.

I happened to be coding up those classes this spring when there was the big brouhaha about IndexReader.isDeleted(). It seemed wrong to pay the method call overhead for IndexReader.isDeleted() on each iter in NOTScorer.next() or MatchAllScorer.next(), when we could just store the next deletion:

(Note: Scorer.next() in KS returns the document number; doc nums start at 1, and 0 is the sentinel signaling iterator termination. I expect that Lucy will be the same.)

Perhaps we could get away without needing the random access, but that's because IndexReader.isDeleted() isn't exposed and because IndexReader.fetchDoc(int docNum) returns the doc even if it's deleted – unlike Lucene which throws an exception. Also, you can't delete documents against an IndexReader, so Robert's objection doesn't apply to us.

I had always assumed we were going to have to expose isDeleted() eventually, but maybe we can get away with zapping it. Interesting!

I've actually been trying to figure out a new design for deletions because writing them out for big segments is our last big write bottleneck, now that we've theoretically solved the sort cache warming issue. I figured we would continue to need bit-vector files because they're straightforward to mmap, but if we only need iterator access, we can use vbyte encoding instead... Hmm, we still face the problem of outsized write cost when a segment has a large number of deletions and you add one more...

Marvin Humphrey
added a comment - 05/Dec/08 18:22 > Marvin, in KS/Lucy are you using random-access or iterator to access
> deletedDocs & norms?
Both. There's a DelEnum class which is used by NOTScorer and MatchAllScorer, but it's implemented using BitVectors which get the next deleted doc num by calling nextSetBit() internally.
I happened to be coding up those classes this spring when there was the big brouhaha about IndexReader.isDeleted(). It seemed wrong to pay the method call overhead for IndexReader.isDeleted() on each iter in NOTScorer.next() or MatchAllScorer.next(), when we could just store the next deletion:
i32_t
MatchAllScorer_next(MatchAllScorer* self)
{
do {
if (++self->doc_num > self->max_docs) {
self->doc_num--;
return 0;
}
if (self->doc_num > self->next_deletion) {
self->next_deletion
= DelEnum_Skip_To(self->del_enum, self->doc_num);
}
} while (self->doc_num == self->next_deletion);
return self->doc_num;
}
(Note: Scorer.next() in KS returns the document number; doc nums start at 1, and 0 is the sentinel signaling iterator termination. I expect that Lucy will be the same.)
Perhaps we could get away without needing the random access, but that's because IndexReader.isDeleted() isn't exposed and because IndexReader.fetchDoc(int docNum) returns the doc even if it's deleted – unlike Lucene which throws an exception. Also, you can't delete documents against an IndexReader, so Robert's objection doesn't apply to us.
I had always assumed we were going to have to expose isDeleted() eventually, but maybe we can get away with zapping it. Interesting!
I've actually been trying to figure out a new design for deletions because writing them out for big segments is our last big write bottleneck, now that we've theoretically solved the sort cache warming issue. I figured we would continue to need bit-vector files because they're straightforward to mmap, but if we only need iterator access, we can use vbyte encoding instead... Hmm, we still face the problem of outsized write cost when a segment has a large number of deletions and you add one more...

robert engels
added a comment - 05/Dec/08 14:21 I don't think you can change this...
In many cases after you have read an index, and retrieved document numbers, these are lazily returned to the client.
By the time some records are needed to be read, they may have already been deleted (at least this was the usage in old lucene, where deletions happened in the reader).
I think a lot of code assumes this, and calls the isDeleted() to ensure the document is still valid.

But, SegmentReader needs random access to the bits (DocIdSet only provides an iterator)?

Although IndexReader.isDeleted exposes a random-access API to deleted docs, I think it may be overkill.

Ie, in most (all?) uses of deleted docs throughout Lucene core/contrib, a simple iterator (DocIdSet) would in fact suffice.

EG in SegmentTermDocs iteration we are always checking deletedDocs by ascending docID. It might be a performance gain (pure speculation) if we used an iterator API, because we could hold "nextDelDocID" and only advance that (skipTo) when the term's docID has moved past it. It's just like an "AND NOT X" clause.

Similarly, norms, which also now expose a random-access API, should be fine with an iterator type API as well.

This may also imply better VM behavior, since we don't actually require norms/deletions to be fully memory resident.

This would be a biggish change, and it's not clear whether/when we should explore it, but I wanted to get the idea out there.

Marvin, in KS/Lucy are you using random-access or iterator to access deletedDocs & norms?

Michael McCandless
added a comment - 05/Dec/08 11:29 But, SegmentReader needs random access to the bits (DocIdSet only provides an iterator)?
Although IndexReader.isDeleted exposes a random-access API to deleted docs, I think it may be overkill.
Ie, in most (all?) uses of deleted docs throughout Lucene core/contrib, a simple iterator (DocIdSet) would in fact suffice.
EG in SegmentTermDocs iteration we are always checking deletedDocs by ascending docID. It might be a performance gain (pure speculation) if we used an iterator API, because we could hold "nextDelDocID" and only advance that (skipTo) when the term's docID has moved past it. It's just like an "AND NOT X" clause.
Similarly, norms, which also now expose a random-access API, should be fine with an iterator type API as well.
This may also imply better VM behavior, since we don't actually require norms/deletions to be fully memory resident.
This would be a biggish change, and it's not clear whether/when we should explore it, but I wanted to get the idea out there.
Marvin, in KS/Lucy are you using random-access or iterator to access deletedDocs & norms?

BitVector does not implement the methods of java.util.BitSet. RABitSet could be implemented by OpenBitSet and BitVector. This way an OpenBitSet or another filter such as P4Delta could be used in place of BitVector in SegmentReader.

The IndexReader.flush type of methods would need to either automatically not save, throw an exception, and there needs to be a setting. This helps the synchronization issue in SegmentReader.isDeleted by allowing access to it.

Jason Rutherglen
added a comment - 04/Dec/08 00:16 BitVector does not implement the methods of java.util.BitSet. RABitSet could be implemented by OpenBitSet and BitVector. This way an OpenBitSet or another filter such as P4Delta could be used in place of BitVector in SegmentReader.
The IndexReader.flush type of methods would need to either automatically not save, throw an exception, and there needs to be a setting. This helps the synchronization issue in SegmentReader.isDeleted by allowing access to it.