[ http://issues.apache.org/jira/browse/LUCENE-738?page=comments#action_12456216 ]
Doron Cohen commented on LUCENE-738:
------------------------------------
I tried two implementations:
(1) writing d-gaps for ids of deleted docs, and
(2) writing d-gaps for indexes of non zero bytes in the "byte bits []" array, plus the non-zero
byte itself.
I favor (2) because it needs less CPU and because its code is simpler.
I ran the following performance test (Win XP, ~2GHz, 2GB RAM):
- create an index with 2,000,000 short docs.
- delete 4,000 docs in steps of 8: 0, 8, 16, 24, ...
("steps of 8" is worst case for the new code, because this way a single written byte never
covers more than one deleted doc.)
Results for original code:
Operation......runCnt...recsPerRun........rec/s..elapsedSec
Deletions_4000......1........12000........232.1.......51.70
OpenReader.-..-..4000.-..-..-..-.1.-..-...150.0.-..-..26.67
DeleteDoc........4000............1......1,542.6........2.59
CloseReader.-....4000.-..-..-..-.1.-..-...180.0.-..-..22.23
Size of the .del file was 245KB.
Results for new code:
Operation......runCnt...recsPerRun........rec/s..elapsedSec
Deletions_4000......1........12000........263.8.......45.49
OpenReader.-..-..4000.-..-..-..-.1.-..-...173.3.-..-..23.08
DeleteDoc........4000............1......1,679.3........2.38
CloseReader.-....4000.-..-..-..-.1.-..-...201.4.-..-..19.86
Size of .del file was (at the end) 8KB.
So the improvement in this somewhat artificial scenario was 13%.
(I initially thought it would be higher).
I noticed when running repeatedly that the new code run time is more steady, while the original
code run time fluctuates - e.g. 58.8, 51.7, 56.0, 51.8 (reported above are the best times.)
We should ask: is this improvement sufficient to justify making BitVector code less simple?
Is a shorter .del file an advantage by itself?
I tend to "yes" for both questions, but would like to hear more opinions.
Change is backwards compatible - when opening a .del file (actually a bit-vector file) that
was written with older version, it identifies that and reads it correctly. Existing backwards
compatibility tests (of lock-less commits) have .del files that are opened correctly with
new code.
I added a test case to BitVectorTest - it tests the switch from writing d-gaps to writing
bits as the number of set bits increases or decreases.
This test also shows for what range of values d-gaps are written:
- size=100: never
- size=1,000: count < 6
- size=10,000: count < 42
- size=100,000: count < 417
- size=1,000,000: count < 3125
All tests pass.
> read/write .del as d-gaps when the deleted bit vector is sufficiently sparse
> ----------------------------------------------------------------------------
>
> Key: LUCENE-738
> URL: http://issues.apache.org/jira/browse/LUCENE-738
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Store
> Affects Versions: 2.1
> Reporter: Doron Cohen
> Assigned To: Doron Cohen
>
> .del file of a segment maintains info on deleted documents in that segment. The file
exists only for segments having deleted docs, so it does not exists for newly created segments
(e.g. resulted from merge). Each time closing an index reader that deleted any document, the
.del file is rewritten. In fact, since the lock-less commits change a new (generation of)
.del file is created in each such occasion.
> For small indexes there is no real problem with current situation. But for very large
indexes, each time such an index reader is closed, creating such new bit-vector seems like
unnecessary overhead in cases that the bit vector is sparse (just a few docs were deleted).
For instance, for an index with a segment of 1M docs, the sequence: {open reader; delete 1
doc from that segment; close reader;} would write a file of ~128KB. Repeat this sequence 8
times: 8 new files of total size of 1MB are written to disk.
> Whether this is a bottleneck or not depends on the application deletes pattern, but for
the case that deleted docs are sparse, writing just the d-gaps would save space and time.
> I have this (simple) change to BitVector running and currently trying some performance
tests to, yet, convince myself on the worthiness of this.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org