Description

In order to avoid having very many long-living PostingList objects in TermsHashPerField we want to switch to parallel arrays. The termsHash will simply be a int[] which maps each term to dense termIDs.

All data that the PostingList classes currently hold will then we placed in parallel arrays, where the termID is the index into the arrays. This will avoid the need for object pooling, will remove the overhead of object initialization and garbage collection. Especially garbage collection should benefit significantly when the JVM runs out of memory, because in such a situation the gc mark times can get very long if there is a big number of long-living objects in memory.

Another benefit could be to build more efficient TermVectors. We could avoid the need of having to store the term string per document in the TermVector. Instead we could just store the segment-wide termIDs. This would reduce the size and also make it easier to implement efficient algorithms that use TermVectors, because no term mapping across documents in a segment would be necessary. Though this improvement we can make with a separate jira issue.

Activity

But, note that term vectors today do not store the term char[] again – they piggyback on the term char[] already stored for the postings. Though, I believe they store "int textStart" (increments by term length per unique term), which is less compact than the termID would be (increments +1 per unique term), so if eg we someday use packed ints we'd be more RAM efficient by storing termIDs...

Michael McCandless
added a comment - 18/Mar/10 09:21 This would be great!
But, note that term vectors today do not store the term char[] again – they piggyback on the term char[] already stored for the postings. Though, I believe they store "int textStart" (increments by term length per unique term), which is less compact than the termID would be (increments +1 per unique term), so if eg we someday use packed ints we'd be more RAM efficient by storing termIDs...

Slightly off-topic ... Having a facility to obtain termID-s per segment (or better yet per index) would greatly benefit Solr's UnInverted field creation, which currently needs to assign term ids by linear scanning.

Michael McCandless
added a comment - 18/Mar/10 15:01 This issue is just about how IndexWriter's RAM buffer stores its terms...
But, the flex API adds long ord() and seek(long ord) to the TermsEnum API.

This issue is just about how IndexWriter's RAM buffer stores its terms...

Actually, when I talked about the TermVectors I meant we should explore to store the termIDs on disk, rather than the strings. It would help things like similarity search and facet counting.

But, note that term vectors today do not store the term char[] again - they piggyback on the term char[] already stored for the postings.

Yeah I think I'm familiar with that part (secondary entry point in TermsHashPerField, hashes based on termStart). Haven't looked much into how the "rest" of the TermVector in-memory data structures are working.

Though, I believe they store "int textStart" (increments by term length per unique term), which is less compact than the termID would be (increments +1 per unique term)

Actually we wouldn't need a second hashtable for the secondary TermsHash anymore, right? It would just have like the primary TermsHash a parallel array with the things that the TermVectorsTermsWriter.Postinglist class currently contains (freq, lastOffset, lastPosition)? And the index into that array would be the termID of course.

This would be a nice simplification, because no hash collisions, no hash table resizing based on load factor, etc. would be necessary for non-primary TermsHashes?

so if eg we someday use packed ints we'd be more RAM efficient by storing termIDs...

How does the read performance of packed ints compare to "normal" int[] arrays? I think nowadays RAM is less of an issue? And with a searchable RAM buffer we might want to sacrifice a bit more RAM for higher search performance? Oh man, will we need flexible indexing for the in-memory index too?

Michael Busch
added a comment - 18/Mar/10 17:51 This issue is just about how IndexWriter's RAM buffer stores its terms...
Actually, when I talked about the TermVectors I meant we should explore to store the termIDs on disk , rather than the strings. It would help things like similarity search and facet counting.
But, note that term vectors today do not store the term char[] again - they piggyback on the term char[] already stored for the postings.
Yeah I think I'm familiar with that part (secondary entry point in TermsHashPerField, hashes based on termStart). Haven't looked much into how the "rest" of the TermVector in-memory data structures are working.
Though, I believe they store "int textStart" (increments by term length per unique term), which is less compact than the termID would be (increments +1 per unique term)
Actually we wouldn't need a second hashtable for the secondary TermsHash anymore, right? It would just have like the primary TermsHash a parallel array with the things that the TermVectorsTermsWriter.Postinglist class currently contains (freq, lastOffset, lastPosition)? And the index into that array would be the termID of course.
This would be a nice simplification, because no hash collisions, no hash table resizing based on load factor, etc. would be necessary for non-primary TermsHashes?
so if eg we someday use packed ints we'd be more RAM efficient by storing termIDs...
How does the read performance of packed ints compare to "normal" int[] arrays? I think nowadays RAM is less of an issue? And with a searchable RAM buffer we might want to sacrifice a bit more RAM for higher search performance? Oh man, will we need flexible indexing for the in-memory index too?

Actually, when I talked about the TermVectors I meant we should explore to store the termIDs on disk, rather than the strings. It would help things like similarity search and facet counting.

Ahhhh that would be great!

Actually we wouldn't need a second hashtable for the secondary TermsHash anymore, right? It would just have like the primary TermsHash a parallel array with the things that the TermVectorsTermsWriter.Postinglist class currently contains (freq, lastOffset, lastPosition)? And the index into that array would be the termID of course.

Hmm the challenge is that the tracking done for term vectors is just within a single doc. Ie the hash used for term vectors only holds the terms for that one doc (so it's much smaller), vs the primary hash that holds terms for all docs in the current RAM buffer. So we'd be burning up much more RAM if we also key into the term vector's parallel arrays using the primary term id?

But I do think we should cutover to parallel arrays for TVTW, too.

How does the read performance of packed ints compare to "normal" int[] arrays? I think nowadays RAM is less of an issue? And with a searchable RAM buffer we might want to sacrifice a bit more RAM for higher search performance?

EG custom attrs appearing in the TokenStream? Yes we will need to... but hopefully once we get serialization working cleanly for the attrs this'll be easy? With ByteSliceWriter/Reader you just .writeBytes and .readBytes...

I don't think we should allow Codecs to be used in the RAM buffer anytime soon though...

Michael McCandless
added a comment - 18/Mar/10 18:36 Actually, when I talked about the TermVectors I meant we should explore to store the termIDs on disk, rather than the strings. It would help things like similarity search and facet counting.
Ahhhh that would be great!
Actually we wouldn't need a second hashtable for the secondary TermsHash anymore, right? It would just have like the primary TermsHash a parallel array with the things that the TermVectorsTermsWriter.Postinglist class currently contains (freq, lastOffset, lastPosition)? And the index into that array would be the termID of course.
Hmm the challenge is that the tracking done for term vectors is just within a single doc. Ie the hash used for term vectors only holds the terms for that one doc (so it's much smaller), vs the primary hash that holds terms for all docs in the current RAM buffer. So we'd be burning up much more RAM if we also key into the term vector's parallel arrays using the primary term id?
But I do think we should cutover to parallel arrays for TVTW, too.
How does the read performance of packed ints compare to "normal" int[] arrays? I think nowadays RAM is less of an issue? And with a searchable RAM buffer we might want to sacrifice a bit more RAM for higher search performance?
It's definitely slower to read/write to/from packed ints, and I agree, indexing and searching speed trumps RAM efficiency.
Oh man, will we need flexible indexing for the in-memory index too?
EG custom attrs appearing in the TokenStream? Yes we will need to... but hopefully once we get serialization working cleanly for the attrs this'll be easy? With ByteSliceWriter/Reader you just .writeBytes and .readBytes...
I don't think we should allow Codecs to be used in the RAM buffer anytime soon though...

I'm wondering how to manage the size of the parallel array? I started with an initial size for the parallel array equal to the size of the postingsHash array. When it's full then I allocate a new one with 1.5x size. When shrinkHash() is called I also shrink the parallel array to the same size as postingsHash. How does that sound?

Michael Busch
added a comment - 22/Mar/10 08:05 Changes the indexer to use parallel arrays instead of PostingList objects (for both FreqProx* and TermVectors*).
All core & contrib & bw tests pass. I haven't done performance tests yet.
I'm wondering how to manage the size of the parallel array? I started with an initial size for the parallel array equal to the size of the postingsHash array. When it's full then I allocate a new one with 1.5x size. When shrinkHash() is called I also shrink the parallel array to the same size as postingsHash. How does that sound?

One change I should make to the patch is how to track the memory consumption. When the parallel array is allocated or grown then bytesAllocated() should be called? And when a new termID is added, should only then bytesUsed() be called?

Michael Busch
added a comment - 22/Mar/10 08:28 One change I should make to the patch is how to track the memory consumption. When the parallel array is allocated or grown then bytesAllocated() should be called? And when a new termID is added, should only then bytesUsed() be called?

Michael McCandless
added a comment - 22/Mar/10 11:19 Looks great Michael!
I think *ParallelPostingsArray.reset do not need to zero-fill the arrays – these are overwritten when that termID is first used, right?

Michael Busch
added a comment - 22/Mar/10 15:21 I think *ParallelPostingsArray.reset do not need to zero-fill the arrays - these are overwritten when that termID is first used, right?
Good point! I'll remove the reset() methods.

Results with -Xmx2000m:

Results with -Xmx256m:

Write performance

Gain

Number of segments

trunk

467 docs/sec

41

trunk + parallel arrays

871 docs/sec

+86.5%

32

So I think these results are interesting and roughly as expected. 4.3% is a nice small performance gain.
But running the tests with a low heap shows how much cheaper the garbage collection becomes. Setting IW's RAM buffer to 200MB and the overall heap to 256MB forces the gc to run frequently. The mark times are much more costly if we have all long-living PostingList objects in memory compared to parallel arrays.

So this is probably not a huge deal for "normal" indexing. But once we can search on the RAM buffer it becomes much more attractive to fill up the RAM as much as you can. And exactly in that case we safe a lot with this improvement.

Also note that the number of segments decreased by 22% (from 41 to 32). This shows that the parallel-array approach needs less RAM, thus flushes less often and will cause less segment merges in the long run. So a longer test with actual segment merges would show even bigger gains (with both big and small heaps).

Michael Busch
added a comment - 23/Mar/10 00:48 - edited I did some performance experiments:
I indexed 1M wikipedia documents using the cheap WhiteSpaceAnalyzer, no cfs files, disabled any merging, RAM buffer size = 200MB, single writer thread, TermVectors enabled.
Test machine: MacBook Pro, 2.53 GHz Intel Core 2 Duo, 4 GB 1067 MHz DDR3, MacOS X 10.5.8.
Results with -Xmx2000m:
Write performance
Gain
Number of segments
trunk
833 docs/sec
41
trunk + parallel arrays
869 docs/sec
+ 4.3%
32
Results with -Xmx256m:
Write performance
Gain
Number of segments
trunk
467 docs/sec
41
trunk + parallel arrays
871 docs/sec
+86.5%
32
So I think these results are interesting and roughly as expected. 4.3% is a nice small performance gain.
But running the tests with a low heap shows how much cheaper the garbage collection becomes. Setting IW's RAM buffer to 200MB and the overall heap to 256MB forces the gc to run frequently. The mark times are much more costly if we have all long-living PostingList objects in memory compared to parallel arrays.
So this is probably not a huge deal for "normal" indexing. But once we can search on the RAM buffer it becomes much more attractive to fill up the RAM as much as you can. And exactly in that case we safe a lot with this improvement.
Also note that the number of segments decreased by 22% (from 41 to 32). This shows that the parallel-array approach needs less RAM, thus flushes less often and will cause less segment merges in the long run. So a longer test with actual segment merges would show even bigger gains (with both big and small heaps).
So overall, I'm very happy with these results!

Sweet, this looks great Michael! Less RAM used and faster indexing (much less GC load) – win/win.

It's a little surprising that the segment count dropped from 41 -> 32? Ie, how much less RAM do the parallel arrays take? They save the object header per-unique-term, and 4 bytes on 64bit JREs since the "pointer" is now an int and not a real pointer? But, other things use RAM (the docIDs in the postings themselves, norms, etc.) so it's surprising the savings was so much that you get 22% fewer segments... are you sure there isn't a bug in the RAM usage accounting?

Michael McCandless
added a comment - 23/Mar/10 09:07 Sweet, this looks great Michael! Less RAM used and faster indexing (much less GC load) – win/win.
It's a little surprising that the segment count dropped from 41 -> 32? Ie, how much less RAM do the parallel arrays take? They save the object header per-unique-term, and 4 bytes on 64bit JREs since the "pointer" is now an int and not a real pointer? But, other things use RAM (the docIDs in the postings themselves, norms, etc.) so it's surprising the savings was so much that you get 22% fewer segments... are you sure there isn't a bug in the RAM usage accounting?

so it's surprising the savings was so much that you get 22% fewer segments... are you sure there isn't a bug in the RAM usage accounting?

Yeah it seems a bit suspicious. I'll investigate. But, keep in mind that TermVectors were enabled too. And the number of "unique terms" in the 2nd TermsHash is higher, i.e. if you summed up numPostings from the 2nd TermsHash in each round that sum should be higher than numPostings from the first TermsHash.

Michael Busch
added a comment - 23/Mar/10 15:16
so it's surprising the savings was so much that you get 22% fewer segments... are you sure there isn't a bug in the RAM usage accounting?
Yeah it seems a bit suspicious. I'll investigate. But, keep in mind that TermVectors were enabled too. And the number of "unique terms" in the 2nd TermsHash is higher, i.e. if you summed up numPostings from the 2nd TermsHash in each round that sum should be higher than numPostings from the first TermsHash.

OK, but, RAM used by TermVectors* shouldn't participate in the accounting... ie it only holds RAM for the one doc, at a time.

And the number of "unique terms" in the 2nd TermsHash is higher, i.e. if you summed up numPostings from the 2nd TermsHash in each round that sum should be higher than numPostings from the first TermsHash.

1st TermsHash = current trunk and 2nd TermsHash = this patch? Ie, it has more unique terms at flush time (because it's more RAM efficient)? If so, then yes, I agree But 22% fewer still seems too good to be true...

Michael McCandless
added a comment - 23/Mar/10 15:54 But, keep in mind that TermVectors were enabled too.
OK, but, RAM used by TermVectors* shouldn't participate in the accounting... ie it only holds RAM for the one doc, at a time.
And the number of "unique terms" in the 2nd TermsHash is higher, i.e. if you summed up numPostings from the 2nd TermsHash in each round that sum should be higher than numPostings from the first TermsHash.
1st TermsHash = current trunk and 2nd TermsHash = this patch? Ie, it has more unique terms at flush time (because it's more RAM efficient)? If so, then yes, I agree But 22% fewer still seems too good to be true...

Michael Busch
added a comment - 23/Mar/10 16:19
OK, but, RAM used by TermVectors* shouldn't participate in the accounting... ie it only holds RAM for the one doc, at a time.
Man, my brain is lacking the TermVector synapses...

I checked how many bytes were allocated for postings when the first segment was flushed. Trunk flushed after 6400 docs and had 103MB allocated for PostingList objects. 2329 flushed after 8279 docs and had 94MB allocated for the parallel arrays, and 74MB out of the 94MB were actually used.

The first docs in the wikipedia dataset seem pretty large with many unique terms.

I think we need to fix how RAM is managed for this... right now, if
you turn on IW's infoStream you'll see a zillion prints where IW tries
to balance RAM (it "runs hot"), but, nothing can be freed. We do this
per-doc, after the parallel arrays resize themselves to net/net over
our allowed RAM buffer.

A few ideas on how we can fix:

I think we have to change when we flush. It's now based on RAM
used (not alloc'd), but I think we should switch it to use RAM
alloc'd after we've freed all we can. Ie if we free things up and
we've still alloc'd over the limit, we flush. This'll fix the
running hot we now see...

TermsHash.freeRAM is now a no-op right? We have to fix this to
actually free something when it can because you can imagine
indexing docs that are postings heavy but then switching to docs
that are byte[] block heavy. On that switch you have to balance
the allocations (ie, shrink your postings). I think we should
walk the threads/fields and use ArrayUtil.shrink to shrink down,
but, don't shrink by much at a time (to avoid running hot) – IW
will invoke this method again if more shrinkage is needed.

Also, shouldn't we use ArrayUtil.grow to increase? Instead of
always a 50% growth? Because with such a large growth you can
easily have horrible RAM efficiency... ie that 50% growth can
suddenly put you over the limit and then you flush, having
effectively used only half of the allowed RAM buffer in the worst
case.

Michael McCandless
added a comment - 29/Mar/10 11:58 I think we need to fix how RAM is managed for this... right now, if
you turn on IW's infoStream you'll see a zillion prints where IW tries
to balance RAM (it "runs hot"), but, nothing can be freed. We do this
per-doc, after the parallel arrays resize themselves to net/net over
our allowed RAM buffer.
A few ideas on how we can fix:
I think we have to change when we flush. It's now based on RAM
used (not alloc'd), but I think we should switch it to use RAM
alloc'd after we've freed all we can. Ie if we free things up and
we've still alloc'd over the limit, we flush. This'll fix the
running hot we now see...
TermsHash.freeRAM is now a no-op right? We have to fix this to
actually free something when it can because you can imagine
indexing docs that are postings heavy but then switching to docs
that are byte[] block heavy. On that switch you have to balance
the allocations (ie, shrink your postings). I think we should
walk the threads/fields and use ArrayUtil.shrink to shrink down,
but, don't shrink by much at a time (to avoid running hot) – IW
will invoke this method again if more shrinkage is needed.
Also, shouldn't we use ArrayUtil.grow to increase? Instead of
always a 50% growth? Because with such a large growth you can
easily have horrible RAM efficiency... ie that 50% growth can
suddenly put you over the limit and then you flush, having
effectively used only half of the allowed RAM buffer in the worst
case.

Changes DocumentsWriter to trigger the flush using bytesAllocated instead of bytesUsed to improve the "running hot" issue Mike's seeing

Improves the ParallelPostingsArray to grow using ArrayUtil.oversize()

In IRC we discussed changing TermsHashPerField to shrink the parallel arrays in freeRAM(), but that involves tricky thread-safety changes, because one thread could call DocumentsWriter.balanceRAM(), which triggers freeRAM() across all thread states, while other threads keep indexing.

We decided to leave it the way it currently works: we discard the whole parallel array during flush and don't reuse it. This is not as optimal as it could be, but once LUCENE-2324 is done this won't be an issue anymore anyway.

Note that this new patch is against the flex branch: I thought we'd switch it over soon anyway? I can also create a patch for trunk if that's preferred.

Michael Busch
added a comment - 01/Apr/10 01:32 This patch:
Changes DocumentsWriter to trigger the flush using bytesAllocated instead of bytesUsed to improve the "running hot" issue Mike's seeing
Improves the ParallelPostingsArray to grow using ArrayUtil.oversize()
In IRC we discussed changing TermsHashPerField to shrink the parallel arrays in freeRAM(), but that involves tricky thread-safety changes, because one thread could call DocumentsWriter.balanceRAM(), which triggers freeRAM() across all thread states, while other threads keep indexing.
We decided to leave it the way it currently works: we discard the whole parallel array during flush and don't reuse it. This is not as optimal as it could be, but once LUCENE-2324 is done this won't be an issue anymore anyway.
Note that this new patch is against the flex branch: I thought we'd switch it over soon anyway? I can also create a patch for trunk if that's preferred.

Michael McCandless
added a comment - 01/Apr/10 18:58 Fixed a couple of issues on the last patch:
We weren't notifying DW that we freed up RAM when setting postingsArray to null
We can shrink postingsArray separately from the termsHash, and, instead of nulling it, we can simply downsize it
Tests pass, and indexing perf on first 10M 1 KB sized wikipedia docs is a bit faster though probably in the noise (1296 sec on current flex head vs 1238 sec with this patch).

Does away with separate tracking of used vs alloc, in IndexWriter.
This distinction added much complexity and only saved a small
number of free/alloc's per flush cycle, especially now that
postings realloc only in big chunks (parallel arrays).

Fixed some over-counting of bytes used.

The indexing throughput is basically unchanged after this (on first
10M 1KB Wikipedia docs), so I think this is a good simplification.

Michael McCandless
added a comment - 02/Apr/10 00:49 New patch attached:
Does away with separate tracking of used vs alloc, in IndexWriter.
This distinction added much complexity and only saved a small
number of free/alloc's per flush cycle, especially now that
postings realloc only in big chunks (parallel arrays).
Fixed some over-counting of bytes used.
The indexing throughput is basically unchanged after this (on first
10M 1KB Wikipedia docs), so I think this is a good simplification.

Reopening – this fixed causes an intermittent deadlock in
TestStressIndexing2.

It's actually a pre-existing issue, whereby if a flush happens only
because of deletions (ie no indexed docs), and you're using multiple
threads, it's possible some idled threads would fail to be notified
to wake up and continue indexing once the flush completes.

The fix here increased the chance of hitting that bug because the RAM
accounting has a bug whereby it overly-aggressively flushes because of
deletions, ie, rather than free up RAM allocated but not used for
indexing, it flushes.

I first fixed the deadlock case (need to clear DW's flushPending when
we only flush deletes).

Then I fixed the shared deletes/indexing RAM by:

Not reusing the RAM for postings arrays – we now null this out
for every field after flushing

Calling balanceRAM when deletes have filled up RAM before deciding
to flush, because this can free RAM up, making more space for
deletes.

I also further simplified things – no more separate call to
doBalanceRAM, and added a fun unit test that randomly alternates
between pure indexing and pure deleting, asserting that the flushing
doesn't "run hot" on any of those transitions.

Michael McCandless
added a comment - 05/Apr/10 20:21 Reopening – this fixed causes an intermittent deadlock in
TestStressIndexing2.
It's actually a pre-existing issue, whereby if a flush happens only
because of deletions (ie no indexed docs), and you're using multiple
threads, it's possible some idled threads would fail to be notified
to wake up and continue indexing once the flush completes.
The fix here increased the chance of hitting that bug because the RAM
accounting has a bug whereby it overly-aggressively flushes because of
deletions, ie, rather than free up RAM allocated but not used for
indexing, it flushes.
I first fixed the deadlock case (need to clear DW's flushPending when
we only flush deletes).
Then I fixed the shared deletes/indexing RAM by:
Not reusing the RAM for postings arrays – we now null this out
for every field after flushing
Calling balanceRAM when deletes have filled up RAM before deciding
to flush, because this can free RAM up, making more space for
deletes.
I also further simplified things – no more separate call to
doBalanceRAM, and added a fun unit test that randomly alternates
between pure indexing and pure deleting, asserting that the flushing
doesn't "run hot" on any of those transitions.