Understanding lucene indexes and disk I/O

Understanding lucene indexes and disk I/O

Hi all,

Please let me know if this should be posted instead to the Lucene java-dev list.

We have very large tis files (about 36 GB). I have not been too concerned as I assumed that due to the indexing of the tis file by the tii file, only a small portion of the file needed to be read. However, upon re-reading the Lucene Index File Formats document: http://lucene.apache.org/java/3_0_1/fileformats.html#Term%20DictionaryI am now wondering whether most of the tis file may need to be read if a term is near the end of the file.

I'm trying to understand whether lucene has enough information stored in the *tii and *tis files to determine what byte offset in the prx file to seek to in order to get the freq or positions list for a particular term and whether a sequential read of the entire *tis file up to the term being sought is needed in order to decode ProxDeltas and determint the byte offset.

What has me confused is that the Lucene Index File Formats document says:

"ProxDelta determines the position of this term's TermPositions within the .prx file. In particular, it is the difference between the position of this term's data in that file and the position of the previous term's data (or zero, for the first term in the file. For fields with omitTf true, this will be 0 since prox information is not stored."

I assumed that Lucene implements a binary search of the tii file, then reads the appropriate 128 (IndexDivisor) entries from the tis file and does a binary search on that. (So that should be one disk seek when the searcher starts up to read in the entire tii file and then a second disk seek to load in the appropriate data for 128 terms from the tis file. However, once a term's entry in the tis file is found, if only the difference between this term and the previous term's position in the prx file is stored, it seems that in order to get the actual byte offset of a term in the prx file, all of the previous term's ProxDelta's starting at the first term in the file, would have to be read and added together. If that is true then we are talking a sequential read of the entire tis file up to the current term.

Is this correct? Can someone point me to the area of the code base where this is implemented ? Am I missing something here?

Re: Understanding lucene indexes and disk I/O

Hi Tom,

Fear not: we only scan up to 128 terms, to find the specific term.

First, the terms dict index (tii) is fully loaded into RAM, and then a
binary search is done on this (in-RAM) to find the nearest index term
just before the term you want. Then, we seek to that spot in the
main terms dict (tis) file, and scan (at most 128 entries) to find the
term.

On the frq/prx deltas: the tii holds absolute pointers. So, on
seeking to that first spot in the tis, we know the absolute frq/prx
(long) offsets, and then during scanning we just add the deltas we
see to those base absolutes.

BTW, this approach sucks up alot of extra RAM (holding 2 longs = 16
bytes, per indexed term)... in flex (now committed to trunk) we've
improved this so that the tii entry does not hold these longs, and,
instead, every indexed term inside the tis file holds the absolute
pointer not the delta. We also switched to parallel arrays, to avoid
the GC/object RAM overhead of one object per indexed term, and used
packed ints for some of these arrays.

Hmm though we also hold the terms characters as utf8 bytes, which may
net/net take more RAM for your case...

I would love to get ahold of your terms dict :) I'd have a field day
testing Lucene against it... I'm very curious how the flex
improvements affect your usage.

> Hi all,
>
> Please let me know if this should be posted instead to the Lucene java-dev list.
>
> We have very large tis files (about 36 GB). I have not been too concerned as I assumed that due to the indexing of the tis file by the tii file, only a small portion of the file needed to be read. However, upon re-reading the Lucene Index File Formats document: http://lucene.apache.org/java/3_0_1/fileformats.html#Term%20Dictionary> I am now wondering whether most of the tis file may need to be read if a term is near the end of the file.
>
> I'm trying to understand whether lucene has enough information stored in the *tii and *tis files to determine what byte offset in the prx file to seek to in order to get the freq or positions list for a particular term and whether a sequential read of the entire *tis file up to the term being sought is needed in order to decode ProxDeltas and determint the byte offset.
>
> What has me confused is that the Lucene Index File Formats document says:
>
> "ProxDelta determines the position of this term's TermPositions within the .prx file. In particular, it is the difference between the position of this term's data in that file and the position of the previous term's data (or zero, for the first term in the file. For fields with omitTf true, this will be 0 since prox information is not stored."
>
> I assumed that Lucene implements a binary search of the tii file, then reads the appropriate 128 (IndexDivisor) entries from the tis file and does a binary search on that. (So that should be one disk seek when the searcher starts up to read in the entire tii file and then a second disk seek to load in the appropriate data for 128 terms from the tis file. However, once a term's entry in the tis file is found, if only the difference between this term and the previous term's position in the prx file is stored, it seems that in order to get the actual byte offset of a term in the prx file, all of the previous term's ProxDelta's starting at the first term in the file, would have to be read and added together. If that is true then we are talking a sequential read of the entire tis file up to the current term.
>
> Is this correct? Can someone point me to the area of the code base where this is implemented ? Am I missing something here?
>
>
> Tom Burton-West
>
>

RE: Understanding lucene indexes and disk I/O

Thanks Mike,

At some point maybe the File Formats Document could be updated to make it clear that the tii has an entry similar to the IntexInterval'th tis entry but instead of holding frq/prx deltas it holds absolute pointers. Is it worth entering a JIRA issue? I would be happy to update the doc myself, but I'm don't think I have enough of an in depth understanding.

As you probably have guessed, I'm trying to understand the impact of the over 2.4 billion unique terms in our indexes on performance (https://issues.apache.org/jira/browse/LUCENE-2257). We suspect that a very large percentage of these terms are due to dirty OCR, but have not yet found a good way to eliminate a significant amount of dirty OCR.

I assume that these cause a few extra steps in the binary search of the tii file in memory but we probably won't notice that performance impact since our bottleneck is disk I/O for reading long postings lists for frequently occurring terms.

Am I correct in assuming that even if we have a very large number of garbage terms in our prx file, the overall size of the file does not significantly affect the number of disk seeks or amount of data to be read since Lucene can seek to the beginning of the postings for any particular term?

>> I would love to get ahold of your terms dict :) I'd have a field day
>>testing Lucene against it... I'm very curious how the flex improvements affect your usage.

Sometime in the next month or so we will get our new test server and after I get the backup of testing jobs under control, I'd love to do some testing with flex and our data.

First, the terms dict index (tii) is fully loaded into RAM, and then a
binary search is done on this (in-RAM) to find the nearest index term
just before the term you want. Then, we seek to that spot in the
main terms dict (tis) file, and scan (at most 128 entries) to find the
term.

On the frq/prx deltas: the tii holds absolute pointers. So, on
seeking to that first spot in the tis, we know the absolute frq/prx
(long) offsets, and then during scanning we just add the deltas we
see to those base absolutes.

Re: Understanding lucene indexes and disk I/O

> At some point maybe the File Formats Document could be updated to make it clear that the tii has an entry similar to the IntexInterval'th tis entry but instead of holding frq/prx deltas it holds absolute pointers. Is it worth entering a JIRA issue? I would be happy to update the doc myself, but I'm don't think I have enough of an in depth understanding.

Well I need to redo this part of file format docs, due to the flex
changes (LUCENE-2371)... I'll add a comment to remember to
state that this means we scan at most 128 terms.

> As you probably have guessed, I'm trying to understand the impact of the over 2.4 billion unique terms in our indexes on performance (https://issues.apache.org/jira/browse/LUCENE-2257). We suspect that a very large percentage of these terms are due to dirty OCR, but have not yet found a good way to eliminate a significant amount of dirty OCR.

Fun problem ;)

> I assume that these cause a few extra steps in the binary search of the tii file in memory but we probably won't notice that performance impact since our bottleneck is disk I/O for reading long postings lists for frequently occurring terms.

Yeah, and added RAM usage holding the tii in memory.

In theory, one could make a flex codec that instead of indexing every
Nth (128 by default) term, indexes in a non-regular manner, eg at any
term > X in frequency (though nobody has built such a codec!). In
theory this'd be good for cases like yours... because you don't need
every 128 when many of them are very low count. Ie the added time to
scan to a rare term is usually a non-issue since iterating that rare
term's docs will be so fast. But added cost when scanning to a common
term "hurts" because it'll take so long to walk that common term's
docs.

> Am I correct in assuming that even if we have a very large number of garbage terms in our prx file, the overall size of the file does not significantly affect the number of disk seeks or amount of data to be read since Lucene can seek to the beginning of the postings for any particular term?

Right, frq and prx.

>>> I would love to get ahold of your terms dict :) I'd have a field day
>>>testing Lucene against it... I'm very curious how the flex improvements affect your usage.
>
> Sometime in the next month or so we will get our new test server and after I get the backup of testing jobs under control, I'd love to do some testing with flex and our data.

I'd love to see the results of this :) EG basic stats like
before/after size of the tis/tii files, RAM used on opening the
reader, startup up time, search performance...