Introducing Lucene Index Doc Values

From day one Apache Lucene provided a solid inverted index datastructure and the ability to store the text and binary chunks in stored field. In a typical usecase the inverted index is used to retrieve & score documents matching one or more terms. Once the matching documents have been scored stored fields are loaded for the top N documents for display purposes. So far so good! However, the retrieval process is essentially limited to the information available in the inverted index like term & document frequency, boosts and normalization factors. So what if you need custom information to score or filter documents? Stored fields are designed for bulk read, meaning the perform best if you load all their data while during document retrieval we need more fine grained data.

Lucene provides a RAM resident FieldCache built from the inverted index once the FieldCache for a specific field is requested the first time or during index reopen. Internally we call this process un-inverting the fieldsince the inverted index is a value to document mapping and FieldCache is a document to value datastructure. For simplicity think of an array indexed by Lucene’s internal documents ID. When the FieldCache is loaded Lucene iterates all terms in a field, parses the terms values and fills the arrays slots based on the document IDs associated with the term. Figure 1. illustrats the process.

Figure 1. Univerting a field to FieldCache

FieldCache serves very well for its purpose since accessing a value is basically doing a constant time array look. However, there are special cases where other datastructures are used in FieldCache but those are out of scope in this post.

So if you need to score based on custom scoring factors or you need to access per-document values FieldCache provides very efficient access to them. Yet, there is no such thing as free lunch. Uninverting the field is very time consuming since we need to first walk a datastructure which is basically the invers of what we need and then parse each value which is typically a String (until Lucene 4) or a UTF-8 encoded byte array. If you are in a frequently changing environment this might turn into a serious bottleneck.

Nevertheless, with FieldCache you get all or nothing. You either have enough RAM to keep all your data withing Java Heapspace or you can’t use FieldCache at all.

IndexDocValues – move FieldCache to the index

A reasonably new feature in Lucene trunk (4.0) tries to overcome the limitations of FieldCache by providing a document to value mapping built at index time. IndexDocValues allows you to do all the work during document indexing with a lot more control over your data. Each ordinary Lucene Field accepts a typed value (long, double or byte array) which is stored in a column based fashion.

This doesn’t sound like a lot more control yet, right? Beside “what” is stored IndexDocValues also exposes “how” those values are stored in the index. The main reason for exposing internal datastructure was that users usually konw way more about their data so why hide it, Lucene is a low level library. I will only scratch the surface of all the variant so see the ValueType javadocs for details.

For integer types IndexDocValues provides byte aligned variant for 8, 16, 32 and 64 bits as well as compressed PackedInts. For floating point values we currently only provide float 32 and float 64. However, for byte array values IndexDocValues offers a lot of flexibility. You can specify if you values have fixed or variable length, if they should be stored straight or in a dereferenced fashion to get a good compression in the case the number of distinct values is low.

Cool, the first limitation of FieldCache is fixed! There is no need to neither un-invert nor parse the values from the inverted index. So loading should be pretty fast and it actually is. I ran several benchmarks for loading up IndexDocValues for float 32 variants vs. loading FieldCache from the same data and the results are compelling – IndexDocValues loads 80 to 100 X faster than building a FieldCache.

So lets look at the second limitation – all or nothing. FieldCache is entirely RAM resident which might not be possible or desired in certain scenarios but since we need to un-invert there is not much of a choice.

IndexDocValues provide the best of both worlds whatever the users choice is at runtime. The API provides a very simply interface called Source which can either be entirely RAM resident (a signle shared instance per field and segment) or Disk-Resident (insteance per thread). With both RAM resident and on disk the same Random-Access interface is used to retrieve a single value per field for a given document ID.

The performance conscious of you might ask if the lookup performance is comparable to FieldCache since now we need to do an additional method call per value lookup vs. a single array lookup. The answer is: “it depends”! For instance if you choose to use PackedInts to compress your integers you certainly pay a price but if you choose a 32-bit aligned variant you can actually access the underlaying array via Source#getArray() just like Java NIO ByteBuffers.

You want it sorted?

By default Lucene returns search results sorted by the individual document score. However, one of the most commonly used features (especially in the Enterprise sector) is sorting by individual fields. This is yet another usecase where FieldCache is used for performance reasons. FieldCache can load up the already sorted values from the terms dictionary, providing lookup by Key & Ordinal. IndexDocValues provides the same functinality for fixed and variable length byte [ ] variants through a SortedSource interface. Figure 2 illustrates obtaining a SortedSource instance.

Recent benchmarks using SortedSource have shown equal performance to FieldCache and even with on-disk versions the performance hit is between 30% and 50%. These properties can be extremely helpful especialy for users with a large number of fields they need to sort on.

Wrapping up…

This introduction is the first post in a serious of posts for IndexDocValues. In the upcoming weeks we gonna publish more detailed post on how to use IndexDocValues for Sorting and Result Grouping. Since Lucene 4 also provides Flexible Scoring, using IndexDocValues with Lucene’s new Similarity deserves yet another post.

For the folks more interested in the technical background and how to extend IndexDocValues, understand how the internal type promotion works or event write your own low level implementation I’m planning to publish a low level codec post too. So stay tuned.

1 Comment

3 Pings/Trackbacks

[…] Lucene's central scoring class Similarity has been extended to return dedicated Scorers like ExactDocScorer and SloppyDocScorer to calculate the actual score. This refactoring basically moved the actual score calculation out of the QueryScorer into a Similarity to allow implementing alternative scoring within a single method. Lucene 4.0 also comes with a new SimilarityProvider which lets you define a Similarity per field. Each field could use a slightly different similarity or incorporate additional scoring factors like IndexDocValues. […]

[…] values can be retrieved quickly and efficiently as Simon Willnauer describes in his first IndexDocValues blog post. There are many applications that can benefit from using IndexDocValues for search […]

[…] data you are storing you also sometimes have massive runtime costs. In Lucene 4 you can use DocValues to store your boosts / categories per document in a nice & efficient datastructure; one of my […]

This information is extremely valuable, but please run a spell and grammar check on it. So many mistakes in a short writeup can undermine your credibility. Regardless, thank for the concise information.