Disk-Based Field Data a.k.a. Doc Values

Elasticsearch is not just about full-text search, and many users are actually not using Elasticsearch for full-text search at all but for analytics though facets. This approach works well, but, as you probably know, faceting or sorting on a field requires loading field values into in-memory data structures that we call field data. It is very common that field data takes several (tens of) gigabytes of memory. Memory is rather cheap, so it is usually not a problem to get boxes with enough memory. However, this can raise issues at the JVM level: major garbage collections on a heap of several tens of gigabytes can easily take several seconds during which your application will be unresponsive. Careful JVM tuning can help prevent this issue, but ideally field data should be stored off-heap.

Doc Values to the Rescue

Doc values is a feature that will be available in the forthcoming Elasticsearch 1.0 release. You can already check it out in our 1.0 beta release.

Doc values are a Lucene 4.x feature which allow for storing field values on disk in a column stride fashion, which is filesystem cache friendly and suitable for custom scoring, sorting or faceting, exactly like field data. It was only natural to build a new field data backend based on doc values, and this new implementation has several benefits compared to the traditional field data implementations that Elasticsearch builds by uninverting the inverted index:

Memory is managed at the OS level through the filesystem cache instead of the JVM, just like the rest of the index.

Since doc values data structures are computed at indexing time, _refresh will be faster.

The way doc values are computed provides more efficient compression.

On the other hand, doc values are going to make indices bigger — unless it allows for not indexing the field, eg. if the field is used solely for sorting — and field data intensive work loads such as faceting will be slightly slower.

When Should I Use Doc Values?

Doc values can be used as a drop-in replacement for uninverted field data in most cases, but there are a few cases where they can be particularly helpful:

Constrained hardware: The in-memory field data data structures take up a lot of space, and sometimes you just can't add more and more RAM to your nodes in order to sustain the growth of your field data.

Daily jobs: Some users have queries which are run daily in order to get point-in-time aggregated views on their data. In those cases, the most important thing is not query latency or throughput, rather to not take all the resources that your cluster may need for other work loads. Doc values can help here since they won't fill up memory with data structures which won't be used again in the future.

Memory management: Having the Elasticsearch data structure that takes the most memory moved to disk means that it is now possible to start an Elasticsearch process with only 1G of memory, and to let the OS handle memory through the file system cache with almost no risk of out-of-memory error. Another nice aspect to this type of setup is that there are usually many less garbage collection issues on heaps of 1G as compared to heaps of tens of gigabytes.

How to Enable Doc Values

Doc values are an index time decision, so they need to be enabled in the mappings before indexing the first document. Here is an example of a string field definition that can be used for sorting and faceting, but not searching:

As you can see, fields don't need to be indexed to enable doc values. And once doc values are enabled, all operations working on top of field data like sorting or faceting will transparently use doc values under the hood.