Part 2.0: The true story behind Elasticsearch storage requirements

Several months ago we wrote about the true story behind Elasticsearch storage requirements and today we’re here to refresh those tests with the new Elasticsearch 2.0 beta-1 which includes several enhancements. For those of you looking for a quick answer, these tests showed an overall compression ratio of 0.732 (index size on disk / raw log file size) for those of you who are doing both search and visual analysis of your log data.

But of course, the answer to “How much hardware will I need” is still “It depends!” Many factors need to be taken into account and both this post and our previous are meant to elaborate on these factors in detail so you can make informed choices for both the best performance and the right hardware purchases.

Changes since last time

When we wrote our last blog post, we ran our experiments using Elasticsearch 1.4. Since then, there have been hundreds of new features, enhancements and bug fixes made to Elasticsearch. With respect to disk storage requirements, the key changes are: 1) the addition of a best_compression option for stored fields and 2) doc_values enabled by default. In addition, we found a bug in our original experiments that had a significant impact on index size.

DEFLATE: no footballs have been tampered in these tests

As described in this blog post by Adrien Grand, one of our engineers and core Lucene committers, Lucene 5.0 added a new compression option using the DEFLATE algorithm (the same algorithm behind zip, gzip, and png). Elasticsearch 2.0 is the first version of Elasticsearch to rely on Lucene 5.x and thus is able to leverage the new compression option. This is exposed via the index.codec configuration setting.

In our tests, using best_compression reduces our index size between 15-25% depending on the configuration. This is substantial, especially when looking at larger clusters. Many large Elasticsearch clusters are 50-100 nodes and more, where cluster size is primarily driven by sheer data volume. When you can cut down the amount of hardware by 15-25%, that’s a pretty significant change for the better.

So what’s the catch? The stored fields (value of the _source field) are what’s compressed, so there’s only a performance penalty due to decompression when the stored fields are returned in the query response. When the size=0 parameter is added to the request (as is recommended for pure aggregations queries and what Kibana does under the covers), there is no decompression penalty. There is a small performance penalty at index time; in many cases, people will gladly give up the extra CPU required for compression in exchange for disk space but you’ll have to consider this in the context of your requirements.

Alternatively, we expect many users will want to utilize a “hot/warm” architecture using the shard allocation filtering feature. In this scenario, time-based indexes on hot nodes can be configured to be created with the default LZ4 compression; when they’re migrated to the warm nodes (configured with index.codec: best_compression in elasticsearch.yml), the indexes can be compressed by optimizing that index. It may be preferable to pay the CPU penalty of compression during this optimize process (which we often recommend executing when the cluster is known to be less utilized) than at the time of initial indexing.

What's up doc_values?

Doc values were introduced with Elasticsearch 1.0 as an alternative means of storing data that is better suited for analytics workloads by using a columnar representation and can be accessed off the JVM heap. Initially, using doc values was only recommended in specific scenarios but over time, we’ve seen our users wanting to run analytics on continuously growing volumes of data that simply could not be executed under the constraints of the JVM heap. Fortunately, we’ve made significant enhancements to eliminate most of the performance gap between using field_data and doc_values, to the point where we felt comfortable enabling doc_values by default for fields of all types except analyzed string fields.

There are two ways doc values can impact a cluster’s hardware requirements. First, enabling doc_values requires more disk but this should be offset by the gains from enabling best_compression as described above. Second, doc_values can reduce overall hardware footprint by not letting the limits of the JVM heap define the hardware profile of the nodes or the horizontal scaling point. For example, without doc_values enabled, a node might have been sized with a smaller indexed data volume to ensure that JVM heap utilization is kept under control to avoid frequent garbage collection. This becomes less of a concern when the data can be accessed off-heap.

Dates and strings

In our original tests, our Logstash configuration converts the string representation of the date/time into a date-typed value. We forgot to delete the original string representation of the date/time, which is safe to do if the date/time parsing is successful. This oversight is significant since we discovered the date/time string values made up about 20% of the overall index size in our test data set. Elasticsearch, as with most search engines, builds an inverted index to provide fast lookups; when you have a value that’s highly unique (such as a timestamp), this can greatly increase the size of that index.

not_analyzed, except for 'message' field which is retained and analyzed

disabled

65382872

49532858

0.966

0.732

-0.242

4

not_analyzed, except for 'agent' field which is analyzed

disabled

43083702

32063602

0.636

0.474

-0.255

Semi-structured data file.Original file size: 75037027

1

analyzed and not_analyzed

enabled

100478376

82132782

1.339

1.094

-0.182

2

analyzed and not_analyzed

disabled

75238480

56911638

1.002

0.758

-0.243

3

not_analyzed

disabled

71866672

53553561

0.957

0.713

-0.254

3b

not_analyzed, except for 'message' field which is retained and analyzed

disabled

104638750

83824398

1.394

1.117

-0.198

4

not_analyzed, except for 'agent' field which is analyzed

disabled

72925624

54603882

0.971

0.727

-0.251

With the standard LZ4-based compression, the indexed data size to raw data size ratio ranged from 0.575 to 1.394. After enabling DEFLATE-based compression using the best_compression index.codec option, the indexed data size to raw data size ratio range came down to 0.429 to 1.117. Enabling the best_compression option resulted in a 15.7% to 25.6% reduction in indexed data size depending on the test parameters.

As you can see, the ratio of index size to raw data size can vary greatly based on your mapping configuration, what fields you decide to create/retain, and the characteristics of the data set itself. We encourage you to run similar tests yourself to determine what the data compression/expansion factor is for your data set and application requirements.

Conclusion

There were many amazing features added to Elasticsearch 2.0 worth considering. As we’ve discussed, two of these new features in particular can reduce the hardware footprint required for an Elasticsearch cluster by 15-25% or more: 1) the addition of a best_compression option and 2) enabling doc_values by default. This allows us to get to compression ratios between 0.429 and 1.117.

Being an open source company means a lot more to us than just the licensing model of our core products. It means being open about the capabilities (and limitations) of our products. Our hope is that these experiments encourage you to discover the truth about Elasticsearch’s capabilities with regards to storage requirements yourself and not just take our word (or anyone else’s) for it. We’d love to hear about the results of your own experiments using your own data and specific configuration or mapping settings that makes sense for your application. You can submit those config and data files or file issues at this Github repo: https://github.com/elastic/elk-index-size-tests.