Sizing Elasticsearch

UPDATE: This article refers to our hosted Elasticsearch offering by an older name, Found. Please note that Found is now known as Elastic Cloud.

Scaling up and out

We're often asked "How big a cluster do I need?", and it's usually hard to be more specific than "Well, it depends!". There are so many variables, where knowledge about your application's specific workload and your performance expectations are just as important as the number of documents and their average size. In this article we won't offer a specific answer or a formula, instead we will equip you with a set of questions you'll want to ask yourself, and some tips on finding their answers.

Resource estimation is no exact science. Elasticsearch can be used for so many different purposes, each with their challenges and demands. Some workloads require everything to be in memory to provide responses in milliseconds, while other workloads can get by with indexes whose on-disk size is many orders of magnitude bigger than available memory. It can even be exactly the same workload, but one is for mission critical real time reporting, and the other is for archived data whose searchers are patient.

Then there's growth planning, for the long-term and the short-term. There's expected growth, and the need to handle sudden unexpected growth.

We'll be starting by looking at different approaches to indexing and sharding that each solve a certain problem. This is important in the long run. If you haven't planned for it and e.g. sharded appropriately, you cannot necessarily add more hardware to your cluster to solve your growth needs.

Knowing more about how to prepare for the future, we'll look at how to reason about resource usage on the underlying nodes. This enables us to understand what needs attention when testing.

Lastly, we'll look at things to keep in mind when devising tests to give you confidence you can handle required growth while also meeting performance expectations.

The atomic scaling unit for an Elasticsearch index is the shard. A shard is actually a complete Lucene index. If you are unfamiliar with how Elasticsearch interacts with Lucene on the shard level, Elasticsearch from the Bottom Up is worth a read. Since the nomenclature can be a bit ambiguous, we'll make it clear whether we are discussing a Lucene or an Elasticsearch index.

In the abstraction layer cake, you need to consider everything below The Shard as a single indivisible unit for scaling purposes. Shards can be moved around, but they cannot be divided further. Consequently, the shard must be small enough so that the hardware handling it will cope. While there is no technical upper limit on the size of a shard/Lucene index, there is a limit to how big a shard can be with respect to your hardware, your use case and performance requirements. Unfortunately, that limit is unknown and hard to exactly estimate. Approaches to finding the limits are discussed in the section on testing.

If the shard grows too big, you have two options: upgrading the hardware to scale up vertically, or rebuilding the entire Elasticsearch index with more shards, to scale out horizontally to more machines of the same kind.

Scaling out requires appropriate sharding. Shards cannot be split.

An Elasticsearch index with two shards is conceptually exactly the same as two Elasticsearch indexes with one shard each. The difference is largely the convenience Elasticsearch provides via its routing feature, which we will get back to in the next section.

This insight is important for several reasons. First, it makes clear that sharding comes with a cost. Storing the same amount of data in two Lucene indexes is more than twice as expensive as storing the same data in a single index. This is because Lucene index internals like term dictionaries will have to be duplicated. Also, there's a cost associated with having more files to maintain and more metadata to spend memory on.

Second, searching more shards takes more time than searching fewer. There's more data to process, and - depending on your search type - possibly several trips to take over all the shards as well.

So while it can be necessary to over-shard and have more shards than nodes when starting out, you cannot simply make a huge number of shards and forget about the problem.

As emphasized in the previous section, there's no simple solution that will simply solve all of your scaling issues. You have to make an educated choice. Thus, it's useful to look into different strategies for partitioning data in different situations.

For time oriented data, such as logs, a common strategy is to partition data into indexes that hold data for a certain time range. For example, the index logstash-2014.01.01 holds data for events that happened on 2014-01-01, i.e. a time range of a day. You can of course choose bigger or smaller time ranges as well, depending on your needs. Using index templates, you can easily manage settings and mappings for any index created with a name starting with e.g. logstash-*.

When the day is over, nothing new will be written to its corresponding index. Such indexes can be fully optimized to be as compact as possible, and possibly moved somewhere for archiving purposes. When the data becomes too old to be of interest, the data can easily be deleted by deleting the entire index for the obsolete time ranges, which is a lot cheaper than deleting a lot of data from an existing index.

Searches can be run on just the relevant indexes for a selected time span. If you are searching for something that happened on 2014-01-01, there's no point in searching any other index than that for 2014-01-01.

Using this technique, you still have to decide on a number of shards. However, the number of shards will just have to handle data for the desired timespan. Expected future growth can be handled by changing the sharding strategy for future indexes.

If a user only ever searches his or her own data, it can make sense to create one index per user. However, the extra cost for having a large amount of indexes can outweigh the benefits if your average user has a small amount of data. With appropriate filters, Lucene is so fast there's typically no problem having to search an index with all its users data. With fewer indexes, more internal index structures can be re-used.

An example where it makes sense to create user specific indexes is when you have users that have substantially more data than the average. For example, if you are providing search for blog comments, it can make sense to create one index per blog for those few blogs that have millions of comments. However, the blogs with just a few comments per day can easily share the same index. One approach some people follow is to make filtered index aliases for users. This can make the applications oblivious to whether a user has its own index or resides in an index with many users. Note that this approach can be problematic if you have a big number of index aliases, e.g. because it's part of the cluster state.

We mentioned earlier that the only real difference between using multiple indexes and multiple shards is the convenience provided by Elasticsearch in the form of routing.

When a document is indexed, it is routed into a specific shard. By default, the routing is based on the document's ID. This results in round robin routing and shards with fairly evenly distributed amounts of data.

This is the default, and to search over data that is partitioned this way, Elasticsearch searches all the shards to get all the results.

If, however, you specify a routing parameter, Elasticsearch will only search the specific shard the routing parameter hashes to. This makes it possible to have something between a single big index and one index per user. By routing on user_id, for instance, you can make sure that all the documents for a user end up in the same shard. This way, you don't have to search over all the shards for every single search request, only the single shard the user_id hashes to.

By not evenly distributing documents to all shards, this may lead to a skewed distribution of data, where some shards have a lot more data than others. Again, if there are users with orders of magnitude more documents than the average, it is possible to create custom indexes for them. You can combine these techniques.

So far, we have looked at how various partitioning strategies can let you deal with growth, from a fairly high level abstraction wise. It is also important to understand how different use cases have different demands on the underlying hardware running the nodes.

Elasticsearch in Production covers some ground in terms of the importance of having enough memory. Instead of repeating the advice you find there, we'll focus on how to get a better understanding of your workload's memory profile.

While having an in-depth understanding of the memory needs of all your different requests is (luckily) not required, it is important to have a rough idea of what has high memory, CPU, and/or I/O demands. This enables you to at least know what you need to test, and to some extent how.

Regular searches need to look up the relevant terms and their postings in the index. For returned results, the stored fields (typically _source) must be fetched as well. As much as possible of this data should be in the operating system's page cache, so you need not hit disk.

Often, search patterns follows a Zipfian distribution. Simplified, this means that you can possibly answer, say, 80% of your searches using only 20% of your index. In other words, simple searching is not necessarily very demanding on memory. You can possibly get by with having a small fraction in memory. When the necessary index pages are not found in memory, you'll want storage that can serve random reads efficiently, i.e. SSDs.

For search heavy workloads, you'll want page cache and I/O able to serve random reads. Unless custom scoring and sorting is used, heap space usage is fairly limited. Similarly to when you aggregate on a field, sorting and scripting/scoring on fields require rapid access to documents' values given their IDs.

The world is quickly discovering that Elasticsearch is great for analytics. Analytics type searches have a memory profile that is very different to regular searches. With a regular search, we want to find the top-n results, for what's probably a small n. When we analyze, we aggregate over possibly billions of records.

To do this, Elasticsearch needs to have tons of data in memory. Much of Elasticsearch's analytical prowess stems from its ability to juggle various caches effectively, in a manner that lets it bring in new changes without having to throw out older data, for near realtime analytics. These field data caches can become very big, however, and problematic to keep entirely in memory.

The inverted index cannot give you the value of a field given a document ID; it's good for finding documents given a value. Therefore, as shown in the figure below, all the documents' values for a field are loaded entirely into memory the first time you try use it for aggregations, sorting or scripting.

Uninverting a field into a field cache

Using doc_values as the fielddataformat, the heap space can be relieved of the memory pressure. Instead of having to uninvert and load everything into memory when the field is first used, files with the field stored in a column stride format are maintained when indexing. Thus, instead of having to have all the data in heap space, it becomes a question of whether the needed data is in the page cache, or can be provided quickly by the underlying storage.

While storing fields like this results in bigger on-disk indexes and slightly more overhead when searching, the big win is that less heap space is spent on field caches. This is particularly nice if you only ever use a small fraction of the values. For example, if your queries and filters typically work with a small sub-set of your entire index, then the remaining unused and possible majority of data does not cost you any memory.

Having said that, if your workload uses almost all the data all the time, using doc_values will not necessarily help you. You will still need a lot of memory. Nevertheless, having the data off the heap can massively reduce garbage collection pressure. As noted in Elasticsearch in Production, garbage collection can become a problem with excessively big heaps. You cannot scale a single node's heap to infinity, but conversely, you cannot have too much page cache.

To get real estimates, it is important that you are testing as realistically as possible. This means that both the data you index and the searches you use must closely resemble what you are actually going to use. If the text you are indexing is auto-generated "Lorem ipsum" and the metadata you generate is randomized in a fashion that is far from real data, you might be getting size and performance estimates that aren't worth much. As mentioned, it is important to get an idea of how much can be answered with data cached in memory, with the occasional cache misses that will inevitably occur in real life.

Existing search logs can be of great value here, as you can easily replay them. Again, you will probably find that your searches have a Zipf distribution. This is something you will want to consider also while testing, so you don't end up with overly pessimistic estimates. (Although, if you can get the budget approved, over-provisioning due to pessimistic testing is arguably better than being overly optimistic. :-)

Elasticsearch has many endpoints that lets you inspect resource usage. You can get stats about the cluster, nodes, indexes and shards, and segments. Elasticsearch Inc. also recently released Marvel which lets you track these statistics over time, and explore them using Kibana.

When inspecting resource usage, it is important not to just look at the total heap space used, but to also check memory usage of things like field caches, filter caches, ID caches, completion suggesters, etc. Also, you want to pay attention to garbage collection statistics. If your nodes spend a lot of time garbage collecting, it's a sign you need more memory and/or more nodes.

Also, it's important to follow how the memory usage grows, and not just look at isolated snapshots. The way the garbage collector works, you may see sawtoothy pattern, as memory is freed periodically as the garbage collector does its thing. Usually, this is perfectly fine, as long as sufficient memory can actually be reclaimed and it's not frequently spending a lot of time. However, if the tendency is like in the below figure, it's a clear warning that you are on the verge of having a memory problem.

Thorough testing is time consuming. Thus, you want to quickly home in on getting valuable estimates. Doing several iterations of "Doh! Too small again!" wastes valuable developer time. With services such as AWS or Found's hosted Elasticsearch, paying for a big cluster for some hours or days is probably cheaper than repeatedly configuring your own cluster from scratch.

If your estimate is way too high, you already have a rough idea of how much resources you actually need and can scale down accordingly in order to do more accurate testing. If it's too low, it is harder to predict what the next best guess is.

The goal of this article was to shed some light on possible unknowns, and highlight important questions that you should be asking.

Knowing a little bit more about various partitioning patterns people successfully use, limitations and costs related to sharding, identifying what your use case's pain points are, and how you can reason about and test resource usage, you should hopefully be able to home in on an appropriate cluster size, as well as a partitioning strategy that will let you keep up with growth. And if not, at least you will know when more science is needed!