Six Ways to Crash Elasticsearch

UPDATE: This article refers to our hosted Elasticsearch offering by an older name, Found. Please note that Found is now known as Elastic Cloud.

As much as we love Elasticsearch, at Found we've seen customers crash their clusters in numerous ways. Mostly due to simple misunderstandings and usually the fixes are fairly straightforward. In our quest to enlighten new adopters and entertain the experienced users of Elasticsearch, we'd like to share this list of pitfalls.

Any datastore can be brought to its knees, especially if it contains a substantial amount of data compared to the available hardware or if the traffic is simply too high. This article will however focus on ways of taking down an Elasticsearch cluster that are not related to insufficient hardware. Some are quite common and real, while others less so. At any rate, it’s good to be aware of them. These also emphasize why we only provide dedicated clusters at Found: you can shoot yourself in the foot, but not your neighbors’.

One quick way of making Elasticsearch run out of memory and struggle to recover is by ignoring to consider the difference between keys and values when indexing documents. This has been explained in great detail in the article Troubleshooting Elasticsearch searches, for Beginners, but the essence is that the keys in the documents change, with the result that the mappings for the index is evergrowing. On a small scale, this will require a disproportionate large amount of heap space and for larger clusters it will typically cause problems with distributing the cluster state.

Let’s make this more clear with an example. We create a script that indexes documents with a single field, the key is a growing integer and the value is the square of the key. The first documents look like this:

{"1": 1}
{"2": 4}
{"3": 9}
{"4": 16}
{"5": 25}
{"6": 36}

You can find the entire script in this gist. On a single node cluster with 128MB the script managed to index about 30 000 documents before the node went down, but the index on disk is less than 7 MB. Depending on the use case it is not uncommon to have an index size between one and sixteen times the heap size in a normal system.

I terminated the script after the first out-of-memory exception and our system restarted the instance. It managed to recover, but even at idle status it is doing heavy garbage collection. If you imagine what the mappings will look like for this index, with more than 30000 different fields, it’s not difficult to understand that the cluster is struggling. Especially considering that this mapping is included in the cluster state that is broadcast to all nodes on several occasions. Put differently, we have not reached a limit on documents, but a limit on the metadata for an index.

The proper way of indexing this data would have been something like this:

Too many shards or the gazillion shards problem, as some of the Elasticsearch developers like to refer to it, comes from the fact that there is a base cost to every shard and index even if it does not contain any documents.

In this rather extreme example we create an index with 1073741824 shards.

A few seconds after issuing this request the connection was closed and a few seconds after that I got an email saying the instance ran out of memory. This without indexing a single document. Now, the next time you set up Logstash with its default index template of one index per day, you might want to have a look at the size of those indexes after a few days. Why? Well, each index implies at least one shard and if you don’t have that many log entries, you will probably do better with one index per week or per month. For Elasticsearch it is the same thing (in terms of performance) if you are having one index with two shards or two indexes with one shard. Both cases have two Lucene indexes.

The classic example: a developer knows that a search will not return that many hits, and he wants all of them without paging, so he simply sets the size parameter to an arbitrarily large size like Integer.MAX_VALUE (2147483647).

When optimizing code like Elasticsearch developers do, there is no getting around making some assumptions. One of the assumptions is that a query usually has more hits than the number of elements to be returned. This implies that they will choose to prepare internal data structures for the number of documents specified in the size. In other words, if the size parameter is ridiculously large, then the Elasticsearch will create a ridiculously large internal data structure. And the one who pays the price is you, waiting forever - or so it seems! - for an insignificant result.

In earlier versions of Elasticsearch this was perhaps the quickest trick to wreak havoc on somebody’s cluster. Now, in recent versions of Elasticsearch this is not as big a problem as it used to be, but it’s still a good reason to use the scan and scroll API instead. Even if you have tested it before and you know that there is lots of free heap on the instance, it can blow up simply because one of those structures needs contiguous memory on the heap, and your instance is not able to provide that without doing a major garbage collection.

In academia it is a known problem that one cannot determine if a running process will terminate or not, and since Elasticsearch does not have a timeout for long running scripts the following script will never halt. This results in the script keeping one search thread that will never be released back to the pool. Subsequent executions of such scripts will eventually consume all search threads and put all future searches in the queue until the queue reaches its maximum and all future searches are halted. This is also the case for the new sandboxed scripts.

This script does however have one benevolent purpose: testing how your client behaves when the search queue fills up. Neeless to say, don’t do this in production, use a staging or test cluster instead. By starting this script repeatedly until all the search threads are taken, the normal searches issued by your client will be put in the queue. This will simulate the effect of an overloaded cluster where searches take longer time to process than the rate at which they are received. Another entry into this pit is if your client mistakes an expensive and thus slow query for a failed query and executes it again. This is particularly bad if the query for some reason is not capable of utilizing caches, like queries calculating date ranges based on the current time.

Even without a single document this way of nesting aggregations resulted in the instance running out of memory. I have tested this for instances with heaps as large as 8GB.

This example is actually based on a real world problem where a developer most likely had misunderstood the aggregations API and created nested aggregations instead of sibling aggregates, but it still shows that the temporary structures created inside Elasticsearch when processing nested aggregations can be quite large.

Even if this contrived example without a single document in the index has been fixed in newer versions of Elasticsearch it can be good idea to just snapshot the indexes you need over to a separate cluster when doing ad hoc analytics. Personally, I often do that when analyzing logs, our Logstash cluster is simply not tuned for heavy analytics. The safety of steering clear of ad hoc queries on the production cluster is not the only benefit. It also allows me to use a cluster with more memory, since it will only be running for the few hours that I need it.

At Found we have experienced both customers and ourselves pushing clusters to their limits. Failing to take action when a cluster reaches its limits can be dangerous to the integrity of your data.

To demonstrate this I added a second node to the previous cluster, making it redundant (2 x 128MB heap). I recreated the index with one shard and one replica and started the same script used for demonstrating the bad mappings, but instead of stopping on the first out-of-memory error I simply restarted the script and kept pushing more data.

The first thing I noticed was a dramatic slowdown in indexing speed in the script. I then looked up the cluster in our metrics console and as expected it was collecting garbage continously. Elasticsearch configures the JVM to use the ConcurrentMarkSweep collector and since it is concurrent it avoids stop-the-world pauses as much as possible, but it cannot avoid them entirely, and with small instances like these the effect of the collector consuming CPU is also very much noticeable.

Looking at the cluster in Kopf I also noticed that the replicas where out of sync, at least that’s what the document counts reported. Elasticsearch is usually pretty quick to discard a replica if it does not match the master, but in this case the nodes were slowed down to a point where they had not been able to do so yet, or they were too slow to respond with an updated statistic to Kopf.

In the end, both nodes went out of memory at the same time and I stopped the script. This allowed the nodes to come back online. The replica was then discarded and the master shard recovered from the local gateway of that node. This time the recovery mechanism worked and having a high availaility setup did make the cluster handle the load better, but the high load made the entire cluster unavailable, and for a while the redundancy was lost. In other words, if one does not respond to a situation like this, then it’s just a matter of time before another error comes along and data loss is a fact.

Despite the oddities and pitfalls described in this article, our opinion is that Elasticsearch is a really great product. Most of these issues can easily be avoided or will be addressed in future versions as similar issues have been adressed previously, but at the moment their consequences can be destructive.

Personally, I am careful with who I allow access to my cluster and I would never use a shared cluster environment, at least not until there have been major changes to how Elasticsearch balances resources between different tasks. Even if the circuit breakers are constantly improving and considering more and more resources, can you be sure they will always be able to keep up with the new features?