Elasticsearch in Production

UPDATE: This article refers to our hosted Elasticsearch offering by an older name, Found. Please note that Found is now known as Elastic Cloud.

Elasticsearch easily lets you develop amazing things, and it has gone to great lengths to make Lucene's features readily available in a distributed setting. However, when it comes to running Elasticsearch in production, you still have a fairly complicated system on your hands: a system with high demands on network stability, a huge appetite for memory, and a system that assumes all users are trustworthy. These articles cover some of the lessons we've learned from securing and herding hundreds of Elasticsearch clusters.

Update August 10, 2015: Elastic provides Shield, a product which offers comprehensive security for Elasticsearch, including encrypted communications, role-based access control, AD/LDAP integration and auditing. The following article was authored before Shield was available.

This article is introductory. Its goal is to give an overview of important aspects of running and maintaining Elasticsearch clusters (or other distributed search engines), and to motivate learning more about them. It also aims to explain the importance of having enough memory and how to achieve high availability. Hopefully, this article will help you set reasonable expectations in terms of what Elasticsearch can (and cannot) do for you.

In the future, we'll add more thorough articles about each of the covered topic.

These are the topics we will cover in this article:

Memory

Search engines are designed to deliver answers fast. Really fast. To do this, most of the data structures they use must reside in memory. To a large extent, they assume you provide them with enough memory to do so. This can lead to problems when that is not the case – not just with performance, but also with your cluster's reliability.

It is no coincidence that Found's pricing is largely based on the amount of memory reserved for your cluster.

Security

Elasticsearch does not consider authentication or authorization to be its job (which is perfectly fine!), so it has no features for it. Thus, there are several things developers must be aware of, to avoid disclosing data that should be private, being denied service due to prohibitively expensive queries, or letting users run arbitrary code with access to anything Elasticsearch has access to.

Networking

Elasticsearch works brilliantly on a single machine, and easily lets you scale out to multiple machines when your data size requires it. It is impressively easy to use for a distributed system, but distributed systems are complicated – and can fail in many ways.

Client-side considerations

Assuming you have a reliable cluster, there are still some things you need to get right in your clients/applications to be reliable and performant.

While a response time of one second is often touted as sufficient to keep the user's interest, users nowadays expect applications to respond "instantly". That gives a response time budget of around 0.1 seconds, including network latencies – which also may vary a lot, as users are increasingly mobile.

This does not leave much time for the search engine to do its job, so it certainly cannot twiddle its thumbs waiting for I/O. To address this challenge, we need to keep as much as possible in memory.

Thus, a lot of effort is spent on building and also continuously maintaining various caches:

Hot parts of the index structures, such as term dictionaries, posting lists, etc., are assumed to (mostly) reside in the operating system's page cache. This implies that we cannot simply allocate everything to Elasticsearch.

Field values that are used for faceting, sorting or scripting are loaded into the field cache.

Filters can be cached as bitmaps in the filter cache, making further uses of the filter blazingly fast.

Furthermore, building the index structures requires a lot of memory. Inverted indexes are typically built in segments (more on this in a future article), by indexing as much as possible until reaching a threshold (e.g. memory/time/size), and flushing the segment to storage. A written segment is immutable, i.e. never changed.

As the indexing progresses, these segments are in turn merged, in accordance with a given merge policy. I recommend reading Lucene hacker McCandless' excellent coverage of how segment merging works. While not the topic of this article, maximizing indexing throughput requires finding a sweet spot where the I/O the segments are written to is saturated as well.

To better understand how the various caches are used in a search, let's consider a search request with a query, some filters, and some facets. In a future article, we'll cover this in depth. For now, we'll keep it fairly simple and only consider a single shard on a single node.

First, the corresponding dictionaries and posting lists to satisfy the query are consulted to find candidate documents that can possibly match1. When reading these, we anticipate that the relevant pages reside in the OS page cache, so we don't need to wait for the slow storage.

Before scoring, they are first filtered. If the filter is already in the filter cache - this is done fairly quickly - the cached filters are represented as compact bitmaps. If not, the filter must be constructed as well. Typically, we want filters to be reusable, so they must consider the entire "universe" of documents, not just the subset of documents for the current query. The resulting filter bitmap is then cached, and applied to the query.

If we were simply after the top n documents, we would be almost done at this point. The search engine calculates the relevance for the matched documents, while trying to not to waste time scoring documents that won't make it into the top n. With the documents ranked, the requested stored fields can be fetched – and with some luck, they're in the page cache.

Since we also wanted facets, however, we need to know about all the documents in the result set. And for all these documents, we need the values for the fields we're faceting on. These fields must reside in the field cache. If the fields are not there already, we load all the values for the field into the cache - not just the ones relevant for the current query. Recent versions of Elasticsearch let you filter what values go into the cache, though, so as not to waste memory on useless values – but it's usually still a lot of memory.

With Elasticsearch, this is all done per shard, and in turn per Lucene index segment per shard.

On "sunny" days, you get searches where the necessary fields are in memory, the popular index-pages party in the page cache, and the influx of new documents is manageable. Responses are fast, users are happy, and the operations staff can sleep at night.

Running out of memory can manifest itself in many ways:

A search request can result in an attempt to load way too large fields into memory, e.g. because it's faceted, sorted or scripted on. Note: be careful before you unleash queries with unknown memory profile on your production cluster!

Too many fields are attempted to be kept in memory. Note: if you offer a ton of faceting opportunities, make sure you have enough memory to cope with all of them at once.

Waiting too long before flushing an in-progress index segment can cause it to outgrow the heap space.

New documents can cause previously cached fields to become too large.

Indexes grow so large that not enough can fit in the page cache …

… or the page cache is continuously "poisoned" or invalidated due to other things happening on the system.

If you outgrow the page cache on your system, you will experience a gradual slowdown as more and more search requests will need to fetch the index pages from slower storage. Things will continue to work, but slower.

When you outgrow the amount of heap space dedicated to your node, however, things can crash; suddenly. You can be serving searches at great speeds, until something causes an OutOfMemory error, and things come crashing down.

In comparison, systems like Postgres are extremely careful when it comes to allocating resources, like memory2. For this reason, there are many configuration options developers/DBAs need to get right3. For example, the default out-of-the-box values are very defensive, and thus not very performant. This makes a lot of sense for Postgres, a transaction processing system with intense focus on resiliency and correctness. (At Found, we <3 Postgres too!). If Postgres does not have enough memory to do the task entirely in memory, it will change its execution plans to something that involves flushing to disk.

Elasticsearch does not look before it leaps. It assumes you have provided it with enough memory. While efforts are in progress4 to make Elasticsearch more resilient, the current status is that OutOfMemory-errors can result in many things going awry, on a scale from simply failing the request, to corrupting cluster state, and bringing the entire cluster down.

There are many things happening concurrently in an Elasticsearch node. For example, nodes can be updating cluster state, and other components in Elasticsearch, like Netty (the networking stack), can be updating its internal state. Most processes are not prepared for OutOfMemory errors raised while doing seemingly innocuous things.

While Elasticsearch is likely to be a more resilient speed demon in the future, best practice is to not put your production cluster in those situations in the first place.

To prevent memory related problems, it is important to both understand the memory profile of your requests, and to continuously monitor the resource consumption of your cluster as more data is added.

Understanding the true memory needs of your cluster is difficult without subjecting it to the actual workload you expect. Thus, you need to use realisticly sized data sets when developing new searches, tweaking exisiting ones, etc. Be careful when running experimental queries on your production cluster.

As the data in your indexes grows, so will the memory needs. Elasticsearch provides endpoints5 that give insights into its resource usage, such as:

We often get questions like "How much memory do I need for my index with n documents, which are about X KB on average?". With that information, we can only give a fairly weak lower bound. It depends on several factors such as the types and diversity of searches, the growth and update rates, and so on. For example, fairly simple searches without any facets and/or scripts loading lots of fields, can have low heap space usage, but have high demands on page cache. On the other hand, analytics applications that are heavy on faceting, but do not display the full results, can get by with a tiny page cache, while requiring tons of heap space.

The approach we usually recommend is to start out with more memory than you need, and scale down to find the sweet spot. With cloud services letting you rent capacity by the hour, it's cheap to start out large, load your data and run your tests . Then proceed to examine the metrics described above, e.g. heap space usage, field sizes, and so on.

We will expand on this point in a future article.

Note, however, that you can in fact have too much memory on a single Elasticsearch node. First, when you have 32GB or more heap space on a single JVM, a technique called "pointer compression" can no longer be used. So you actually need to bump it to around 48GB of heap space6 to have more effective memory. At that point, garbage collection can become prohibitively expensive, causing "stop the world" pauses that can cascade to other problems. For instance, other nodes in the cluster may assume that the node is dead. Elasticsearch is built for scaling out on commodity hardware, not up on single massive machines. So it might make sense to run multiple Elasticsearch nodes on a single physical machine.

Having covered the importance of having enough memory and caches that aren't invalidated by other customers, it should be pretty clear why we are obsessed with memory. Whether you're a direct Found customer or using a PaaS like Heroku, you have probably noticed that the amount of memory is key when configuring your cluster and determining its price. We do not enforce arbitrary limits on things like the number of indexes or documents, because ultimately, what matters is the amount of memory we have to reserve for a customer.

Much like Heroku7, we use Linux's "control groups" and "containers" for process and resource isolation. This ensures that you get the memory you are paying for without having to "compete" for it with other customers. Use of CPU is similar, with the exception that we do allow using more CPU when it would otherwise be idle.

We have also made it super easy to adjust the amount of memory of the nodes in your cluster: both up and down. This is very useful when determining initial memory needs (as described in [How much memory is enough?] ), as well as in day-to-day operations where continuous growth warrants larger clusters.

We've covered the many reasons why having enough memory is important - for performance and stability - and what can happen if that's not the case. We have also looked at how to address the problems caused by insufficient memory.

There is still a lot more to cover, though, like …

How various searches are actually evaluated, and how they use the various caches in more detail.

Approaches for making searches more memory efficient, and thus more performant.

Elasticsearch is a multi-tenant search engine. You can have many indexes on a single cluster, with different purposes and supporting different applications and/or customers.

However, Elasticsearch has no features for authentication or authorization. If you are used to systems like PostgreSQL or MySQL, where you limit access to databases tables, functions, etc. with high granularity, you might be trying to find a way to limit access to certain operations and/or certain indexes. At the moment, Elasticsearch does not consider that to be its job, which is fair enough.

Thus, you need to configure authentication and authorization in your own application layer, and regard Elasticsearch as a system that treats every user as a trustworthy super user.

Just as you would not expose a database directly to the Internet and let users send arbitrary SQL, you should not expose Elasticsearch to the world of untrusted users without sanitizing the input. Specifically, these are the problems we want to prevent:

We start with the last point, as it is quite important to be aware of, and it's an important reason for not simply limiting access to e.g. POST-ing to certain _search endpoints and pass arbitrary requests through.

Elasticsearch has very powerful scripting capabilities. These are used for partial updates, scripted fields, fancy facets, relevance models etc. Depending on what language plugins8 you enable, you can write these in MVEL, Python, Javascript or Groovy.

It is important to remember that these scripts are not sandboxed. They have access to do everything your Elasticsearch process has access to. So if you were to limit access to just posting to foo/_search, a nefarious user could post a request with a dynamic script that bypasses that restriction. From a security perspective, with dynamic scripts enabled, you should assume that any user that can send arbitrary requests to an endpoint with scripts (e.g. _search or _update) has the equivalent of shell access.

Another problem is that when a buggy script gets stuck in an infinite loop, there is no process management internally in Elasticsearch that will detect that and kill it. A spinning script will consume a thread from the corresponding thread pool (search or index), while burning CPU cycles. This will cause slowdowns or hangs if all threads are consumed.

Dynamic scripts is not the only concern when allowing arbitrary _search-requests. In the section on [OutOfMemory-caused crashes], we described many reasons why a production cluster should never experience OutOfMemory errors.

An adversary with the intention of making a node or cluster deny service, can quite easily do so if arbitrary _search requests are allowed, for example by causing an excessively large field to be loaded.

Elasticsearch has many features for doing operations on multiple indexes. Most places, you can use an index pattern instead of an index name to define what indexes to operate on. Keep this in mind if you are doing things like concatenating a user-prefix with an index name, so the user cannot specify ",*" as the index name and gain access to other indexes.

Intuitively, no machine on the Internet can connect to something listening to localhost or that is protected by a company firewall. However, your web browser can access your localhost.

Thus, any website you visit can send requests your local Elasticsearch node. This is sometimes used in demos of _site plugins.

It can also be an attack vector. As detailed in the section on [Dynamic scripts], you should assume that every external user could potentially have shell access - you certainly shouldn't trust every web page with that.

We recommend running Elasticsearch in a virtual machine while developing - or at least without any interesting data in your indexes - and with dynamic scripts off.

Bearing in mind these notes on security, in addition to what we have covered on Memory, it will come as no surprise that we provide dedicated clusters, and not indexes on a shared cluster.

It is highly important to completely isolate customers. All the Elasticsearch processes of our customers run in separate Linux containers (LXC), the same technology Heroku uses to isolate processes. This ensures that customers can't see the other processes, much less interfere with them.

We also run a few custom Elasticsearch plugins, which add authentication to the connection establishment. This ensures that even though a customer compromises his container, he or she cannot easily connect to other clusters.

This also applies to other services the Elasticsearch cluster needs to connect to, like ZooKeeper for cluster management and S3 for continuous backups. Having asserted that you should assume users with full access to your Elasticsearch cluster has shell access, that is the assumption we have designed our service with as well. Customer clusters get their own S3 bucket, for example.

While Elasticsearch works brilliantly on a single node, it is built for scaling out to many nodes, running a cluster on commodity hardware.

As a distributed search engine, Elasticsearch is impressively easy to use. Spin up a few nodes, and Elasticsearch will do most of the work itself. On a "sunny" day, at least. Distributed systems are complicated. Really complicated. By being distributed, a whole universe of things that can go wrong presents itself. As such, different database systems focus on different strengths: some strive for strong guarantees, others on always being available, even if it means being inaccurate some, or even most of the time. Furthermore, what a database system claims to achieve when problems occur is rarely what it actually can cope with, as Kyle Kingsbury explores in his excellent series on the perils of network partitions. There's also an article describing the vast possibilities of what can cause network partitions, which also has a few mentions of Elasticsearch.

It is important to be aware of which failure modes apply to your underlying infrastructure. For example, if you run in a virtualized environment, can you be sure two different nodes are not on the same physical machine, rack or power supply? If you run on AWS EC2, the only way to be sure of that is to run in different availability zones.

If you have a large number of instances, you may be experiencing single-node failures all the time. If you are on a smaller scale, however, experiencing the failure of an entire availability zone might actually have comparable probability.

When large scale failures happen, for instance when an entire availability zone experiences an outage, problems tend to cascade. As everyone else is also scrambling to recover their systems, an increase in load and demand can make it difficult to recover. Thus, any system with high availability needs must have replicas ready and running when problems appear. You cannot risk expecting a quick recovery.

A situation any distributed system must prevent, is what's called a "split brain". A split brain is what happens when you have multiple autonomous sub-clusters forming, and more than one believe they're the "master". This can cause irreconcilable changes and data loss.

This is why minimum master nodes is an important setting to get right. It's the minimum number of master eligible nodes that must be present in order for the cluster to elect a master and accept requests. To prevent two sides of a partition to be eligible, it's necessary to have a majority, i.e. \(\lfloor\frac{n}{2}\rfloor + 1\), where \(n\) is the number of nodes in the cluster. Note that the minimum master nodes for a split brain proof 2-node cluster is the same as for a 3-node cluster: 2 nodes must be available.

The \(+1\) node does not have to be a full-fledged and expensive node, however. It can be an inexpensive "tie breaker" node, which does not hold data.

This also applies to different data centers, or availability zones, as they are called on EC2. If you run your cluster in two availability zones, you can't necessarily afford losing one of them. Like with nodes, you can have a tie breaker data center as well. So for example, you could run 8 nodes in zones A and B (for a total of 16 nodes), and 1 node in zone C. This setup would allow you to lose any of the three zones.

Nodes can come and go with little configuration. No need to also run a ZooKeeper cluster, as with SolrCloud.

Elasticsearch can be configured with a strict minimum of master nodes, which is important when data is continuously changing.

It can also be lenient about minimum master nodes, which is fine for read-only workloads.

Index operations can specify the required write consistency, i.e. whether one, most or all replicas must confirm the changes before returning.

Being flexible is one of Elasticsearch's virtues. However, it's impossible for Elasticsearch to provide defaults that works for all types of uses in this case. For example, it does not know the number of nodes the cluster should have, so it cannot know the number of master nodes.

This flexibility and the resulting lack of strict defaults, combined with being very easy to get started with, has caused problems for many users. It is important for the reliability and integrity of your cluster to get the configurations right. Especially the minimum number of master nodes. If you are in doubt, it's better to be safe than sorry: we recommend you configure things to be excessively strict.

Here at Found, the stability of your cluster is very important to us. We enable you to effortlessly and comfortably run your cluster in a high-availability configuration, across multiple data centers - and add tie-breaker nodes and zones if and when needed.

We are meticulous about making sure your cluster has the right number of master nodes - with an additional layer of protection in our proxy layer if it sees nodes with different opinions on which node is master. This has so far protected our users against a few bugs in Elasticsearch.

Assuming you have a rock solid cluster, there are still a few things the clients must do right as well, to keep things resilient and performant. Clients make up an important part of the distributed system.

Whenever you do a request with side effects, like indexing a document, it is important that the request is idempotent. That is, the result of doing a request many times is the exactly the same as doing it once, which is important in terms of the ability to retry something.

For example, if you send a document to Elasticsearch for indexing, and you do not get any response, you cannot know whether the document was indexed; nor can you be certain that it did not get indexed. If you specified an ID for the document, you can safely retry the request: if Elasticsearch did index it, it will reindex it - and the net result will be that the document is indexed. On the other hand, if you leave it to Elasticsearch to assign the IDs, then retrying the request can lead to duplications.

If you use Elasticsearch's HTTP-interface, like most do, make sure you (or the client library you are using) are using a connection pool. If not, you may be establishing new connections for every single request, which adds a lot of latency. This is especially true if you are using HTTPS, as the initial key exchange takes a lot of time.

Elasticsearch has a _bulk endpoint that lets you do many index operations with a single request. This is very useful when you are indexing lots of documents over a short period of time, as less chatter back and forth is needed.

Make sure you inspect the responses of the bulk request, however. Elasticsearch will not roll back the changes caused by a bulk request if a sub-operation fails.

There is a similar interface for searches, _msearch, which is useful if you need to do multiple search requests to meet your end users' information needs.

Elasticsearch lets you make amazing things quite easily. It provides great features at great speeds and scale. But like anything fast and fun, some care must be taken to stay out of trouble:

Elasticsearch must be provided with all the memory it needs. A production cluster should never experience OutOfMemory errors

Users connecting to your cluster has implicit super user rights and has access to anything the Elasticsearch process has access to. Sanitize the search requests properly, and be careful with dynamic scripts.

The network should be reliable, and when it is not, a majority of nodes must be available in order for the cluster to function.

Even though the cluster is solid, the clients must be prepared for intermittent network errors - and be able to retry requests.