Elasticsearch from the Top Down

UPDATE: This article refers to our hosted Elasticsearch offering by an older name, Found. Please note that Found is now known as Elastic Cloud.

The previous article in this series, Elasticsearch from the Bottom Up, covered essential data structures within a single shard. In this article, we will look at the distributed nature of Elasticsearch.

In the previous article, Elasticsearch from the Bottom Up, we started with fairly low level data structures, and ascended up the abstraction layer to put them into context. We focused on what happens within a single shard, however.

To look at how all of this fits together in a distributed setting, it is easier to start at the other end: from the “top”, as observed from a user’s perspective.

We will look at what happens when a user sends an index and a search request to Elasticsearch, and how the requests ripple through the network, until we reach the lower level structures covered earlier. First, we’ll look at the node accepting the request and its role as a coordinator. We’ll look at how it routes requests to the shards’ primaries for indexing requests, and load balances across replicas for search requests.

After having been routed to the right shards, we’ll look at what happens at the shard level — such as what constitutes a “successfull” index operation, and transformations necessary to convert documents and searches to their Lucene equivalents. For search request, we’ll also look at the scatter/gather process, which can happen in multiple rounds.

To describe how Elasticsearch works, we need some sort of cluster topology. The cluster in the figure describes the kind of cluster we assume. We have one Elasticsearch index with two shards, replicated across nodes in two different “zones” for availability — with master only nodes running across three zones. Also, we have two client nodes, which we will be sending requests through.

Sample Cluster Topology

Different nodes in a cluster can have different roles (data and/or master – or none as a client) as well as properties (such as zone). We have data nodes, master nodes and client nodes - in different zones, i.e. with different node.zone properties. We focus on the data nodes in this article. To learn more about the other kinds of nodes, we recommend reading Elasticsearch in Production, and Java Clients for Elasticsearch.

When you send a request to a node in Elasticsearch, that node becomes the coordinator of that request. It will decide which nodes and shards to route a request to, how to merge different nodes’ responses, as well as decide when the request is “done”. While Elasticsearch handles this transparently for you, advanced partitioning schemes require knowledge about internal processing and routing. Given the cluster described above, the client node will act as the coordinator.

To be able to act as a coordinator, the node needs to know the cluster’s state. The cluster state is replicated to every node in the cluster. It has things like the shard routing table (which nodes host which indexes and shards), metadata about every node (such as where it runs and what attributes the node has), and index mappings (which can contain important routing configuration) and templates. The cluster state being replicated to all nodes is an important reason why the mappings need to be reasonably sized

For a search request, coordinating means picking replicas of shards to send the request to for further processing, possibly in multiple rounds, which we’ll look more into later. Picking a replica is done either at random, or influenced by the request’s preference.

An index request (i.e. one where you create, update or delete a document) is slightly different. In this case, the request must be routed to the primaries of the shards, and also to a certain amount of replicas - depending on the index request’s write_consistency. Write consistency can be either one, quorum (the default, though equivalent to one unless you have ≥2 replicas), or all — i.e. requiring one, most or every replica to have acknowledged.

The term “index” is used in a lot of contexts, with different meanings. You index documents, to an Elasticsearch index. The Elasticsearch index has shards, which are Lucene indexes. And those have inverted indexes. This can get confusing.

An important insight is that, conceptually, an Elasticsearch index with two shards is exactly the same as two Elasticsearch indexes with one shard each. Ultimately, they are two Lucene indexes. The difference is largely the convenience Elasticsearch provides via its routing feature. It is possible to achieve the same “manually” having just single shard indexes. (Not that you should!)

The important part is realizing that an Elasticsearch index is an abstraction on top of a collection of Lucene indexes, through the concept of shards. This helps when you need to wrap your head around more advanced partitioning strategies.

An index request is one that creates or changes documents in an index. Whether it’s the creation of a new document, a deletion or an update is not very important, so we’ll refer to them all as an “index request”.

There are several questions the node that processes this request needs to consider:

What is the routing configuration for the tweets and the logs? This can be determined by looking at the mappings in the cluster state. For the tweets, it’s possible that user based routing is in place, so the tweet’s user attribute should determine which shard to route the documents to. We look at some examples in the next section.

After having determined the shards for the two tweets, when can we consider the document to actually be indexed? Since nothing is specified, this defaults to the node’s default action.write_consistency, which defaults to quorum.

The log message does not have an ID associated with it. All documents must have one, so the coordinator will have to assign one to it. The document ID is what the coordinator will use to route the request, if the action or mapping does not override it. (It’s a best practice to assign IDs and have idempotent requests, however)

The log action specifies a strict _consistency-setting, so the request cannot be acknowledged until every single replica has returned success.

Routing specifies which documents go where, and is therefore an important part of how requests flow internally in Elasticsearch, both for search and index requests. It’s integral to designing proper data flow and index partitioning. Here, we focus on the mechanics, since index and shard design is a big topic.

Documents are routed based on a routing key, and are placed on shard number \(\mathrm{hash}\left(key\right) \bmod n\), where \(n\) is the number of shards in the index. (For the curious, the hash function used is djb2)

Shard design is largely about deciding on what that key is. If we use the document IDs for the above tweets as the routing key, and assume the tweets index has two shards, we would get \(\mathrm{hash}\left(502878691740090369\right) \bmod 2 = 1\) and \(\mathrm{hash}\left(475944863175671808\right) \bmod 2 = 0\) respectively. The two tweets would be stored in different shards. However, if the mapping would rather specify routing based on the user key, we would find that \(\mathrm{hash}\left(mikepluta\right) \bmod 2 = \mathrm{hash}\left(viktorklang\right) \bmod 2 = 1\). The documents would thus be stored in the same shard. That can be helpful if we only want to search a particular user’s documents.

When an index operation has been routed, it is forwarded to the “primary” for that shard. A shard has exactly one “primary”, and zero or more replicas. In this regard, the primary can be considered the “master” for the shard, and the replicas as “slaves”.

The primary will act as a coordinator for index operations for that specific shard. It will send the index operations to the relevant replicas, and wait until the required number has acknowlegded before indicating success.

Sequence of an Index Operation

When a sufficient number of replicas have acknowledged, the primary will report success back to the originating request coordinator, in our case the client node. For the default write consistency of “quorum”, two out of three operations is sufficient, so it is not necessary to wait for the third operation before passing on success.

The coordinator/client node sends out operations to separate shards’ primaries in parallel. When all operations have returned, the originating bulk request can also finally be returned.

The definition of a “successful” operation is worth looking into.

In the previous article, we looked at how indexes are built, and how there is a tradeoff between search speed, index compactness, indexing speed, and the time it takes for operations to become visible. Lucene internals like immutable segments were covered, and we emphasized that building these in batches and delaying costly disk synchronization operations were important.

All of that seems to run counter to what we want from an index operation: safe and durable, yet quickly acknowledged. To achieve that, Elasticsearch has a “transaction log” (“translog” in the documentation), or a “write-ahead log”, like almost every database system. Being written to the append-only translog is what defines success for a shard, not whether the document is actually part of a live index through a searchable segment.

During normal operation, Elasticsearch does not replicate already built pieces of indexes (i.e. segments), but the operations necessary to reapply the same change. (During recovery or shard migration, on the other hand, segments are replicated.) Therefore, all the replicas are essentially doing the same amount of work, for the same CPU cost. A replica cannot “reuse” the work the primary has already done. This is an important reason why adding replicas decreases the overall indexing throughput — you need to wait for even more nodes to acknowledge the operations. Also, you cannot expect to have “write only” nodes and “read only” nodes.

While putting an operation in the shard’s transaction log is a great start, ultimately the effect of the operation is supposed to end up in the index structures. In the previous article, we covered how the inverted index works, and how it’s built. We said nothing about mappings, but focused on the fact that a shard is a Lucene index.

Lucene does not have a concept of a mapping, nor does a Lucene index or document have types. In Lucene, a document is typeless and has an arbitrary number of fields. These fields have certain types (string/numeric) and properties (stored/indexed/…). The Mapping is an excellent abstraction to express how to transform a source document to a Lucene document with a bunch of fields. Mappings have concepts like types, dynamic properties, multi-fields, scripted transforms and so on. Lucene, however, knows nothing of these. Lucene is happy as long as Elasticsearch produces a document in the way Lucene expects. There is nothing special with fields like _all, _source, or even _type as far as Lucene is concerned.

Elasticsearch transforms a source document to a Lucene document

Many compare an Elasticsearch index to a database, and a type to a table. Personally, I think this comparison can cause more confusion than clarity. With two different tables, you can reasonably assume that the tables have nothing in common in how or where they are stored. Conversely, you can construct mappings that will cause Elasticsearch to produce fields with the same name, but with disparate types. Since everything is a bunch of terms in the same inverted index, this can cause problems. If you have a filter that is numeric for some documents and a string for others, you will eventually get errors or possibly unexpected results. The same goes for fields with the same type (e.g. string), but with different analyzers. Dynamic mapping is great for jumpstarting development, but probably not something you want to rely entirely on.

The node accepting the request will be the coordinator. It consults the mappings to determine which shard to send the request to.

The request is sent to the primary of that shard.

The primary writes the operation to its translog, and relays the request to the replicas.

When a sufficient number of replicas have acknowledged, the primary returns success.

The coordinator returns success when all the sub-operations (in e.g. a bulk request) have succeeded.

While this happens, each shard will continuously process its queue of documents, transforming the input documents to Lucene documents. These are then added to the index buffer, eventually flushed in a new segment, and even later completely committed. When this happens is up to the node hosting the shard. The flushing is not synchronized across nodes, so it is possible for searchers to briefly see separate “timelines” as refreshes propagate.

We’ve seen the amount of back and forth required to turn a request with an index operation into an actual index change. The process for search requests is similar in some ways, and different in many others.

Like with index requests, the search requests must be routed. A search will either hit all distinct shards, if no routing is specified — or a specific shard, if routing is specified.

When the relevant shards have been identified, the coordinator will choose amongst the available replicas of that shard, trying to balance load.

Let’s assume we have the following search. We have a multi_match-query on the fields title and description, searching for “Holy Grail”. We want the top ten authors, and the top ten books, preferring a match in the title.

We have not specified the search’s type, so the default will be used – query_then_fetch. This “search type” is not to be confused with the types discussed above! Naming is difficult.

Essentially, “query then fetch” means that there will be two rounds of searching. We will compare this search type with others later.

First, the shards will each find the top 10 hits and send back their IDs. These IDs (up to \(10 * n\) of them) will then be merged by the coordinator to find the true top 10, after which it will request the actual documents for the winners. For finding the global top 10 hits, it is sufficient to ask for the top 10 from every shard. We’ll see why this is different for aggregations later.

But first, let’s look closer at the first round, and see what happens at the shard level for a search.

Before a search request can be executed on a shard, the search needs to be rewritten and adapted to Lucene query graphs. Elasticsearch is often criticized for having a deeply nested and verbose search DSL, which to some extent assumes Lucene familiarity. Personally, I find the search DSL to be awesome — for precisely the same reasons:

First, the nested nature makes it easier to work with programatically. It is certainly verbose for simple things, but real search needs with heavily customized scoring and matching are not simple.

Secondly, the queries and the filters in the DSL are quite close to their Lucene counterparts. This is important to understand what is really going on.

There are a few exceptions, however. As mentioned above, the mapping concept is foreign to Lucene. Yet, some queries utilize the mappings. For example, the match family of queries have no Lucene counterpart. They will process the query text according to the fields’ mappings, and compose a Lucene query. Our example will be rewritten to something like the following:

Note that the query text “Holy Grail” has been tokenized and lowercased, according to the analyzers configured for the respective fields. A common source of frustration when not getting the results you want is to have incompatible index- and query time text processing. For example, a terms query would not have transformed Holy Grail into holy, grail, and therefore would not match.

The section on inverted indexes and index terms in the bottom up article describes how the search is executed per segment, so we will skip repeating that here. However, it is worth repeating that a search happens across multiple independent segments and merged - much like how the sharding works. There are several caches that can boost the performance of the search, all scoped by segment:

If the tag:python filter is cached, it can immediately be reused.

The author_id field must be in the field cache, if document values are not enabled. If not, all author_ids for all documents must be loaded into memory.

If document values are used for author_id, it’s possible that the disk pages holding the needed values is in the page cache. If so, great!

Queries are not cached, however. So for these, and any uncached filters and fields, we will need to hit the inverted index. For anything that needs to use the underlying index, it would be great if the relevant pages are present in the operating system’s page cache. (If you assign all the memory to the Elasticsearch process, there is nothing left for the operating system.)

Ultimately, with filters and aggregations, you are essentially manipulating bitmaps to summarize values you already have cached. This is why Elasticsearch can be so mind-boggingly fast.

After every shard has provided its contribution to the results, the coordinator will merge them. Specifically, there are two things it needs to find out:

The true top 10 hits. The shards will provide the IDs of up to 10 documents, and their scores.

An approximation of the top 10 authors. The shards have provided the counts for up to 100 authors each. If we did not do this, and only requested the top 10 per shard, what would happen if one of the true top 10 authors were actually the 11th in one of the shards? It would not have been submitted as a candidate, and caused that particular outer to have fallen out. This is still possible, if one of the true top 10 authors is actually the 101st on one of the shards, but it should be less likely. To have complete accuracy, Elasticsearch needs to gather the counts for all authors for every shard. This can be prohibitively expensive to do, so trading accuracy for speed is common in this case.

The merge process will determine the true top 10 hits, then reach out to the shards that host the documents and ask for the entire document. Whether this extra step is helpful or an optimization that ends up adding to the overall latency depends on your use case. If you have a big number of shards and rather big documents, it’s probably worthwhile to do it in two rounds. If you have tiny documents and few shards, you can consider the query_and_fetch search type. As always, test and verify.

The goals of this article was to start out with an index and a search request, as sent from a client, and reach the bits covered in the first part of this series. We’ve looked at shard routing, with the help of cluster state, mappings and possibly customized routing. With shards routed, we’ve seen how the shards’ primaries coordinate changes to its replicas, and how the transaction log balances durability and timely responses.