How we get high availability with Elasticsearch and Ruby on Rails

On e-Manifest,
C2, and Open Data
Maker, we pair a full-text
search tool with a storage tool, as is a common technique for
information management systems. For example, we keep the canonical data
in a database and also push it to a search tool; all queries are
directed first at the much-faster search tool.

The technique is so common that modern
databases
often include an integrated
full-textsearch component, which can be
very useful for simple data models where you want to, for example,
search blog posts or a few large text columns in a table or two.
Sometimes, though, your data model is complex enough or the user
experience demands features that the built-in search component does not
offer. Then it’s time to consider a standalone search
tool, like Solr or Elasticsearch.

A database gives us safety: redundancy, validations, referential
integrity checks, and transactions. With safety comes tradeoffs: it can
be difficult and expensive to match a single word from multiple columns
of many words, and all those data integrity features take time and CPU
cycles.

A full-text search tool gives us speed; an inverted
index makes finding a
needle amidst a haystack of data a very fast operation. As with the
storage tool, there are tradeoffs: the data pushed into the search tool
must be de-normalized, and then kept in sync with the canonical storage
tool.

The experience for the user can be lovely: near-instantaneous search
results, even for really large datasets. The experience for the
developer can be … less than lovely. Keeping multiple tools in sync,
especially for large datasets, can require a lot of both time and
effort. For anything beyond the simplest of models, the stored data is
likely normalized across multiple tables and columns, so creating the
searched data can make re-assembling the data into its de-normalized
state a costly exercise. In a typical use case, the time spent follows
the Pareto
principle: 80 percent
of the time is spent
marshalling
the data from storage and sending it to the search tool, while the
search indexing process itself usually takes about 20 percent of the
total time.

Example

Take this example. We have a database with these tables:

reports

attachments

users

groups

user_groups

Each report may have zero or more associated attachments (one-to-many)
and zero or more associated users (one-to-many). Each user may have zero
or more associated groups (many-to-many).

Say we want to find all the reports that belong to user Jane and that
have attachments that contain the word “apple”. Searching the inverted
index can be a query as simple as:

user:janedoe@example.com attachments:apple

But to make that simple query possible, we must de-normalize a report
and its associated objects into a single “document” object, which might
look like this:

Building that de-normalized object from the database requires multiple
SQL queries:

SELECT * FROM reports WHERE id=123
SELECT * FROM attachments WHERE report_id=123
SELECT * FROM users WHERE id=456
SELECT name FROM groups WHERE groups.id IN (SELECT group_id FROM user_groups WHERE user_id=456)

That’s a lot of work to make the user query so simple.

Problem: Wall-clock time

Performed once, those SQL queries probably take, cumulatively, less than
a second. But if you have a million reports to index in your search
tool, that adds up to many millions of SQL queries and a lot of
wall-clock time. And
if your search index configuration needs to change, you probably need to
re-index everything again. During development, especially, you may
re-index large data sets many, many times.

Solution: Divide and conquer

Fortunately, as an impatient developer you
can take advantage of the fact that, even though the marshaling phase of
the synchronization process takes up the majority of the wall-clock
time, the problem lends itself to the
MapReduce pattern; you can
split the problem into smaller pieces and run them in parallel. Modern
computers, even the laptop on which I’m composing this blog post, have
multiple processor cores, each of which can run multiple threads. A lot
of the time, one or more cores are sitting idle. Using all the resources
in the machine can decrease the total wait time significantly.

Parallel play

Here’s a real example. I work on
multipleRuby on
Rails projects that depend upon PostgreSQL
and Elasticsearch. My laptop has four
cores, so during the data
synchronization process, one core will be dedicated to Elasticsearch,
and one core to PostgreSQL, leaving two cores free to run Ruby code that
will read from PostgreSQL (marshaling) and write to Elasticsearch
(indexing). I wanted to simulate a production dataset, so I created
100,000 dummy reports. Using the Elasticsearch Rails
gem it was very easy
to index my reports:

That task took about an hour to complete (or about 27 reports per
second). If we believe the 80/20 rule of thumb, that meant that about 48
minutes were spent marshaling data from PostgreSQL, and Elasticsearch
took 12 minutes to parse and index the reports.

But wait! Only one of the two available cores was busy during that time.
That’s because the Rake task runs as a single process. It grabs a batch
of Reports (1,000 at a time by default), marshals them all into JSON
“documents” and sends them to Elasticsearch using the bulk
API.

That lonely, idle processor core, twiddling its thumbs while its
neighbor is working so hard. Well, that core can now share in the fun
too:

We wrote some extensions to the basic Elasticsearch Rails
gem that can take
advantage of multiple cores. Now my 100,000 Reports can finish in 36
minutes (48/2 + 12) instead of 60. That increased speed is not without a
cost, though. Since we fork new processes instead of using threads, it
takes twice as much memory to run twice as fast. That’s why I explicitly
set the BATCH size to be smaller than the default 1,000, so that the
memory used by each process is smaller.

We call these extensions “high availability” because this approach means
that re-indexing a production system can happen much faster, reducing
downtime for our users. We’re limited less by wall-clock time and more
by available cores and memory.

Zero downtime

We also included a “staging”
feature, so that
when we re-index an entire system, we create the new index alongside the
existing index, and then “promote” the new index when it’s finished.

That means zero downtime for our users, and gives us the ability to
rollback the changes if we need to. In a production environment, where
there are already separate, dedicated servers for the database and
Elasticsearch, a single 8-core machine can churn through a re-indexing
task very quickly. That means, as a developer, I can spend more time
developing and less time waiting for the machine to do its work. Lovely.

To ensure that agencies could focus on the important work of joining their internal systems without unnecessary technology distractions, we (the 18F and Treasury prototype team) sought to deliver the simplest possible interface that would accept agency data using the simplest possible format for that data.

Rather than wait for knowledge to naturally diffuse through team changes, we try to kick-start the process through shared interest groups, tech talks, and documents highlighting some of the more interesting design decisions our developers make. Today, we'll focus on some of the core architectural philosophiesbehind the Micro-purchase project.