Friday, October 29, 2010

We need full text search capabilities in our frontend WWW interface to
allow users to search through logs sent by embedded devices on the
field. Quite recently, we changed the search backend from Xapian to
ElasticSearch.

Background

Xapian is an open source, GPL
licensed C++ library that implements a rich set of features for
indexing any type of documents, searching and ranking them. An
application that uses Xapian embeds it by linking to the C++
library. There's no server involved whatsoever, unless your
application itself is a server.

ElasticSearch is an open
source, Apache licensed Java application that implements a server
that performs indexing and searching of JSON documents. It's built
on top of Lucene, a popular
Java library used by many higher-level search engines. ElasticSearch
has a HTTP REST API as well as higher performance Thrift API, and
it's query
DSL provides rich searching capabilities.

From day one, our log indexing service has indexed JSON documents.
Log parsers output JSON documents and mappings can be used to
convert specific fields to forms understood by Xapian. I think the
system (not invented by me!) is quite clever on how Xapian is used
to allow indexing JSON documents.

Problems with Xapian

As more and more logs started to flow, we started facing problems
with Xapian.

First, we had problems on how to scale indexing. Xapian's database
is a bunch of files, and only one process is allowed to write to the
database at a time. We wanted good durability, so the database was
flushed often to not lose any data. Due to how the communication
between the client (that sends logs) and the server (that submits
them to indexing) works, we couldn't index a large batch of
documents and then flush the database. So as the amount of incoming
logs started to grow, the indexing was left behind at
times.

The next problem was search performance. As our log database hit
about 10 million entries, searches on a single device's logs were
taking many seconds to complete. Searching through the logs of all
devices took minutes, even if limiting to a few lines of results.
The situation was worsened by the fact that flushing a Xapian
database invalidates all ongoing search operations and they have to
be restarted. And our indexer flushed often.

Our setup had a single node and a single database. I believe that
with some refactorizations, splitting databases, adding nodes, etc.
we could have made better with Xapian. It would just have been too
much trouble, as we would have to build clustering and scaling all
by ourselves. At about 19 million log entries we decided to do
something about it.

Meet ElasticSearch

We started to look for alternatives and found ElasticSearch. It was
amazing how it seemed to fit to our needs perfectly. It uses JSON as
the native document format, its mapping capabilities and JSON-based
search language were built in the same spirit as in our Xapian-based
system.

So I started playing with it.

ElasticSearch was ridiculously simple to get running. Just download
the binaries and start one shell script. It was up an running in 5
minutes, with zero configuration. Once I got grip of the mapping
system, it was easy to make the same fields searchable in the same
way as we had done with Xapian. What needed most work was to change
from building Xapian-type queries to ElasticSearch ones. But after
all, this wasn't so big deal either, as our own query language was
also based on JSON.

After these issues were solved became the fun part: Moving our logs
from Xapian to ElasticSearch. I wrote a small Python script that
iterated through all the documents in our 16GB Xapian database, made
minor modifications to them and used the
ElasticSearch bulk
API to index a few thousand in each request. The process took a
few hours to complete, and after it was done, it was time to see
what had happened to the performance.

ElasticSearch is fast

Our first ElasticSearch node had one CPU and 2GB of memory, of which
1GB was dedicated for ElasticSearch. And searching was blazingly
fast. After getting used to waiting 10-15 seconds for the 1000 most
recent log entries of a single client with Xapian, ElasticSearch
returned the results in 5 seconds. When I pressed the search button
again, the I got the results (with a few new lines) in less than a
second. This was amazing.

The log indexer perfomed a lot better too. Before, cathing up on
2000 pending indexing jobs took an hour to complete. Now it was 2
minutes.

ElasticSearch is bonsai cool

All the worries about scaling are gone. If speed becomes an issue,
we can start a new node or three, and let ElasticSearch work out
load balancing behind the scenes.

But we're nowhere near requiring more performance. Currently, in our
testing environment we still have a single ElasticSearch node, but I
reduced the memory limit of ElasticSearch to 512 MB. Nothing changed
in terms of speed even though the available memory was cut to half.

We're really happy about ElasticSearch and would never change back
to our old system.Because of the speed, we're now able to enhance
the log searching user experience. We have plans on implementing
polling for new entries from log browser, fetching more lines
dynamically when the user scrolls the window, and more.