Spotlight on: Elasticsearch

We talk to Shay Banon, creator and co-founder of the distributed RESTful search engine.

In the June issue of JAX Magazine, we talked to
the creator and co-founder of the distributed RESTful search engine
making enterprise waves.

JAX Magazine: Can you give some background to
the company and how it came about?

Shay Banon: Elasticsearch was
open sourced about three years ago, and I quit my day job to do it
full-time. The project was getting more and more successful,
gaining broad adoption and being used as a mission critical
component in applications. Companies using it were actively looking
for professional support around it.

I already knew Uri Boness and Simon Willnauer
through my history around Lucene and Search in general, and Uri
connected me with Steven Schuurman, one of SpringSource’s
co-founders. I flew over to Amsterdam and we all hit it off, and
decided to form together the Elasticsearch company as it is
today.

JAXmag: Apache Lucene – what is so good about
that project in your view and what is possible to achieve with
it?

Banon: Apache Lucene is a
wonderful Information Retrieval library. It has some of the best
minds behind it, and is the de facto standard when it comes to low
level IR work, purely on merit. It is a library though. In order to
use it, one has to program in Java (or the JVM) and effectively
embedding Lucene while knowing its internal API usage, and
understand its design intimately in order to make the best use of
it.

JAXmag: Why is it the foundation of
Elasticsearch?

Banon: [It] forms a strong
foundation for Elasticsearch to build on. In Elasticsearch, we
managed to utilize Lucene for what its best for, making sure its
used as it should be, and extended it in order to achieve some of
the broader goals of Elasticsearch as a distributed search and
analytics engine. We also managed to strike a nice balance for
people who are already familiar with Lucene, yet prefer to use it
through an “over the wire” API where Elasticsearch takes care of
the distributed API execution, data distribution, and so on. For
example, our search API provides an easy to use Query DSL that maps
very nicely to how queries in Lucene are actually
represented.

We did extend Lucene, for example, in adding
additional geo query capabilities, and adding stronger consistency
and atomicity model for the data stored in Elasticsearch. Other
features we have built from the ground up, like our analytics/
aggregation engine providing real-time aggregations over billions
of documents.

Another example is the fact that we take care of
distributed data and the search requests automatically,
potentially across tens or hundreds of machines. This include other
important aspects that are outside of the context of a library and
more towards a distributed runtime environment. For example, we
have a sophisticated evented networking layer that allows you to
execute a distributed search request across hundreds of machines,
and receive the responses in milliseconds.

JAXmag: Can you outline some key goals from a
design perspective with ElasticSearch – how is it
distributed/highly available for example?

Banon: Building a distributed
system is not easy. Careful thought needs to be taken on all
elements that form it, starting from how networking is done to
perform cross node communication, how nodes discover each other,
and ends with how data is balanced across all available
machines.

We do all the “regular” things one would expect
from a high end distributed system. We allow the user to easily
partition the data, and make sure that the data has multiple
replicas or copies for high availability. Elasticsearch stands on
the more proactive side of distributed systems, taking automatic
action in moving data around to make use of more machines if they
are added to the cluster, or reallocating data in case of machine
failure.

One thing that we take deeply to is the notion
that not all the data is the same, and we strive hard in
Elasticsearch to allow to user to easily convey it with our system.
For example, in a logging scenario, where Elasticsearch is being
used more and more, old logs are probably not as important as more
recent logs. New logs keep on being generated, and take the lion
share of the searches. With Elasticsearch, its easy to build a
system that takes that into account, and prioritizes resources for
recent or new data, compared to older one.

JAXmag: Why JSON over HTTP?

Banon: JSON has become the
de-facto standard to represent data in the past few years. And HTTP
has become the defacto transport for it. By standardizing on JSON
and HTTP, we make sure that Elasticsearch can be easily integrated
into any environment, regardless of a programming language,
framework, or technology stack used.

I would add that how a system uses JSON is as
important as the fact that it uses JSON. As with any data format,
JSON can be heavily abused, and we take extra care in designing our
API in the most easy to use and consumable manner. For example, our
histogram aggregation component returns the data in a format that
can easily be used to drive almost any charting library out
there.

The same applies to HTTP, by the way. It’s not
enough to just say we support HTTP, we take HTTP to heart. For
example, when a document is indexed in Elasticsearch and there is a
conflict due to optimistic concurrency control logic, we return the
HTTP status code 409 (CONFLICT).

JAXmag: How important is the schemaless
approach?

Banon: Elasticsearch is
semi-schemaless, if that’s a definition. The dataset a user pushes
into Elasticsearch ends up auto-defining the schema that will be
used. Though that dynamic definition is possible, users have all
the power to explicitly define a schema for their JSON
documents.

We feel it’s important because it helps users
get started with Elasticsearch, simply getting data, formatting it
in JSON (that has some notion of types), and storing it in
Elasticsearch. It also serves an important aspect in real systems,
where data ingested might be undefined, and Elasticsearch allows
for an evolvable schema over time.

JAXmag: What makes Elasticsearch different to
Solr?

Banon: To be honest, I know
that a lot of people like to compare Elasticsearch to Solr, I
personally think that the systems are quite different in the
breadth of problems they try and solve. Solr was born in the single
server, Enterprise Search era, and has proved to be a great
solution for that. Elasticsearch inception and implementation
started with a deep understanding of the fact that systems today
require advance distributed capabilities and, yet be easy and
digestible to use under today advance technology stacks. REST is a
good example. Also, we at Elasticsearch have a broad vision for it,
with prime example is our real time analytics capabilities that
took a lot from my personal experience of building distributed in
memory data grids.

JAXmag: You have some very impressive clients
using the project such as Foursquare, Soundcloud and Github, all of
whom have varying use cases. Does that show the malleability of
ElasticSearch?

By combining those three areas in a single
product, users find themselves empowered with what they can do with
their data.

For that reason, the data content itself, though
interesting, becomes irrelevant to a degree. If implemented
properly, there isn’t a big difference between a location on
Foursquare, music on SoundCloud, code/issues on GitHub, trades in a
bank, audit logs, for example, in financial institutions, and so
on. And we see all of those use cases and more use Elasticsearch to
help make sense of their data.

JAXmag: Moving forward – what’s planned for
the company?

Banon: We are focused on
continuing to improve the product itself, and not just the “core”
Elasticsearch, but also Lucene, and the ecosystem that has
developed around Elasticsearch. Language clients is a great example
[of that]. We have a lot of work left around improving other
aspects of the product, like our documentation, which we are
actively working on.

Aside from the product, we will continue to
invest heavily in providing our different services around
Elasticsearch, namely our production support, development support
and training. We also focus on starting to taking more
active role in conferences and meetups, to help educate people
around Elasticsearch. Though, I must admit, it starts to be
challenging to find people that haven’t heard of it.