Elastic Search: Distributed, Lucene-based Search Engine

Here at Sematext we are constantly looking to expand our horizons and acquire new knowledge (as well as our search team – see Sematext jobs page – we are always on the lookout for people with passion for search and, yes, we are heavy users of ElasticSearch!), especially in and around our main domains of expertize – search and analytics. That is why today we are talking with Shay Banon about his latest project: ElasticSearch. If his name sounds familiar to you, that’s likely because Shay is also known for his work on GigaSpaces and Compass.

What is the history of Elastic Search in terms of:

When you got the idea?

How long you had that brewing in your head?

When did you first start cutting code for it?

I have been thinking about something along the lines of what elasticsearch has turned out to be for a few years now. As you may know, I am the creator of Compass (http://www.compass-project.org), which I started more 7 years ago, and the aim of Compass was to ease the integration of search into any Java application.

The last important aspect of Compass is its integration with different mechanisms to handle content and make it searchable. For example, it has very nice integration with JPA (Hibernate, TopLink, OpenJPA), which means any change you do the database through JPA is automatically indexed. Another integration point was with data grids such as GigaSpaces and Coherence: any change done to them gets applied to the index.

But still, Compass is a library that gets embedded in your Java app, and its solutions for distributed search are far from ideal. So, I started to play around with the idea of creating a Compass Search server that would use its mapping support (JSON) and expose itself through a RESTful API.

Also, I really wanted to try and tackle the distributed search problem. I wanted to create a distributed search solution that is inspired, in its distributed model, from how current state of the art data grids work.

So, about 7 months ago I wrote the first line of elasticsearch and have been hacking on it ever since.

The inevitable: How is Elastic Search different from Solr?

To be honest, I never used Solr. When I was looking around for current distributed search solutions, I took a brief look at Solr distributed model, and was shocked that this is what people need to deal with in order to build a scalable search solution (that was 7 months ago, so maybe things have changed). While looking at Solr distributed model I also noticed the very problematic “REST” API it exposes. I am a strong believer in having the product talk the domain model, and not the other way around. ElasticSearch is very much a domain driven search engine, and I explain it more here: http://www.elasticsearch.com/blog/2010/02/12/yourdatayoursearch.html. You will find this attitude throughout elasticsearch APIs.

Is there a feature-for-feature comparison to Solr that would make it easier for developers of new search applications to understand the differences and choose the right tool for the job?

There isn’t one, and frankly, I am not that expert with Solr to create such a list. What I hope is that people who work with both will create such a list, hopefully with input from both projects.

When would one want (or need) to use Elastic Search instead of Solr and vice versa?

As far as I am concerned, elasticsearch is being built to be a complete distributed, RESTful, search solution, with all the features anyone would ever want from a search server. So, I really don’t see a reason why someone would choose Solr over ElasticSearch. To be honest, with today data scale, and the slow move to the cloud (or “cloud architectures”) you *need* a search engine that you can scale, and I really don’t understand how one would work with Solr distributed model, but that might just be me and I am spoiled by what I expect from distributed solutions because of my data grid background.

What was the reason for simply not working with the Solr community and enhancing Solr? (discussion, patches…) Are some of Elastic Search’s features simply not implementable in Solr?

First, the challenge. Writing a data grid level distributed search engine is quite challenging to say the least (note, I am not talking about data grid features, such as transactions and so on, just data grids distributed model).

Second, building something on top of existing codebase will never be as good as building something from scratch. For example, elasticsearch has a highly optimized, asynchronous, transport layer to communicate between nodes (which the native Java client uses), a highly modular core where almost anything is pluggable. These are things that are very hard to introduce or change with existing codebase, and existing developers. Its much simpler to write it from scratch.

We see more and more projects using Github. What is your reason for choosing Git for SCM and Github for Elastic Search’s home?

Well, there is no doubt that Git is much nicer to work with than SVN thanks to its distributed nature (and I like things that are distributed 🙂 ). As for GitHub, I think that its currently the best project hosting service out there. You really feel like people out there know developers and what developers need. As a side note, I am a sucker for eye candy, and it simply looks good.

We see the community already created a Perl client. Are there other client libraries in the works?

Yeah, so there is an excellent Perl client (http://github.com/clintongormley/ElasticSearch.pm) which Clinton Gormley has been developing (he has also been an invaluable source of suggestions/bugs to the development of elasticsearch). There are several more including erlang, ruby, python and PHP (all listed here http://www.elasticsearch.com/products/). Note, thanks to the fact that elasticsearch has a rich, domain driven, JSON API, writing clients to it is very simple since most times there is no need to perform any mappings, especially with dynamic languages.

We realize it is super early for Elastic Search, but what is the biggest known deployment to date?

Yes, it is quite early. But, I can tell you that some not that small sites (10s of millions of documents) are already playing with elasticsearch successfully. Not sure if I can disclose any names, but once they go live, I will try and get them to post something about it.

What are Elastic Search future plans, is there a roadmap?

The first thing to get into ElasticSearch are more search engine features. The features are derived from the features that are already exposed in Compass, including some new aspects such a geo/local search.

Another interesting aspect is making elasticsearch more cloud provider friendly. For example, elasticsearch persistent store is designed in a write behind fashion, and I would love to get one that persist the index to Amazon S3 or Rackspace CloudFiles (See more information on how persistency works with elasticsearch, see here: http://www.elasticsearch.com/blog/2010/02/16/searchengine_time_machine.html).

NoSQL is also an avenue that I would love to explore. In similar concept to how Compass works with JPA / Data Grids, I would like the integration of search with NoSQL solutions more simple. It should be as simple as you do something against the NoSQL solution, and it automatically gets applied to elasticsearch as well. Thanks to the fact that elasticsearch model is very much domain driven, and very similar to what NoSQL uses, the integration should be simple. As an example, TerraStore already comes with an integration module that applies to elasticsearch any changes done to TerraStore. I blogged my thoughts about search and NoSQL here http://www.elasticsearch.com/blog/2010/02/25/nosql_yessearch.html.

If you have additional questions for Shay about Elastic Search, please feel free to leave them in comments, and we will ask Shay to use comments to answer them.

I have seen several features in ES regarding faceting (statistical, range, histogram, etc.), but did not find any helping with the scope of text processing. On basic lucene, I have seen some text extensions like synonym analyzer, and of course carrot2 is an important step towards text clustering.

Are there any such features in ES (if not now, maybe in future)?

Also, any suggestions on how to add such text processing engines as a plugin or script in ES? (I understand that map reduce on text processing might be a bit more tricky than that on statistical calculation, still …. )

Sujot – ES is relatively young, so features are getting added to it rapidly. The best way to go about getting new features in is to 1) ask on the ES mailing list, 2) check ES Github repo and issues people opened, 3) fork, patch and request a pull from your fork into ES master. I hope this helps.

Always good to hear about new Open Source! I am a Solr enthusiast, and I do REALLY like it. But I must admit distributed search to be its Achilles heels. You have to configure multicores, masters, slaves, replication and so on. I think it will improve a lot with Zookeeper. But it is always good to have another option.

In Solr we have the concept of Search Components, that let specify a way to do my query and fine tune it for specific cases. So you plug your Search Component at a Request Handler and tweak how the queries meant to that Handler are going to handled. You can even plug more than one Component at the same Handler, or the same Component at different Handlers.

How do we achieve different search needs for the same application with Ellastic Search? Does my question makes any sense at all?

Of course. Solr has more options when it comes to pure search features, I guess. From what I know, it has more faceting features than elasticsearch for example, and someone mentioned on the mailing list function scoring. I do plan to add these features, some of them are features that I already implemented in Compass way back (of course, on top of Lucene), and some are new. I do have several ideas on exciting new features for search (which I don’t think exists currently). Well, you know better than myself, these features are a moving target, and I certainly plan to move with it.

Cloud Support is nice (though this doesn’t sound like something that could not be added to Solr. After all, there is Solr Cloud: http://search-lucene.com/?q=solr+cloud although what you describe in that post does sound nice and simple – I’m guessing the complex stuff is hidden in Zen?).

Questions re cloud support in ES:
* Discovery – so the idea is that a single new ES instance can find other, already existing ES instances on the same cloud and, at the same time, the existing ES instances can find out about the new ES instance that just joined, correct? Does ES have a notion of masters and slaves? If so, is the idea to use Zen to let them discover each other? And how does discovery come into plan in scenarios where one has multiple query slaves behind a load balancer (LB). When a new slave joins the cluster, how does the LB find out about it and, importantly, how does it know when the query slave is fully ready (e.g. when it has the index and all other data and is fully ready to serve queries)?

* Gateway – with S3 (and maybe other blobstores), how does ES deal with size limitations, such as S3 bucket size being 5 GB max? Also, can you describe the process of keeping data both local and in a blobstore? Is the local data simply replicated into the blogstore periodically? Or in near-real-time? Is there a chance of things getting out of sync? (e.g. a doc is added to ES, it goes to local store, but the machine/VM dies and S3 never gets the data)

Good questions, which actually belog to the mailing list since its hard to explain them in a blog comment.

Regarding discovery, there is no concept of query slaves in elasticsearch. You have nodes that can hold data, and nodes that do not. There is no need for a load balancer in elasticsearch, each node can receive a request, and if it can’t handle it, it will automatically delegate it to the appropriate node(s). If you want to scale out search, you can simply have more shard replicas per shard.

Zen is used for both discovery and master election. A master in elasticsearch is responsible for handling nodes coming and going and allocation of shards. Note, the master is not a single point of failure, if it fails, then another node will be elected as master.

Also note, that nodes do not need to communicate with the master on each request, so its not a single point of bottleneck.

The readiness of nodes is done using the shard allocation algorithm. A shard allocated to a node is considered “ready” to receive requests only once it has fully initialized.

In short, elasticsearch provides BASE support. Each document you index is there once the index operation is done. No need to commit or something similar to get everything persisted. A shard can have 1 or more replicas for HA. Gateway persistency is done in the background in an async manner. This can survive node failure since that node has a replica which will continue with the async persistance to the gateway.

As for S3 limited file size, elasticsearch automatically breaks large index files into S3 files if they are over the limit.

This is all basic data grid (and to a degree, nosql) based distributed concepts. All we need now is to make people who use search engines and try to go distributed to understand them, and then, questions like how elasticsearch is different than Solr will become mute, even with the upcoming Solr Cloud…, which, based on the links you sent, is, ahem…, a bit of a hack…

I think I explained in the interview why I decided against integrating with Solr. Its a completely different architecture, changing Solr to support that would have been really difficult (from what I saw). I think that you get a good product with ElasticSearch, and choice is always good. Another reason is the fact that I want to set the pace for the project, an existing project, with its history, comes with a baggage that I simply did not want to carry with the development.

In any case, choosing between two solutions which are free and open source is better than writing your own, no?

Elastic Search sounds like a really interesting project.
Unfortunately the author of Elastic Search seems see the things a little bit too easy, I think.

You criticised the way Solr deals with distributed search. What exactly is the difference between your project and the Solr architecture that make it so difficult that contribution seems to be too expensive (I am no expert on that field, sorry)?

According to the interviewer’s question: Why not merge your idea of processing distributed search and the already given Solr-Server? Of course it would cost a little bit more effort to change the distributed search architecture in Solr than creating a whole new product. However, if you did so, Solr would get even better than it is today.

I ask again, because as a developer of search application, I have to decide whether I use Elastic Search or Solr. It would cost a lot of customization to get the best of both.

Services

About

Contact

Apache Lucene, Apache Solr and their respective logos are trademarks of the Apache Software Foundation.
Elasticsearch, Kibana, Logstash, and Beats are trademarks of Elasticsearch BV, registered in the U.S.
and in other countries. Sematext Group, Inc. is not affiliated with Elasticsearch BV.