Q&A with Shay Banon: 10 “Bonsai Cool” Things About elasticsearch

We are very fortunate to post an interview with Shay Banon, the founder of elasticsearch. Elasticsearch is technology that is very popular among some of the coolest companies on the web today, including SoundCloud, StumbleUpon, Mozilla and Klout. These companies use elasticsearch to help them deploy powerful search capabilities in their applications that are easy to set up, scalable and built for the cloud. In this interview, we get to learn all kinds of cool things:

Q1. So, how did you end up getting into search?
About 10 years ago, I moved from Israel to London because my wife was going to study to be a chef at the Cordon Bleu. I had no job. I was in a new country. I was unemployed. So, I started to get into the latest, cool, new technologies.

To do this, I came up with a little project. I decided to build my wife an application to store all the chef recipes, historic aspects about food, etc. I wanted her to have a single place to find whatever is relevant for what she typed about cooking. This little project is what got me into search. By doing this, I got into Lucene. I built my first open-source project called Compass, a library around Lucene. Then, I moved from Compass to elasticsearch.

Unfortunately, my wife is still waiting for the application I promised to build :- ).

Q2. How did you come up with the idea for elasticsearch?
Well, I was doing a lot of work with open source search for quite some time, mainly on top of Lucene. I was doing that for ages. Then I started working for Gigaspaces and in-memory data grids. I felt the need to build a next-gen search engine with all the features you expect today—distributed computing, hybrid cloud support, ease of adoption, scalable, and designed with standard APIs on REST/JSON. I sat down to write it a few years ago and open-sourced it after a few months. That’s it. It’s taking off and has been amazing. I quit my job, started doing this full time, and started a company around it.

Q3. Why is elasticsearch different than other open source search engines?
In a nutshell, elasticsearch makes data exploration very easy.

First of all, there is a strong principle or architectural notion that things should be easy and simple. When people start to set-up, deploy, and use elasticsearch, it is easy to set-up 2, 4, 6, 10—as many nodes as you want. It’s easy to set up a cluster too. When you start developing, it’s easy to start using documents as JSON documents. The API makes it easy to use different languages like Java, Ruby, Perl, Python, and more. In runtime, elasticsearch manages distribution—adding a node is quite easy and data is redistributed automatically.

If you think about search as a process, elasticsearch goes beyond free-text. Users and developers want valuable information from their data regardless of the form. While elasticsearch does free-text search very well, you also want structured search, analytics, aggregations, facets over the data, and more. We tie all these together nicely. For example, let’s say you are indexing a social string from Twitter like a lot of our customers do. You can easily set it up and ask questions like, “find all the tweets about the president.” This is a free-text search over a bunch of tweets. Then you can say, “Find all the tweets about the president when tweeted from Idaho in the past month.” We sprinkle a bit more structure here by adding the location and time period. Then, we can ask it do the same thing, but break it down into number of tweets per day to see a trend over time. So, we end up with a metric and several dimensions to show something of value. After the initial query is set up, we can easily change the name of the president to any other person, and elasticsearch reflects the result—set in real-time. Then, we can change from a person to a topic like “flu epidemic,” and the result-set is reflected in real-time again.

For developers, its very powerful how easy data exploration becomes with elasticsearch.

As far as metrics go, I think that one of the coolest implementations is an email archiving company. They put a petabyte of data into elasticsearch. Then, they do unstructured search on a sizable multi-cluster, multi-node environment. This is interesting because it handles such large, distributed, big data problems.

There is another company that effectively stores and indexes all the download events of their online property. They have billions of these events. They set-up elasticsearch like a version of Google Analytics for their users. Their users can see download sources, usage patterns, histograms, trends, money made, geo-based filtering, and more.

Path uses us to index moments. This is a lot of data. This is social graph search—search within my friend’s moments. Soundcloud is another one—when you search there, you can explore music by a variety of filters.

We are always excited to hear stories about our customers.

Q5. How does the design of the elasticsearch architecture deal with big data?
When we designed it, we spent a lot of time thinking about how data looks. We wanted users to easily express what they want. Let me explain.

Not all data is created equal. Data flows into systems in different ways. Some data is more important than other data. Data might need to be highly available versus archiving. What we tried to do is allow a user to easily express these notions. For example, one popular use case is logging. We are used as an open-source logging database—people can easily search and get statistics, etc. Logging is a great example where data is not equal—current data is very important compared to old data. With elasticsearch, you can express the fact that data for today is much more important. We want the capacity and scale to be right for new data—we need to be on strong servers with SSD to scale. As data ages—and we can say that it is less relevant. For those situations, the app runs on cheaper boxes. As well, we don’t need 4 copies, we just need 2. Now, we can manage our costs.

With big data, we have to manage data differently depending on use. We think about these things as much as possible and try to make it really easy for users to express these types of policies.

Q6. You said it was strategic to make elasticsearch work with JSON over HTTP. Could you elaborate?
Sure, this is simple. JSON over HTTP has effectively become the lingua franca of services in a system. Even though we are a Java solution, JSON over HTTP makes it really easy for people to develop in Ruby, Perl, and other languages. HTTP is the wire format and JSON is the payload. It’s easily consumable by any language. In addition, you can now state a response in JSON and stream it directly to the browser.

By supporting these standards, our usage becomes easier.

In addition, we also try to make our RESTful APIs easily consumable. When you give elasticsearch a search request, you can also ask for a histogram like the number of tweets per day in the past year. We return it in a structure that is ready to be thrown into any charting library. It’s not just HTTP and JSON, it’s the data or object structure—we make it easy to consume.

Q7. How does RabbitMQ work with elasticsearch?
Conceptually, you can easily move data from RabbitMQ into elasticsearch.

As I mentioned earlier, one way to push data into elasticsearch is by calling our index API—a lot of people use it this way. But, many times people need an asynchronous approach and are using RabbitMQ. So, we made it easy for developers using RabbitMQ to push messages into elasticsearch.

Instead of having to write something to poll a queue and move the message into elasticsearch, elasticsearch can do it for you. Basically, elasticsearch starts a listener for a queue or several queues. The data coming down the queue will be pulled and automatically inserted into elasticsearch. Developers won’t have to worry about doing more than pushing data into RabbitMQ to have it indexed and searchable. Of course, you will need to ensure formats are compatible, but the idea is that elastic search is a plug-in for a loosely coupled architecture running RabbitMQ.

Q8. How might developers connect elasticsearch with Spring and Cloud Foundry?
Oh, it’s very possible and fairly straightforward to connect elasticsearch with the whole Spring portfolio. I just saw someone connect Spring Data with elasticsearch on Twitter the other day—it’s on GitHub.

Q9. What about connecting with GemFire at the data layer?
We certainly have Java in common. Not too long ago, I spoke with a GemFire architect. We both have solid Java-based APIs.

In addition, Spring Data is enabling data access for so much—relational databases, Hadoop, HTTP/REST, Redis, MongoDB, Neo4j, HBase, Grails, GemFire, and more soon, I am sure. As mentioned above, there are now connections between Spring Data and elasticsearch. From a loose-coupling perspective, there also is connectivity through RabbitMQ. So, there are many possibilities. I also heard you can run Hadoop style distributed workloads like a distributed stored procedure written in Java.

Q10. How does elasticsearch perform on virtualized infrastructure?
In fact, most of our customers run in the cloud on VMs. Many people run Tomcat or cost effective platforms like tc Server. We require Java 6 to run, and we typically run on most variants of Linux.

We were architected for elastic scale in hybrid cloud situations—just add a VM, and the service will automatically join the cluster and redistribute the data and the workload. You can also stick elasticsearch into various cloud deployment services like vFabric Application Director for automation and cloud bursting.

About Adam Bloom

Adam Bloom has worked for 15+ years in the tech industry and has been a key contributor to the VMware vFabric Blog for the past year. He first started working on cloud-based apps in 1998 when he led the development and launch of WebMD 1.0’s B2C and B2B apps. He then spent several years in product marketing for a J2EE-based PaaS/SaaS start-up. Afterwards, he worked for Siebel as a consultant on large CRM engagements, then launched their online community and ran marketing operations. At Oracle, he led the worldwide implementation of Siebel CRM before spending some time at a Youtube competitor in Silicon Valley and working as a product marketer for Unica's SaaS-based marketing automation suite. He graduated from Georgia Tech with high honors and an undergraduate thesis in human computer interaction.