The Blur Project: Marrying Hadoop with Lucene

Doug Cutting’s recent post about Cloudera Search included a hat-tip to Aaron McCurry, founder of the Blur project, for inspiring some of its design principles. We thought you would be interested in hearing more about Blur (which is mentored by Doug and Cloudera’s Patrick Hunt) from Aaron himself – thanks, Aaron, for the guest post below!

Blur is an Apache Incubator project that provides distributed search functionality on top of Apache Hadoop, Apache Lucene, Apache ZooKeeper, and Apache Thrift. When I started building Blur three years ago, there wasn’t a search solution that had a solid integration with the Hadoop ecosystem. Our initial needs were to be able to index our data using MapReduce, store indexes in HDFS, and serve those indexes from clusters of commodity servers while remaining fault tolerant. Blur was built specifically for Hadoop — taking scalability, redundancy, and performance into consideration from the very start — while leveraging all the great features that already exist in the Hadoop stack.

About three and a half years ago, I had an experience on a project that showed me just how powerful the fault tolerance characteristics of Hadoop are. This is what made me start to think about the core design behind Blur.

It was around that time that my project began using Hadoop for data processing. We were having network stability issues that would randomly drop Hadoop nodes off the network. Over one weekend, we steadily lost network connections to 47 of the 90 data nodes in the cluster. When we came in on Monday morning, I noticed that some of the MapReduce jobs were a little sluggish but still working. When I checked HDFS, I saw that our capacity had dropped by about 50%. After running an fsck on the cluster I was expecting to find a catastrophic failure but to my amazement we still had healthy file system.

There is interest in folding some of Blur’s code into Lucene for others to utilize.

This experience left a lasting impression on me. It was at that point that I got the idea to somehow leverage the redundancy and fault tolerance of HDFS for the next version of a search system that I was just beginning to (re)write.

At the time, I had already written a custom Lucene server that had been in production for a couple of years. Lucene performed really well and met all of our requirements for search. The issue that we faced was that it was running on a big-iron type of server that was not redundant and could not be easily expanded. After seeing the resilient characteristics of Hadoop first hand, I decided to look into marrying the already mature feature set of Lucene with the built-in redundancy and scalability of the Hadoop platform. From this experiment, Blur was created.

Blur was initially released on Github as an Apache Licensed project and was then accepted into the Apache Incubator project in July 2012, with Patrick Hunt as its champion. Since then, Blur as a software project has matured and become much more stable. One of the major milestones over the past year has been the upgrade to Lucene 4, which has brought many new features and massive performance gains.

Recently there has been some interest in folding some of Blur’s code (HDFSDirectory and BlockCache) back into the Lucene project for others to utilize. This is an exciting development that legitimizes some of the approaches that we have taken to date. We are in conversations with some members of the Lucene community, such as Mark Miller, to figure out how we can best work together to benefit both the fledgling Blur project as well as the much larger and more well known/used Lucene project.

Blur’s community is small but growing. Our project goals are to continue to grow our community and graduate from the Incubator project. Our technical goals are to continue to add features that perform well at scale while maintaining the fault tolerance that is required of any modern distributed system.