Open Sourcers Build ‘Google Search for Big Data’

Mike Olson, the CEO of Cloudera. Photo: Wired.com/Jon Snyder

On the surface, Google’s search engine is a simple thing. Type what you’re looking for into that Google search box, and you get a list of relevant webpages and documents. But behind Google search is an extremely complex network of machines. Instead of buying supercomputers to manage the massive amounts of data that play into our web searches, the company has erected computer clusters made up of tens of thousands of commodity servers that all work in unison.

Google doesn’t make its tools available for other companies to use, but it has published white papers about how they work, and that has spawned an entire industry of open source clones, most notably Hadoop, a collection of tools for working with big data across large clusters of servers.

Businesses have long relied on relational databases and data warehouses from companies like Oracle and Microsoft for their data storage needs. But these tools weren’t built to handle the massive amounts of data that face the modern business. As data collection accelerates thanks to e-commerce, social media, mobile computing, and other factors, many companies are starting to use tools like Hadoop. Cloudera is now offering a Google-style search engine for Hadoop. It’s called Cloudera Search.

Founded by ex-Oracle man Mike Olson and various Hadoop gurus from Yahoo, Facebook and Google, Cloudera wants customers to store all their data in Hadoop — even before it starts to get “big.” The idea is that they will eventually “grow into” Hadoop. But Hadoop isn’t aways a convenient place to store data for many users because to interact with it, you need to use a method called MapReduce, which requires writing Java code.

“There’s all sorts of data that never fit readily in a row or column. You could always store that data in Hadoop, but getting it out was exceptionally technically difficult,” Cloudera product manager Charles Zedlewski said Tuesday at The Economist Information Forum in San Francisco.

There are already several ways to make Hadoop easier to use. For example, most Hadoop distributions include something called Pig, a tool for writing SQL-style queries for Hadoop. And there are many connectors that integrate Hadoop with other database servers and data warehouse systems, such as Oracle and HP Vertica, so that users can use tools they’re already familiar with. But Cloudera is trying to go one step further by building a search engine for Hadoop.

“Tens of thousands of people know how to write MapReduce, millions of people can do SQL queries, but billions of people know how to use a search engine,” Zedlewski said.

Cloudera Search can integrate with the Hadoop Distributed File System or with Hbase — a NoSQL database also based on a Google white paper. Users can type what they’re looking for and get a list of results — just as they would with a Google search. The tool is based on Apache Solr, an open source search engine. Solr has been around since 2004, but underwent a major update last year that added features for using the tool across large computer clusters. Solr is based on Lucene, an open source library created by Doug Cutting, who also created Hadoop.

A search results page from Cloudera Search.

“Every additional route to data hosted in Hadoop is a good thing for the platform,” RedMonk analyst Stephen O’Grady told us via e-mail. “From traditional MapReduce jobs to SQL-like layers such as Hive or Pig to search, each is one more avenue through which people can become productive with the data.”

Cloudera isn’t alone in this approach. Cloudera competitor MapR has a Hadoop search solution as well: it integrates LucidWorks Search, which is also based on Solr. Meanwhile, the open source Lily Project provides integration between Solr and Hbase.

Although Cloudera does sell some proprietary Hadoop management tools, Cloudera Search is open source, it will be included in the free Cloudera Distribution Including Hadoop.

This is a step forward for Hadoop usability, but the big question here is whether customers really need to put all their data in Hadoop. Earlier this year, Microsoft Research published a paper arguing that most companies don’t data problems that justify the use of big clusters of servers. Even Yahoo and Facebook, two of the companies most associated with big data, are using clusters to solve problems that could actually be done on a single server, the paper says.

But many company’s data sets are constantly growing, and starting with Hadoop can be a good way to prepare for data growth. RedMonk, for example, has long used Hadoop for its “medium data” needs. RedMonk runs Hadoop on a single server with tools like Big Sheets, a Microsoft Excel style interface for Hadoop. It’s not an unreasonable approach — the Microsoft Research paper has some tips for running Hadoop in a single machine “scale-up” environment, as opposed to a large scale-out setup.

RedMonk has started to shift away from Hadoop because its data hasn’t grown quite the way its analysts expected two years ago. “Most of our datasets these days are smaller in nature,” O’Grady says. He says RedMonk is now using other tools like Google’s BigQuery. But he still thinks Hadoop is good for those with growing data sets.

“If we could get more data more easily, however, we certainly would use Hadoop.”

Here’s The Thing With Ad Blockers

We get it: Ads aren’t what you’re here for. But ads help us keep the lights on. So, add us to your ad blocker’s whitelist or pay $1 per week for an ad-free version of WIRED. Either way, you are supporting our journalism. We’d really appreciate it.