Is Hadoop just the flavor of the day?

Hadoop is becoming a popular choice for large organizations that need to store and process large volumes of unstructured data. But will they abandon it if something better comes along?

According to an article on website computing, eBay Senior Director of E-commerce Darren Bruntz told an audience at this week’s Teradata conference that he wants to see “more focus and energy” from Hadoop’s open-source development community or else eBay might abandon its use of the big data platform in the coming years. That’s one of the strongest rebukes of the Hadoop community that I’ve seen, but the community appears poised to deliver on Bruntz’s challenge.

Bruntz described eBay’s three-platform big data environment that includes two separate Teradata data warehouse systems, as well as a Hadoop cluster. Although he expects this system to be in place for a few more years, Bruntz said that in the future, “we could perhaps move to a single platform” if something were to emerge that meets eBay’s needs.

That’s a meaningful statement because it comes from a huge Hadoop user — eBay’s Hadoop cluster is currently well over a dozen petabytes — but it’s not an entirely new sentiment. Forrester’s James Kobelius has predicted that Hadoop will be “the nucleus” of next-generation enterprise data warehouses, which presumably is the type of single-platform system that Bruntz wants, but Kobelius thinks that’s still three to five years out. Database analyst Curt Monash isn’t so sure about Hadoop as the foundation of data warehouses, but he did offer this assessment of Hadoop’s future:

Hadoop (as opposed to general MapReduce) has too much momentum to fizzle, perhaps unless it is supplanted by one or more embrace-and-extend MapReduce-plus systems that do a lot more than it does.

The way for Hadoop to avoid being a MapReduce afterthought is to evolve sufficiently quickly itself; ponderous standardization efforts are quite beside the point.

Teradata, eBay’s data warehouse vendor, recently acquired Aster Data Systems, which brings with it a MapReduce engine that’s not associated with Hadoop. To whatever degree it’s technically feasible, Teradata could conceivably deliver a single big data platform at some point. Alternatively, there is new big data entrant, HPCC Systems, which is pushing a Hadoop alternative that already does more than Hadoop, and it could develop its technology further to address even more use cases.

However, if you read Cloudera CEO Mike Olson’s great blog post earlier this week, you know that the Hadoop community is already working hard on evolving Hadoop beyond its roots. Buried in an assessment of whether Yahoo’s (and Hortonworks’) claims to Hadoop dominance are justified, Olson pointed out the following:

In the early days, if you wanted to use Hadoop, you loaded data into the system by hand and coded up Java routines to run in the MapReduce framework. The broad community recognized these problems and invented new projects to address them — Apache Hive and Apache Pig for queries, Apache Flume and Apache Sqoop (both incubating) for data loading, Apache HBase for high-performance record storage and more. … That ecosystem has exploded in recent years, and most of the innovation around Hadoop is now happening in new projects. That’s not surprising — as Hadoop has matured, the core platform has stabilized, and the community has concentrated on easing adoption and simplifying use.

But even if the Apache Hadoop community can’t sufficiently address the needs of companies like eBay, the greater Hadoop ecosystem likely will. There are companies such as Hadapt actually trying to make Hadoop the core of a data warehouse, and vendors such as Oracle, EMC and IBM are all working to align Hadoop more tightly with their data warehouse and analytic database products. There’s also MapR, which is pushing a Hadoop distribution (that EMC bases its Enterprise Edition on) that it says is more technologically advanced and better suited for business users, and, of course, Cloudera.

As different as their focuses might be, the one commonality is that all of these vendors rely on Apache Hadoop for technology and/or customer base. They will either drive the Apache project to address their needs and incorporate their innovations or they will integrate their own tweaked versions of Apache Hadoop into their products. Other approaches will surely exist, and some probably will thrive, but it doesn’t look like Hadoop is going anywhere.