The author is a Forbes contributor. The opinions expressed are those of the writer.

Loading ...

Loading ...

This story appears in the {{article.article.magazine.pretty_date}} issue of {{article.article.magazine.pubName}}. Subscribe

Mike Driscoll, CEO of MetaMarkets

It is quite natural for a ground breaking technology to arrive, be celebrated, enable new types of applications, and then to have expectations to grow beyond the technology’s ability to deliver. This crest of this cycle is now occurring in the world of Hadoop, which could be headed for a hangover.

For those of us who are interested in understanding how applications will make use of big data to do new things, it is vital that we not just declare that claims for Hadoop’s power are exaggerated. We must understand what Hadoop does well and what it cannot do and why. It is also vital to understand the way that Hadoop is evolving so that we can understand what it will do. The complete picture of Hadoop that I want to draw is coming out in a series of articles.

This is the first article in the series which focuses on the thoughts of Mike Driscoll, CEO of Metamarkets, a company that specializes in real time digital marketing analytics. Driscoll, who was CTO prior to taking the CEO role, has a deep understanding of most technology related to big data and strong views about the strengths and weaknesses of Hadoop.

Driscoll’s Take on Hadoop

Driscoll’s critique is that Hadoop’s data crunching power provides only a fraction of a complete application. “It excels,” he says, “at batch processing of large-scale, unstructured data, such as web server logs. Hadoop is a foundational technology, but it is not a database, it is not an analytics environment, and it is not a visualization tool. By itself, it is not a solution for helping businesses make better decisions.”

Hadoop is based on a computational model called MapReduce, in which data is distributed across many servers (as opposed to shipping data to a central computing hub, which is highly inefficient). Hadoop maps computations across the distributed data and crunches (“reduces”) it in place rather than moving it. Hadoop has a file system and plumbing to enable algorithms for processing data to be applied in part to hundreds or thousands of shards of the data and aggregates all the partial answers into a complete answer. Related projects that augment Hadoop have been developed including HBase, a distributed database; Hive, a data warehouse structure with SQL-like capabilities; and Pig, which provides orchestration of many different Hadoop jobs.

Expectations for what Hadoop can do have soared in the wake of discussions about big data. The “hangover” that Driscoll predicts will occur because the scope of what Hadoop can possibly do is smaller than the scope of the market’s expectations. There’s a lot of misunderstanding around Hadoop.

There is also some tension between Hadoop’s original purpose and the needs of the world of big data. “Hadoop is the first successful big data technology, but we are witnessing the emergence of other tools that complement it.”

He frames the argument around Hadoop’s position within the stack. At the bottom of the stack, where Hadoop resides, infrastructure has seen waves of commoditization happening from hard drives to operating systems.

“The top of the stack is where business users interface with technology--whether as desktop applications or web services. Hadoop drives tremendous value behind the scenes --just as Linux, Apache, and MySQL do--but it is a back-end technology, not a front-end solution. To get value from Hadoop, organizations must build applications on top of it.” And building such applications is not easy.

Driscoll notes the following reasons for the Hadoop hangover:

Hadoop is not a database. Users who want to run fast ad hoc queries to perform real decisioning are disappointed. Market-wise, it’s slower than alternatives such as SAP HANA, not to mention high-end databases like Oracle Exadata, IBM Netezza or HP Vertica.

Hadoop is hard to set up, use, and maintain. In and of itself, grid computing is difficult, and Hadoop doesn’t make it any easier. Hadoop is still maturing from a developer’s standpoint, let alone from the standpoint of a business user. Because only savvy Silicon Valley engineers can derive value Hadoop, it’s not going to make inroads into larger organizations without a lot of handholding and professional services.

Hadoop is neither real time nor interactive. If a company is doing continuous processing of a stream of tweets or check-ins, advertising impressions, or point of sale purchases, all of which need to be done in real time, Hadoop isn’t the best solution, especially when there are alternatives like Kafka, an open source project coming out of LinkedIn, which handles distributed stream processing. Metamarkets uses a two-tiered model: Kafka for real time processing and Hadoop for batch processing.

Hadoop has no front-end visualization tool. Without a visualization tool, users can’t perform analytics directly – and thus realize value - on top of Hadoop unless other pieces are integrated on top of the stack.

It’s not that Hadoop’s going away anytime soon. Despite its limitations, Hadoop does add value, behind-the-scenes at the bottom of the stack. What’s Driscoll’s advice on how to avoid a Hadoop hangover? “Treat it as a strong wine, an essential complement, but not a meal by itself.”

Dan Woods is CTO and editor of CITO Research, a publication that helps CIOs and CTOs optimize the present and build the future. He consults for many of the companies he writes about. For more stories like this one visit www.CITOResearch.com.