Opinion
Hadoop: What is it Good For? Absolutely Something

Enterprises have options. One of the questions I asked my Hadoop case studies for the upcoming Forrester report is whether they considered using the tried-and-true approach of a petabyte-scale enterprise data warehouse (EDW).

It’s not a stretch, unless you are a Hadoop bigot and have willfully ignored the commercial platforms that already offer shared-nothing massively parallel processing for in-database advanced analytics and high-performance data management. If you need to brush up, check out my recent Forrester Wave for EDW Platforms.

Many of the case studies did in fact consider an EDW such as those from Teradata and Oracle. But they chose to build out their “Big Data” initiatives on Hadoop for many good reasons. Most of those are the same reasons any user adopts any open-source platform. By using Apache Hadoop, they could avoid paying expensive software licenses, could give themselves the flexibility to modify source code to meet their evolving needs and could avail themselves of leading-edge innovations coming from the worldwide Hadoop community.

But the basic fact is that Hadoop is not a radically new approach to processing extremely scalable data analytics. You can use a high-end EDW to do most of what you can do with Hadoop with all the core features – including petabyte scale-out, in-database analytics, mixed workload support, cloud-based deployment, and complex data sources – that characterize most real-world Hadoop deployments. And the open-source Apache Hadoop codebase, by its devotees’ own admission, still lacks such critical features as real-time integration and robust high availability that you find in EDWs everywhere.

So – apart from being an open-source community with broad industry momentum – what is Hadoop good for that you can’t get elsewhere? The answer to that is a mouthful, but a powerful one.

Essentially, Hadoop is vendor-agnostic in-database analytics in the cloud, leveraging an open, comprehensive, extensible framework for building complex advanced analytics and data management functions for deployment into cloud computing architectures. At the heart of that framework is MapReduce, which is the only industry framework for developing statistical analysis, predictive modeling, data mining, natural language processing, sentiment analysis, machine learning, and other advanced analytics. Another linchpin of Hadoop, Pig, is a versatile language for building data integration processing logic.

The bottom line is that Hadoop is the future of the cloud EDW, and its footprint in companies’ core EDW architectures is likely to keep growing throughout this decade. The roles that Hadoop is likely to assume in your EDW strategy are the dominant applications that it’s being used for here and now:

Hadoop as petabyte-scalable staging layer. The most frequent application I hear for Hadoop is in support of the “T” in ETL, but where the data being extracted and transformed are in myriad unstructured, semi-structured, and structured formats, and where the loading may be to a traditional EDW, or, more common, to terabyte-scale analytical data marts where predictive modelers and other data scientists work their magic.

Hadoop as petabyte-scalable event analytics layer. Another key use for Hadoop is in doing petabyte-scale log processing of event data for call detail record analysis, behavioral analysis, social network analysis, IT optimization, clickstream sessionization, anti-fraud, signal intelligence, and security incident and event management. There are more traditional EDW architectures that support these applications, but the industry push is toward Hadoop and other “NoSQL” approaches, such as graph databases, to handle high-volume event processing in batch and real-time.

Hadoop as petabyte-scalable content analytics layer. This is an area where Hadoop’s comprehensive MapReduce modeling layer is its key competitive differentiator. Next best action, customer experience optimization, and social media analytics are every CRM professional hottest hot button projects. They absolutely depend on plowing through streaming petabytes of customer intelligence from Twitter, Facebook, and the like and integrating it all with segmentation, churn, and other traditional data-mining models. MapReduce provides the abstraction layer for integrating content analytics with these more traditional forms of advanced analytics, and Hadoop is the platform for MapReduce modeling.

Hadoop is not a religion, any more than traditional EDWs are a religion (though some people seem to hold the latter in similar regard, aligning themselves with this or that EDW architectural school). In order to promote itself into the enterprise primetime, the Hadoop industry needs to focus on what their up-and-coming approach does better than EDWs, or does best within the context of a traditional EDW architecture.