Hadoop 2.0: New Big Data Possibilities

Hadoop 2.0 will move beyond batch processing to support interactive, online and streaming applications. But don't let warnings about YARN tie you up in knots.

Hadoop 2.0 will be announced within a matter of days, and a new YARN framework component at its core promises to "take Hadoop beyond MapReduce," according to Arun Murthy, chairman of the Apache committee overseeing the release. Moving beyond slow, iterative MapReduce processing is obviously a good thing, but just what are the new possibilities?

Better SQL querying, graph analysis and stream processing are all on the short list, according to Murthy, a Yahoo veteran who co-founded Hortonworks. He describes YARN (a slightly-off acronym for Yet Another Resource Manager) as a kind of large-scale, distributed operating system for big data applications. As is typical with operating systems, there's some question about what will and what won't work with YARN, but more on that later.

Most Hadoop adopters are treating the platform as a data lake or ocean for all company information, says Murthy, but they want to be able to use the information in multiple ways roughly falling into four categories: batch, interactive, online and streaming.

"As you look through the entire life cycle of that data, and as data is coming in, you want to process it quickly and efficiently and tackle whatever application you have in mind," Murthy says.

SQL is an example where human-interactive queries come in, and that could be through Hive. HBase, the Hadoop NoSQL database, is an online processing option. Storm (developed by Twitter) is a stream-processing option. Apache Giraph is an option for graph analysis. Spark is an option for high-speed, in-memory analytics on top of Hadoop. MPI is a modeling framework used for assessing risk, optimizing pricing and other advanced analytic applications.

And then there are what Murthy calls the "great big honking batch jobs across six, nine or 12 months of data where you're processing hundreds of terabytes or even petabytes of data." That's where MapReduce comes in.

"All of these things have been refactored to work on top of YARN," says Murthy.

Of course Hive, HBase and other options have been available alongside MapReduce for some time, but before Hadoop 2.0, the system was designed to be a single-application system, setting up competition for resources. Run a complex Hive query or one of those great big honking MapReduce jobs and you're likely to lock up resources and prevent any other application from running with anything like predictable performance.

YARN's job is to allocate resources across all the applications running on top of Hadoop to enable them to run simultaneously and with consistent levels of service to end users. This extends to supporting internal or external service-level agreements, quality-of-service standards and administrative control, according to Murthy.

"Instead of having [simplistic] queues for each of your classes of applications, you can decide how much resource you want to give to which class of application," he explains.

The only caveat with YARN is that it's part of the Apache Hadoop framework and is, therefore, designed to allocate resources to Apache Hadoop components. Where does that leave Cloudera Impala, Pivotal HAWQ and the many other SQL-on-Hadoop developments that may or may not become part of Hadoop? In the case of Impala, for example, the core query engine is shipped under Apache license, but Cloudera's Enterprise Real-Time Query (RTQ) management console for Impala is commercial, subscription-based tool.

"It's absolutely conceivable that something like Impala or Pivotal HAWQ could come into the YARN resource-management framework," Murthy says promisingly, but then he adds the caveat.

Speaking as an executive of Hortonworks -- a company that adheres strictly to open-source code and that competes with Cloudera, Pivotal and others adding commercial components to Hadoop -- Murthy warns that "with a bolt-on system like an Impala or a HAWQ, you reinvent everything built into YARN."

With YARN inside Hadoop and a separate management system outside of the platform, the question becomes, which system will control the resources and will these services be duplicative?

I had questions about the timing of Hadoop 2.0 and YARN, but the response from Arun Murthy came in too late for publication. It's the beta version that will be announced within a matter of days. When will it reach GA? The short answer is the second half of 2013 and into 2014, but here's a Murthy's statement with more detail:

"Apache Hadoop 2.0 and YARN have been under development for 2.5 to 3 years, will be reaching final Beta shortly, with a push to final stable release within the Apache community a matter of weeks after that. At that point, MapReduce (batch data processing) and Apache Tez (interactive data processing) will be two application types that are fully tested to run on YARN. Community projects such as S4, Storm, Giraph, OpenMPI and other open source projects have been doing work to be first-class YARN applications as well, so they will now have a stable platform release to test against and finish their efforts. Commercial vendors and startups have also been doing work around YARN. For example, Continuuity is a startup that created an open source framework called Weave that makes it easy to create YARN applications.

Bottom-line: the next wave of innovation on top of YARN has been underway for a while. How long will it take for the market to adopt Hadoop 2.0?GǪ Initial uptake of Hadoop 2.0 based solutions with YARN will start in the second half of 2013 with broader market adoption happening throughout 2014."