Hadoop Evolution: What You Need to Know

Monday May 23rd 2016 by Loraine Lawson

Share:

Hadoop has been hamstrung by complexity, skills shortages and a lack of standardization, but new approaches to using Hadoop are emerging.

It's been a decade since Hadoop became an Apache software project and released version 0.1.0. The open source project helped launch the Big Data era, created a foundation for most of the big cloud platform providers and changed how enterprises think about data.

Despite Hadoop's rocket evolution from a Google pet project to a technology stack with major distributions and cloud providers, many enterprises still find Hadoop difficult, experts say. Rather than becoming simpler and easier, Hadoop spawned an entire ecosystem of open source tools and technologies, including Mesos, Spark, Hive, Kafta, Zookeeper, Phoenix, Oozie, HBase -- all tied directly or indirectly to Hadoop.

In this article, we discuss:

Who is driving Hadoop adoption in the enterprise

How early design decisions hampered Hadoop

How its open source licensing model affects Hadoop

Why companies are using the cloud and platform-as-a-service (PaaS) with Hadoop

How and why companies are moving away from huge Hadoop data clusters

Making Sense of Hadoop

How can enterprise executives make sense of this sprawling Hadoop ecosystem?

"It's a struggle," acknowledged Nick Heudecker, who researches data management for Gartner's IT Leaders (ITL) Data and Analytics group. "Hadoop doesn't typically get better by improving the things that are already there; it gets better by adding new stuff on top of it, and that consequently makes things much more complicated."

Hadoop Adoption: Backed by the Business

Even trying to assess Hadoop adoption is more complicated than it should be. Last year, Gartner surveyed 284 of its Gartner Research Circle members and found enterprise Hadoop adoption was falling short of expectations, especially given its hype. Fifty-four percent of survey respondents had no plans to invest in Hadoop, and just 18 percent had plans to invest over the next two years. What's more, Heudecker noted, early adopters didn't appear to be championing further Hadoop usage.

A TDWI survey of 247 IT professionals published at about the same time supported a conflicting conclusion: Many enterprises (46 percent) were already using Hadoop to complement or extend a traditional data warehouse, and 39 percent were shifting their data staging or data landing workloads to Hadoop. Other surveys, like one from AtScale, did as well.

Philip Russom, research director for data management with TDWI Research, consulted with Gartner's Merv Adrian about the discrepancy and discovered something surprising. Gartner had primarily talked to CIOs and other C-level executives while TDWI primarily consulted with data management professionals.

"Long story short, Hadoop is not being adopted as a shared resource, owned and operated by central IT," Russom said via email. "However, it is being adopted briskly 'down org chart' as a Big Data platform and analytics processing platform for specific applications in data warehousing, data integration and analytics. And those applications are sponsored, funded and used by departments and business units - not central IT."

Heudecker said that still matches what Gartner's seeing. It may also help explain why enterprises seem to be so iffy about Hadoop: Despite Hadoop's technical learning curve, business units seem to be dabbling in it more than central IT.

"It's very rare to see enterprisewide deployments that are run as a Hadoop center of excellence, for instance," Heudecker said. "It's hard to really pin down one reason why that's happening."

One reason may simply be that business units control a growing portion of the technology spend, he said. Business users want self-service data, which can mean everything from self-service data preparation to self-service integration and analytics. It's also creating a demand for accessing Hadoop through existing business intelligence or analytic tools, but those tools still need to improve, he cautioned.

Hadoop's Persistent Problem

Hadoop has been limited by its own design as well as recent changes in the technology world.

Hadoop and its first processing engine, MapReduce, were developed as a tool for technology's elitist data analysts, and not much changed on the way to distribution. In many ways, the open source technology stack has been its own worst enemy, from MapReduce's disk-centric approach and demand for specialist programming skills down to Hadoop's batch-oriented approach.

"That's the big limitation with Hadoop; it's a batch-oriented data layer and, as companies start to get more serious about Hadoop, they're moving into 'how do I get real-time, how do I start impacting the business,'" said Jack Norris, senior vice president of Data & Applications at MapR, a Hadoop-derived startup. "To do that with Hadoop at the center, you've got to do a lot of things to try to make up for the fact that it's got a weak underlying data layer."

MapR avoided the problem by rewriting that data layer rather than using the Apache Hadoop distributed file system, Norris added.

MapReduce and Hadoop were also originally designed to run clusters on commodity hardware back when memory was very expensive, Heudecker pointed out. That need has diminished in as in-memory processing has become cheaper.

That's where Spark shines, since it uses in-memory processing, which is faster than a disk-centric approach. Spark is getting love from companies ranging from IBM, which has opened a Spark technology center and introduced a number of Spark-centric solutions, to Cloudera, which made Spark a focal point of its latest release. Proprietary appliances that leverage in-memory processing have also come to market, which further skewed the market for Hadoop.

But no matter how you mix up the ecosystem, these open source tools still aren't easy. That is Hadoop's most persistent problem: It requires skills that even enterprises struggle to hire.

The Open Source Conundrum

Hadoop's open source licensing model also played an unintentional role in driving complexity, Heudecker said.

"Open source has been effectively weaponized by these vendors so everyone has a vested interest in X project versus Y project, depending on where you have allocated your committers that work for your company," he said. "Open source is phenomenal; it really is. It has completely changed the game for how enterprises look at acquiring software, but it's not this altruistic effort any more. There's big money in open source software. So you'll see some companies supporting project X over project Y because that's what they ship."

The open source community may also be more focused on developing the technology over supporting data management best practices. Many Hadoop data lakes either don't support or offer inadequate support for audit trails, data integrity, data quality, encryption or data governance, Russom said.

"It's not all rainbows and unicorns," Russom wrote. "I don't see the open source community caring much about these issues in a Hadoop environment."

Hadoop, Amazon and New Tools

That may be why more companies are looking to the cloud to handle Hadoop. Gartner estimates that Amazon has over twice as many users of EMR, its Hadoop service, than all of the startup Hadoop distributors combined. Cloud allows companies to separate compute from storage, so they can spin up more clusters as needed, then tear them down rather than maintaining them simply to store the data.

Other vendors are also introducing new tools to help close the data capabilities gap, Russom pointed out.

"Enterprises are faced with a lot of complex choices: What's the right technology option for this use case, which vendors are going to be the most viable, and will I have the skills to actually run this stuff at scale," Heudecker said. "In the short term, you're going to see a lot of companies kicking the tires on the cloud and they may be looking at platform-as-a-service vendors to bridge the skills gap."

Enterprises also should be aware that these tools can sometimes come with major limitations, according to Pentaho CTO and founder James Dixon.

"It's not all at the same level of maturity, sophistication, completeness," Dixon said during a recent interview. "Some of these capabilities weren't built into the design of the software in the first place, so you may find there are major limitations with these new features because you're taking a technology that just wasn't designed to do that. So I would say be very cautious of using the new whiz-bang features that suddenly arise. Be very cautious of those because they're not mature, they weren't designed in, so there may be architectural design flaws or just major limitations that you're not aware of."

Not Just a Big Cluster

There's also a key shift in how vendors and analysts see Hadoop's role in the enterprise. It's no longer about dumping all the data into one huge data lake -- although data lakes do have a role as archives and sandboxes, experts say. Instead, it's about "connections, not collections," Heudecker said.

"The trend has been to collect a bunch of data together and then analyze it. That's expensive and it's hard to do," he said. "I think it's much easier to leave the data where it is and do your consolidation logically with metadata. So you're leaving data in its legacy store, and you're saying, 'Alright, let me bring in what I need, I'll build out my analysis and then do push-down processing to the relevant platform.' It's much more advanced, and very early, but I think that's a more viable strategy than saying let's consolidate everything into this big cluster."

Emerging Approaches to Hadoop

Enterprises can expect to see a similar message from vendors as new offerings come to market. Pentaho's recent Business Analytics 6.1 release supports Heudecker's and Russom's observations. Pentaho is a data integration and data analytics company, but the new release's big boast is metadata injection for Big Data in Hadoop and traditional environments.

When discussing MapR's new Converged Data Platform, Norris pointed out it's not just about pooling data, but reaching data where it lives and incorporating it into business decisions.

"It is very different than what's possible with Apache Hadoop alone without that kind of converged data platform," Norris said. "The companies that are really getting the biggest payoff from their investments in data are the ones incorporating it into the business flow; so things like performing billions of transactions or billions of events a day."

Loraine Lawson is a freelance writer specializing in technology and business issues, including integration, health care IT, cloud and Big Data.