Cloud In, Hadoop Out as Hot Repository for Big Data

Alex Woodie

If you’re in charge of architecting a big data strategy for your organization, you have a lot of tough decisions to make. One of the easiest, however, may be to use cloud repositories to store and process the biggest sets of data instead of Hadoop — and the Hadoop ecosystem vendors are going along with it.

Public clouds are quickly becoming the primary store for big data analytics for all but the most heavily regulated industries. While the financial services industry can’t readily adopt the cloud, the cost advantage of storing and processing data in public clouds like Amazon Web Services, Microsoft Azure, and Google Cloud Compute is quickly enabling the cloud to gain a foothold in nearly every other industry.

Analysts and vendors agree that we’re currently seeing a massive migration of dollars and data into clouds. Gartner, for instance, says worldwide public cloud services will grow 18% this year to become a $247 billion business, and that cloud will account for the majority of analytics purchases by 2020. Forrester, meanwhile, pegs cloud growth at 19%, and says it will be a $162 billion business in three years.

This projected growth appears to come, in part, at the expense of on-premise Hadoop deployments. While the Hadoop hype bubble has clearly popped, that doesn’t mean Hadoop won’t continue to help customers for years. In fact, anecdotal evidence suggests that companies will continue to build and run Hadoop clusters to manage and process big data – ostensibly using Apache Spark, which continues to gather steam.

But the smart money says the installed base of HDFS and YARN clusters in 2020 will not be as big as the industry widely believed during the peak Hadoop years of 2012 to 2014. And then, some companies are most definitely moving their data off Hadoop and into cloud repositories.

The day-to-day management overhead associated with on-prem computing is one factor driving data and workloads to the cloud (Dario Lo Presti/Shutterstock)

“We’ve seen really big companies who have big Hadoop clusters move their data to the cloud because they’re much more agile,” says Lloyd Tabb, the founder and CTO of data platform vendor Looker. “We keep seeing this over and over. Big time companies who I can’t name…spent all this money building these big Hadoop infrastructures, and then they say, we can spend that engineering on figuring out answers to things instead of managing the cluster.”

Looker’s Web-based software works equally well in on-premise and cloud scenarios. But according to Tabb, cloud-based analytic warehouses like AWS’ Athena and RedShift, Snowflake‘s analytical service, and Google’s BigQuery have become hot commodities these days.

The cost advantages of using pre-built, cloud-based data warehouses are too big to ignore, Tabb says. “It’s too cumbersome to try to build a team to manage your own data center, basically your own clusters, so managed services in data is really where it’s at,” Tabb tells Datanami. “If you’re spending all your time managing the lake, you don’t have time to do the analysis.”

That time and cost advantages of the cloud is shifting the debate. Pivotal Labs recently unveiled a new release of its Greenplum MPP data warehouse that is 100% open source and supported on all the major cloud platforms. The idea is to give customers more freedom than what they get by using Redshift or Big Query, says Elisabeth Hendrickson, VP of data and R&D for Pivotal Labs. “As far as I know we’re the only data warehouse that you can run anywhere you want to on any infrastructure that you want to and it’s entirely based on open source,” she says.

A similar dynamic is unfolding in the eyes of Ash Munshi, the CEO of Pepperdata, which develops software to help tune Hadoop and Spark workloads sitting on prem or in the cloud. According to Munshi, the big decisions are not whether to use Spark or Hadoop, but which cloud to utilize. “They’re not making technology bets. I think they’re making strategic bets. How much do I do on prem versus how much do I do on cloud, and how much do I do on hybrid?” he says

AtScale initially targeted Hadoop with its BI and analytics enablement software, which meant ensuring that the software plays nicely with HDFS and YARN and the rest of that ecosystem. But these days, AtScale’s focus is on supporting all the public clouds.

The company’s capability to analyze data from multiple sources with a single query has endeared it to customers like Home Depot that have invested in on-premise and cloud data repositories. “The store managers are using Excel as their BI tool, and they don’t know the difference,” says AtScale CEO Dave Mariani. “To them, it’s all a single data source, even though on the backend the data is coming from two different places, on-prem and in the cloud.”

As Hadoop’s influence wanes, you might think that the Hadoop distributors would be bucking this trend and fighting to maintain the importance of the yellow elephant in the data center. But that’s not the game plan that Cloudera and Hortonworks have created for themselves. In fact, both companies are embracing the notion that their data management tools and data processing engines can work equally well with on-prem Hadoop clusters and cloud-based object stores like Amazon S3 and Microsoft ALDS.

Specifically, at Strata Cloudera unveiled its Shared Data Experience (SDX), which is a suite of tools to unify the governance and security models associated with the data stored and processed in cloud and on-premise systems. The idea is to help clients build on a common foundation, says Cloudera product manager Alex Gutow.

Like all overhyped technologies, Hadoop’s expectations have come back to reality

“As folks are moving to the cloud, if you’re just getting all these compute resources without having that shared data context, then it becomes a lot more of these silos, and all of a sudden you have to redefine all the security policies,” she says. “You have to rebuild the metastore. You have to start from scratch with every cluster. This is where we get the same unification even as we start to take advantage of these isolated resources.”

Hortonworks, meanwhile, sees its newly announced Data Plane Service (DPS) as the third leg in a triumvirate of tools for managing data at rest and in motion. Similar to Cloudera’s SDX, Hortonworks’ DPS offering allows customers to apply to extend the security and governance controls they’ve defined in Apache Ranger and Apache Atlas out to any data store, including on premise Hadoop, cloud-based object stores, MPP data warehouses, and any combination.

“Think about this as phase two of the company,” Hortonworks co-founder and chief product officer Arun Murthy tells Datanami. “As they’ve have gotten comfortable with cloud model, they want this notion of a service. They prefer the notion of a service because now it’s something they can just use, rather than have to operate and manage.”

The cloud has clearly emerged as a major repository for enterprise data. Getting value out of the data still remains a challenge,