Can you run Hadoop in the cloud? And is it the right choice?

As the Director of Big Data and Data Science at Pythian, I often get questions from clients about the many solutions available to them to address their big data needs. Between Hadoop, cloud-based, and hybrid solutions, finding the best option for their unique needs can be a daunting challenge.

Sometimes Hadoop is the answer, and sometimes a cloud solution is a better fit. Determining the right direction requires us to consider multiple factors including the specific need, use case(s), budget, available resources within the organization and data volume, to name a few.

With this post, I’m going to try to answer some of the most common questions presented to us by clients, including: At what point is big data big enough to require a system like Hadoop? What are its limitations? How does it compare to cloud? And can you run Hadoop on the cloud?

The cost and benefits of Hadoop

Hadoop is ideal for batch processing terabyte to petabyte scale. Because Hadoop can store and process any type of data, from plain text files to binary files like images, or different versions of data collected over time, it’s ideal if your use case requires you to store large volumes of unstructured data.

On the surface, it appears to be a cost-effective way to handle these big data workloads because it uses commodity clusters, and can scale from a single cluster to thousands of clusters. And the software itself is open-source, so it’s free.

But on the other hand, Hadoop is not a single solution—it’s a framework that requires you to build your data warehouse from the ground up. That means your solution can cost a lot of time and require specialized engineering resources. But once it’s built, you can continue to scale to suit your needs.

How Hadoop has helped evolve big data in the cloud

Many Hadoop ecosystem projects are easily reusable in the cloud and have been widely adopted in cloud architectures. Most notably, Spark and Kafka were developed as part of the Hadoop ecosystem, but they have adapted even better to the cloud than to the environment for which they were originally built.

So how do you handle large data sets that include existing Hadoop environments?

Spark and Kafka work very well on the cloud alongside other scalable message bus systems like Google Pub/Sub, Amazon Kinesis, or Azure EventData Hub.

There are also some good distributed SQL engines built for the cloud, including Amazon Redshift, Google BigQuery and Azure SQL Data Warehouse. These are better alternatives for running SQL workloads than Hadoop.

Maintaining a large permanent cluster in the cloud is expensive, especially at scale. So to run a permanent HDFS you might need to use cloud storage solutions and other SQL engine alternatives.

Another challenge is that there is really no universal solution to support security. While you might have a preferred cloud solution from one vendor, your Hadoop vendor may use another cloud platform. Using a mix of solutions can result in a steep learning curve, and possibly switching costs. And when it comes to ephemeral clusters in the cloud, governance and security become even more challenging. There is not yet an ideal solution to resolve them.

So yes, it is possible, but the costs and barriers that are created when you run Hadoop in the cloud generally make alternatives more appropriate.

How does Hadoop perform in the cloud?

If you compare Hadoop to cloud-native solutions from a VM-to-VM perspective, or from a machine-to-machine perspective for the clusters of the same size and same configuration, you can see that cloud providers have made significant advances.

You can likely achieve similar performance, if not maybe slightly lower performance in the cloud than you would on bare metal. But in reality there is more to cloud economics than just performance.

Rather than adopting a Hadoop framework in the cloud, you might consider some of the alternative architectures that provide scalability features (like cloud SQL engines) that are not available with Hadoop ecosystems.

While you can achieve similar performance, there are simply better alternatives for meeting your large-scale data processing needs than implementing Hadoop clusters in the cloud.

The bottom line: Going straight to the cloud for big data analytics

While Hadoop on it’s own is a reliable, scalable solution, the engineering costs and the inefficiencies associated with batch-processing queries aren’t appropriate for most use cases.

And as for implementing Hadoop in the cloud, you may end up needlessly building something from scratch that was already available as a service from one of the major cloud platform providers.

This is something we experienced first hand during a recent client project where we migrated a large on-premise Hadoop cluster with petabytes of data into a Google Cloud Platform. Once the migration was complete not much was left of the original Hadoop architecture. There were some basic Spark jobs that we could run on the elastic clusters in the cloud, but there was no need to maintain this permanent cluster for the long term.

Companies are increasingly looking to adopt big data solutions, but many of them are skipping Hadoop entirely and jumping straight to a cloud solution for these reasons. So when you start planning for your next big data project, it is important that you consider cloud-based solutions as an alternative to Hadoop—you may just find that it meets your needs faster and at a lower cost.

Download this webinar to discover the benefits many businesses are currently reaping by taking their data and analytics to the cloud.

PYTHIAN®, LOVE YOUR DATA®, and ADMINISCOPE® are trademarks and registered trademarks owned by Pythian in North America and certain other countries, and are valuable assets of our company. Other brands, product and company names on this website may be trademarks or registered trademarks of Pythian or of third parties. Use of trademarks without permission is strictly prohibited.