”BigData” is a term that has been buzzing around a lot for the last few years. And when you hear this buzz, you’ll hear ”Hadoop” as well. In last 2-3 years, many big players in the industry have come up with their own distribution of Apache Hadoop, be it Intel, Microsoft, IBM, or EMC, etc. Also, some startups, focusing only on Hadoop, have become big players now – Cloudera, Hortonworks – in this area.

Each Hadoop distributor claims how its distribution is the best one out there. Each distribution has some unique features which really may be useful for a set of users and may not be useful for another.

It may become non-trivial to choose from so many distributors matching your requirements, especially when the user is spending money on purchasing a distribution and support.

Update: The free white paper comparing the Hadoop Distributions is ready for download! Click here or check the resources section on the sidebar to download the whitepaper for free.

User Bases:

There are multiple user bases that may need to deploy Hadoop. Some of them are listed below:

1. Higher management in some company, willing to move to BigData solutions using Hadoop.
2. A developer building some tool in Hadoop Ecosystem.
3. A newbie learning Hadoop and looking for a temporary/non-serious Hadoop deployment.

Keeping these things in mind, we have completed a thorough study of following distribution sources, which will be covered in a 6-part series.

Through this series, we’ll share our experience with each of these distributors and provide subjective as well as objective results of the feature/performance comparisons we did. This will help you shortlist the distributors, based on your requirements.

Study Requirements:

AWS EC2 Instances (5-node cluster) - We installed each of these distributions on each of the instance and studied them for feature comparisons.

Intel’ HiBench Benchmarking utility – For Performance comparisons

Intel HiBench

HiBench is a benchmarking suite, to benchmark Hadoop deployments, developed and open sourced by Intel. [You can read more about Intel HiBench and its each benchmark test here.]

We performed following benchmarks from HiBench suit:

Sort

This workload sorts its input data, which is generated using the Apache Hadoop* RandomTextWriter example. Representative of real-world MapReduce* jobs that transform data from one format to another.

WordCount

This workload counts the occurrence of each word in the input data, which is generated using Apache Hadoop RandomTextWriter. Representative of real-world MapReduce jobs that extract a small amount of interesting data from a large data set.

PageRank

This workload is an open-source implementation of the page-rank algorithm, a link-analysis algorithm used widely in web search engines.

Mahout Bayesian Classification

Typical application area of MapReduce for large-scale data mining and machine learning (for example, in Google and Facebook platforms).