That could be what’s on an admin’s mind during their first try to deploy Hadoop. It’s not necessarily that hard to install, but to understand how to scale it and how to work with it you need to put some proper time into it.

How about we try to make Hadoop easier for everyone to understand and use? That’s what the team in the Open Innovations Lab at EMC thought, and they’ve now released a full whitepaper called “EMC Hadoop Starter Kit – EMC Isilon and VMware Big Data Extensions for Hadoop”. Now you might wonder what Isilon and VMware has to do with Hadoop, and I’ll come to that in just a bit.

Hadoop + Serengeti + Isilon = AWESOME

First, let’s look at what type of Hadoop distribution we’re talking about deploying here. There are different distributions (or versions) of the lovely elephant Hadoop out in the wild. The most notable ones are Pivotal HD, Hortonworks, Cloudera and of course the original open source Apache one. For the purpose of this whitepaper, the Open Innovations Lab team has decided to start with the Apache Hadoop distribution.

Now what about VMware and Hadoop? We’re actually talking about virtualizing Hadoop here, something that’s usually a big “heck no” in Hadoop circles. Actually, for most companies that have an existing VMware virtualization environment, you’re sure to find a lot of resources just sitting there idle and ready to use. Why not use them for Hadoop and help your organization in getting some good, real information out of all that data you’re already storing? Other benefits of virtualization Hadoop are:

Rapid provisioning – quickly creating a new cluster or node when needed

High availability – Protecting the Single Points Of Failure like the NameNode with the help of VMware HA

Elasticity – Scale your Hadoop cluster to the size you want it to be with resources still shared with other applications in your virtualized environment

Multi-tenancy – Run multiple Hadoop clusters in the same environment, dividing up data but centralizing management

Portability – Use and mix any of the popular Hadoop distributions (Apache, Pivotal HD, Cloudera, Hortonworks) with no data migration

Some of you might now wonder how we can achieve zero data migration, as the data is usually tied to an Hadoop cluster by the use of HDFS? Well, that’s been taken care of as well thanks to the inclusion of EMC Isilon in the whitepaper. Isilon is the only scale-out NAS platform with HDFS natively integrated, meaning we can create and mount HDFS filesystems to any new cluster or node that’s created.

By separating compute and data, we achieve elasticity in both. Want more compute? Scale up your VMs. Need more data? Scale up your storage. This gives you an unprecedented ability to start your Journey to Big Data in a more cost-effective and efficient manner. So, how do we piece it all together? By using the vSphere Big Data Extensions, powered by something called Project Serengeti (Serengeti is a large area in Africa, home to large animals like elephants, get the reference? :)), it gives you as an administrator an easy to use interface to create, manage, scale and decommission Hadoop clusters in your environment.

For the full whitepaper including all the step-by-step instructions on how to get your own Hadoop Starter Kit going, have a look here: