Get Started with Apache Hadoop on Rackspace Cloud

Disclaimer: This document details a process intended for
educational purposes only. This will not deploy a production
environment.

What is Apache Hadoop?

Hadoop is an open source project that provides a platform to store and
process massive amounts of data. Hadoop uses the Map Reduce paradigm to
split large tasks into many smaller chunks and executes them in
parallel. Each of these tasks are executed close to the data in the
Hadoop Distributed File System.

Hadoop Use Cases

In a very short time, Hadoop has revolutionized almost every business
sector. Actual use cases involving Hadoop include:

Objective

This document is for educational purposes only and will provide you with
an example of how to get started with Apache Hadoop in the Cloud. You
will learn how to launch a Hadoop cluster starting with 2 nodes and
growing it up to 64 nodes. During this process you will learn how to:

This will install the Chef server, knife-rackspace plugin and upload the
chef hdp-cookbooks, and configure it to talk to Rackspace Cloud using
your account. We can now use the knife client to interact with Rackspace
Cloud and configure our Hadoop cluster.

Choosing the Image

You need a CentOS 6.2 image as the base image for the server to install
Hadoop.

Choosing the Flavor

Creating your Environment

In order not to conflict with other Hadoop clusters within the same
account, we will create a Chef environment called YourName to create our
Hadoop cluster on. We will save this name in an environment variable so
we can reference it later.

This script will download all of Shakespeare’s books from project,
Gutenberg, upload them to HDFS and run a Map Reduce operation run a word
count against the text.

Adding More Nodes

So far, you have created only hadoopworker1. Keep adding more
HadoopWorker nodes following the same process. Make sure to increment
the hadoopworker number each time. Run and benchmark your application
and see how it performs when the size of the cluster grows.

Once you feel comfortable, you can also play with different flavor sizes
and see what works best for your application.

Deleting the Cluster

If you are done with your computation, you may want to delete the
cluster and free up the resources. To do this, you need the server id of
the server you want to delete.

Repeat the process for all the servers in the cluster by replacing
$HADOOP_W1_IP with the IP for the appropriate worker number.

Summary

In this document, we showed you how to interact with the cloud using
tools and scripts. We also demonstrated how to get started with Apache
Hadoop on a couple of cloud servers and scale it up with your needs.