Hadoop on Amazon's Cloud

There are lots of ways to run Hadoop, but what if you want to start working with it right away, without the distraction of building a cluster yourself? Your best bet is probably a cloud-based Hadoop cluster, and the Elastic MapReduce (EMR) service on Amazon Web Services (AWS) can get you there pretty speedily.

To get an EMR cluster up and running, you'll need to create an AWS account at http://aws.amazon.com, and you'll want to create a security key pair too. There are several other steps of course, and we'll cover them, one by one, in this gallery.

Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

Caption by: Andrew Brust

Pick a distro

Amazon refers to the process of standing up an EMR cluster as creating a "job flow." You can do this from the command line, using a technique we'll detail later, but you can also do it from your browser. Just navigate to the EMR home page in the AWS console at https://console.aws.amazon.com/elasticmapreduce, and click the Create New Job Flow button at the top left. Doing so will bring up the Create a New Job Flow dialog box (a wizard, essentially), the first screen of which is shown here.

An EMR cluster can use Amazon's own distribution of Hadoop, or MapR's M3 or M5 distrubution instead. M5 carries a premium billing rate as it not MapR's open source distro.

Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

Caption by: Andrew Brust

Sample applications

Those just experimeting with Amazon's Elastic MapReduce can get started immediately by running a sample application, rather than running their own code on their own data. Amazon offers WordCount (the ubiquitous Hadoop sample application) as well as a Hive-based contextual advertising sample, Java and Pig-based log analysis samples and another Java-based sample that looks at data from Amazon's CloudBurst service.

Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

Caption by: Andrew Brust

Run your own app

If you need to do production work, or just want to conduct a more free-form Hadoop experiment, you'll want to select the option to run your own application. Picking HBase and clicking Continue is best, as this lets you add Hive and Pig as well.

Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

Caption by: Andrew Brust

Specify Parameters

The Specify Parameters screen allows you configure backup options for your HBase cluster, and/or to create the new cluster by restoring from an existing backup.

If you just want to play in the sandbox though, you can disregard the backup options, but make sure to select the Hive and Pig checkboxes in the Install Additional Packages section at the bottom of the screen, then click Continue.

Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

Caption by: Andrew Brust

Configure EC2 instances

In the Configure EC2 Instances screen, you'll need to select an Instance Type for your Master and Core Instance groups. Amazon's "m1.large" instance type is the minimum required for an EMR cluster. If you're creating a cluster just for learning purposes, this will be your least expensive and therefore most sensible option. Select it for both the Master Instance Group and Core Instance Group.

Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

Caption by: Andrew Brust

Instance counts

With your instance types selected, you now need to set the number of instances in your Core and Task Instance groups. Again, if you're just putting up a cluster for learning purposes, you will want to minimize the resources you're using, so change the Core Instance Group's Instance Count from the default setting of 2 to just 1. Leave the same setting for the Task Instance Group at 0, and click Continue.

Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

Caption by: Andrew Brust

Advanced options

When you provisioned your AWS account, you should have created at least one EC2 key pair. Pick one for your EMR cluster. Without it, you won't be able to establish a secure terminal session and work interactively with Hadoop. Once you've selected a key pair, click Continue.

Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

Caption by: Andrew Brust

Bootstrap actions

You needn't worry about bootstrap actions, so just click Continue through this screen.

Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

Caption by: Andrew Brust

Review

In the Review screen, confirm that your instance types, instance counts and key pair configuration are all correct. If not, click the Back link and amend your settings as appropriate. Once everything is correctly configured, click Create Job Flow.

Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

Caption by: Andrew Brust

Job flow created

If all goes well, you should see this screen confirming that your EMR job flow has been created. Click Close so that you can monitor the status of your cluster as it's stood up.

Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

Caption by: Andrew Brust

Job flows

The EMR Job Flows screen should display the job flow you just designed. Confirm the state of the job flow is "STARTING." An animated orange spinner should appear in the job flow's row, in the leftmost column in the grid.

Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

Caption by: Andrew Brust

The command line

Would you rather do all the previous steps in one fell swoop? While there are a number of preparatory steps required, you can. The Amazon Web Services Elastic MapReduce Command Line Interface (AWS EMR CLI) makes all the previous interactive selections completely scriptable. Amazon provides complete instructions for downloading the CLI and completing all prerequisite steps, including creating an AWS account, configuring credentials and setting up a Simple Storage Service (S3) "bucket" for your log files.

If you're running on Windows, download and install Ruby 1.8.7 (which the EMR CLI relies upon), then download and install the EMR CLI itself. From a Command window (a.k.a. DOS prompt), you'll be able to navigate to the EMR CLI's installation folder and enter a command like the one shown here, which creates an EMR job flow with Hive, Pig and HBase, based on an m1.large EC2 instance.

If you're clever, you can embed all of this in a Windows batch (.BAT) file, and create a shortcut to it on your desktop. Form there, your Hadoop cluster is only a double-click away.

Once the job flow is created, proceed to the EC2 Instances screen as you would have were the job flow created interactively...

Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

Caption by: Andrew Brust

Go to EC2 instances screen

Watching the job flow's progress is useful, but you'll need some details about the particular EC2 instance serving as the head node in your cluster. Therefore, click the traingle to the right of the Services menu option, then click on the EC2 option in the resulting drop-down panel.

Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

Caption by: Andrew Brust

Select instances

There's one more step required to get to a status screen for your running EC2 instances: in the EC2 dashboard, click the Instances link along the left bav bar.

Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

Caption by: Andrew Brust

Instances screen

In the instances screen, select your instance from the top grid. As soon as you do, details about your instance appear below. One such detail is the instance's Internet host name, which you can select and copy. Once the status of your instance is "running," you're ready to connect to the cluster and start using it.

Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

Caption by: Andrew Brust

Enter host name

If you're on Windows, you'll want to download, install and then run PuTTY, the de facto SSH (Secure SHell) client for that OS. Once it's running, paste your instance's host name into the Host Name field in the PuTTY Configuration screen.

Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

Caption by: Andrew Brust

Private key file

Rememeber that key file you selected when you configured your job flow? Now you need to select its private key file in PuTTY's SSH authentication screen. Select Connection\SSH\Auth from the Category tree view control on the left, then click the Browse button and navigate to the file.

The file will need to be in PPK format, conversion to which can be performed by the PuTTYgen utility that accompanies PuTTY, as described in the EMR CLI instructions. After you've selected the file, click Open to begin your SSH session with your EMR cluster.

Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

Caption by: Andrew Brust

Add host key

Right after you click the Open button, you'll probably see a scary-looking dialog like this one. But have no fear, as it's actually harmless. If you click Yes to add the server's host key to PuTTY's cache, you won't see this message again for this particular job flow.

Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

Caption by: Andrew Brust

Log in!

You're almost there! When you see the "login as:" prompt in PuTTY's terminal window, enter "hadoop" (without the quotes) and tap Enter. That should log you in.

Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

Caption by: Andrew Brust

Welcome to your cluster

Upon successful login, you should see a welcome screen and be presented with a command prompt. The message telling you how to gain access to the Hadoop "UI" should be taken with a grain of salt, however, as that user interface is presented in Lynx, a text-based Web browser.

Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

Caption by: Andrew Brust

The bin folder

Switch to the bin folder (using the "cd bin" command) and list its contents (using the "ls" command). You will see that Hadoop, HBase, Hive and Pig are all neatly installed for you.

They're ready to run, too. To check this out, enter the "hive" command, and you'll be placed at Hive's command line prompt.

Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

Caption by: Andrew Brust

HBase prompt and "grunt"

Use the "hbase shell" command to get to the HBase prompt or use the "pig" command to get to the Pig prompt (called "grunt").

Although not shown here, you can also use the "hadoop fs" command to perform Hadoop Distributed File System (HDFS) operations and, of course, the "hadoop jar" command to run a Hadoop MapReduce job.

Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

Caption by: Andrew Brust

Change termination protection

When you're all done, don't forget to terminate the instances in your cluster, otherwise you will continue to be billed for them! To terminate the instances, you'll first need to select the Change Termination Protection option in the Actions menu, shown here.

Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

Caption by: Andrew Brust

Disable termination protection

Now click the Yes, Disable button.

Published: January 7, 2013 -- 14:30 GMT (06:30 PST)

Caption by: Andrew Brust

Terminate your instance

Now you're ready to terminate the instance. Select the Terminate option from the Actions menu.

Big Data on Amazon: Elastic MapReduce, step by step

Curious how to go about doing Hadoop in Amazon's cloud? Here's some guidance.

Read More

Configure EC2 instances

In the Configure EC2 Instances screen, you'll need to select an Instance Type for your Master and Core Instance groups. Amazon's "m1.large" instance type is the minimum required for an EMR cluster. If you're creating a cluster just for learning purposes, this will be your least expensive and therefore most sensible option. Select it for both the Master Instance Group and Core Instance Group.