PROGRAMMING – BIG DATA – IT- ETC

Build a Hadoop Cluster on Windows Azure

Getting started with Hadoop can be tough. There’s a steep learning curve involved with getting even vaguely familiar with all the ins and outs.

If you want to get hands-on experience the first step is going to be finding a platform to play with. Fortunately there’s a lot of good options available. The Hortonworks quick start VM, Amazon Elastic MapReduce and Microsoft’s HDInsights are all easy ways to hit the ground running.

If those preconfigured solutions seem a little bit too magical and/or if you’re a do-it-yourself-er then building your own cluster can be a fun (and educational) exercise. And if you’re one of those unlucky few who don’t have a spare rack of servers lying around, you’re still in luck! All you need to do is turn to the cloud.

This post will tackle building your own Hadoop cluster in the cloud. Specifically, I’ll be using the Windows Azure and Hortonworks HDP 2.1.

0 – Sign up for Windows Azure

If you haven’t already, sign up for Windows Azure. You can get a free trial for $220 worth of cloud services for 28 days. You can only use a maximum of 20 CPU cores during the free trial, but that should be enough for our purposes.

1 – Name The Components

The first step is to write down some basic information about your cluster. When you start adding the bits in Azure, its going to ask you to name things. I find it easier if you write down all the names first.

The domain name you plan to use. I chose albertlockett.ca

A naming scheme for the nodes in your cluster. I like to name my Namenodes and Datanodes differently and number them, so I called them agl-datanode<#> and agl-namenode<#>

A hostname for your DNS Server. I called mine agl-dns1

Name of the network you plan to use. I called mine agl-network

You should also choose where you want everything to be located. Azure has a number of data centres all over the world, so choose one close to you. I’m using East US because I live in Halifax and that one is closest.

With that information, here is a picture of what we’re going to build. 3 datanodes, 1 Namenode and 1 DNS Server.

2 – Setup the Network

I like to keep my Namenodes, Datanodes and DNS on separate subnets, so we’re going to setup a network with 3 subnets

Subnet 1 10.124.1.0 /24 for the DNS

Subnet 2 10.124.2.0 /24 for the Datanodes

Subnet 3 10.124.3.0 /24 for the Namenodes

Go to your Azure management portal and choose new

+NEW -> NETWORK SERVICES -> VIRTUAL NETWORK -> CUSTOM CREATE

Configure the network like this:

Virtual Network Details

Name – The name of your network. I used agl-network

Location – The location you plan to use. I used East-US

DNS Servers and VPN Connectivity

I’m going to add 2 DNS Servers, an internal DNS and a public one.

In the first line add your internal DNS information:NAME – agl-dns1IP – 10.124.1.4

In the next kine add a public DNS. I used Google’s public DNS:Name – google-public-dns-a.google.comIP – 8.8.8.8

Virtual Network Address Spaces

Add 3 subnets with the following information:

Name – Subnet 1Starting Address – 10.124.1.0CIDR – /24

Name – Subnet 2Starting Address – 10.124.2.0CIDR – /24

Name – Subnet 3Starting Address – 10.124.3.0CIDR – /24

3 – Build the DNS

We’re going to use Ubuntu Server 12.04 with Bind9 to do the DNS. The first step is to create the virtual server.

Create DNS Virtual Server

In the Azure management portal choose

+NEW -> COMPUTE -> VIRTUAL MACHINE – FROM GALLERY

Choose an Image

Pick Ubuntu from the list on the left

Pick Ubuntu Server 12.04 LTS from the list of Ubuntu Images

Virtual machine Configuration

Version Release Date – Just choose the newest

Virtual Machine Name – Choose the name of your DNS Server. I chose agl-dns1

Now our DNS Server is configured. Next we can create an image to use as a template for our Datanodes and Namenode

3 – Create Node Image

In this step we’re going to create a base image we can use copy to use for each node in the cluster. The machine we create this image from will also serve as our first datanode, so we’ll configure it as such. We’ll still have to do some configuration for each node individually, but creating this image will save us having to repeat a few steps. We’re going to use Centos 6 as the base operating system for our cluster, but Azure calls it OpenLogic 6.5

First create/configure the new image from the gallery:

+NEW -> COMPUTE -> VIRTUAL MACHINE -> FROM GALLERY

Choose an Image

Pick Centos-Based from the list on the left

Pick OpenLogic 6.5 LTS from the list of Centos-Based images

Virtual machine Configuration

Virtual Machine Name – Choose the name of your first datanode. I chose agl-datanode0

Configure the Image

Now log into the new node and we’ll do some base configurations

ssh albertlockett@agl-datanode0.clouapp.net

Disable SELinux.

setenforce 0

And make sure it’s permanently disabled

vim /etc/selinux/sysconfig

# This file controls the state of SELinux on the system.
# SELINUX= can take one of these three values:
# enforcing - SELinux security policy is enforced.
# permissive - SELinux prints warnings instead of enforcing.
# disabled - No SELinux policy is loaded.
SELINUX=disabled

Capture the Image

Log into the Windows Azure manager. Select the virtual machine and click the Capture button

And name the image whatever you like. I called it hadoop_img

4 – Create Namenode and Datanodes

In this section we’ll create the name node and the other two datanodes. We’re going to create them from the image we produced in the previous section and we’ll configure them in the next section. Repeat the following procedure three times.

Create/configure the new image from the gallery:

+NEW -> COMPUTE -> VIRTUAL MACHINE -> FROM GALLERY

Choose an Image

Pick My Images from the list on the left

Pickthe image you saved in the previous section from the list of images. I called mine hadoop_img

Virtual machine Configuration

Virtual Machine Name – Choose the name of your new machine. I called my datanodes agl-datanode1 & agl-datanode2 and I called my namenode agl-namenode

Tier – Basic

Size –I used A3 (4cores, 7GB Memory) for the datanodes and A4 (8 cores, 14GB Memory) for the namenode.Note: if you’re on the Azure free trial, you should use an A3 machine for the namenode so we don’t exceed the maximum number of virtual cores you’re using.

Virtual Network Subnets – For the datanodes use Subnet 2 (10.124.2.0 /24) and for the namenode use Subnet 3 (10.124.3.0 /24)

Availability Set – (None)

Endpoints – Leave the default (SSH/TCP/22/22)

Virtual machine Configuration (3)

Install the VM Agent

Add Storage

We’ll attach storage to each node (3 datanodes and the namenode). Select the node in the list of virtual machines and click ‘Attach’ then choose ‘Add empty disk’.

In the next window, add some storage. I added 100GB to each node.

Configure Each Node

Log into each node and do the following configurations.

First set the root password

sudo passwd root

Then type in your login password, and then set the new root password to whatever you like.

Next, set the hostname on each machine. I’ll use the name node as an example, but be sure to repeat this procedure for each datanode (replacing agl-namenode with agl-datanode<#>, or your own node names).

sudo hostname agl-namenode.albertlockett.ca

sudo vim /etc/sysconfig/network

HOSTNAME=agl-namenode.albertlockett.ca

Disable the firewall as well

sudo service iptables stop
sudo chkconfig iptables off

Attach Storage Disks

Next we’ll set up a place to mount the attached storage disks and make sure they mount automatically when our nodes are booted. Perform this procedure on each node.

Choose whatever version of the stack you like. I chose the newest, 2.1 and click Next

If you get an error, make sure the namenode can connect to the internet and can perform DNS lookups.

Next type the fully qualified domain name of each node in the text box. Include the namenode too.

Put the ssh private key into the second text box. The key is on the namenode in the file /root/.ssh/id_rsa. Copy and paste the whole contents of that file into the second text box, or you can download the file to your laptop and point to it using the Choose File button.

Click Register and Confirm

Ambari will try to connect to each host and run some registration scripts.

This step can be a little flakey. If a node fails, check the logs by clicking Failed in the status column then search google for the error. Sometimes failures can be resolved by simply re-running this step multiple times.

I usually ignore the warnings, but you can deal with them if you like.

Click off the text boxes of the services you want to install. I like to install them all. You can stop the ones you don’t want later.

Assign the masters in this next step. I like to assign all the masters to the namenode but install zookeeper servers on every node.

Click Next

Assign Clients and Slaves to each datanode and click Next

In this step you can customize the configuration of each service. Fill in the boxes that are highlighted red, and accept the defaults if you like.

The only default I like to change is the to set the YARN remote log location to /tmp/logs, but this is entirely optional.

Click Next

Ambari will start installing all the services. This step can take a while so sit back and relax.

Sometimes this step will fail too, so try re-running it if it doesn’t work.

Click Next when it’s finished

Click through the final confirmation screens and you’ll see your Ambari dashboard

What’s next?

Congratultions, hopefully you now have a working Hadoop Cluster on Azure.

You can write some MapReduce jobs or Pig Scripts, set up a Hive Database and process some streaming data with Storm.

Stay tuned for my next post where I’ll cover what it takes to write your own YARN application.