Still, you may want to install a custom distribution, run custom components which are not available in the HDInsight distribution, or have Hadoop running on Linux instead of Windows. This post shows how to install a Linux distribution on Windows Azure virtual machines.

While it is possible to install several distributions like MapR, CDH (Cloudera) or HortonWorks (HDP) on different Linux OSs like CentOS, SUSE, or Ubuntu, this post takes HDP 1.2 on CentOS as an example. The documentation I want to be able to follow here in order to install the cluster is the Hortonworks’ HDP 1.2.2 installation with Ambari on CentOS.

In this post, I suppose you already have a Windows Azure account. The windowsazure.com web site can provide you with the information you need to get one. I also suppose readers are quite advanced users so I don’t give all the details when documentation also exists elsewhere.

Scope

This blog post shows one way to install a Linux Hadoop cluster. It may not respect all the best practices, particularly in terms of security. The goal of the post is to show how this kind of environment can be hosted in Windows Azure. In this post, I chose to use Windows (DNS Server, Web Browser, scripting environment, …) because this is what is simple for me, but this is possible to install the cluster without using Windows at all. I give some hints on how to do that. Also, I use the portal and PowerShell (scripting) because I find it more easy to understand that way, but I’m pretty confident everything could be done with scripting.

Choosing the Windows Azure environment for the Linux cluster

There are several ways to have a local network in Windows Azure. One of them is to create a virtual network.

I also want to be able to use a browser against the cluster and have a DNS I can easily setup and manage. In this scenario, I install a Windows Server which plays those roles. This is because I’m more a Windows guy ! Note that you may prefer installing a Linux based DNS server in the virtual Network and browse the cluster thru an SSH tunnel.

The following table contains the machines I want to instanciate and their roles:

Server Name

Server role

subnet

n124dns

DNS Browser (Windows Server)

subnet1

n124m

master node

subnet2

n124w1 to n124w3

worker nodes

subnet3

In this sample, all machines are in the following DNS domain name : n124.benjguin.com

At the end, I’ll have those machines:

n124dns will be a Windows Server machine, and it could be a Linux machine.

Install the DNS Server

I install the DNS server as a Windows Server 2012 VM. It could also be a Linux machine, but I find it simpler to use a Windows Server which will also serve as a desktop environment for administrative tasks like browsing services like Ambari Server or Hadoop dashboards from a machine that has local access to the whole cluster.

I’ll follow the Hortonworks’ documentation for a CentOS 6 OS. In particular, chapter 1.5 explains how to prepare the environment. I’ll give URLs in the documentation so that you can have the context, as well as the main steps I follow in my environment.

Now, I have my own Linux image. I’ll use it to create the different Linux VMs I need in my cluster.

Instanciate Linux virtual machines

sudo –s

vi /etc/hosts

sudo –s

fdisk –l

grep SCSI /var/log/messages

fdisk /dev/sdc

mkfs.ext3 /dev/sdc1

mkdir /mnt/datadrive

mount /dev/sdc1 /mnt/datadrive

vi /etc/fstab

add the following line at the end of the file:

/dev/sdc1 /mnt/datadrive ext3 defaults 1 2

The other machines in the cluster can be created thru the portal too. But that can also be done thru a script. There are two main flavors of automation scripts in Windows Azure: the Windows Azure PowerShell module can be used from Windows machines, the Command Line Interface (CLI for short, more information) which is based on Node.js can be used from Windows, Mac and Linux. They can be downloaded from http://www.windowsazure.com/en-us/downloads/.

As I’m a Windows guy, I will use PowerShell here. The details on how to start with the Windows Azure PowerShell cmdlets is available here.

NB: in order to connect to n124dns, one simple way is to select this VM in the Windows Azure management portal and click Connect:

In a remote desktop on the n124dns machine, I do the following:

I connect with admin/admin

Next screen asks for the .ssh/id_rsa key. The simplest way is to copy it from an ssh session (benjguin@n124.cloudapp.net:22 in my case) and paste it in the browser.

in the customize services web page, I have

I choose to remove /mnt/resource folders because it corresponds to a disk that lives with the VM and which is not persisted in the Windows Azure storage folder. This disk is destroyed when one destroys the VM. HDFS would support it, but I want to be able to stop my whole cluster without loosing HDFS data. So I change to the following:

I do the same in tab MapReduce

I also enter the required passwords and click Next

the detail is the following

Admin Name : admin

Cluster Name : n124hdp

Total Hosts : 4 (4 new)

Local Repository : No

Services

HDFS

NameNode : n124m.n124.benjguin.com

SecondaryNameNode : n124w1.n124.benjguin.com

DataNodes : 3 hosts

MapReduce

JobTracker : n124m.n124.benjguin.com

TaskTrackers : 3 hosts

Nagios

Server : n124m.n124.benjguin.com

Administrator : nagiosadmin / (web@benjguin.com)

Ganglia

Server : n124m.n124.benjguin.com

Hive + HCatalog

Hive Metastore : n124m.n124.benjguin.com

Database : MySQL (New Database)

HBase

Master : n124m.n124.benjguin.com

Region Servers : 3 hosts

Oozie

Server : n124m.n124.benjguin.com

ZooKeeper

Servers : 3 hosts

…

Run

Let’s now test HDFS, PIG and HIVE by ourselves in this cluster.

I open a new SSH connection to the master node (n124m, available at n124.cloudapp.net:22)