In this post, we will discuss about hadoop installation on cloud storage. Though there are number of posts available across internet on this topic, we are documenting the procedure for Cloudera Manager Installation on Amazon EC2 instances with some of our practical views on installation and tips and hints to avoid getting into issues. This post also gives a basic introduction on usage of Amazon AWS cloud services.

Creation of Amazon EC2 Instances:

First we need to create necessary EC2 instances on Amazon AWS cloud services with appropriate AMI (Amazon Machine Image). In this post, we are using Ubuntu 14.04 as AMI for Clouder Manager 5 installation along with CDH 5.2 release. We are going to setup 4 node cluster with 1 Namenode and 3 Datanodes, which is the minimum requirement for cloudera hadoop cluster setup without any error messages or warnings.

Private and Public IP Addresses:

Creation of EC2 instances in Amazon AWS cloud services, will assign one private IP address (This will be used within AWS to access each machine) and one Public IP address (This is used to access the machines from outside AWS like Internet) to each EC2 instance created.

Pricing Mode:

The pricing mode of EC2 instances usage is hourly basis and whenever we are not using any EC2 instance, we can stop that instance and we can start it again with same AMI configuration to save the cost of EC2 instances usage. In this case the billing is charged only for the hours during which the instances were running.

But the only disadvantage of stopping and starting instances again is that, every time we start the instance, instance will be assigned with new Private and Public IP address paircreated dynamically. And we can’t access this instance with previous Private and Public IP address pairs.

Hint:

If we install Hadoop on EC2 instances directly, then either we need to keep all the EC2 instances running forever so that their Private and Public IP addresses will not change after Hadoop Installation or we need to terminate the instances and re-create them and re-install hadoop after every stop/start of EC2 instances. Both of these options are not ideal maintaining a hadoop cluster.

So, in order to keep the cluster cost effective (so that we can stop and start the instances whenever we needed), we can make use of Amazon VPC (Virtual Private Cloud network) cloud service and Elastic IP addresses. With these two AWS cloud services, we can achieve Static Private and Public IP addresses for the EC2 instances being created. But keep in mind that these two additional services come at the extra cost but provides flexibility to stop and start the EC2 instances whenever we do not need the instances running to save the cost.

In this post, we will make use of Amazon VPC, Elastic IP and EC2 Instance AWS cloud services to setup a private cloud network and maintain static IP addresses.

After login into AWS console, first select VPC cloud service it will open VPC dashboard as shown below.

Click on Start VPC Wizard and select VPC with a single public subnet as shown below. Provide VPC name and rest all properties can be left as default values. And create VPC as HDP-VPC.

Now select Services –> EC2 to open EC2 Dashboard and Launch Instances into VPC.

Namenode Instance Configuration:

Now select AMI configuration for Namenode as m3.2xlarge instance type.

Click on Launch Instance and choose AMI as Ubuntu 14.04 and follow steps as shown in below screens in the same order.

Select 1 instance for Namenode and select HDP-VPC (the above created VPC) as Network and remaining properties as default values.

Now Add storage at least 80 GB to install Cloudera Manager.

And give instance name=CL_NN in Tag instance and Create a new security group as shown below.

Creation of Security Group:

Add inbound rules as shown above for TCP ports 7180, 7182, 7183 and 7432 and SSH port 22 and all other rules shown in above screen are better keep. In order to access this EC2 instance from any machine from outside, we need to select Anywhere in source tab.

After this page click on Launch button and we will be asked for creation of private key pair and Download Key Pair , This is the only place where we can save the private key pair otherwise we can’t connect to these EC2 instances from outside. Give the key pair name as HDPCluster1

Now we can see the instance running under Instances tab.

Creation of Elastic IP Address:

Elastic IPs –> Allocate New Address and after new IP address allocation, open the Associate Address and select the instance just now created.

This will associate a static Private and Public IP address pair to Namenode Instance.

Create DataNode EC2 Instances:

Similar to Namenode EC2 instance creation as shown above, create 3 instances under HDP-VPC, each with 100 GB storage and all are allocated to same security group which is created in the above. This time, Choose AMI as Ubuntu 14.04 and Instance Type as m3.xlarge and select the instance configuration as shown below.

And review the configuration and launch the instances and Allocate three new Elastic IP addresses and associate them to Data node instances. Below are the list of four instances:

Install Cloudera Manager on NameNode Instance:

Now connect to Namenode instance via the terminal from our local Ubuntu machine through SSH port 22. The commands needed for connection to an EC2 instance is shown in below screen.

After changing the permissions on HDPCluster1.pem file we can use below ssh command to connect to EC2 instance.

Follow the directions shown by installer and after successful completion screen will instruct to login at 7180 port on Namenode hostname for Cloudera Manager Admin page login for further CDH5.2 installation.

Here in the below SSH Login Credentials Screen, we need to select the username as ubuntu than root. We should not select root here. We have to assign HDPCluster1.pemas the private key file and need to select all hosts accept same private key.

If we don’t get any error messages, then the installation will be successful as shown below

Or if we get any error messages as shown below,

In this case, provide Private IP Addresses and Private DNS names in /etc/hosts file of all nodes of the cluster being installed.

In the next steps, save the Postgre SQL login Username and Passwordsomewhere to login Postgre SQL manually incase of any issues in creating metastore tables.

Next select Continue with the default settings for the cluster configuration and follow till first run of all the requested services. On Successful start of all service cluster show good health for all the services as shown below.

As all the services are showing green status, the hadoop cluster is successfully installed and configured and all the services are running successfully without any warning messages.

Some times it will success without any error messages if we provide public IP addresses, in specifying host for cloudera installation page. If this fails trying providing your public hostnames (instead of ip addresses).

Even if this also fails, try giving Private IP addresses/hostnames. In case if you still get errors in installing, then you need to change the /etc/hosts file on each node of the cluster being installed.

you need to copy ip addresses of all the nodes into /etc/hosts file of each node in the below format

Installation Failed. Failed to Receive Heartbeat from Agent
Ensure that the host’s hostname is configured properly.
Ensure that port 7182 is accessible on the Cloudera Manager server (check firewall rules).
Ensure that ports 9000 and 9001 are free on the host being added.
Check agent logs in /var/log/cloudera-scm-agent/ on the host being added (some of the logs can be found in the installation details).

could u pls help?

in hosts file its looks like below: mine is single node cluster with CDH5 on local system with ubuntu 12.04 lts.

this /etc/hosts file is very critical in this installation.
I have finished all other steps, but always get error on hostname not properly configured.
Accroding to Cloudera doc, /etc/hosts should look like this
127.0.0.1 localhost.localdomainlocalhost
192.168.1.1 cluster01.example.com cluster01
….

but it’s not working. so I guess we like to have author’s /etc/hosts file content to see how it setup to work.

Hi Robin, sorry for the delayed response. I currently closed my AWS cluster since it is chargeable and maintaining offline cluster but anyway to your question below,

First try listing out your public ip address and dns names in /etc/hosts and check if this works,
In case if this is not working then you can try with private ip and dns names in /etc/hosts file
Make sure that these entries are same across all the nodes in your cluster.

Suppose if you are having 4 node cluster,

then there should be entry for these 4 machines private ip, dns names in every /etc/hosts file across these 4 machines.

Siva Sr.
thank you so much to take time to answer my question. Sorry to take your precise time.
your tutorials is the simplest one for CDH 5 install on AWS/EC2. I have learned so much from it.
thank you for doing this.

I always get an error: Enrsure that host’s hostname configured properly…. and
no matter how I modify my /etc/hosts file, this error stay with me like a cancel cell….

I am pretty new to the whole concept on Big data and Cloudera in general. I have recently registered to Amazon services and got a free 1 year usage. So will it charge me if I try and do this cloudera setup in an instance for learning purposes?