Cloudera Hadoop is easy to install, 100% open source, and contains features and fixes from future versions. It also includes related projects such as Hive™, used for data summarization and ad hoc querying, and Pig™, used for parallel computation.

So what is Hadoop? Apache™ Hadoop® is a framework that uses simple programming models to process data-intensive applications across clusters of computers. It features local computation and storage for each machine, failure detection and resolution at the application layer, and is scalable and highly available.

Skytap Cloud offers two CDH public templates. The Cloudera CDH4 Hadoop Cluster contains 2 Hadoop Node VMs and one management VM. The Cloudera CDH4 Hadoop Host contains a single additional host VM, which can be added to the Hadoop cluster as an additional node. You can therefore scale up your Hadoop cluster to best fit your needs.

Let’s move on to setting up Cloudera Hadoop. We’ll start with setting up a 2-node cluster, move on to connecting to the Hadoop manager, and end with adding nodes as needed.

Setting Up A 2-Node Cloudera Hadoop Cluster

Navigate to the Templates page.

In the Public Templates search box, type ‘hadoop.’

Select the Cloudera CDH4 Hadoop Cluster and create a new configuration from it.

Run the new configuration.

Once all the VMs start (which should take about 90 seconds) you’ll have a working Hadoop cluster with all the normal services (hdfs, hbase, hue, mapreduce, oozie, and zookeeper). This cluster is a 2-node cluster (host-1 and host-2) with a management server (manager).

Connecting to Hadoop Manager

To manage your Hadoop cluster, you’ll need to access the manager VM in this configuration. The manager cannot be accessed via SmartClient, so you’ll need to use one of the following methods to access it:

Create a VPN to connect your local network to this configuration. This will likely be the best way to connect if you’re using your Hadoop Cluster in a production capacity.

However you choose to access the VM, you’ll need to login with the username of ‘admin’ and the password provided in the Credentials tab on the VM Settings page.

Adding Additional Nodes to the Cluster

You can add additional nodes to your Hadoop Cluster by merging one or more copies of the Cloudera CDH4 Hadoop Host into your cluster, then configuring those hosts as additional nodes. This process isn’t complicated, but does take a number of steps, which I’ll detail below.

Adding a Host Template

Navigate to the Hadoop Cluster configuration.

Click Add VMs.

In the search box, type ‘hadoop.’

Select the Coudera CDH4 Hadoop Host.

Click the Add button.

Now redo steps 2-5 for each node you want to add to the configuration.

Notice that although the titles for all of these new nodes are shown as ‘host-n,’ their network names have been automatically incremented for your convenience. If you want to make your configuration easier to view, you can rename all of your host VMs by replacing “n” with a number (host-3, host-4, etc.).

Installing Your Nodes

Now that you have all the new hosts added, click the Run button to spin them up.

After about 90 seconds, everything will start up. Now you’ll need to go back into Cloudera Manager to finish setting up your nodes.

Go back into Cloudera Manager (log in again if you need to).

Click Hosts at the top of the web page.

Click the Add Hosts button.

Click Continue.

In the search form, type ‘host-[3-Y].hadoop.local’ where Y equals the total number of nodes in the cluster (so, if you’ve added 3 additional nodes to the original 2, it will be 3-5). This will search DNS for all of our new host nodes.

Leave all hosts selected and click Install CDH On Selected Hosts.

Keep all of the defaults on the next page and click Continue.

On the next page, leave the radio buttons on their default selections, and use the root password found in the credentials tab of any of the host-n VMs (they all have the same password).

Click Start Installation.

Wait for all of the nodes to finish installing.

If for some reason your web page times out, or something just doesn’t seem right, you can redo steps 2-9 again and it will validate that all the software was installed properly. Note that the installation time will depend on the number of nodes you’re installing; an additional 7 nodes could take 10-15 minutes to install.

Once the installations are done, click Continue.

The UI will now inspect all of your hosts.

You should get a series of green messages indicating that everything is functioning correctly. It is OK if you have one yellow message about mismatched versions.

If all looks good, click Continue.

If not, try running steps 2-11 again (this should fix any issues that popped up).

Click Continue again to finish the wizard. It should forward you to the Hosts page where all your hosts are located.

Making Each Node Identical

At this point, all your Hadoop nodes should be functional. However, you’ll likely want to make each of these new nodes work just like nodes 1 and 2. To do that, follow these steps:

Click the ‘Cloudera Manager (Free Edition)’ text at the top left of the web UI. This will bring you back to the services page.

Click the up-side-down triangle next to the top service (hbase1), then click Instances.

Click the Add button.

In the Add Role Instances view, you’ll want to make the checkbox selection identical for each node. Take note of the setting for nodes 1-2 and copy that to the new nodes.

Click Continue.

Click Accept.

Now wait for the commands to complete.

Repeat steps 1-6 for each service.

Note that some services may not utilize nodes 1 and 2, in which case you can safely leave out other nodes. For example, the Hue service is only hosted on the manager VM and there are no settings for nodes 1 and 2.

Once this process is completed, all of your nodes will function identically.

Template Location

The Cloudera CDH4 Hadoop Cluster template is located here. Note that you’ll need to log in before viewing the template.

Template Contents

Cloudera CDH4 Hadoop Cluster

2 Cloudera Host Node VMs, labeled as ‘host-1’ and ‘host-2’

1 Cloudera Manager VM, labeled ‘manager’

Cloudera CDH4 Hadoop Host

1 Cloudera Host VM, labeled ‘host-n’

Support

Cloudera Enterprise Free is provided for trial purposes only, and as such is not eligible for direct support from Cloudera.

Licensing

These templates use Cloudera Enterprise Free, which is freely licensed for use on up to 50 nodes. For more information, see Cloudera’s product details page.