Running Hadoop Components

One of the advantages of Bigtop is the ease of installation of the different Hadoop Components without having to hunt for a specific Hadoop Component distribution and matching it with a specific Hadoop version.

Running Pig

Create a tab delimited file using a text editor and import it into HDFS under your user directory /user/$USER. By default PIG will look here for yoru file. Start the pig shell and verify a load and dump work. Make sure you have a space on both sides of the = sign. The statement using PigStorage('\t') tells Pig the columns in the text file are delimited using tabs.

you should see a verification from HBase the table t2 exists, the symbol t2 which is the table name should appear under list

Running Hive

This is for bigtop-0.2.0 where hadoop-hive, hadoop-hive-server, and hadoop-hive-metastore are installed automatically because the hive services start with the word hadoop. For bigtop-0.3.0 if you use the sudo apt-get install hadoop* command you won't get the Hive components installed because the Hive Daemon names are changed in Bigtop. For bigtop-0.3.0 you will have to do

sudo apt-get install hive hive-server hive-metastore

Create the HDFS directories Hive needs
The Hive Post install scripts should create the /tmp and /user/hive/warehouse directories. If they don't exist, create them in HDFS. The Hive post install script doesn't create these directories because HDFS is not up and running during the deb file installation because JAVA_HOME is buried in hadoop-env.sh and HDFS can't start to allow these directories to be created.

run classify-20newsgroups.sh, first modify the ../bin/mahout to /usr/lib/mahout/bin/mahout. Do a find and replace using your favorite editor. There are several instances of ../bin/mahout which need to be replaced by /usr/lib/mahout/bin/mahout

run the rest of the examples under this directory except the netflix data set which is no longer officially available

Running Whirr

Set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY in .bashrc according to the values under your AWS account. Verify using echo $AWS_ACCESS_KEY_ID this is valid before proceeding.

Unable to start the cluster. Terminating all nodes.
org.apache.whirr.net.DnsException: java.net.ConnectException: Connection refused
at org.apache.whirr.net.FastDnsResolver.apply(FastDnsResolver.java:83)
at org.apache.whirr.net.FastDnsResolver.apply(FastDnsResolver.java:40)
at org.apache.whirr.Cluster$Instance.getPublicHostName(Cluster.java:112)
at org.apache.whirr.Cluster$Instance.getPublicAddress(Cluster.java:94)
at org.apache.whirr.service.hadoop.HadoopNameNodeClusterActionHandler.doBeforeConfigure(HadoopNameNodeClusterActionHandler.java:58)
at org.apache.whirr.service.hadoop.HadoopClusterActionHandler.beforeConfigure(HadoopClusterActionHandler.java:87)
at org.apache.whirr.service.ClusterActionHandlerSupport.beforeAction(ClusterActionHandlerSupport.java:53)
at org.apache.whirr.actions.ScriptBasedClusterAction.execute(ScriptBasedClusterAction.java:100)
at org.apache.whirr.ClusterController.launchCluster(ClusterController.java:109)
at org.apache.whirr.cli.command.LaunchClusterCommand.run(LaunchClusterCommand.java:63)
at org.apache.whirr.cli.Main.run(Main.java:64)
at org.apache.whirr.cli.Main.main(Main.java:97)

When whirr is finished launching the cluster, you will see an entry under ~/.whirr to verify the cluster is running

cat out the hadoop-proxy.sh command to find the EC2 instance address or you can cat out the instance file. Both will give you the Hadoop namenode address even though you started the mahout service using whirr.

ssh into the instance to verify you can login. Note: this login is different than a normal EC2 instance login. The ssh key is id_rsa and there is no user name for the instance IP address ~/.whirr/mahout:ssh -i ~/.ssh/id_rsa ec2-50-16-85-59.compute-1.amazonaws.com
#verify you can access the HDFS file system from the instance