Running hadoop on windows is not trivial, however running Apache Spark on Windows proved not too difficult. I came across couple of blogs and stackoverflow discussion which made this possible. Putting down my notes below which are outcome of these reference material.

In a new command window run hadoop-env.cmd followed by {HADOOP_INSTALL_DIR}/bin/hadoop classpath
The output of this command is used to initialize SPARK_DIST_CLASSPATH in spark-env.cmd (You may need to create this file.)

Create spark-env.cmd in {SPARK_INSTALL_DIR}/conf[code language=”Java”]
#spark-env.cmd content
SET HADOOP_HOME=C:\amit\hadoop\hadoop-2.6.0
SET HADOOP_CONF_DIR=%HADOOP_HOME%\conf
set SPARK_DIST_CLASSPATH=<Output of hadoop classpath>
SET JAVA_HOME=C:\Progra~1\Java\jdk1.7.0_80
[/code]

Now run the examples or spark shell from {SPARK_INSTALL_DIR}/bin directory. Please note that you may have to run spark-env.cmd explicitly prior running the examples or spark-shell.

This challenge was about using IBM Bluemix’s “Analytics For Hadoop” service to process a data set that is minimum 500MB in size.

This was a wonderful opportunity to get some hands on on IBM Bluemix ( IBM is giving extended trial access if you are a participant). Apart from this I was also keen to build some Data visualization app on my own.

I selected CitiBikedata for one year (2013-2014). Initially I did not had a clue about what insights I could gather from the dataset, but as soon as I ran some Apache Pig scripts and started looking at the output, I could see more and more use cases around the dataset. I could not address all the use cases I thought as I soon hit the deadline pressure. I had to finish the video demonstration and write some write up about the project.

Overall it was a very enriching experience as I did so many things for the very first time.

Listing some of them below

IBM Bigsheets and BigSQL

Using Chart.js library

Using Google Maps JavaScript APIs – It was remarkably simpler than I thought. Much appreciate these APIs from Google.

Creating the custom Map icon – Never realized it would be this difficult

HTML 5/CSS challenges when putting up the UI

Last but not the least GitHub’s easy way to publish your work online.

Now that the challenge is in Public voting and judging phase, appreciate if you could take a look at

Anyone who is new to the hadoop world often ends up frantically searching for the debug log statements that he might have added in mapper or reducer function. At least this has been the case with me when I started working on Hadoop. Hence I thought it might be a good idea to post this particular entry.

The mapper and reducer are not executed locally and hence you can not find the logs for those on local file system. The mapper/reducer are run on the hadoop cluster and hence cluster is the place where you should look for them. However you do need to know the “JobTracker” URL for your cluster.

Access the “JobTracker. The default URL for JobTracker is “http://{hostname}:50030/jobtracker.jsp”.

This simple UI lists down the different Map-reduce jobs and their states ( namely “running”,”completed”, “failed” and “retired”).

You need to locate the Map-reduce job that you started (Please Refer the attached screenshot.)

There are various ways to identify your Map-reduce job.

If you run a Pig script, the Pig client will log the job id which you can search in jobtracker. The screen shot shows the map-reduce job corresponding to a Pig script.

If you are running a plain map-reduce application then once you submit the job, you can search the same either by using the userid used to submit the job or the application name.

Clicking on the job number reveals the job details as shown below –

Following screenshot shows way to navigate to the log statements. Please note that the screenshot comprises of three steps.