Apache + Yarn + Spark: Play with Twitter data!

In this tutorial I want to write about using Apache Spark on Ubuntu machines where you can develop big data analysis apps with it.

First of all, I want to write a small and quick introduction to Hadoop + Spark environment. Hadoop makes it possible to work with lots of computers in a cluster. Work can be: storing files in cluster (HDFS – Hadoop Distributed File System), storing database in cluster (Apache HBase), or run software in cluster (MapReduce, Spark).

In Hadoop there is a master node who controls clients inside clusters and partition the data between them.

I do not want to talk about how to configure Hadoop + Spark in Linux. There is a nice tutorial which I followed and configured two machines (1 master + 1 slave). Here is the link. Thanks to Sumit Chawla.

After that I’ve created my slave01 node which was a clone of master node. There is a good tutorial on how to clone virtual machines using VirtualBox in this link. Actually I’ve changed some configuration in slave including user name and display name. Here is a screenshot of slave01:

After that it is time to start HDFS + YARN + SPARK in master node. Commands are here:<br /><br />

$HADOOP_HOME$/sbin/start-dfs.sh

$HADOOP_HOME$/sbin/start-yarn.sh

And finally start Spark:

$SPARK_HOME$/sbin/start-all.sh

And control in both master and slave01 that daemons are running using jsp:

It is time to programming. The main concept in MapReduce or Spark programming is (key, value). The main jobs is to read data from file (or HDFS) line by line. Process the lines by Map and create 0 or 1 or many (key, value) pairs. Hadoop will collect and sort the pairs with same key and give them to Reducer. Now reducer can decide what to do with the values. In the popular example of word count, the reducer sums all 1s to produce the word count.

I’ve developed with Java. I’ve installed Netbeans on master node and added libraries for Spark programming from $SPARK_HOME$/jars directory.

Mir Saman

I'm currently an IT PhD. candidate at Urmia University. I'm interested in Social Network Analysis, Big Data Mining, and NLP in my academical field as well as Guitar, Nature, and Android!
View all posts by Mir Saman