Creating Hadoop Streaming Job with Spring Data Apache Hadoop

A Hadoop streaming job is a MapReduce job that uses standard Unix streams as an interface between the Apache Hadoop and our program. This naturally means that we can write MapReduce jobs by using any programming language that can read data from standard input and write data to standard input.

This tutorial describes how we can create a Hadoop streaming job by using Spring Data Apache Hadoop and Python. As an example we will analyze a novel called The Adventures of Sherlock Holmes and find out how many times the last name of Sherlock’s loyal sidekick Dr. Watson is mentioned in the novel.

These steps are described with more details in the following Sections. We will also learn how we can run our Hadoop streaming job.

Getting the Required Dependencies with Maven

We can download the required dependencies with Maven by following these steps:

Add the Spring milestone repository to the list of repositories.

Configure the required dependencies.

Since we are using the 1.0.0.RC2 version of Spring Data Apache Hadoop, we must add the Spring milestone repository to our pom.xml file. In other other words, we have to add the following repository declaration to our POM file:

Read the input from standard input and process it one line at the time.

Convert the processed line to UTF-8.

Split the converted line into words.

Remove special characters from each word.

Encode the output to UTF-8 and write the key-value pair as a tab-delimited line to standard output.

The source code of the mapper.py file is given in following:

#!/usr/bin/python
# -*- coding: utf-8 -*-
import sys
import unicodedata
# Removes punctuation characters from the string
def strip_punctuation(word):
return ''.join(x for x in word if unicodedata.category(x) != 'Po')
#Process input one line at the time
for line in sys.stdin:
#Converts the line to Unicode
line = unicode(line, "utf-8")
#Splits the line to individual words
words = line.split()
#Processes each word one by one
for word in words:
#Removes punctuation characters
word = strip_punctuation(word)
#Prints the output
print ("%s\t%s" % (word, 1)).encode("utf-8")

Creating the Reducer Script

The implementation of the reducer script must follow these guidelines:

The reducer script receives its input from standard input as tab-delimited key-value pairs.

The reducer script writes its output to standard output.

We can implement our reducer script by following these steps:

Read key-value pairs from the standard input and process them one by one.

Convert the processed line to UTF-8.

Obtain the key and value by splitting the word.

Count how many times the string “Watson” is given as a key to our reducer script.

Encode the output to UTF-8 and write it to standard output.

The source code of the reducer.py file is given in following:

#!/usr/bin/python
# -*- coding: utf-8 -*-s
import sys
wordCount = 0
#Process input one line at the time
for line in sys.stdin:
#Converts the line to Unicode
line = unicode(line, "utf-8")
#Gets key and value from the current line
(key, value) = line.split("\t")
if key == "Watson":
#Increase word count by one
wordCount = int(wordCount + 1);
#Prints the output
print ("Watson\t%s" % wordCount).encode("utf-8")

Configuring the Application Context

We can configure the application context of our application by following these steps:

Create a properties file that contains the values of configuration parameters.

Configure the property placeholder that is used to fetch the values of configuration parameters from the created properties file.

Configure Apache Hadoop.

Configure the Hadoop streaming job.

Configure the Job Runner

Creating the Properties File

We can create the properties file by following these steps:

Configure the default file system of Apache Hadoop.

Configure the path that contains the input files.

Configure the path in which the output files are written.

Configure the path of our mapper script.

Configure the path of our reducer script.

The contents of the application.properties file is given in following:

#Configures the default file system of Apache Hadoop
fs.default.name=hdfs://localhost:9000
#The path to the directory that contains our input files
input.path=/input/
#The path to the directory in which the output is written
output.path=/output/
#Configure the path of the mapper script
mapper.script.path=mapper.py
#Configure the path of the reducer script
reducer.script.path=reducer.py

Configuring the Property Placeholder

We can configure the property placeholder by adding the following element to our application context configuration file:

Configuring Apache Hadoop

We can use the configuration namespace element for providing configuration parameters to Apache Hadoop. In order to execute our job by using our Apache Hadoop instance, we have to configure the default file system. We can configure the default file system by adding the following element to the applicationContext.xml file:

Configuring the Job Runner

A job runner is component that executes the Hadoop streaming job when the application context is loaded. We can configure it by using the job-runner namespace element. This process has the following steps:

Configure the job runner bean.

Configure the executed jobs.

Configure the job runner to run the configured jobs when it is started.

Loading the Application Context When the Application Starts

We have now created a Hadoop streaming job with Spring Data Apache Hadoop. This job is executed when the application context is loaded. We can load the application context during startup by creating a new ClasspathXmlApplicationContext object and providing the name of our application context configuration file as a constructor parameter. The source code of the Main class is given in following:

Running the MapReduce Job

We have now learned how we can created a streaming MapReduce job by using Spring Data Apache Hadoop and Python. Our last step is to run the created job. Before we can run our job, we must download The Adventures of Sherlock Holmes. We must download the plain text version of this novel manually because Project Gutenberg blocks download utilities such as wget.

After we have downloaded the input file, we are ready to execute our MapReduce job. We can run our job by starting our Apache Hadoop instance in a pseudo-distributed mode and following these steps:

Upload the input file to HDFS.

Run the MapReduce job.

Uploading the Input File to HDFS

We can upload our input file to HDFS by running the following command at command prompt:

hadoop dfs -put pg1661.txt /input/pg1661.txt

We can verify that the upload was successful by running the following command at command prompt:

hadoop dfs -ls /input

If the file was uploaded successfully, we should see the following directory listing:

Running the MapReduce Job

We have two alternative methods for running our MapReduce job:

We can execute the main() method of the Main class from our IDE.

We can build a binary distribution of our example project by running the command mvn assembly:assembly at command prompt. This creates a zip package to the target directory. We can run the created MapReduce job by unzipping this package and using the provided startup scripts.