Eine weitere WordPress-Website

Python: yet another way to implement map/ reduce

In this blog, I will discuss the word count problem as done with Python. It is often used to show how map reduce works. In most examples, it is developed within the context of a Java programme. The idea is that the programme is split into two stages. In one stage, calculations are made on a part of the total data set. The results are stored in a name, value set. This stage is called the mapper stage. In a subsequent stage, the reduce stage, the results as stored in name, value sets are taken together. They are summarised into a single solution. This stage is called the reduce stage and it will generate one final outcome.
In Python, the mapper programme looks like:

#!/usr/bin/env python
import sys
# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()
# increase counters
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
print '%s\t%s' % (word, 1)

The outcomes are stored as word,1. This is the name, value set that is sent by the mapper as an intermediate outcome. This programme can also be run on its own via:

The intermediate results are sent as a stream from mapper to reducer. To handle this on the hadoop platform, the streaming jar is used. Roughly stated, the system works as follows: stream input > mapper > reducer > outcome. Thanks to the hadoop streaming jar, we do not have to indicate that intermediate outcomes are generated on one server and are sent to another server where the reduce part happens.