What is Hadoop?

Hadoop is
sub-project of Lucene (a
collection of industrial-strength search tools), under the
umbrella of the Apache Software
Foundation. Hadoop parallelizes data processing across
many nodes (computers) in a compute cluster,
speeding up large computations and hiding I/O latency through
increased concurrency. Hadoop is especially well-suited to large
data processing tasks (like searching and indexing) because it
can leverage its distributed file system to cheaply and reliably
replicate chunks of data to nodes in the cluster, making data
available locally on the machine that is processing it.

Hadoop is written in Java. Hadoop programs can be written using
a small API in Java or Python. Hadoop can also run binaries and
shell scripts on nodes in the cluster provided that they conform
to a particular convention for string input/output.

Hadoop provides to the application programmer the abstraction of
map and reduce (which may be familiar to those with functional
programming experience). Map and reduce are available in many
languages, such
as Lisp
and Python.

Data Structures

Hadoop operates on <key, value> pairs, or two-tuples. The
value can be any data type, but must be serializable since
Hadoop transmits data over the wire as strings. All data that
your Hadoop programs produce must be formatted as pairs. An
example of output in pairs is word counts, i.e.:

a 37
the 20
and 16
zygote 1

Map and Reduce

The MapReduce paradigm takes inspiration from
the map
and reduce programming constructs
common in many programming languages. We introduce map and
reduce using Python, then draw out differences between
map/reduce in most programming languages and the
map/reduce phases of Hadoop.

Map

Mapapplies a function to
each element in a list, returning a list of the results. For
example, here is code in Python which uses map to triple each
element in a list:

def triple(n):
return n * 3
print map(triple, [1, 2, 3, 4])

This code prints the following list:

[3, 6, 9, 12]

Reduce

Reduce also applies a function to
elements in a list, but instead of being applied to each element
separately, the function is applied to two arguments, the
"current result" and the next element in the list. The
current result is initialized by calling reduce on the first two
elements in the list. This allows you to build a single result
(which can be another list but is often a scalar value) from a
list. This is best illustrated in another simple Python example:

def sum(n1, n2):
return n1 + n2
print reduce(sum, [1, 2, 3, 4])

Can you guess what this will print? Try it if you are
unsure. You can think of this function as making three recursive
function calls like this:

sum(sum(sum(1, 2), 3), 4)

Parallelizing Map and Reduce with Hadoop

In the MapReduce programming abstraction,
the map function takes a single
<key, value> pair, and produces zero or
more <key, value> pairs as a result. This differs
slightly from the functional programming
construct map which always produces
one and only one result for each invocation
of map. The MapReduce style
of map allows you to produce
many intermediate pairs which can then be further
analyzed with reduce.

In MapReduce, the reducer is also more flexible than its
functional programming
counterpart. While reduce is similar
in spirit to the reduce described in the Python example above,
it is not limited to processing the list of pairs two-at-a-time,
but rather is given an iterator over all pairs that have the
same key, and that list can be walked over in any way the
programmer chooses. Also like the
MapReduce map, the
MapReduce reduce can emit an arbitrary
number of pairs, although applications often will want to just
reduce to a single output pair.

A Quick Illustration of Reduce

The reduce phase in MapReduce joins
together in some way those pairs which have the same key. For
example, say that you do a word count on N documents, and each
document is on a separate
node. Your map function will process
each document separately, producing many pairs of the following
form: <word, count>. The documents will most likely have
many words in common, so those counts will need to be
combined. The MapReduce algorithm automatically starts a reduce
process for each set of pairs with the same key (in this case,
counts for the same word), and the reducer can simply sum those
counts together, producing a single pair of the form <word,
total_count>. For example, reduce might transform one set of
pairs into another like this:

a 37
the 20 a 50
a 10 and 16
a 3 -- reduce --> the 32
and 16 zygote 1
zygote 1
the 12

There is a word count example written in Java distributed with
Hadoop in
src/examples/org/apache/hadoop/examples/WordCount.java

By separating the map task (where the computation on each input
element is independent of the other) from the reduce task (where
pairs with the same key must be processed together on the same
node), the MapReduce algorithm can improve parallelism. This is
the reason why MapReduce separates map and reduce into two
separate phases.