44.
MapReduce: Simpliﬁed Data
Processing on Large Clusters
Jeffrey Dean and Sanjay Ghem
awat
jeff@google.com, sanjay@goog
le.com
Google, Inc.
Abstract
given day, etc. Most such com
MapReduce is a programmin putations are conceptu-
g model and an associ- ally straightforward. However
ated implementation for proc , the input data is usually
essing and generating large large and the computations have
data sets. Users specify a map to be distributed across
function that processes a hundreds or thousands of mac
key/value pair to generate a set hines in order to ﬁnish in
of intermediate key/value a reasonable amount of time.
pairs, and a reduce function that The issues of how to par-
merges all intermediate allelize the computation, dist
values associated with the sam ribute the data, and handle
e intermediate key. Many failures conspire to obscure the
real world tasks are expressible original simple compu-
in this model, as shown tation with large amounts of
in the paper. complex code to deal with
these issues.
Programs written in this func As a reaction to this complex
tional style are automati- ity, we designed a new
cally parallelized and executed abstraction that allows us to exp
on a large cluster of com- ress the simple computa-
modity machines. The run-time tions we were trying to perform
system takes care of the but hides the messy de-
details of partitioning the inpu tails of parallelization, fault-tol
t data, scheduling the pro- erance, data distribution
gram’s execution across a set and load balancing in a libra
of machines, handling ma- ry. Our abstraction is in-
chine failures, and managing spired by the map and reduce
the required inter-machine primitives present in Lisp
communication. This allows and many other functional lang
programmers without any uages. We realized that
experience with parallel and most of our computations invo
distributed systems to eas- lved applying a map op-
ily utilize the resources of a larg eration to each logical “record”
e distributed system. in our input in order to
Our implementation of Map compute a set of intermediat
Reduce runs on a large e key/value pairs, and then
cluster of commodity machine applying a reduce operation to
s and is highly scalable: all the values that shared
a typical MapReduce computa the same key, in order to com
tion processes many ter- bine the derived data ap-
abytes of data on thousands of propriately. Our use of a func
machines. Programmers tional model with user-
ﬁnd the system easy to use: hun speciﬁed map and reduce ope
dreds of MapReduce pro- rations allows us to paral-
grams have been implemente lelize large computations easi
d and upwards of one thou- ly and to use re-execution
sand MapReduce jobs are exec as the primary mechanism for
uted on Google’s clusters fault tolerance.
every day. The major contributions of this
work are a simple and
powerful interface that enables
automatic parallelization
and distribution of large-scale
computations, combined
1 Introduction with an implementation of this
interface that achieves
high performance on large clus
ters of commodity PCs.
Over the past ﬁve years, the Section 2 describes the basic
authors and many others at programming model and
Google have implemented hun gives several examples. Sec
dreds of special-purpose tion 3 describes an imple-
computations that process larg mentation of the MapReduce
e amounts of raw data, interface tailored towards
such as crawled documents, our cluster-based computing
web request logs, etc., to environment. Section 4 de-
compute various kinds of deri scribes several reﬁnements of
ved data, such as inverted the programming model
indices, various representatio that we have found useful. Sec
ns of the graph structure tion 5 has performance
of web documents, summaries measurements of our implem
of the number of pages entation for a variety of
crawled per host, the set of tasks. Section 6 explores the
most frequent queries in a use of MapReduce within
Google including our experie
nces in using it as the basis
To appear in OSDI 2004
1

60.
/**
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.mapred.lib;
import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;