16.
Interlude: Solving problems with Map and ReduceGoogle Tutorial View 1 MapReduce library shards the input ﬁles and starts up many copies on a cluster. 2 Master assigns work to workers. There are map and reduce tasks. 3 Workers assigned map tasks reads the contents input shard, parse key-value pairs and pass pairs to map function. Intermediate key-value pairs produced by the map function are buﬀered in memory. 4 Periodically, buﬀered pairs are written to disk, partitioned into regions. Locations of buﬀered pairs on the local disk are passed to the master. 5 When a reduce worker has read all intermediate data, it sorts by the intermediate keys. All occurrences a key are grouped together. 6 Reduce workers pass a key and the corresponding set of intermediate values to the reduce function. 7 Output of the reduce function is appended to a ﬁnal output ﬁle for each reduce partition.Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 16 / 43