Labels

Friday, January 26, 2018

Map Reduce Data Flow

MapReduce inputs typically come from input files loaded onto our processing cluster in HDFS. These files are evenly distributed across all the nodes.

Running a MapReduce program involves running these mapping tasks across all the nodes in our cluster.

Each of these mapping tasks are equivalent (No mappers have particular identity associated with them ). Therefore any mapper can process any input file.

Each mapper loads the set of file local to that machine and process them.

Intermediate data from the mappers and shuffling

When mapping phase is completed. the intermediate (key,value) pairs must be exchanged between machines to send all values with the same key to a single reducer

Reducing process

The reduce task are spread across the same nodes across the same nodes across the clusters.This is the only task in MapReduce.

Individual map task do no exchange information with one another, nor they are aware of one anoothers'existence

The user never explicitly marshals information from one machine to another; all data transfer is handled by the Hadoop MapReduce platform itself, guided implicitly by the different keys associated with values. This is a fundamental element of Hadoop MapReduce's reliability.

Note:

If nodes in the cluster fail, tasks must be able to be restarted. If they have been performing side-effects, e.g., communicating with the outside world, then the shared state must be restored in a restarted task. By eliminating communication and side-effects, restarts can be handled more gracefully.

Word Count

In the example below we can see the data flow for word count

Note: Combiner is a semi-reducer in mapreduce. This is an optional class which can be specified in mapreduce driver class to process the output of map tasks before submitting it to reducer tasks.