Apr 23, 2013

Understanding the MapReduce concept

Step 1 is mapping: Take one input, process into many outputs. For example, take some wooden blocks and split them into smaller blocks. Process them to have key/value pairs.

Step 2 is reducing: Process those many outputs, for example process each block by coloring them in different colors and then generate one output by combining the split blocks with similar color into one block by gluing them together.

Why do we need to map it and then reduce it? Because if the data sets are so huge, you can process them in parallel by different machines by splitting up the tasks to make it more efficient. The data is split across server nodes. This is the MapReduce way of processing large data. Not every piece of data can be easily split and then combined.

Example: Input = "big brown fox jumped over a brown fence". The requirement is to count the number of occurrences of the word "count".