When you need it !

mapreduce

So, you want to know what is Hadoop and how can Python be used to perform a simple task using Hadoop Streaming?

HADOOP is a software framework that mainly is used to leverage distributed systems in an intelligent manner and also, perform efficiently operations on big datasets without letting the user worry about nodal failures (Failure in one among the ‘n’ machines performing your task). Hadoop has various components :

Hadoop Common – contains libraries and utilities needed by other Hadoop modules

Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster

Hadoop YARN – a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users’ applications

I will walk you through the Hadoop MapReduce component. For further information on Map Reduce you can either Google. For now I will present to you a brief introduction of it.

What is MapReduce

To know it truly you have to understand your problem first. So, what kind of problems can Map Reduce solve ?

Counting occurrences of digits in a list of numbers

Counting prime numbers in a list

Counting the number of sentences in a text

Counting the Average of 10 million numbers in a database

Counting the name of all people belonging to a particular sales region

Do you think that these are trivial problems ? Yes, they appear as if they are but what if you have millions of records and time for processing the results is very important for you? Thinking again ? You’ll get your answer.
Not just time but multiple dimensions of a task are there and map reduce if implemented efficiently, can help you overcome the risks associated with processing that much data.

Okay, enough of what and why! Now ask me how !!!

A MapReduce ‘system’ consists of two major kinds of functions, namely the Map Function and the Reduce function (not necessarily with the same names but more often with the pre-decided intentions of acting as the Map and Reduce functions). So, how do we solve the simple problem of counting a million numbers from a list quickly and display their sum ? This is, let me tell you, a very long operation though simple (For a complex program in MapReduce using not Hadoop but mincemeat module please go throughthis.

In this particular example the Map Function(s) will go through the list of numbers and create a list(s) of key-value pairs of the format {m,1} for every number m that occur during the scan. The reduce function takes the list from the Map function(s) and forms a list of key-value pairs as {m,list(1+)}. 1+ means 1 or more occurrences of 1.

The complicated expression above is nothing but just the number m encountered in the scan(s) by the Map Function(s) and the 1’s in the value in the Reduce task appear as many times as the number was encountered in the Map Function(s). So, that basically means {m, number of times m was encountered in the Map Phase}.

The next step is to aggregate the 1’s in the value for every m. This means {m,sum(1s)}. The task is almost done now. All we got to do is just display the number and the corresponding sum of the 1s as the count of the number. But wait, still you don’t understand as why this is a big deal right? Anybody can do this. But hey! The Map Functions aren’t just there to take all your load and process alone all of it. Nope! there are in fact many instances of your Map Function working in parallel in different machines in your cluster (if it exists, else just multithread your program to create multiple instances of Map (but why should you when you have distributed systems). Every Map Function running simultaneously work on different chunks of your big list, hence, sharing the task and reducing processing time. See! What if you have a big cluster and many machines running multiple instances your Map Functions to process your list? It’s simple; your work gets done in no time!!! Similarly, the Reduce functions can also run in multiple machines but generally after sorting ( where your mincemeat or hadoop programs will first sort the say,m‘s and distribute distinct such m‘s to different reduce functions in different machines). So, even aggregation task gets quicker and you are ready with your output to impress your boss!

A brief outline of what happened to the list of numbers is as follows:

Map functions counted every occurrence of every number m

Map functions stored every number m in the form {m,1} - as many pairs for any number m

Reduce functions collected all such {m,1} pairs

Reduce functions converted all such pairs as {m,sum(1's)} - Only 1 pair for a number m

Reduce functions finally displayed the pairs or passed it to the main function to display or process

In the part two of the tutorial I will explain how to install Hadoop and do the same program in python using Hadoop Framework.

For a similar program in mincemeat please go through :
import mincemeat import sys