A Beginner's guide to Cloud, Apache Hadoop and MapReduce paradigm

For those of you with patience, you can watch an hour long video by Jacob at LinkedIn:

Video source: [VS1]

For others I present here a brief summary regarding the same. In the later 90s and mid 2000s, Google was faced with a challenge to index the gargantuan amounts of data referred to by its search engine. Until the early 90s, the amount of data in the web was comparatively smaller and relational databases could address the contemporary requirements. But, with the decade starting 2000, things were completely different. The number of users on the internet was large and so also the amount of data that was getting added to the web everyday. If one had to build an index for the entire web data, it was going to be tremendously complicated.

What Google did to address this problem was distributing the computation and storage across several nodes. For most us having an insight into CS topics like Distributed Computing, this might seem really trivial, however, you will for yourself learn to appreciate the beauty of their solution as the blog post unfolds. Google developed a technique called MapReduce paradigm that undertook the responsibility to distribute the computation across several computers and combine the results obtained from each of them. I am just giving this superficial, headline to keep you guys reading.

In the MapReduce paradigm, two programs are written for two kinds of computing nodes, viz. mapper and reducer. The mapper program keeps dividing the large task that needs to be computed to smaller and smaller sub-tasks and these smaller tasks are independently computed by different nodes in parallel. The results of these smaller tasks are consolidated and aggregated to obtain the result of the large, original task.

Image source: [IS1]

The reducer is pretty much what does the consolidation and aggregation processes. So hang on: Are the mappers and the reducers software or hardware? The answer is both. You have some nodes that do the mapping function, so these nodes called mappers are loaded with the mapper program. You have another bunch of nodes doing the aggregation of results. These nodes are the reducers and the reducer program is loaded onto them.

Here is an example:

Just consider that you want to count the number of characters in the sentence: I go to school. A first mapper does the task of identifying the different words within the sentence. These words: {I,go,to,school} are handled independently by 4 more mappers that count the number of characters only in the word assigned to them, i.e. mapper 2 counts no. of characters in I, mapper 3 counts no. of characters in go, mapper 4 takes care of to and mapper 5 handles the word school. After distribution of words to mappers 2,3,4 and 5 is done by mapper 1, the mappers 2,3,4 and 5 work in parallel. This provides a speed up as against conventional method of serial computation at a single node. Now after the mappers 2,3,4 and 5 write their results, a reducer combines the result, i.e. adds them together to obtain the final count of the number of characters.

Now, I do understand that some of you might be asking yourself: So what? If that's the case, I wouldn't really be very surprised. Just keep reading and you will learn to appreciate the crux of the discussion here.

Where exactly does Hadoop fit in? I haven't spoken about it in a while. I was just talking about nodes and parallel computation. So what about them? Hadoop provides an effective way to organize the nodes tasked with computation and also to distribute the data across these nodes (this is pretty much the kind of support that MapReduce needs to accomplish its goals).The MapReduce paradigm first implemented by Google was documented by two scientists at their research labs and a paper provides a complete description of the node deployment and topology and in essence the infrastructure to support the MapReduce technique. This was Google's proprietary system (sadly, without a name). The documentation provided by Google was used by open source project leader, Apache to develop Hadoop (an open source implementation of Google's proprietary MapReduce system). Later, research scientists at Yahoo did a significant amount of work to give Hadoop the shape it has attained today. Now, that's just for an intro to Hadoop and to tell you guys where it came from and who is to be credited with its development.

The Hadoop system comprises of two key concepts to serve two different purposes: HDFS (Hadoop File System) that takes care of the storage of data and MapReduce concept which handles the processing and computation in a manner described above.

In the Hadoop HDFS system, the data which is going to be accessed is stored beforehand in a way that supports the MapReduce paradigm. The data is actually fragmented into blocks of size typically, 128 MB, 256 MB, 512 MB or 1 GB. Now, the nodes I talked about above have their own disk storage. This is to say that when the nodes perform computations, they do not have to fetch data remotely from another source or sources. When the nodes compute, they only need to access the data resident on their own disks. Typically, each block of data is replicated three times (at three nodes) for making the system fault tolerant. A directory mapping each block to its location (nodes where the block can be found) is held by a manager node called NameNode. Now, when you search using the keyword, 'apple' in Google search engine, every page or link on the web where an 'apple' substring can be found are brought together and presented to you as the search result. Actually, what is in fact happening in the background is that each of these links containing 'apple' (which are distributed and stored beforehand across thousands of nodes) are aggregated by a bunch of reducer nodes and displayed to you. So what are the mapper nodes doing here? They just store the links or pages and of course undertake some minor processing such as locating the searched keyword in the page and forwarding their results to the reducers.

What next? I just want to answer one final question. Now, why is there all this hype about Hadoop and MapReduce? It is because they can handle large amounts of data of the order of terabytes to petabytes which is otherwise practically impossible. Hadoop is scalable, i.e. as the amount of data goes on increasing, you just keep increasing the number of nodes and there you are, with complete access to the data accretes everyday. One more thing: If not for cloud computing techniques like Hadoop, we wouldn't have many businesses today. Twitter, facebook, LinkedIn all use Hadoop. Small businesses generally do not have the capital to invest on thousands of computers with disks and processors. Cloud computing services enable these small businesses to pay a monthly rental to avail their service, thereby encouraging many promising and enterprising small businesses. (Twitter and facebook wouldn't be here today if not for these cloud services.)

So who has invested in the cloud infrastructure? There are some firms like nebula and rackspace that rent commodity hardware to customers for a monthly rental fee. In the times ahead, like most articles and blogs suggest, IT solutions are going to be largely cloud based and whoever undertakes a research in cloud computing will be heavily rewarded.