Sunday, December 29, 2013

Apache Hadoop 2/ YARN/MR2 Installation for Beginners :

Background: Big Data spans three dimensions: Volume, Velocity and Variety. (IBM defined 4th dimension or property of Big Data i.e Veracity). Apache™ Hadoop® is an open source software project that enables the distributed processing of large data sets (Big Data) across clusters of commodity Machines(Low-cost Servers). It is designed to scale up to thousands of machines, with a high degree of fault tolerance and software has the intelligence to detect & handle the failures at the application layer.

Apache Hadoop YARN is the next-generation Hadoop framework designed to take Hadoop beyond MapReduce for data-processing- resulted in better cluster utilization that permit Hadoop to scale to accommodate more and larger jobs.

First test with hadoop to run existing hadoop program - launch the program, monitor progress, and get/put files on the HDFS. This program calculates the value of " pi " in parallel i.e 2 maps with 10 samples:

WordCount Example:
WordCount example reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occured, separated by a tab.Each mapper takes a line as input and breaks it into words. It then emits a key/value pair of the word and 1. Each reducer sums the counts for each word and emits a single key/value with the word and sum.

To run the example, the command syntax isbin/hadoop jar hadoop-*-examples.jar wordcount <in-dir> <out-dir>

All of the files in the input directory (called in-dir in the command line above) are read and the counts of words in the input are written to the output directory (called out-dir above).It is assumed that both inputs and outputs are stored in HDFS.If your input is not already in HDFS, but is rather in a local file system somewhere, you need to copy the data into HDFS as shown in above steps 29-31.
NOTE: Similarly you could think of processing bigger Data Files ( Weather data , Healthcare data, Machine Log data ...etc).