Tuesday, 26 February 2013

Apache Hadoop Framework forms the kernel of an operating system for big data permitting users to share resources, managing permissions and allocations.

Map Reduce Layer :

The Task Tracker on each node spawns off a separate Java Virtual Machine process to prevent the Task Tracker itself from failing if the running job crashes the JVM.

The Job Tracker pushes work out to available Task Tracker nodes in the cluster, striving to keep the work as close to the data as possible.

Crux of MapReduce Architecture:

Maps are the individual tasks that transform input records into intermediate records. The transformed intermediate records do not need to be of the same type as the input records. A given input pair may map to zero or many output pairs.

Reducer reduces a set of intermediate values which share a key to a smaller set of values.

HDFS Layer :

Namenode is the single point for storage and management of metadata, this can be a bottleneck for supporting a huge number of files, especially a large number of small files.

Data Node talk to each other to rebalance data, to move copies around, and to keep the replication of data high.

Hadoop is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) in a reliable, fault-tolerant manner.

Hadoop Distributes a task or piece of job across a Cluster of Machines, which access a shared file system hosted by a SAN.