Scala

Spark

What is Hadoop ?

Hadoop is an open source frame work used for storing & processing large-scale data (huge data sets generally in GBs or TBs or PBs of size) which can be either structured or unstructured format. This vast amount of data is called Big data which usually can’t be processed/handled by legacy data storage mechanisms.

Hadoop is written in java by Apache Software Foundation. Hadoop can easily handle multi tera bytes of data reliably and in fault-tolerant manner.

Hadoop parallelizes the processing of the data on 1000s of computers or nodes in clusters. This frame work uses normal commodity hardware for storing distributed data across various nodes on the cluster.

This Site provides detailed walk through of the Hadoop framework along with all the sub components under Hadoop Frame work.

Hadoop Eco System Core Components:

Hadoop Common : Common utilities supporting hadoop components

HDFS : Hadoop Distributed File System

YARN : Frame work for job scheduling and resource management.

Map Reduce : Parallel Processing Mechanism for distributed data

The sub components are:

Hbase : Column Oriented Database for Processing Billions of Records

Hive : Data Warehouse for Distributed File System HDFS

Pig : High Level Programming Language for Distributed computations

Sqoop : Data migration tool from/to RDBMSs to/from HDFS, HBase, Hive

Flume : Data Collection mechanism for Log & Event data

Oozie : Work Flow Management Service.

ZooKeeper : Configuration Management & Coordination Service.

Avro : Serialization Framework

Tez : Successor for Mapreduce Framework

Hcatalog : Common Interface for Hive, Pig, HBase.

Azkaban: Workflow management tool. Alternative to Oozie.

Refer Corresponding Categories on this blog for further details on each sub component of Hadoop Eco System.

Why Hadoop ? :

Now a day the electronic data is getting increased rapidly day by day in terms of tera bytes (1000 GB = 1 TB) or peta bytes (1000 TB = 1 PB) all over world. This data is majorly stored on databases, distributed across the globe. The rate of data increase is getting accelerated. Some of the data might be structural and some might be unstructured data like flat data sets. Some of the examples of huge data generations sources are like social networking sites, blogs, databases and many other kinds of web sites. This data is being used by various organizations/industries for analyzing and foreseeing trends of business in near future based on the analysis of current data statistics.

But extraction & analysis of vast amount of structured data or unstructured data requires lot of computational power which is beyond the scope of legacy databases or processing techniques. This massive explosion of data over the years leads many organizations to replace the data servers with high processing servers which couldn’t solve the problem beyond a certain point of growth in data.

That’s where the Hadoop evolution started based on scale-out approach for storing big data on large clusters of commodity hardware. Since Hadoop is designed to use commodity hardware through scale-out approach instead of using the larger servers in scale-up approach, data storage and maintenance became very cheap and cost effective when compared to other storage mechanisms.

For processing this Big data, distributed across various clusters of commodity hardware, Map Reduce technique is introduced to parallelize the process of data extractions & processing of structured/unstructured data from many nodes/hard drives in the clusters.

Hadoop was created by Doug Cutting, who is the creator of Apache Lucene, a text search library. Hadoop was written in Java and has its origins from Apache Nutch, an open source web search engine. As Apache Software Foundation developed Hadoop, it is often called as Apache Hadoop and it is a Open Source frame work and available for free downloads from Apache Hadoop Distributions.