About Rahul Patodi

Hadoop: A Soft Introduction

Hadoop is a framework written in Java for running applications on large clusters of commodity hardware and incorporates features similar to those of the Google File System and of MapReduce. HDFSis a highly fault-tolerant distributed file system and like Hadoop designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications that have large data sets.

Who uses Hadoop:

Hadoop is mainly used by the companies which deal with large amount of data. They may need to Process the data, Perform Analysis or Generate Reports. Currently all leading organizations including Facebook, Yahoo, Amazon, IBM, Joost, PowerSet, New York Times, Veoh etc are using Hadoop. For more information check the PoweredBy Hadoop page.

Why Hadoop:

MapReduce is Google’s secret weapon: A way of breaking complicated problems apart, and spreading them across many computers. Hadoop is an open source implementation of MapReduce, and its own filesystem HDFS (Hadoop distributed file system).

Hadoop has defeated Super Computer in tera sort:

Hadoop clusters sorted 1 terabyte of data in 209 seconds, which beat the previous record of 297 seconds in the annual general purpose (daytona) terabyte sort benchmark. The sort benchmark, which was created in 1998 by Jim Gray, specifies the input data (10 billion 100 byte records), which must be completely sorted and written to disk. This is the first time that either a Java or an open source program has won. For more Information click here.

Europe’s Largest Ad Targeting Platform Uses Hadoop:

Europe’s Largest Ad Company get over 100GB of data daily, Now using classical solution like RDBMS they need 5 days to for analysis and generate reports. So they were running 1 weak behind. After lots of research they started using hadoop. Now Interesting fact is “Tey are able to process data and generate reports with in 1 Hour” Thats the beauty of Hadoop. For more Information click here.

ZooKeeper: A high-performance coordination service for distributed applications.

2. Cloudera Hadoop:

Cloudera’s Distribution for Apache Hadoop (CDH) sets a new standard for Hadoop-based data management platforms. It is the most comprehensive platform available today and significantly accelerates deployment of Apache Hadoop in your organization. CDH is based on the most recent stable version of Apache Hadoop. It includes some useful patches backported from future releases, as well as improvements we have developed for our customers

Cloudera Hadoop Offers:

HDFS – Self healing distributed file system

MapReduce – Powerful, parallel data processing framework

Hadoop Common – a set of utilities that support the Hadoop subprojects

HBase – Hadoop database for random read/write access

Hive – SQL-like queries and tables on large datasets

Pig – Dataflow language and compiler

Oozie – Workflow for interdependent Hadoop jobs

Sqoop – Integrate databases and data warehouses with Hadoop

Flume – Highly reliable, configurable streaming data collection

Zookeeper – Coordination service for distributed applications

Hue – User interface framework and SDK for visual Hadoop applications

Architecture of Hadoop:

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data.

Name Node:

NameNode manages the namespace, file system metadata, and access control. There is exactly one NameNode in each cluster. We can say NameNode is master and data nodes are slaves. It contains all the informations about data (i.e. the meta data)

Data Node:

DataNode holds the actual file system data. Each data node manages its own locally-attached storage (i.e. the node’s hard disk) and stores a copy of some or all blocks in the file system. There are one or more DataNodes in each cluster.

Install / Deploy Hadoop:

Hadoop can be installed in 3 modes

1. Standalone mode: To deploy Hadoop in standalone mode, we just need to set path of JAVA_HOME. In this mode there is no need to start the daemons and no need of name node format as data save in local disk.

2. Pseudo Distributed mode: In this mode all the daemons (nameNode, dataNode, secondaryNameNode, jobTracker, taskTracker) run on a single machine.

In this mode, daemons (nameNode, jobTracker, secondaryNameNode(Optionally)) run on master (NameNode) and daemons (dataNode and taskTracker) run on slave (DataNode).Stay tuned for an article on the three Hadoop modes/configurations.

Newsletter

Join them now to gain exclusive access to the latest news in the Java world, as well as insights about Android, Scala, Groovy and other related technologies.

Email address:

Join Us

With 1,043,221 monthly unique visitors and over 500 authors we are placed among the top Java related sites around. Constantly being on the lookout for partners; we encourage you to join us. So If you have a blog with unique and interesting content then you should check out our JCG partners program. You can also be a guest writer for Java Code Geeks and hone your writing skills!

Disclaimer

All trademarks and registered trademarks appearing on Examples Java Code Geeks are the property of their respective owners. Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries. Examples Java Code Geeks is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.