What is Hadoop?

What is Hadoop

Apache™ Hadoop® is an open source software project that enables distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with very high degree of fault tolerance. Rather than relying on high-end hardware, the resiliency of these clusters comes from the software's ability to detect and handle failures at the application layer.

Hadoop for the data scientist

Hadoop for the Enterprise from IBM

IBM BigInsights brings the power of Hadoop to the enterprise, enhancing Hadoop by adding administrative, discovery, development, provisioning, security and support, along with best-in-class analytical capabilities. IBM BigInsights for Apache Hadoop is a complete solution for large-scale analytics.

Why use Hadoop?

Hadoop changes the economics and the dynamics of large-scale computing.
Its impact can be boiled down to four salient characteristics.

Hadoop enables a computing solution that is:

Scalable

A cluster can be expanded by adding new servers or resources without having to move, reformat, or change the dependent analytic workflows or applications.

Cost effective

Hadoop brings massively parallel computing to commodity servers. The result is a sizeable decrease in the cost per terabyte of storage, which in turn makes it affordable to model all your data.

Flexible

Hadoop is schema-less and can absorb any type of data, structured or not, from any number of sources. Data from multiple sources can be joined and aggregated in arbitrary ways enabling deeper analyses than any one system can provide.

Fault tolerant

When you lose a node, the system redirects work to another location of the data and continues processing without missing a beat.

Hadoop architecture

Hadoop is composed of four core components—Hadoop Common, Hadoop Distributed File System (HDFS), MapReduce and YARN.

Hadoop Common

A module containing the utilities that support the other Hadoop components.

MapReduce

A framework for writing applications that process large amounts of structured and unstructured data in parallel across a cluster of thousands of machines, in a reliable, fault-tolerant manner.

The next-generation MapReduce, which assigns CPU, memory and storage to applications running on a Hadoop cluster. It enables application frameworks other than MapReduce to run on Hadoop, opening up a wealth of possibilities.

Hadoop is supplemented by an ecosystem of Apache open-source projects that extend the value of Hadoop and improve its usability.

Data access projects

A Hadoop runtime component that allows those fluent with SQL to write Hive Query Language (HQL) statements, which are similar to SQL statements. These are broken down into MapReduce jobs and executed across the cluster.

Search projects

Solr

An enterprise search tool from the Apache Lucene project that offers powerful search capabilities, including hit highlighting, as well as indexing capabilities, reliability and scalability, a central configuration system, and failover and recovery.

Administration and security projects

Kerberos

A network authentication protocol that works on the basis of “tickets” to allow nodes communicating over a non-secure network to prove their identity to one another in a secure manner.

Zookeeper

A centralized infrastructure and set of services that enable synchronization across a cluster.