Apache Hadoop is an open source framework for distributed processing of large data sets (structured and unstructured). Files are split into blocks and distributed across nodes in a cluster the scale from one to thousands of computers.

Most of an average company's data is unstructured (i.e. emails and social media) that don't fit perfectly into rows and columns. Hadoop handles any file or format of data.

The framework consists of:

Common utilities

HDFS (Distributed File System)

YARN - job scheduling and cluster resource management

MapReduce - YARN-based system for parallel processing of large data sets

Hadoop includes a larger collection of services that include other tools, such as:

The "reduce" job combines the tuples (key-value pairs) into a smaller set of tuples

Example: multiple files contain high temperatures for the same cities on different days. The "map" job calculates the highest temperature in each file in distinct jobs for each file. Those individual jobs are fed into the "reduce" phases which combines all the high temperatures to arrive at a single high temperature for each city.

Spark

General-purpose cluster computing system

Write applications in Python, Java, R, or Scala

Combine libraries for machine learning, SQL, streaming, and analysis into single application

SQL queries from within the Spark program

Uniform access to different data sources, such as JSON, JDBC, and Hive

Based on resilient distributed dataset (RDD), a collection of elements partitioned across the cluster

RDD starts with file in Hadoop and transformation of that data

The transformation of a dataset is returned to the driver program

Pig

Platform for analyzing large data sets through reduced time of writing map/reduce programs

Handles any type of data

Applications written in PigLatin language and executed in runtime environment