Hadoop was developed to enable applications to work with thousands of computational independent computers and petabytes of data. Hadoop is a popular open source project that not only has incorporated an implementation of the MapReduce programming model, but also includes other subprojects supporting reliable and scalable distributed computing such as HDFS, (a distributed file system) and Pig (a high-level data flow language for parallel computing), along with others. See http://hadoop.apache.org.

The Hadoop stack includes more than a dozen components, or subprojects, that are complex to deploy and manage. Installation, configuration and production deployment at scale is really hard.

• Transformations and enhancements, such as auto-tagging social media, ETL processing, data standardization.

MapReduce

MapReduce is a programming model introduced and described by researchers at Google for parallel computation involving large data sets that are distributed across clusters of many processers. In contrast to the explicitly parallel programming models typically used with imperative language such as Java and C++, the MapReduce programming model is reminiscent of functional languages such as Lisp and APL, in its reliance on two basic operational steps:

• Map which describes the computation or analysis to be applied to a set of input key/value pairs to produce a set of intermediate key/value pairs, and

• Reduce, in which the set of values associated with the intermediate key/value pairs output by the Map operation are combined to provide the results.

Conceptually, the computations applied during the Map phase to each input key/value pair are inherently independent, which means that both the data and the computations can be distributed across multiple storage and processing units and automatically parallelized.

A Common Example

The ability to scale based on automatic parallelization can be demonstrated using a common MapReduce example that counts the number of occurrences of each word in a collection of many documents. Looking at the problem provides a hierarchical view:

• The total number of occurrences of each word in the entire collection is equal to the sum of the occurrences of each word in each document;

• The total number of occurrences of each word in each document can be computed as the sum of the occurrences of each word in each paragraph;

• The total; number of occurrences of each word in each paragraph can be computed as the sum of the occurrences of each word in each sentence;

This apparent recursion provides the context for both our Map function, which instructs each processing node to map each word to its count, and the Reduce function, which collects the word count pairs and sums together the counts for each particular word. The runtime system is responsible for distributing the input to the processing nodes, initiating the Map phase, coordinating the communication of the intermediate results, initiating the Reduce phase, and then collecting the final results.

While we can speculate on the level of granularity for computation (document vs. paragraph vs. sentence), ultimately we can leave it up to the runtime system to determine the best distribution of data and allocation of computation to reduce the execution time. In fact, the value of a programming model such as MapReduce is that its simplicity essentially allows the programmer to describe the expected results of each of the computational phases while relying on the complier and runtime systems for optimal parallelization while providing fault-tolerance.

• Data volumes – As more data is created and made available, the analysis platform must be able to absorb and handle larger volumes.

• Performance – There are two considerations that drive the need for increased performance: massive data volumes (which may include unstructured content, massive web transaction logs, daily terabyte feeds of call detail records), and the desire to distill critical bits of actionable knowledge from that data. Improving performance is driven by scalability so that the application’s runtime improves in proportion to the computational, network bandwidth, and storage resources as they are added to the cluster.

• Data integration – Not only has the volume of data increased, so has its variety. Analytical applications increasingly depend on the combination of both structured and unstructured data, either managed internally or fed by external sources, from static or persisted sources or from continuous streams.

Fault tolerance – Increased analytical complexity usually goes hand in hand with complex computations of greater duration, exposing the environment to the risks of system failures. It is desirable to enable recovery from a failure without having to restart the entire process.

At a higher level, some systems provide a level of fault tolerance that allows queries or processes to be restarted at specific checkpoint location when some component fails.

Heterogeneity – With emerging approaches to virtual cluster and cloud computing relying on a variety of computing resources, the potential to scale the analysis framework (to hundreds, or even thousands machines) using homogenous or heterogeneous systems provides potential for even more flexibility in resource allocation and usage.

Latency – The time from when data is recorded to when questions are answered is a critical component when reacting in a dynamic business environment. Low latency means fast ingest of data, dynamically scalable processing and fast query response times all operating concurrently.

Keeping these characteristics in mind, we can review two different approaches to scalable high performance data analysis that are growing in popularity: the MapReduce programming model, and the
use of high performance analytical databases.