Apache Hadoop: Big data's big player

By Patrick Marshall

Feb 07, 2012

If there is a key technology that enabled the analysis of big data, it is the introduction of Apache Hadoop.

Hadoop is software that allows for the distributed processing of large datasets across clusters of computers. It is designed to scale up from single servers to thousands of machines, with computation and storage of pieces of the dataset taking place on each local machine.

The framework was originally developed by Doug Cutting (then at Yahoo and now chairman of the board of the Apache Software Foundation), who named it after his son’s toy elephant. It is based on Google's MapReduce analysis engine, which parses data for distributed processing.

The Hadoop framework includes several modules, including Hadoop Common, the common utilities that support other Hadoop projects; Hadoop Distributed File System, a distributed file system that provides high-throughput access to application data; and Hadoop MapReduce, a software framework for distributed processing of large datasets on compute clusters.

While the public sector is just beginning to set up projects based on Hadoop, more than 100 major private-sector big data applications have been built on the Apache Hadoop framework. Some of the more notable include Facebook, eBay, Twitter, Yahoo and the New York Times.