Saturday, March 21, 2015

The three open source projects that transformed Hadoop

Hadoop,
an open source software framework with the funny sounding name, has
been a game-changer for organizations by allowing them to store, manage,
and analyze massive amounts of data for actionable insights and
competitive advantage.

But this wasn't always the case.
Initially, Hadoop implementation required skilled teams of engineers
and data scientists, making Hadoop too costly and cumbersome for many
organizations. Now, thanks to a number of open source projects, big data
analytics with Hadoop has become much more affordable and mainstream.
Here's a look at how three open source projects—Hive, Spark, and Presto—have transformed the Hadoop ecosystem.

Hive

An early problem with Hadoop was that while it was great for storing
and managing massively large data volumes, analyzing that data for
insights was difficult. Only skilled data scientists trained in writing
complex "Java Map-Reduce" jobs could unleash Hadoop's analytics
capabilities. As a solution to that problem, two data scientists at
Facebook, Ashish Thusoo and Joydeep Sen Sarma, who later went on to
found the cloud-based Hadoop big data analytics service called Qubole,
created Apache Hive in 2008.
Capitalizing on the ease of use of Structured Query Language (SQL), a
language that requires relatively little training and is widely used by
data engineers, Hive uses a language called HiveQL to automatically
translate SQL-like queries into MapReduce jobs executed on Hadoop.
Because SQL is the preferred data language taught in schools and used in
the industry, Hive, which put SQL on top of Hadoop, transformed Hadoop
by making its formidable analytics power more readily available to
people and organizations, not just developers. Hive is best used for
summarizing, querying, and analyzing large sets of structured data where
time is not of the essence.

Spark

While Hive on MapReduce is very effective for summarizing, querying,
and analyzing large sets of structured data, the computations Hadoop
enables on MapReduce are slow and limited, which is where Spark comes
in. Developed at UC Berkeley's AMPLab in 2009 and open sourced in 2010,
Apache Spark is a powerful Hadoop data processing engine designed to
handle both batch and streaming workloads in record time. In fact, on
Apache Hadoop 2.0, Apache Spark runs programs 100 times faster in memory and 10 times faster on disk than MapReduce.
The advantage for users is that Spark not only supports operations
such as SQL queries, streaming data, and complex analytics such as
machine learning and graph algorithms, it also allows these multiple
capabilities to be combined seamlessly into a single workflow. In
addition, Spark is 100% compatible with Hadoop's Distributed File System
(HDFS), HBase, and any Hadoop storage system, which means that all of
an organization's existing data is immediately usable in Spark. And
Spark's ability to unify big data analytics reduces the need for
organizations to build separate processing systems to take care of their
various computational needs.

Presto

Faced with the task of performing fast interactive analysis on a
massive data warehouse of over 250 petabytes and counting, engineers at
Facebook developed their own query machine called Presto. Unlike
Spark, which runs programs both in memory and on disk, Presto runs in
memory only. This functionality allows Presto to run simple queries on
Hadoop in just a few hundred milliseconds, with more complex queries
taking only a few minutes. In contrast, scanning over an entire dataset
using Hive, which relies on MapReduce, can take anywhere from several
minutes to several hours. Presto has also been shown to be up to seven
times more efficient on the CPU than Hive. Plus Presto can combine data
from multiple sources into a single query, allowing for analytics across
an entire organization.
Today Presto is available as an open source distributed SQL query
solution that organizations can use to run interactive analytic queries
on data sources ranging from gigabytes to petabytes. With the ability to
scale to the size of organizations as big as Facebook, Presto is a
powerful query engine that has transformed the Hadoop ecosystem and
could be transformative for organizations and entire industries as well.
Big data is getting bigger every day. As organizations look for new
and better ways to leverage valuable data they will rely less on Hadoop
and MapReduce for batch processing and more on open source tools such as
Hive, Spark, and Presto to meet the big data demands of the future.