Follow by Email

Thursday, November 28, 2013

Distributed SQL Query Engine for Big Data

Presto is 10 times faster than Hive for most queries,

Technologically, Hive and Presto are very different, namely because
the former relies on MapReduce to carry out its processing and the
latter does not. This is by and large the difference that makes Presto
suitable for low-latency queries while the MapReduce-based Hive can take
a long time — especially over Facebook’s many petabytes of data —
because it must scan everything in the cluster and requires lots of disk
writes. Presto also works with a variety of
non-Hadoop-Distributed-File-System data sources and uses ANSI SQL
compared with Hive’s SQL-like language.
Presto is currently running in numerous Facebook data centers and the
company has scaled a single cluster up to 1,000 nodes. More than 1,000
employees run queries on Presto, and they do more than 30,000 of them
per day over a petabyte of data. Traverso’s post gives a lot more
details about how Presto works and how Facebook plans to improve its
speed and functionality in the near term.

A Presto screenshot

However, I think the most-interesting part about Presto might be less
technological and more about its effects on the Hadoop industry, which
is projected to be worth tens of billions of dollars in the next few
years. The mere fact that Facebook chose to create a website for the
project says something about how serious the company takes it. And
although Facebook has technically open sourced quite a few Hadoop
improvements over the years, this is the first since Hive where I’ve
noticed such fast (if any) uptake from external companies.
It will be interesting to watch how, if at all, Presto affects adoption of Cloudera’s Impala, Hortonworks’ Stinger project, Pivotal’s HAWQ or any other of the myriad SQL-on-Hadoop engines
currently making fighting for mindshare. The fact that Presto is open
source and ready to use certainly has to be a big draw for some users,
and could help it establish a solid user base while other technologies
are still coming to be.
Facebook isn’t looking to compete with other projects and doesn’t
have a horse in the race from a business perspective — it will likely go
along using and improving Presto at its own pace regardless what
happens — but serious uptake could inspire the Hadoop vendors to change
their strategies when it comes to the SQL engines they support. Much of
the early innovation from Hadoop came from power users (including Yahoo
and Facebook) rather software companies, and it’s possible we haven’t seen the end of that trend.