SQL on Hadoop: A state of the art

Since the mainstream adoption of Hadoop, many open-source and enterprise tools have emerged recently which aim to solve the problem of “querying Hadoop”. Indeed, Hadoop started as a simple (yet powerful) Big Data framework consisting of (1) a storage layer (HDFS, inspired by GFS) and (2) a processing layer (an implementation of MapReduce). Only after a couple of years the first releases of HBase (inspired by BigTable) started to make data on Hadoop more actionable. After that, several NoSQL’s and query engines started to work on Hadoop connectors (SOLR-1301 is one of the oldest). However, the problem of “querying Hadoop” remained unsolved in two main aspects:

Flexibility: Query engines outside Hadoop had discerning features and query languages. Many NoSQL’s were limited by their nature to key/value serving. Choosing a NoSQL always involved a tradeoff, and choosing more than one (the so-called “Polyglot Persistence”) made things harder.

Integration: Making data queryiable by a query engine was a challenge by itself (How does this database integrate with Hadoop? What does the connector look like? How mature it is? How efficient is it? Can I update all my data view atomically, in a ‘big transaction’?).

After some years of “NoSQL hipe”, voices from the past – saying: SQL, SQL, … – appeared as a solution for Hadoop’s querying flexibility. It is true that Sqoop (as old as HADOOP-5815) already made it possible to connect Hadoop with RDBMs, however this didn’t really solve the integration part of the problem. It is also true that “Batch SQL” was available since first releases of Hive (late 2008), but latency-speaking this never really made Hadoop queryiable, but rather “MapReduce avoidable”.

Even though SQL is a 40-year-old language, it has many elements which makes it convenient as a querying language for Hadoop:

It provides the necessary querying flexibility for calculating arbitrary aggregates without needing to pre-calculate all of them beforehand.

It is familiar to many people and companies, therefore making it possible to reuse knowledge.

It supports joins, thus not forcing the underlying data model be completely denormalized.

It makes it easy to integrate existent BI and visualization tools through JDBC or ODBC.

The rise of “SQL on Hadoop” solutions in the last years can be also explained by the maturity of Hadoop. Since Hadoop is now really being adopted by more and more companies, these have pushed the market to offer SQL on top of it. As an example, we see more and more mixed Data Warehouse architectures (see this good paper for reference on the matter).

In this post we will review the state of the art of Hadoop’s SQL querying engines. We will provide a table with the use cases each engine is oriented towards, namely: (1) Batch SQL (long-lived processes that read and write arbitrarily big datasets), (2) Interactive querying (MPP-like times for ad-hoc queries, in the order of seconds or minutes), (3) Point querying (sub-second reads, for web and mobile applications) and (4) Operational SQL (read+writes, ACID transactions). Sometimes an engine will overlap various use cases, which is expected. We will also try to incorporate not so well known solutions, as current “SQL on Hadoop” reports are slightly biased towards mainstream solutions. We will focus mainly on open-source solutions.

There are many details and intricacies in all of those, and it is difficult to compare them (although MapR did a very good job in this post). We recommend analyzing all of them in detail in the following dimensions: community, maturity, SQL completeness, JDBC/ODBC integration, only in-memory or disk-based, data format support, …

LINGUAL: It is Cascading’s SQL interface, akin to Hive, but unlike Hive it aims to support pure ANSI SQL. As far as Cascading runs on Tez in the future, LINGUAL will as well.

Apache Tajo: Apache Top-level Project since 2014-03-24, designed for both batch and interactive SQL, it supports standard SQL and has cost-based join optimizer and extendable rewrite rule engines. It uses its own DAG-style distributed processing engine.

Impala: Cloudera’s popular SQL on Hadoop. At Berlin’s Big Data Beers we had the opportunity to hear an update on its roadmap, which seems to include disk-based joins and aggregators for the future.

BigSQL: Interactive querying through PostgreSQL. This is quite new and it took a me a while to realize, but I finally found the proof. It has nothing to do with IBM’s interactive SQL querying, a propietary thing inside IBM BigInsights.

6 Comments on "SQL on Hadoop: A state of the art"

You might miss Apache Tajo which is one of the most promising SQL-on-hadoop system. Apache Tajo is a top-level project in Apache software foundation. As you can see, it is actively developed in ASF and has real use cases.

Also, I’d like to help you fill your summary table. Tajo is designed for both interactive and batch. Currently, it does not support point-querying and is not operational. In addition, Tajo supports SQL standard and has cost-based join optimizer and extendable rewrite rule engines. Tajo uses HDFS as a primary storage and uses its own DAG-style distributed processing engine.

RainStor has an MPP SQL engine that runs natively on HDFS, as well as on NAS, Cloud and CAS. We’ve got production deployments on the major Hadoop distros. We were also recently certified on Cloudera 5.

We’re under the ‘interactive SQL’ category. Mainly used for low-latency analytics in support of structured data archiving, although we do support SQL DELETE, so we’re not strictly read-only. We support SQL92, and have extended our grammar to include SQL features from Oracle, SQLServer and Sybase.

We support schema versioning, data replication, encryption from top to bottom, data masking, retention rules, audit trails, and very aggressive data compression.

Our compressed file format is similar (but more compressed) than ORC and Parquet. In addition to accessing rainstor data on HDFS through our MPP SQL engine, you can also access the same data through Hive, Pig and MapReduce.

A couple of clarifications regarding JethroData:
– Point-querying is one of the main objectives of JethroData, and a key benefits of our full-indexing design
– We expect a beta version to be available for download shortly
– While we do not currently plan to open-sourse our code, we will provide a free community version, as well as, a paid-for enterprise version