Stinger.next: Enterprise SQL at Hadoop Scale with Apache Hive

In April of this year, Hortonworks, along with the broad Hadoop community delivered the final phase of the Stinger Initiative on schedule, completing the work to bring interactive SQL query to Apache Hive. The original directive of Stinger was about advancing SQL capabilities at petabyte scale in pure open source. And over 13 months, 145 developers from 44 companies delivered exactly that, contributing over 390,000 lines of code to the Hive project alone.

While this community collaboration has had a tremendously positive impact for data workers, business analysts and the many data center tools around Hadoop that rely on Hive for SQL in Hadoop, it was just the beginning.

Apache Hive, Stinger.next and Enterprise SQL at Hadoop Scale

The Stinger Initiative enabled Hive to support an even broader range of use cases at truly Big Data scale: bringing it beyond its Batch roots to support interactive queries – all with a common SQL access layer.

Stinger.next is a continuation of this initiative focused on even further enhancing the speed, scale and breadth of SQL support to enable truly real-time access in Hive while also bringing support for transactional capabilities. And just as the original Stinger initiative did, this will be addressed through a familiar three-phase delivery schedule and developed completely in the open Apache Hive community.

Stinger.next Project Goals

Speed

Deliver sub-second query response times.

Scale

The only SQL interface to Hadoop designed for queries that scale from Gigabytes, to Terabytes and Petabytes.

SQL

Enable transactions and SQL:2011 Analytics for Hive.

Hive has always been the defacto standard for SQL in Hadoop and these advances will surely accelerate the production deployment of Hive across a much wider array of scenarios. Explicitly, some of the key deliverables that will enable these new business applications of Hive include:

Transactions with ACID semantics allow users to easily modify data with inserts, updates and deletes. They extend Hive from the traditional write-once, and read-often system to support analytics over changing data. This enables reporting with occasional corrections and modifications and allows operational reporting with periodic bulk updates from an operational database.

Sub-second queries will allow users to deploy Hive for interactive dashboards and explorative analytics that have more demanding response-time requirements.

SQL:2011 Analytics allows rich reporting to be deployed on Hive faster, more simply and reliably using standard SQL. A powerful cost based optimizer ensures complex queries and tool-generated queries run fast. Hive now provides the full expressive power that enterprise SQL users have enjoyed, but at Hadoop scale.

Transactions with ACID semantics in Hive

Hive has been used as a write-once, read-often system, where users add partitions of data and query this data often. ACID is a major shift in the paradigm, adding SQL transactions that allow users to insert, update and delete the existing data. This allows a much wider set of use cases that require periodic modifications to the existing data. ACID will include BEGIN, COMMIT and ROLLBACK for multi-statement transactions in next releases.

Sub-Second Queries with Hive LLAP

Sub-second queries require fast query execution and low setup cost. The challenge for Hive is to achieve this without giving up on the scale and flexibility that users depend on. This requires a new approach using a hybrid engine that leverages Tez and something new called LLAP (Live Long and Process, #llap online).

LLAP is an optional daemon process running on multiple nodes, that provides the following:

Multi-threaded execution including reads with predicate pushdown and hash joins

High throughput IO using Async IO Elevator with dedicated thread and core per disk

Granular column level security across applications

YARN will provide workload management in LLAP by using delegation. Queries will bring information from YARN to LLAP about their authorized resource allocation. LLAP processes will then allocate additional resources to serve the query as instructed by YARN.

The hybrid engine approach provides fast response times by efficient in-memory data caching and low-latency processing, provided by node resident processes. However, by limiting LLAP use to the initial phases of query processing, Hive sidesteps limitations around coordination, workload management and failure isolation that are introduced by running entire query within this process as done by other databases.

Comprehensive SQL:2011 Analytics

SQL:2011 Analytics subset will be supported by Hive, with new features being added over multiple iterations, driven by customer demand. Hive is already much further along than other SQL options for Hadoop with strong SQL support including:

Window Functions

Common Table Expressions

Common sub-queries – correlated and uncorrelated

Advanced UDFs

Rollup, Cube, and Standard Aggregates

Inner, outer, semi and cross Joins

Stinger.next will extend this lead to cover most of the frequently used SQL constructs:

Non Equi-Joins

Set Functions – Union, Except and Intersect

Interval types

Most sub-queries, nested and otherwise

Fixes to syntactical differences from SQL:2011 spec, such as rollup

Integration with Machine Learning Frameworks

Hive-Spark Machine Learning Integration will also allow Hive users to run machine learning models via Hive. Users want to run predictive analytics and descriptive analytics in Hive, both on the same dataset.

Hive on Spark?

There is a lot of talk about Spark as a powerful engine running on YARN, and we at Hortonworks share that excitement and are working actively to make it enterprise ready for Spark users. In fact, in order to integrate with Spark, the broad Hive community is making use of several of the infrastructure components already added to Hive as part of the Tez integration which was delivered in Hive 0.13.

Some Additional Advances

In addition to these primary use cases, some additional enhancements include:

Hive Cross-Geo Query allows users to query and report on datasets distributed across geography due to legal or efficiency constraints. Users currently are unable to do this and need to write their own application code that stitches together multiple results.

Materialized views allow storing multiple views of the same data allowing faster analyses. The views can be held speculatively in-memory and discarded when memory is needed.

Usability improvements will help users work more simply with Hive.

Simplified deployment will focus on providing near plug and play deployment solutions for the most common use cases.

Delivery

Stinger.next will be delivered at a rapid pace over the next 18 months. Transactions will release in late 2014. Sub-second queries are coming in the first half of 2015, with a preview in the next few months. An initial outline of the delivery is below. We expect this work to be completed as the initial work was, in scope and on schedule.

Enthusiasm abounds

It is not just Hortonworks that is enthusiastic about this next phase in the delivery of Enterprise SQL at Hadoop Scale. Some of our key partners have weighed in on their excitement as well. Watch this space over the next few days as Microsoft, Informatica, Microstrategy and Tableau all weigh in on this important initiative.

And as always, we are excited to continue our work within the Hive community to extend Hive, the leading SQL on Hadoop solution, further in terms of speed, scale, and SQL semantics.

Hive delivers a message of simplicity. It already provides a single tool for all SQL across, batch and interactive workload and with Stinger.next it is extended to near real-time. We’re enthusiastic about the upcoming Stinger.next journey as Hive adds exciting new features toward this goal. Watch this blog for future posts from Apache Hive committers and contributors from around the world, as they share enhancement ideas with the community.