Apache Hive Updated with SQL-on-Hadoop Features

Hortonworks Inc. yesterday announced a new version of Apache Hive, the open source data warehouse software running on top of Hadoop, with new SQL query features and performance improvements.

Hive, formerly a Hadoop subproject that graduated to top-level status of its own, provides the infrastructure to conduct Big Data analytics with a SQL-based query language called HiveQL.

Hive 0.13 was described as a "significant release" by Harish Butani in a blog post announcing the new release. He said more than 70 members of the project conducted a major effort to implement more than 1,000 features or fixes.

In addition to new SQL features and myriad other improvements, Hive 0.13 was released in coordination with Apache Tez 0.4, an interactive alternative to the oft-maligned batch-oriented MapReduce programming model used to extract business insights from Big Data stores.

Improvements in Hive 0.13 (source: Hortonworks blog post)

"With the delivery of Hive on Tez, users have the option of executing queries on Tez," Butani said. "Tez's dataflow model on a DAG of nodes facilitates simpler, more efficient query plans, which translates to significant performance improvements and interactive query on Hive/Hadoop."

One of the improvements to Tez 0.4 is better Windows support. "The community fixed bugs and made changes to Tez so that it runs as smoothly on Windows as it does on Linux," said Bikas Saha in announcing the new release. "We hope this will encourage adoption of Tez on Windows-based systems."

Also announced in conjunction with the releases of Hive 0.13 and Tez 0.4 was the completion of the Stinger Initiative, a community effort to improve Hive with up to 100x performance boosts at petabyte scale while using SQL semantics familiar to developers and users. Butani said, "These improvements extend Hive beyond its traditional roots and brings true interactive SQL query to Hadoop."

Hive 0.13 introduces SQL standard-based authorization, and the SQL language was extended to support grant and revoke on entities. Other authorization-related improvements include support for show roles, user privileges and active privileges. Gaps in authorization checks have been plugged with a reworked, pluggable authorization API.

Other SQL-related improvements include support features such as DECIMAL and CHAR types, permanent functions, common table expressions and many more.

Hive integrates with other tools via the popular JDBC interface, and version 0.13 improves on JDBC operations with support for job cancel and async execution.

"All of these Hive improvements mean that Hive 0.13 accepts a very large percentage of TPC-DS benchmark queries without rewrites," Butani said. TPC-DS is a decision support benchmark standard.

The development team improved its own efficiency by moving builds to the Apache Maven project management tool and creating a new parallel testing framework, a new project wiki and support for the Parquet file format.

Many other improvements were listed, and Butani said the huge revamp took a tremendous development effort. "Ultimately, over 145 developers representing 44 companies, from across the Apache Hive community, contributed over 390,000 lines of code to the project in just 13 months, nearly doubling the Hive codebase," he said.