BUSINESS IMPACT

Feb 11, 2015

5 Reasons Apache Hive 1.0 Matters to Your Big Data Strategy

On Friday, February 6th, 2015 Hortonworks announced that the Apache Hive project reached a major milestone. After years of development since the open source project was released by Facebook, the stable release 1.0 of Hive has been achieved!

What makes Hive 1.0 such an integral piece of the puzzle?

Hive provides the single most comprehensive SQL on Hadoop experience. The vast amount of open source development has resulted in a platform that provides nearly every feature that real-world enterprise needs to implement Data Lake scenarios on Hadoop. We've completed engagements with clients in which they realized success with Hive to create structured data marts on top of unstructured data in Hadoop.

When thinking about your big data strategy, there are five reasons why Apache Hive 1.0 matters:

Reason #1: Security

Hive 1.0 supports enterprise level security including Kerberos support and object level securables. This means data managers will be guaranteed data is secure and only accessible by those team members that should have access. With the later releases of Hive, individual objects like tables and views, and access to the system in general can be secured. This type of support has previously only been available in traditional RDBMS systems with a large licensing cost. Hive 1.0’s support of enterprise-grade security is an important feature of the Data Lake.

Reason #2: Performance

The stable release of Hive supports performance far beyond that of slow map-reduce jobs. Recent enhancements including the Tez Engine, query vectorization, improved index support, and Cost Based Optimization (CBO) ensures that analysts’ queries are returned before they get back to their desk with that cup of coffee.

The Tez processing engine improves upon map-reduce by moving expensive shuffle and sort operation from slow disk structures to fast memory. At BlueGranite, we recommend queries exposed to end users and data analysts be configured to use the Tez engine whenever possible.

Vectorization and ORC (Optimized Row Columnar) Format allow for fast in memory search operations to be done on batches of data. Compared to map-reduces row by row search, vectorization far exceeds performance expectations on large HDFS based data sets.

In traditional RDBMS systems advanced indexing support is one of the features that provides incredibly fast results to be returned when an analyst requests a query for a specific piece of data. Hive’s improved index support helps to ensure that when a request for a very specific piece of data is made, it will be found quickly and accurately.

Apache Hive 1.0 is one of the first SQL on Hadoop projects to support Cost Based Optimization to create execution plans catered to the actual query being executed. This serves to help Hive always run in an optimal state.

Reason #3: Data Integrity

One of the newest features added to Apache Hive 1.0 is full ACID transaction support. This is an incredibly important feature for many enterprise customers. Knowing data stored in a database is ALWAYS in a non-corrupted state is a requirement that many cannot live without.

Apache Hive 1.0 also supports instructions and commands to verify data and store statistics about individual objects. These statistics are used to ensure that data remains healthy and accessible when it’s needed.

Reason #4: Developer Support

Hive 1.0 includes the most SQL ANSI compliant instruction code of all of the SQL on Hadoop projects. Hive supports many intermediate and advanced SQL features like:

Subqueries and joins

Aggregation functions using GROUP BY and HAVING

Windowing functions, also known as OLAP functions

Advanced SQL datatype support including dates, timestamps, and XML

As many enterprise analysts and developers are well versed in SQL support, Hive provides a platform that will require very little retraining.

Reason #5: Data Lakes and the Modern Data Platform

In Chris Campbell’s recent post, Top 5 Differences Between Data Lakes and Data Warehouses, he describes how distributed Hadoop systems augment and enhance traditional EDW to create a Modern Data Platform. Hadoop can be used to enhance how data is transformed; it can be used to create an online archive enabling access to more data than ever before.

At BlueGranite, we couldn’t do this work in the Data Lake without Apache Hive. One could argue that beyond the Hadoop Distributed File System (HDFS), Hive is most important tool in the Data Lake toolbox. We aren’t the only company that feels that way. In recent months, both Hortonworks and Microsoft have invested an incredible amount of resources into enhancing the Hive project. With the 1.0 stable release, Hive has become the defacto standard for SQL on Hadoop.

If you're interested in learning more, we can help!

At BlueGranite, we are very excited about the stable release of Apache Hive 1.0. It is our belief that Hive will continue to be the industry leader in the SQL on Hadoop product offerings. This latest release of Hive only proves to reinforce that assertion.

Want to discover how Hive can fit into your big data strategy? Contact us today to learn more about our offerings including custom Strategy, Envisioning and Architecture Design workshops.

About The Author

Josh is a Principal Architect at BlueGranite. Josh is passionate about enabling information workers to become data leaders. His passions in the data space include: Modern Data Warehousing, unstructured analytics, distributed computing, and NoSQL database solutions.