Stinger Phase 2: The Journey to 100x Faster Hive on Hadoop

The Stinger Initiative is Hortonworks’ community-facing roadmap laying out the investments Hortonworks is making to improve Hive performance 100x and evolve Hive to SQL compliance to simplify migrating SQL workloads to Hive.

We launched the Stinger Initiative along with Apache Tez to evolve Hadoop beyond its MapReduce roots into a data processing platform that satisfies the need for both interactive query AND petabyte scale processing. We believe it’s more feasible to evolve Hadoop to cover interactive needs rather than move traditional architectures into the era of big data.

SQL for Big Data

We also believe this should be done in the open community and that Apache Hive is the right choice to evolve SQL-IN-Hadoop. First, Hive is the only SQL engine proven at multi petabyte scale so it is a good baseline. Second, it has the most comprehensive SQL semantics among any Hadoop solution for SQL so it is “closest to the pin”. Third, Hive is built on MapReduce so it can it effortlessly handles hundreds or thousands of simultaneous queries, even at petabyte scale. Finally, there is a large vibrant community of practitioners and developer using and working on Hive. Why not rally the community to innovate in the open?

So, where are we?

The Stinger initiative outlines three phases. In the first phase of delivery we saw:

A preview of the vectorized query engine that speeds all types of queries, adding another 5x-10x improvement,

Simplified SQL interoperability through the new VARCHAR and DATE datatypes and

A new query optimizer that speeds complex queries by several factors.

While this represents great progress, we still have one remaining piece of Stinger Phase 2: Hive on Apache Tez. This will be released separately in beta form soon and will deliver order-of-magnitude improvements in query latency and push several types of queries past the 100x barrier.

We’re encouraged by the extremely positive reaction we’ve seen from both Hive users and the Hive ecosystem. Let’s take a deeper dive into some of the new HDP 2 features.

Performance

Increasing Hive performance 100x remains the primary goal of the Stinger Initiative. The HDP 2.0 Beta introduces several major new performance features that benefit both small reporting queries and deep analytical queries. Some of which are describe in this table:

We’ll deep dive on these major new improvements in subsequent performance oriented blog posts. These improvements represent important steps on the path to improving Hive’s performance 100x. Here’s a few examples of what they add up to.

We looked at TPC-DS Query 27, a fairly simple reporting query, back in February and showed that some improvements to the Hive query planner led to massive performance benefits. HDP 2 brings incremental progress by introducing vectorized query, which makes the map stages far more efficient. The next big boost in Query 27 will come when we introduce the upcoming Tez Beta which unlocks true low latency on Hadoop. We believe Apache Tez will push this query and others like it past the 100x improvement mark.

Hive is not just for simple queries but can also handle quite complex queries. TPC-DS Query 95 is an extremely complex query including a 3-way fact table join. This query benefitted from our Hive 11 improvements but not to the extent that simple star schema join queries like Query 27 did. HDP 2 and the upcoming Hive 12 introduce a query optimizer that benefits complex queries by generating more efficient map/reduce plans. Even in the fastest time, Query 95 runs in 6 distinct MapReduce jobs, so the introduction of Tez will prove to be a massive boost here as well.

We’ll discuss these performance benefits in more depth in later blogs.

SQL Compliance

Our goal with SQL support is simple: Make Apache Hive a comprehensive and compliant SQL engine that meets Enterprise class needs. This round of Hive development introduces 2 critical new data types, VARCHAR, a very commonly used SQL type, and DATE, which is also very common and a natural choice for partitioning.

These new datatypes are extremely important for interoperability with existing databases. Historically, loading data from an external RDBMS into Hive (or systems claiming to be Hive compatible) required mapping VARCHAR to strings. This mapping process is no longer needed. These interoperability improvements further solidify Hive’s position as the most comprehensive SQL system for Hadoop, and greatly simplify the process of migrating SQL workloads to Hive.

The Hive Ecosystem: Stronger Than Ever

Our philosophy is to innovate SQL in Hadoop with 100% vendor-neutral Apache Foundation software that welcomes contributions from anyone. Although Hive 12 is not officially released yet, a lot of work has been flowing in from more than 60 developers, both within Hortonworks and from many other contributors in the community.

UPDATE: Thanks to Ed Capriolo, who pointed out that at least one large patch must have been missing due to an unexpectedly small number in the Cloudera section. Analysis has been re-run, resulting in updated numbers and the following corrections:

Hortonworks customers value our 100% open source, 0% proprietary, 0% lock-in approach. Our partners have noticed this preference among our mutual customers and we’re excited to see them respond by deepening their investments in Hive. Here are a few noteworthy examples:

Microstrategy focuses on deep analytics and MicroStrategy Server 9.4 has expanded deep analytics on Hadoop by embracing the SQL windowing capabilities introduced in Hive 11. Find out more about the 9.4 release on the MicroStrategy website.

Karmasphere were early adopters of Hive 11, introducing support in July when Hortonworks HDP was the only distribution that included Hive 11. Karmasphere pushes the boundaries of deep analytics in other ways, partnering with Zmentis to bring machine learning and predictive analytics to Hive and Hadoop. Learn more on the Karmasphere website.

MongoDB is a very popular NoSQL database thanks in large part to its simplicity in deployment and use. Mongo doesn’t focus on deep analytical capabilities, so Hive is a natural complement. MongoDB has released a new connector that allows you to create tables within Hive and run SQL queries directly on data stored in MongoDB. Learn more on the MongoDB website.

In March, ESRI released “GIS Tools for Hadoop”, a powerful spatial analytics toolkit for Hive and Hadoop that makes spatial analytics simple. Learn more on their project page. You can also read about a sample application we built that has working code for using the ESRI toolkit.

Tableau recently released version 8.0.4 which adds support for HiveServer2 providing additional security and the ability to support more concurrent users. Find more on working with Tableau and Hadoop here.

Try Hive Today

HDP 2.0 Beta is available for download todayand includes the fastest and most comprehensive Hive to date. Whether you’re looking for faster and more scalable SQL processing, or if you’re an existing Hive user looking to test drive the new performance enhancements, we encourage you to download and give us your feedback.

Regarding your statement “Hive is built on MapReduce so it can it effortlessly handles hundreds or thousands of simultaneous queries”. How many sites are running this many simultaneous Hive queries against a single cluster? Can you name any of them?

Also, what are the business partners listed above doing with Hadoop 2.0?

@Jim
I started off with a clean Hive git repo.
Next off I identified all JIRAs marked as fixVersion = Hive 0.12.0 and status Resolved. There are several hundred of these.
Third step is to trace this back to a git commit ID.
Hive developers are pretty good about including the JIRA number in the commit log so for the most part you can just parse “git log” and match JIRAs to commits. Just to be extra sure I also extracted comments made by Hudson within the JIRAs themselves to cross-validate. I found the Hive commit log complete enough that you don’t really need to do this to get essentially the same results I got.
This maps JIRAs to a set of commits, sometimes more than 1 as patches sometimes get made to multiple branches.
With this mapping I use git to identify the specific lines that have changed and count them up.
For JIRAs that have more than one commit associated I take the max value, trying to represent the effort of making the patch.
At this point I have Assignee, JIRA and number of lines. The last step is to map assignee back to company. This step is purely manual and is based on my best info. A few of the people who wound up in “other” might have belonged elsewhere.
I attributed a few large patches by Namit Jain to Facebook even though I want to say he had actually moved over to Nutanix by the time they went in. Namit can correct me if need be.
There were a few other Facebook contributions, not to the extent of Namit. I think it’s premature to say they’ve stopped, we certainly hope to see more from them.
The single largest patch was the Hive Correlation Optimizer, work that had been developed by Yin Huai for many years and committed while he was interning at Hortonworks this summer. It was about 20% of the whole count.

@Dan
Facebook published some impressive numbers on their Hive usage, more than 60,000 Hive queries per day with 100+PB data under management in their largest cluster. Source = http://www.facebook.com/notes/facebook-engineering/under-the-hood-scheduling-mapreduce-jobs-more-efficiently-with-corona/10151142560538920

The problem with counting lines of code is that a one line patch could cause cascading in hundreds of unit test and create an artificial lines of code count. This kinda goes with the thinking that 20% of the code was the optimizer.

Without crunching the numbers i have a hard time accepting that. Like they say, whatever you count you get more of 🙂

Thanks to Ed Capriolo, who pointed out that at least one large patch must have been missing due to an unexpectedly small number in the Cloudera section. Analysis has been re-run, resulting in updated numbers and the following corrections.
Cloudera contribution count was incorrectly reported as 4244, corrected to 13754
Hortonworks contribution count was incorrectly reported as 83742, corrected to 90681
NexR contribution count was incorrectly reported as 12103, corrected to 15236
“Other” contribution count was incorrectly reported as 8920, corrected to 10228

Paraquet imparts columnar properties over any HADOOP framework. Does Hortons stinger initiative include column store architecture. I do see improvements due to data compression, and go beyond the slow map reduce .

Your email address will not be published. Required fields are marked *

Comment

If you have specific technical questions, please post them in the Forums

Name*

Email*

Related Posts

BLOG

10.2.17

Why Rapid Data Insights are Crucial...

With today’s new rapid pace, speed to market is a huge factor for any business. The faster a company can gain insights from their data, the better they can serve their customers. If changes aren’t made quickly enough, there’s a significant risk of losing customers and market share. One example of gaining faster insights from…

How Data Provides a Full View...

Before making important business decisions, it’s crucial for a company to see a complete picture of what’s going on. To do this, they need to gather the necessary data to make the most informed decision possible. One example of a company who is using data to get a complete picture of their business is DHISCO.…

How the Leading Transportation Tech Provider...

Last Friday, I wrote about how TMW Systems leverages Hortonworks to crack the last mile problem. TMW, the leading transportation technology provider, consolidates the data of their carrier customers to deliver fresh analytical insights to these small carriers, thereby giving them the legs to walk the last mile. TMW’s use case also highlights the importance…

Vizient: Predictive Analytics in Healthcare

The critical business challenge for healthcare organizations is to effectively manage their data. Success means access to real-time market data, data visualization, and cost-saving opportunities. Data virtualization and predictive analytics further improve both the business side of healthcare organizations, and patient care. At San Jose DataWorks Summit (June 13-15), Vizient will show how predictive analytics helps connect members…

Mitsubishi Fuso Selects Hortonworks to Power...

Yesterday we announced that Mitsubishi Fuso Truck and Bus Corporation has deployed Microsoft Azure HDInsight, powered by Hortonworks Data Platform (HDP ®), in the public cloud to power the company’s connected data architecture. Notably, “Mitsubishi Fuso’s big data strategy began in 2014 and since then the company has undergone a process to modernize all operations.…

Danske Bank: Data is Mission Critical

Danske Bank, headquartered in Copenhagen, is the largest bank in Denmark. It’s also one of the major retail banks in the northern European region, with over 5 million retail customers. Data is mission critical to Danske Bank as it provides them with actionable intelligence to help minimize risk and maximize opportunities. In our latest video,…

Parametric: Driving Alpha in the Financial...

With the San Jose DataWorks Summit (June 13-15) just one month away, we’re busy finalizing the lineup of an impressive array of speakers and business use cases. This year our Enterprise Adoption Track will feature Scott Sovine, Director of Project and Data Management, and Amir Aliabadi, Data Architect, both at Parametric Portfolio Associates, LLC. Also present will be…

LIXIL Uses HDP to Be First...

Hortonworks continues to expand its list of customers in the Asia Pacific region, as well as in the housing and building industry. We recently completed a case study to showcase how LIXIL Corporation uses HDP to be first in manufacturing for the Japanese Smart Home Market. READ THE FULL LIXIL CASE STUDY HERE LIXIL is a…

Hortonworks 2016 Year in Review

As we kick off the new year I wanted to thank our customers, partners, Apache community members, and of course the amazing Hortonworks team, for an amazing 2016. Let’s take a step back and look at some of the Hortonworks highlights from last year... IN THE ECOSYSTEM there was tremendous acceleration. At the beginning of…

This website uses cookies for analytics, personalisation and advertising. To learn more or change your cookie settings, please read our Cookie Policy. By continuing to browse, you agree to our use of cookies.