Impala vs. Hive Performance Benchmark

Yahoo! JAPAN needed a data platform that could scale to generate 100,000 reports per day as well as having the ability to process large amounts of data. It needed to keep the last 13 months’ worth of data, which is approximately 500 billion rows, organized and easily accessible. Relational Database Management Systems (RDBMS) cannot scale to these levels from a cost and processing power perspective. Yahoo! JAPAN explored Hadoop to achieve this and evaluated two platforms based on our requirements; Hortonworks Hive and Tez on YARN and Cloudera Impala. Hive and Tez on YARN was able to scale beyond 15,000 queries per hour while Impala hovered at about 2,500 queries per hour.

BACKGROUND

Our business report systems generate 15,000 reports per hour from about 500 billion lines of Time Series data over the last 13 months based on conditions predefined by a user. These report systems had the following problems because of file-based processing:

While data for the last 13 months is stored, we can use only three months of data.

Only small variations of grouping, such as by region and gender, are available

There are no filters to select values above specific levels, etc.

It is possible to put data in an RDBMS and use SQL queries for implementing more processing variations and filter functions. However, it was not possible to migrate to an RDBMS due to the large amount of data and large number of jobs.

We reviewed Hadoop-based SQL engines from Cloudera and Hortonworks, and solved these challenges by employing Hortonworks’ HDP2.2 and Apache Hive-0.14, Apache Tez.

Use Case

When a user specifies query conditions, this system creates asynchronous report data for later download.

Period (dates: FROM-TO)

Grouping by gender, region, etc.

Timing of report generation (daily, weekly, etc.)

The time report generation option was the most challenging by far. Many users specify periodic reporting, such as daily and weekly reporting. A huge amount of report creating processes are executed immediately after daily data updates, which increases the number of queries per unit of time.

Data

This report system handles files in the following TEXT gz format.

Format Text gz

File size per day 10GB

Number of lines per day 1.3 billion lines

Data retention period 13 months

Total data amount 6,000GB

Total lines 450 billion lines

Approach

Originally, MapReduce is suited for batch processing. We thought that it would be practical to use it in the report system, if we could control the latency for each query and ensure parallel execution performance.

Considering Impala

We tried Impala, which has a different execution engine from MapReduce.

Impala was promising because it executes a query in a relatively short amount of time. It had a typical range of several seconds to several tens of seconds.

If the response time for one query is 15 seconds, about 240 queries can be executed per hour.

For increasing the processing amount per unit of time, we considered the idea of increase the number of parallel queries per unit time by increasing the multiplicity of queries.

However, we encountered a problem. Impala’s processing time increases linearly when the degree of query parallelism increases.

It could not handle all of the batch queries that are automatically executed every day after data updates. Therefore, we had to give up on the idea of employing Cloudera Impala.

Considering Hive and Tez on YARN

With the advent of Hadoop-2.x and YARN, more fine-tuned parallel execution control has become possible, and the use of the Tez engine has significantly reduced latency of MapReduce.

YARN and Tez adds significant processing power, scaling data processing to the levels beyond 15,000 jobs per hour. This is a dramatic leap from the 100,000 jobs per day processing capability that we had with our previous cluster setup that is based on Hadoop 1.0.

TEST

Test Data and SQL

At first, for the testing purpose, little less than 2,000 SQL queries were picked up from those actually used by the real report generation processing.

Most SQL returns the result set of less than 1,000 rows, but also include some which returned big sets of more than 100,000 rows.

Hive on Tez vs Impala

At first, we compared with Impala which we were planning to deploy. These 2,000 SQL run in 32 parallels, and fig 2 is the graph of the breakdown of all the SQL processing time.

Hive processes most SQL within 20 seconds, and even the SQL with the big result set, which takes more time, completed the process within 70 seconds.

On the other hand, many took 30 to 90 seconds for Impala, and some SQL’s took more than 10 minutes, namely the big result sets, and performance degradation during parallel execution was remarkable.

Impala can process in milliseconds when running at low load conditions and Impala is one of the valid choices if no SQL parallel processing is executed.

However, there is a batch execution in our use case, and SQL parallel execution was a mandatory requirement, and that is why we could not choose Impala.

Concurrency of Hive

Then we tested how many parallel queries can be executed, and how much of the degree of parallelism is processed most effectively for a single HiveServer2 instance.

We raised number of SQL parallel execution by 16 a time, and confirmed all processing times of the sample SQL.

The throughput was raised rapidly up until it reached parallel 64. At this point, the throughput started declining. In this environment, it was concluded that our needs would be best met if we set the number of parallels 64

Conclusion

The single SQL execution time is around 15 seconds for Hive on Tez, and it would scale by raising the parallel number. (The limit of parallel number depends on the size and performance of cluster)

Impala returns a very fast response when the cluster is at low load state, but is not suitable for use such as running the SQL in parallel.

We have decided to adopt Hive on Tez because it can process the 15,000 SQL per hour that is being requested from our service.

This is Yifeng from Hortonworks. I worked closely with the Yahoo! JAPAN team on this project. Let me share more details of Hive/Tez optimization (Impala was tuned by Cloudera engineers).

Goal of Hive/Tez optimization is basically full utilization of cluster.

1) Partitioning
Basically maximizing the number of partitions while keeping the total number under a couple thousand partitions. Hive and the HCat seem to be able to handle this for queries very well.

2) Loading Data
We had to convert the data into ORC file. There are three main parameters for defining how ORC stores the data: block size, stripe size, and compression. In general the Hive defaults of 64MB stripes, and 256MB blocks should work well. We went with ZLIB compression. ZLIB is optimized highly in new hive versions.

We need to make sure that ORC files are big enough. General best practice is that they should be between 1 and 10 HDFS blocks if possible. One mapper/reducer generates 1 file, so when creating ORC files, we make sure that we don’t have lots of reducers that write to all partitions. This can be done by enalbing dynamic partition sort optimization or using DISTRIBUTED BY. We chose DISTRIBUTED BY for fine grained control.

Re-creating containers takes 3 seconds
– Enable container reuse and keeping containers for a small period of time
– Key is to reach 100% utilization without wasting resources.

4) Paralization
In most query tuning scenarios this is at first ignored. However it is perhaps the most important point for many real-world scenarios. Oftentimes single queries benefit from additional resources but this can reduce throughput.

Let me explain about the Impala cluster.
The Impala cluster was tuned by Yahoo! JAPAN engineers,
not Cloudera’s engineers or any Impala experts.
We only use the information that we got from Q&A within the subscription contruct of Cloudera.

This is Yifeng from Hortonworks. I worked closely with the Yahoo! JAPAN team on this project. Let me share more details of the Hive/Tez optimization part.

Goal is full utilization of cluster and reuse of Tez containers.

1) Partitioning
Basically maximizing the number of partitions while keeping the total number under a couple thousand partitions. Hive and the HCatalog seem to be able to handle this for queries very well.

2) Loading Data
We had to convert the data into ORC file. There are three main parameters for defining how ORC stores: the data block size, Stripe size, and compression. In general the Hive defaults of 64MB stripes, and 256MB blocks should work well. We chose ZLIB compression. ZLIB is optimized highly in new hive versions.

We need to make sure that ORC files are big enough. General best practice is that they should be between 1 and 10 HDFS blocks if possible. One mapper/reducer generates 1 file so when creating ORC files we make sure that we don’t have lots of reducers that write to all partitions. This can be done by enabling optimize sort dynamic partitioning or using the DISTRIBUTED BY clause. We chose DISTRIBUTED BY for fine grained control.

Reinitializing Tez session takes 5+ seconds
– Tez sessions are reused and can be pre-initialized with pre-warm (with some drawbacks).
– With pre-warm, full speed is practically instantaneous

Re-creating containers takes 3 seconds
– Enable container reuse and keeping containers for a small period of time
– Key is to reach 100% utilization without wasting resources

4) Parallelization
In most query tuning scenarios this is at first ignored. However it is perhaps the most important point for many real-world scenarios. Oftentimes single queries benefit from additional resources but this can reduce throughput.

Yifeng, Can you be more specific on the memory optimizations for Tez? Specifically, what parameters were tuned, and to what values? In conjunction with what other variables were they tuned? (Or, upon which variables did the memory tuning depend? Block size? Data size?)

Are there any details on hardware, cluster size and the underlying Hadoop configurations?

Of the testing that was done, I only see a direct comparison vs. Impala with the second diagram (“Number of queries in time range”). It’s not clear to me what this test is all about.

Did you submit the same query simultaneously 2000 times? Or are these 2000 distinct queries? What was the query? Was Impala used through JDBC/HiveServer2, or were the queries issued directly to the daemons? Did you load balance between the Impala daemons? (i.e., Did you remove the client/proxy as the bottleneck, as was done for Hive/Tez?)

For me this is an obvious result based on the architectures of Hive and Impala.

Impala is an MPP-like engine, so each query you are executing on it will start executor on each and every node of your cluster. This delivers the best performance for a single query running on the cluster, but the total throughput degrades heavily under high concurrency. In such systems you should limit the amount of parallel queries to kinda low value of ~10.

Hive is an opposite, it runs on top of MP/Tez framework which requests resources based on the amount of data to process. This way for large clusters it would give you much better concurrency for “small” queries you’ve tested, as each of them would request small amount of execution resources which would result in more queries running in parallel.

But still Impala should be faster for more complex queries like queries with a set of operations over different window functions and similar ones. They require many separate execution steps so Impala here would take an advantage of a better pipelining of intermediate data (of course, if it fits into memory).

I think you are spot-on with the advantages of Hive/Tez for large number of concurrent queries. Since each session is running in its own completely separate Tez instance we could linearly scale the number of queries until we ran into CPU limitations. I don’t think the answer is that simple for queries with large amount of data however. Tez is after all a DAG framework and can reuse steps as well. And it can keep and reuse containers on all nodes. Impala had some advantages on very short execution times however with LLAP and other performance enhancements in future versions we would hope to close that as well. Now there are obviously some specific queries that work better on either system but I doubt that it can be generalized quite as easily.

Your email address will not be published. Required fields are marked *

Comment

Name*

Email*

Related Posts

BLOG

9.6.17

Hospitality and the Big Data Window...

What eventually drives us to purchase something online? This isn’t a hypothetical or philosophical question, as businesses are forced to find answers every day. Consumers across every industry have access to almost anything they can find on a search engine, however with such a big supply of goods, businesses must manage the Big Data that’s…

TMW: From Adoption to Profitability

The San Jose DataWorks Summit is this Tuesday! Have you registered? A few weeks ago, we announced Tim Leonard of TMW Systems would be presenting a Tuesday keynote, Commoditizing Your Data to Sell – A Transportation Example. Lessons from this keynote are vital to any small business looking to compete with major players in an industry.…

Vizient: Predictive Analytics in Healthcare

The critical business challenge for healthcare organizations is to effectively manage their data. Success means access to real-time market data, data visualization, and cost-saving opportunities. Data virtualization and predictive analytics further improve both the business side of healthcare organizations, and patient care. At San Jose DataWorks Summit (June 13-15), Vizient will show how predictive analytics helps connect members…

Parametric: Driving Alpha in the Financial...

With the San Jose DataWorks Summit (June 13-15) just one month away, we’re busy finalizing the lineup of an impressive array of speakers and business use cases. This year our Enterprise Adoption Track will feature Scott Sovine, Director of Project and Data Management, and Amir Aliabadi, Data Architect, both at Parametric Portfolio Associates, LLC. Also present will be…

Hortonworks 2016 Year in Review

As we kick off the new year I wanted to thank our customers, partners, Apache community members, and of course the amazing Hortonworks team, for an amazing 2016. Let’s take a step back and look at some of the Hortonworks highlights from last year... IN THE ECOSYSTEM there was tremendous acceleration. At the beginning of…

The Power of your Data Achieved:...

It’s no secret that there is a data explosion. A recent IDC analyst report from April 2014 indicated the volume of data, known as the digital universe, is doubling in size every two years. And by 2020, there will be as many digital bits as there are stars in the universe. There are many reasons…

Jumpstart Your Digital Transformation with Hadoop...

Guest author: Jeff Kelly, Data Strategist, Pivotal The phrase “digital transformation” gets bandied about a lot these days, but what exactly does it mean? When you strip away the hyperbole, I believe digital transformation is the process by which enterprises evolve from using traditional information technology to merely support existing business models to adopting modern…

What’s the best cloud architecture—and how...

People often think about cloud architecture in simplistic terms: you’re either public, private, or hybrid. (In fact, there’s even confusion about the meaning of the term “hybrid” itself—this video helps clear it up: https://www.youtube.com/watch?v=HPKI-U_ef5w In the real world, of course, virtually every implementation is hybrid—no company puts 100% of its IT environment into one single…