In this post, we will discuss about Hive integration with Tez framework or Enabling Tez for Hive Queries. And we will also run sample hive queries both on Mapreduce and Tez frameworks and we will evaluate the performance difference between Tez and MR Frameworks.

Tez Advantages:

Tez offers a customizable execution architecture that allows us to express complex computations as data flow graphs and allows for dynamic performance optimizations based on real information about the data and the resources required to process it.

Tez increases the processing speed from GB’s to PB’s of data and 10’s to 1000’s of nodes when compared to mapreduce framework.

The Apache Tez library allows developers to create Hadoop applications that integrate with YARN and perform well with Hadoop clusters.

Benefits of Integrating Hive with Tez:

Tez can translate complex SQL statements into highly optimized, purpose-built data processing graphs that strike the right balance between performance, throughput, and scalability across a wide range of use cases and data set sizes.

Tez helps Hive in becoming into interactive from batch mode.

Till hive-0.12 release, there is only mapreduce framework available in hive to convert hive queries into execution jobs on hadoop clusters. But first time in hive-0.13.1 release Tez execution engine framework is embedded into hive to improve the performance of complex hive queries.

Hive on Tez:

By default, execution engine in hive is mapreduce (mr), so we don’t need to specify it explicitly to submit mapreduce jobs from our hive queries. To setup hive on tez, we need below components at the minimum.

Prerequisite:

Running Hadoop 2 cluster with YARN framework

Hive-0.13.1 installed on hadoop cluster

Tez installed and configured on Hadoop successfully.

For tez installation and configuration hadoop2 we can refer our previous post on Tez framework.

We assume all the above three installations are done already and running fine.

Hive setup for Tez:

As Tez is already installed successfully and we are able to sample Tez DAG jobs successfully on hadoop cluster, now we can easily setup Hive for Tez engine.

We need to perform below list of activities in the same order.

As of Hive 0.13.1 release, Hive embeds Tez, we need to copy hive-exec-0.13.1.jar file from $HIVE_HOME/lib directory into HDFS directory specified in tez.lib.uris property in tez-site.xml file in ${TEZ_CONF_DIR}. In this post, it is /apps/tez-0.4.1 is the HDFS directory. Use below command to copy this jar.

Shell

1

2

$hadoop fs-put$HIVE_HOME/lib/hive-exec-0.13.1.jar/apps/tez-0.4.1/

To run query on Tez engine, we need whether to set hive.execution.engine=tez; each time for hive session or change this value permanently in hive-site.xml. In this post, we will simply set this hive variable for each session to compare results of mr and tez frameworks.

Shell

1

2

hive>set hive.execution.engine=tez;

That’s it we are done with hive setup for Tez.

Sample run of Hive Queries on Tez:

To test the performance improvement of Tez over mapreduce, lets create a sample hive table and perform some basic queries on it. Sample data used for running the examples in this section is available at —>SampleUserData

Login to hive shell, set hive.execution.engine=mr to run the above queries through Mapreduce jobs and note down the execution time for each query.

Query 1:

So it took around 13.5 seconds.

Query 2:

This second query took 18.2 seconds.

Tez Framework:

Now we will run the same above two queries on Tez framework after setting

1

2

hive>set hive.execution.engine=tez;

For the first query run after setting the above property, tez will take some extra time when compared to running any subsequent queries. It is due to that, during first query execution, Tez will assign containers required for the Tez session. Once Tez session is established, any next queries will not take that much time.

That’s why in the below screen the first run of the same query took around 13 seconds whereas the second run took just 6 seconds.

Query 1:

So this query took just 6 Seconds which is more than 200% faster when compared to mr engine performance which is around 13.5 seconds.

Query 2:

So even this query also ran with more than 200% faster performance () when compared to mr engine (18 seconds).

So, with this we can confirm that Tez is 200% faster than MR framework.

Post navigation

Review Comments

I have attended Siva’s Spark and Scala training. He is good in presentation skills and explaining technical concepts easily to everyone in the group. He is having excellent real time experience and provided enough use cases to understand each concepts. Duration of the course and time management is awesome. Happy that I found a right person on time to learn Spark. Thanks Siva!!!

DharmeswaranETL / Hadoop DeveloperSpark Nov 2016September 21, 2017

I really like your explanations.

Sylvain Nzeyanghadoop developer December/2016November 23, 2016

Siva , your teaching's are great and indeed very useful for the people who are interested in hadoop. Your sessions are more close to real-time and helps every one to get clear in interviews. Thanks for your support.

kalpana BhemireddyHadoop developerSpark jul/2016September 26, 2016

Course content is well structured. I like Siva's explanation of topics using slide decks & virtual machine (CDH cluster) at the same time,this will help audience to learn not only theory behind a topic but also practical aspect of it. Overall, I would recommend this course.

KumarBig Data DeveloperHadoop&Aug/2016September 26, 2016

Course content is well structured. I like Siva's explanation of topics using slide decks & virtual machine (CDH cluster) at the same time,this will help audience to learn not only theory behind a topic but also practical aspect of it. Overall, I would recommend this course.

KumarBig Data DeveloperHadoop&Aug/2016September 26, 2016

One of the best trainer is Siva Kumar, his way of communication and explantion superb,he teaches excellent as theratical and practically also,I suggest he is the Excellent trainer for Spark and Scala.

purushothamSr.Software EngineerSpark August/2016September 15, 2016

Here is 2 cents
1. Got More exercises and provide feedback. (also a final project)
2. Support (may be you need a part time person)

LexmanArchitectHadoop/SparkSeptember 13, 2016

Siva will give excellent training for Hadoop,spark. He has 4 years real time experience. His teaching is will go close to real time.

sriniwaasHadoop consultantJune 2016September 13, 2016

Excellent Training, classes were so interactive,I never got bored,Siva has Immense Knowledge in all the Hadoop tools.He explained everything so near to real-time . You can never find Hadoop course so pure in the market.

AkhilaHadoop DeveloperHadoop/sparkSeptember 13, 2016

Siva did an excellent job in explaining each topic patiently, gave many real-time examples
And he was really patient enough in answering each of our doubts,responds well in time when needed.
He has Immense knowledge in all the Hadoop/spark eco-system tools. Never felt bored in his classes he makes the classes so interactive
He has an excellent blog..got addicted to it.

AkhilaHadoop DeveloperHadoop/sparkSeptember 13, 2016

Spark and Hadoop course content is really apt for the beginners. Concept articulation gives clarity on the subject and recording are quite handy for reference. my request is to start an advance level course where it takes very close to real time feel