Apache Hive Query Optimzation

26 October 2017

Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data summarization, query and analysis.This blog explains some of the optimization techniques that can be applied in hive query.

Partitions in Hive

Partitions are fundamentally horizontal slices of data which allow large sets of data to be segmented into more manageable chunks. When you define the table with partitions, the underlying structure is changed such that sub-directories are created for each slicer or partitioning column.

The partitions are a physical segmenting of the data - where the partition is maintained in the directory system, and the queries use the metadata to determine where the partition is located. so if you can make the directory structure match the query, it should find the data you want.

Partition columns are virtual columns, they are not part of the data itself but are derived on load.Each Table can have one or more partition Keys which determines how the data is stored. Partitions - apart from being storage units - also allow the user to efficiently identify the rows that satisfy a certain criteria

Setting Maximum number of Partitions in Hive

To set a limit on Maximum number of Partition in Hive
use below properties.

hive.exec.max.dynamic.partitions

Dynamic Partitioning in Hive

A partition is called dynamic partition in hive when there is partition in both the file system and metastore when data is inserted in hive table.

There are two components to a partition:

its directory on the filesystem

an entry in Hive’s metastore.

This entry is essentially just the pair (partition values, partition location).

In Hive, as data is written to disk, each partition of data will be automatically split out into different folders.For Dynamic Partition When inserting data into a partition, it’s necessary to include the partition columns as the last columns in the query. Dynamic partitions are more useful when there is large amount of data and you don’t know how many unique values exist for that column.

Hive properties than enable Dynamic Partition

Bucketing

Bucketing is a technique that allows you to cluster or segment large sets of data to optimize query performance.
Unlike partitioning where each value for the slicer or partition key gets its own space, in clustering a hash is taken for the field value and then distributed across buckets.

Buckets allow the system to prune large quantities of data during query processing, resulting in faster query execution.It also helps in doing efficient map-side joins.

Indexes in Hive

In Hive ,indexes is used to improve the speed of query for certain columns of table .The improvement in query speed that an index can provide comes at the cost of additional processing to create the index and disk space to store the index.

Hive Execution Engine

There are 3 available options for hive.execution.engine.

mr (Map reduce ,default)

tez (Tez execution ,for Hadoop 2 only)

spark (Spark execution ,for Hive 1.10 onward)

Difference between SELECT * vs SELECT Query in Hive

select * from <table-name>
This is just a read operation for hive engine. So data is simple read from your file and dumped on to the screen. No computations are involved in this so the default execution engine, i.e., map-reduce jobs won’t start.

It’s an optimization technique. Hive, fetch, task conversion property can minimize the latency of map-reduce overhead. When queried like SELECT, FILTER LIMIT queries, this property skips map reduce and using FETCH task. As a result, Hive can execute queries without running MapReduce task.

select <col1 ,col2> from <table-name>

This operation requires projection, i.e., some sort of computational operation needs to be performed before showing you the result. So the map-reduce jobs start and its takes a while before you get the result.

Purpose of using Map join in Hive

If all but one table is small, the largest table can be streamed through the mappers while the small tables are cached in memory. Hive can do all the joining map-side, since it can look up every possible match against the small tables in memory, thereby eliminating the reduce step required in the more common join scenarios. Even on smaller data sets, this optimization is noticeably faster than the normal join. Not only does it eliminate reduce steps, it sometimes reduces the number of map steps, too.

Delimiter in Hive

While you create a table in Hive, you specify the delimiter to let Hive know the format of the data you have in the input file.

Hive Skewed Table

A skewed table is a special type of table where the values that appear very often (heavy skew) are split out into separate files and rest of the values go to some other file.
Create Hive Skewed Table Syntax

createtableT(c1string,c2string)skewedby(c1)on(‘x1’)

The skewed table is again one of the important Hive query optimization techniques.