Dynamic Partition Pruning for Hive Map Joins

You can enable dynamic partition pruning for map joins when you are running Hive on Spark (HoS), it is not available for Hive on MapReduce. Dynamic partition pruning (DPP) is a database
optimization that can significantly decrease the amount of data that a query scans, thereby executing your workloads faster. DPP achieves this by dynamically determining and eliminating the number of
partitions that a query must read from a partitioned table.

Map joins also optimize how Hive executes queries. They cause a small table to be scanned and loaded in memory as a hash table so that a fast join can be performed entirely within a
mapper without having to use another reduce step. If you have queries that join small tables, map joins can make them execute much faster. Map joins are enabled by default in CDH with the
Enable MapJoin Optimization setting for HiveServer2 in Cloudera Manager. Hive automatically uses map joins for join queries that involve a set of tables where:

There is one large table and there is no limit on the size of that large table.

All other tables involved in the join must have an aggregate size under the value set for Hive Auto Convert Join Noconditional Size for HiveServer2,
which is set to 20MB by default in Cloudera Manager.

Enabling Dynamic Partition Pruning for Map Joins in Hive on Spark

Dynamic partition pruning (DPP) is disabled by default. Use Cloudera Manager to set the following properties.

Important: Cloudera does not support nor recommend setting the property hive.spark.dynamic.partition.pruning to
true in production environments. This property enables DPP for all joins, both map joins and common joins. The property hive.spark.dynamic.partition.pruning.map.only, which enables DPP for map joins only in Hive on Spark is the only supported implementation of DPP for Hive on Spark in CDH.

Property Name

Description

Default Setting

hive.spark.dynamic.partition.pruning.map.join.only

Enables dynamic partition pruning for queries where the join on the partitioned column is a map join. This property only applies to the Spark execution engine.

Set this property to true to use dynamic partition pruning for queries where the join on the partitioned column is a map join.

false (turned off)

hive.spark.dynamic.partition.pruning

Important: Setting this property to true is not supported in CDH.

Enables dynamic partition pruning for all joins, including shuffle joins and map joins.

false (turned off)

Enabling DPP on a Per-Query Basis with the Hive SET Command

To enable DPP at the session level, use the Hive SET command:

SET hive.spark.dynamic.partition.pruning.map.join.only=true;

Enabling DPP as a Service-Wide Default with Cloudera Manager

Use Cloudera Manager to enable DPP as a service-wide default:

In the Cloudera Manager Admin Console, go to the Hive service.

In the Hive service page, click the Configuration tab.

On the Configuration page, click the HiveServer2 scope and click the Performance category.

Search for Hive on Spark Dynamic Partition Pruning for MapJoins, and select the check box.

Click Save Changes.

Verifying Your Query Uses Dynamic Partition Pruning in Hive on Spark

Use EXPLAIN to generate a query plan, which you can use to verify that DPP is being triggered for your query.

Example of Verifying that Dynamic Partition Pruning Is Triggered For Your Query

In this example, TPC-DS benchmark data is used with the query generated from query3.tpl in their downloadable package. It demonstrates how you can use the
EXPLAIN command to verify that DPP is being triggered. For more information about the TPC-DS benchmark data and queries, see www.tpc.org/tpcds/.

First, set the following properties which instruct Hive to use Spark as its execution engine and turns on DPP for map joins:

SET hive.execution.engine=spark;
SET hive.spark.dynamic.partition.pruning.map.join.only=true;

Then run the following commands, which tell Hive to use the testing_example_db database and to show (EXPLAIN) the query plan
for the query that follows:

The EXPLAIN command returns the query plan for the TPC-DS query. An excerpt from that query plan is included below. Look for the Spark HashTable Sink Operator and the Spark Partition Pruning Sink Operator, which are in bold font in the following output. Presence of these sink
operators in the query plan indicate that DPP is being triggered for the query.

Note: There are a few map join patterns that are not supported by DPP. For DPP to be triggered, the Spark Partition Pruning Sink Operator must have a target Map Work in a child stage. For example, in the above query plan, the Spark
Partition Pruning Sink Operator resides in Stage-2 and has a target work: Map
2. So for DPP to be triggered, Map 2 must reside in either Stage
1 or Stage 0 because both are dependent on Stage 2, thus they
are both children of Stage 2. See the STAGE DEPENDENCIES at the top of the
query plan to see the stage hierarchy. If Map 2 resides in Stage 2, DPP is
not triggered because Stage 2 is the root stage and therefore cannot be a child stage.

Queries That Trigger and Benefit from Dynamic Partition Pruning in Hive on Spark

When tables are created in Hive, it is common practice to partition them. Partitioning breaks large tables into horizontal slices of data. Each partition typically corresponds to a
separate folder on HDFS. Tables can be partitioned when the data has a "natural" partitioning column, such as a date column. Hive queries that read from partitioned
tables typically filter on the partition column in order to avoid reading all partitions from the table. For example, if you have a partitioned table called date_partitioned_table that is partitioned on the datetime column, the following query only reads partitions that are created after January 1,
2017:

SELECT *
FROM date_partitioned_table
WHERE datetime > '2017-01-01';

If the date_partitioned_table table has partitions for dates that extend to 2010, this WHERE clause filter can significantly
decrease the amount of data that needs to be read by the query. This query is easy for Hive to optimize. When it is compiled, only partitions where datetime is greater
than 2017-01-01 need to be read. This form of partition pruning is known as static partition pruning.

However, when queries become more complex, the filter on the partitioned column cannot be evaluated at runtime. For example, this query:

SELECT *
FROM date_partitioned_table
WHERE datetime IN (SELECT * FROM non_partitioned_table);

With this type of query, it is difficult for the Hive compiler to optimize its execution because the rows that are returned by the sub query SELECT * FROM
non_partitioned_table are unknown. In this situation, dynamic partition pruning (DPP) optimizes the query. Hive can dynamically prune partitions from the scan of non_partitioned_table by eliminating partitions while the query is running. Queries that use this pattern can see performance improvements when DPP is enabled. Note that this query
contains an IN clause which triggers a join between the date_partitioned_table and the non_partitioned_table. DPP is only triggered when there is a join on a partitioned column.

DPP might provide performance benefits for Hive data warehouses that use the star or snowflake schema. Performance improvements are possible for Hive queries that join a partitioned fact
table on the partitioned column of a dimension table if DPP is enabled. The TPC-DS benchmark is a good example where many of its queries benefit from DPP. The query example from the TPC-DS benchmark
listed in the above section with EXPLAIN, triggers DPP:

This query performs a join between the partitioned store_sales table and the non-partitioned date_dim table. The join is
performed against the partition column for store_sales, which is what triggers DPP. The join must be against a partitioned column for DPP to be triggered.

DPP is only supported for map joins. It is not supported for common joins, those that require a shuffle phase. A single query may have multiple joins, some of which are map joins and
some of which are common joins. Only the join on the partitioned column must be a map join for DPP to be triggered.

Debugging Dynamic Partition Pruning in Hive on Spark

Debug DPP for Hive on Spark by viewing the query plan produced with the EXPLAIN command or by viewing two types of log files. Both options are discussed in
the following sections.

Debugging with Query Plans Produced with EXPLAIN

A simple way to check whether DPP is triggered for a query is to use the EXPLAIN command as shown above in Verifying Your DPP Configuration in Hive on Spark. If the query plan contains a Spark Partition Pruning Sink Operator, DPP
will be triggered for the query. If it does not contain this operator, DPP will not be triggered for the query.

Debugging with Logs

Use the HiveServer2 logs to debug the compile time phase of DPP and use the Hive on Spark Remote Driver logs to debug the runtime phase of DPP:

HiveServer2 Logs

The HiveServer2 logs print debugging information from the Java class DynamicPartitionPruningOptimization. This class looks at the query and checks if it
can benefit from DPP. If the query can benefit from DPP, the class modifies the query plan to include DPP-specific operators, such as the Spark Partition Pruning Sink
Operator. When the class runs, it prints out information related to whether or not it is enabling DPP for a particular clause in the query.

For example, if the following message appears in the HiveServer2 log, it means that DPP will be triggered and that partitions will be dynamically pruned from the partitioned_table table, which is in bold text in the following example:

The Hive on Spark (HoS) Remote Driver logs print debugging information from the Java class SparkDynamicPartitionPruner. This class does the actual pruning
of the partitioned table. Because pruning happens at runtime, the logs for this class are located in the HoS Remote Driver logs instead of the HiveServer2 logs. These logs print which partitions are
pruned from the partitioned table, which can be very useful for troubleshooting.

For example, if the following message appears in the HoS Remote Driver log, it means that the partition partition_column=1 is being pruned from the table
partitioned_table, both of which are in bold text in the following example:

Using Hive UDFs with Hive on Spark

When the execution engine is set to Spark, use Hive UDFs the same way that you use them when the execution engine is set to MapReduce. To apply a custom UDF on the column of a Hive
table, use the following syntax:

SELECT <custom_UDF_name>(<column_name>) FROM <table_name>;

For example, to apply the custom UDF addfunc10 to the salary column of the sample_07 table in
the default database that ships with CDH, use the following syntax:

SELECT addfunc10(salary) FROM sample_07 LIMIT 10;

The above HiveQL statement returns only 10 rows from the sample_07 table.

If this documentation includes code, including but not limited to, code examples, Cloudera makes this available to you under the terms of the Apache License, Version 2.0, including any required
notices. A copy of the Apache License Version 2.0 can be found here.