Tuning SQL Queries

A newer version of this documentation is available. Click here to view the most up-to-date release of the Greenplum 5.x documentation.

Tuning SQL Queries

The Greenplum Database cost-based optimizer evaluates many strategies for executing a
query and chooses the least costly method.

Like other RDBMS optimizers, the Greenplum optimizer takes into account factors such as the
number of rows in tables to be joined, availability of indexes, and cardinality of column data
when calculating the costs of alternative execution plans. The optimizer also accounts for the
location of the data, preferring to perform as much of the work as possible on the segments
and to minimize the amount of data that must be transmitted between segments to complete the
query.

When a query runs slower than you expect, you can view the plan the optimizer selected as
well as the cost it calculated for each step of the plan. This will help you determine which
steps are consuming the most resources and then modify the query or the schema to provide the
optimizer with more efficient alternatives. You use the SQL EXPLAIN statement
to view the plan for a query.

The optimizer produces plans based on statistics generated for tables. It is important to
have accurate statistics to produce the best plan. See Updating Statistics with ANALYZE in
this guide for information about updating statistics.

How to Generate Explain Plans

The EXPLAIN and EXPLAIN ANALYZE statements are useful
tools to identify opportunities to improve query performance. EXPLAIN
displays the query plan and estimated costs for a query, but does not execute the query.
EXPLAIN ANALYZE executes the query in addition to displaying the query
plan. EXPLAIN ANALYZE discards any output from the SELECT
statement; however, other operations in the statement are performed (for example,
INSERT, UPDATE, or DELETE). To use
EXPLAIN ANALYZE on a DML statement without letting the command affect the
data, explicitly use EXPLAIN ANALYZE in a transaction (BEGIN;
EXPLAIN ANALYZE ...; ROLLBACK;).

EXPLAIN ANALYZE runs the statement in addition to displaying the plan with
additional information as follows:

Total elapsed time (in milliseconds) to run the query

Number of workers (segments) involved in a plan node operation

Maximum number of rows returned by the segment (and its segment ID) that produced the
most rows for an operation

The memory used by the operation

Time (in milliseconds) it took to retrieve the first row from the segment that produced
the most rows, and the total time taken to retrieve all rows from that segment.

How to Read Explain Plans

An explain plan is a report detailing the steps the Greenplum Database optimizer has
determined it will follow to execute a query. The plan is a tree of nodes, read from bottom
to top, with each node passing its result to the node directly above. Each node represents a
step in the plan, and one line for each node identifies the operation performed in that
step—for example, a scan, join, aggregation, or sort operation. The node also
identifies the method used to perform the operation. The method for a scan operation, for
example, may be a sequential scan or an index scan. A join operation may perform a hash join
or nested loop join.

Following is an explain plan for a simple query. This query finds the number of rows in the
contributions table stored at each segment.

The cost is a two-part estimate. A cost of 1.0 is equal to one sequential disk page read.
The first part of the estimate is the start-up cost, which is the cost of getting the first
row. The second estimate is the total cost, the cost of getting all of the rows.

The rows estimate is the number of rows output by the plan node. The number may be lower
than the actual number of rows processed or scanned by the plan node, reflecting the
estimated selectivity of WHERE clause conditions. The total cost assumes
that all rows will be retrieved, which may not always be the case (for example, if you use a
LIMIT clause).

The width estimate is the total width, in bytes, of all the columns output by the plan
node.

The cost estimates in a node include the costs of all its child nodes, so the top-most node
of the plan, usually a Gather Motion, has the estimated total execution costs for the plan.
This is this number that the query planner seeks to minimize.

Scan operators scan through rows in a table to find a set of rows. There are different scan
operators for different types of storage. They include the following:

Index Scan — traverses a B-tree index to fetch the rows from the table.

Bitmap Append-only Row-oriented Scan — gathers pointers to rows in an append-only
table from an index and sorts by location on disk.

Dynamic Table Scan — chooses partitions to scan using a partition selection
function. The Function Scan node contains the name of the partition selection function,
which can be one of the following:

gp_partition_expansion — selects all partitions in the table.
No partitions are eliminated.

gp_partition_selection — chooses a partition based on an
equality expression.

gp_partition_inversion — chooses partitions based on a range
expression.

The Function Scan node passes the dynamically selected list of partitions to the
Result node which is passed to the Sequence node.

Join operators include the following:

Hash Join – builds a hash table from the smaller table with the join column(s) as
hash key. Then scans the larger table, calculating the hash key for the join column(s) and
probing the hash table to find the rows with the same hash key. Hash joins are typically
the fastest joins in Greenplum Database. The Hash Cond in the explain plan identifies the
columns that are joined.

Nested Loop – iterates through rows in the larger dataset, scanning the rows in
the smaller dataset on each iteration. The Nested Loop join requires the broadcast of one
of the tables so that all rows in one table can be compared to all rows in the other
table. It performs well for small tables or tables that are limited by using an index. It
is also used for Cartesian joins and range joins. There are performance implications when
using a Nested Loop join with large tables. For plan nodes that contain a Nested Loop join
operator, validate the SQL and ensure that the results are what is intended. Set the
enable_nestloop server configuration parameter to OFF (default) to
favor Hash Join.

Merge Join – sorts both datasets and merges them together. A merge join is fast
for pre-ordered data, but is very rare in the real world. To favor Merge Joins over Hash
Joins, set the enable_mergejoin system configuration parameter to ON.

Some query plan nodes specify motion operations. Motion operations move rows between
segments when required to process the query. The node identifies the method used to perform
the motion operation. Motion operators include the following:

Broadcast motion – each segment sends its own, individual rows to all other
segments so that every segment instance has a complete local copy of the table. A
Broadcast motion may not be as optimal as a Redistribute motion, so the optimizer
typically only selects a Broadcast motion for small tables. A Broadcast motion is not
acceptable for large tables. In the case where data was not distributed on the join key, a
dynamic redistribution of the needed rows from one of the tables to another segment is
performed.

Redistribute motion – each segment rehashes the data and sends the rows to the
appropriate segments according to hash key.

Gather motion – result data from all segments is assembled into a single stream.
This is the final operation for most query plans.

Other operators that occur in query plans include the following:

Materialize – the planner materializes a subselect once so it does not have to
repeat the work for each top-level row.

InitPlan – a pre-query, used in dynamic partition elimination, performed when
the values the planner needs to identify partitions to scan are unknown until execution
time.

Sort – sort rows in preparation for another operation requiring ordered rows,
such as an Aggregation or Merge Join.

Optimizing Greenplum Queries

This topic describes Greenplum Database features and programming practices that can be used
to enhance system performance in some situations.

To analyze query plans, first identify the plan nodes where the estimated cost to perform
the operation is very high. Determine if the estimated number of rows and cost seems
reasonable relative to the number of rows for the operation performed.

If using partitioning, validate that partition elimination is achieved. To achieve
partition elimination the query predicate (WHERE clause) must be the same
as the partitioning criteria. Also, the WHERE clause must not contain an
explicit value and cannot contain a subquery.

Review the execution order of the query plan tree. Review the estimated number of rows. You
want the execution order to build on the smaller tables or hash join result and probe with
larger tables. Optimally, the largest table is used for the final join or probe to reduce
the number of rows being passed up the tree to the topmost plan nodes. If the analysis
reveals that the order of execution builds and/or probes is not optimal ensure that database
statistics are up to date. Running ANALYZE will likely address this and
produce an optimal query plan.

Look for evidence of computational skew. Computational skew occurs during query execution
when execution of operators such as Hash Aggregate and Hash Join cause uneven execution on
the segments. More CPU and memory are used on some segments than others, resulting in less
than optimal execution. The cause could be joins, sorts, or aggregations on columns that
have low cardinality or non-uniform distributions. You can detect computational skew in the
output of the EXPLAIN ANALYZE statement for a query. Each node includes a
count of the maximum rows processed by any one segment and the average rows processed by all
segments. If the maximum row count is much higher than the average, at least one segment has
performed much more work than the others and computational skew should be suspected for that
operator.

Identify plan nodes where a Sort or Aggregate operation is performed. Hidden inside an
Aggregate operation is a Sort. If the Sort or Aggregate operation involves a large number of
rows, there is an opportunity to improve query performance. A HashAggregate operation is
preferred over Sort and Aggregate operations when a large number of rows are required to be
sorted. Usually a Sort operation is chosen by the optimizer due to the SQL construct; that
is, due to the way the SQL is written. Most Sort operations can be replaced with a
HashAggregate if the query is rewritten. To favor a HashAggregate operation over a Sort and
Aggregate operation ensure that the enable_groupagg server configuration
parameter is set to ON.

When an explain plan shows a broadcast motion with a large number of rows, you should
attempt to eliminate the broadcast motion. One way to do this is to use the
gp_segments_for_planner server configuration parameter to increase the
cost estimate of the motion so that alternatives are favored. The
gp_segments_for_planner variable tells the query planner how many primary
segments to use in its calculations. The default value is zero, which tells the planner to
use the actual number of primary segments in estimates. Increasing the number of primary
segments increases the cost of the motion, thereby favoring a redistribute motion over a
broadcast motion. For example, setting gp_segments_for_planner = 100000
tells the planner that there are 100,000 segments. Conversely, to influence the optimizer to
broadcast a table and not redistribute it, set gp_segments_for_planner to a
low number, for example 2.

Greenplum Grouping Extensions

Greenplum Database aggregation extensions to the GROUP BY clause can
perform some common calculations in the database more efficiently than in application or
procedure code:

GROUP BY ROLLUP(col1, col2, col3)

GROUP BY CUBE(col1, col2, col3)

GROUP BY GROUPING SETS((col1, col2), (col1,
col3))

A ROLLUP grouping creates aggregate subtotals that roll up from the most
detailed level to a grand total, following a list of grouping columns (or expressions).
ROLLUP takes an ordered list of grouping columns, calculates the
standard aggregate values specified in the GROUP BY clause, then creates
progressively higher-level subtotals, moving from right to left through the list. Finally,
it creates a grand total.

A CUBE grouping creates subtotals for all of the possible combinations
of the given list of grouping columns (or expressions). In multidimensional analysis
terms, CUBE generates all the subtotals that could be calculated for a
data cube with the specified dimensions.

You can selectively specify the set of groups that you want to create using a
GROUPING SETS expression. This allows precise specification across
multiple dimensions without computing a whole ROLLUP or
CUBE.

Refer to the Greenplum Database Reference Guide for details of these clauses.

Window Functions

Window functions apply an aggregation or ranking function over partitions of the result
set—for example, sum(population) over (partition by city). Window
functions are powerful and, because they do all of the work in the database, they have
performance advantages over front-end tools that produce similar results by retrieving
detail rows from the database and reprocessing them.

The row_number() window function produces row numbers for the rows in
a partition, for example, row_number() over (order by id).

When a query plan indicates that a table is scanned in more than one operation, you
may be able to use window functions to reduce the number of scans.

It is often possible to eliminate self joins by using window functions.