default_statistics_target - Explained

The below documents explains how parameter default_statistics_target works with an example.

Description

Definition:

As per the documentation, the definition of "default_statistics_target" is described as, " Sets the default statistics target for table columns without a column-specific target set via ALTER TABLE SET STATISTICS. Larger values increase the time needed to do ANALYZE, but might improve the quality of the planner's estimates ".

Default value/Range:

By default, Greenplum set the value to 25, it can accept values from 1 to 1000. The parameter can be set at the session level and doesn't require a restart of the database.

Usage:

From the description/definition mentioned above , it kind of confusing to understand what its used for, in short, and in basic term, this parameter controls the way the stats are collected, with value 1 being the least estimated/accurate statistics and the value 1000 is the most accurate statistics, obviously with the expense of time/resources (CPU, memory etc ) / space . Normally the default value is sufficient to get an accurate plan, but if you have a complex data distribution / or a column is referenced in the query quite often, then setting a higher value might help in getting a better statistics on the table and hence a better plan for the optimizer to execute.

similarly increasing the value of default_statistics_target yield better and accurate stats of the table and its data distribution. If you wish to collect stats on a specific column with specific/constant value and don't want to alter the stats collection irrespective of what value default_statistics_target is defined, you can define stats method using.

Pros- Better statistics of the distributed data. The better plan as now the planner has more accurate statistics of the table. Faster and quicker response from the query, since now it has a good plan

Cons- Increasing the value of the parameter, leading to more time needed for analyze to complete, from my quick test of three values of the parameter with 3 columns yield the below time.

Serial #

Value

Time

1

25

300ms

2

100

2 sec

3

1000

10 sec

Now to do the extra work of collect statistics it will consume additional memory / CPU to achieve this. pg_statistics will consume more space, since now it has to store more information about the stats of the table in its histogram column.