Featured Database Articles

Hybrid Histograms in Oracle 12c

By David Fitzjarrell

A useful new feature in Oracle 12c (12.1.0.2) is the Hybrid histogram, a new type of histogram that uses a reduced number of buckets compared to the number of distinct values in the table data and utilizes the repeat frequency of the endpoint values. This type of histogram can adjust the bucket size based on endpoint values due to the criteria Oracle sets when it generates such a histogram. Unlike other histograms Oracle can generate, the Hybrid histogram is based on four 'rules', which are:

a value should not be found in more than one bucket

the bucket size is allowed to be extended in order to contain all instances of the same distinct value

adjusted bucket size cannot be less than the original size (not applicable at either end of the data set)

the original number of buckets should not be reduced

Let's look at how Oracle decides on, and creates, a Hybrid histogram. Additionally we'll look at the instability of a Hybrid histogram as values are added to the data set. We take the table and data from an example by Jonathan Lewis:

That decision was based on the data, and the fact that this data set doesn't qualify for a TOP=Frequency histogram because the count for the top 20 values (20 being the number of buckets in the histogram) is less than the threshold calculated for that number of buckets:

The histogram creation isn't finished, as Oracle now orders the column values in order to find values that cross buckets. Once such values are found new bucket sizes are computed for all but the boundary buckets, in this case bucket 0 and bucket 19:

Notice that for buckets 1-12 the bucket size increased; for the remaining non-boundary buckets the bucket size should decrease. But according to Rule 3 no bucket can be adjusted to a size smaller than the original bucket size so those buckets remain unadjusted. The bucket size computation in the query results below is based on the sample size divided by the number of buckets and, thus, displays a constant value for all buckets:

For a Hybrid histogram three types of column value exist to base cardinality estimates upon: popular values, non-popular values that are endpoint and non-popular values that are not an endpoint. Let's look at how Oracle estimates the cardinality for each case, starting with popular values:

From the formula found in the above comments our calculated cardinality matches with that provided by the optimizer, so our understanding of the Hybrid histogram seems sound. To continue that verification we consider the case where we have a non-popular value that is an endpoint. The formula for that cardinality calculation is: E-Rows = num_rows * greatest (NewDensity, ENDPOINT_REPEAT_COUNT/sample_size). We calculate the NewDensity with a query based on the work of Alberto Dell'Era. First we return the non-popular values that are endpoints:

We now see if our calculation returned a valid value by 'plugging' it into the following equation: E-Rows = num_rows * greatest (NewDensity, ENDPOINT_REPEAT_COUNT/ sample_size). The cardinality estimate we compute is:

Oracle returned 3 as the estimated cardinality and so did our calculation, more proof that our concept of the Hybrid histogram is correct. One final test will prove how well we understand this histogram, considering non-popular values that are not endpoints. Using our calculated value of NewDensity in this equation: E-Rows = num_rows * NewDensity gives us the following result: E-Rows = 100 * .0209375 = 2.09375 ~ 2. Time to see if our calculated value matches that which Oracle has estimated:

Our calculated value, as evidenced by the Rows value in the execution plan, matches Oracle's estimate, giving further evidence that our contcspt of the Hybrid histogram is sound.

`

As mentioned earlier in this article Hybrid histograms can exhibit instability in the form of endpoint changes based on the changing frequency of values as inserts are executed againt the table. To illustrate this let's add another 16 to the data and see what happens:

Hybrid histograms combine the accuracy of frequency histograms along with the space savings provided by height-balanced histograms by the mechanism of consolidating all like values into a single bucket. This generates more values that are considered 'popular' by the optimizer, which in turn offers the optimizer a better opportunity to generate an optimal execution path. Even with the demonstrated instability of endpoint values the Optimizer is more likely to produce a 'good' execution plan using the Hybrid histogram. It's a histogram option worth serious consideration.