Hybrid Fake

Oracle 12c introduced the “Hybrid” histogram – a nice addition to the available options and one that (ignoring the bug for which a patch has been created) supplies the optimizer with better information about the data than the equivalent height-balanced histogram. There is still a problem, though, in the trade-off between accuracy and speed: just as it does with height-balanced histograms when using auto_sample_size Oracle samples (typically) about 5,500 rows to create a hybrid histogram, and the SQL it uses to generate the necessary summary is essentially an aggregation of the sample, so either you have a small sample with the risk of lower accuracy or a large sample with an increase in workload. This being the case it’s worth knowing how to create a hybrid histogram using the dbms_stats.set_column_stats() API.

It’s fairly easy to identify the cases where a hybrid histogram could be helpful. You have a large volume of data spread over a large number (more than 2048) of distinct values, but a few values (typically less than 250) which are responsible for a significant fraction of the data. You would like to tell Oracle about the special “extreme” cases so that the optimizer can take defensive if you query for one of those values, but at the same time you would like to give Oracle a picture of the way the rest of the data is distributed. This is similar in some respects to the Top-N (a.k.a. Top-Frequency) histogram which says to Oracle “We have a small number of popular values, and some odds and ends on the side that are pretty ignorable”, the critical difference is that you need the hybrid histogram when it’s not safe to “ignore” the odds and ends.

Here’s an example of creating some data and then generating a completely artificial hybrid histogram. The code demonstrates 3 points – the principle feature of creating hybrid histograms and a couple of generic details about Oracle’s histograms:

The main point is that Oracle 12c introduces a new numeric array in the dbms_stats.statrec structure. This allows each row (bucket) in a histogram to hold a second statistic about the bucket so we can now store a frequency figure for the bucket as a whole, and a “repeat-count” figure for the highest value in the bucket. (Warning – there is a counter-intuitive conflict between the name of the new structure and the way it is used for hybrid histograms).

As side-point I’ve included a code variation that shows you the remarkable similarity between generating a Frequency histogram and a Hybrid histogram.

As a second side-point I have also highlighted the effect you see in the dba_tab_histograms view when your popular values are “too similar” to each other – i.e. when they match on the first 6 characters.

We start by creating a table as a copy of the view all_objects – then we’re going to create a hybrid histogram on the object_type column that looks nothing like the data. The histogram will say:

for every 15,000 rows (where the column is not null)

5,000 will have values less than or equal to ‘C’, of which 3,000 will have the value ‘C’

The next 2,000 (i.e. running total 7,000) will have values greater than ‘C’ and up to ‘PPPPPP1’, but ‘PPPPPP1’ itself is not a popular value

The next 2,000 (i.e. running total 9,000) will have values greater than ‘PPPPPP1’ and up to ‘PPPPPP2’, but ‘PPPPPP2’ itself is not a popular value

The next 2,000 (i.e. running total 11,000) will have values greater than ‘PPPPPP2’ and up to ‘PPPPPP3’, but ‘PPPPPP3’ itself is not a popular value

The last 4,000 (i.e. running total 15,000) will have values greater than ‘PPPPPP3’ and up to ‘X’ of which 3,000 will have the value ‘X’

Note particularly that the “how many rows hold the endpoint value” are stored in the statrec.bkvals array – just as they would be for a frequency histogram – and the cumulative count of rows is stored in the statrec.rpcnts structure. All we have to do to create a frequency histogram instead of a hybrid histogram is to store zeros in the statrec.rpcnts structure, or leave it uninitialized.

You’ll notice that since I’m creating a histogram on a character column I’ve used an array of type dbms_stats.chararray to hold the list of values (in ascending order) that I want the histogram to describe.

That’s it – it’s remarkably simple. To show the effect of running this code I can report the content of user_tab_histograms for the column. I’ve actually run the code and queried the results twice; first for the case where I created the hybrid histogram and then after modifying the PL/SQL block to set the rpcnts array to zeros to create a frequency histogram.

I made a comment earlier on that the naming and use of the rpcntsstructure was somewhat counter-intuitive. As you can see in the results above, when I created the hybrid histogram the values I stored in the rpcnts structure are not the values reported as the “repeat count”, the numbers reported as the “repeat count” are from the bkvals(bucket values). As far as I’m concerned this means I have to go back to my basic examples every time I want to fake a histogram because I’m never too sure which arrays I should populate with what values – and whether I should use absolute or cumulative values.

One last minor point: you’ll see that the endpoint_actual_value has been populated in this example. This is because (with Oracle’s interesting transformation from character to numeric) the three ‘PPPPPPx’ character values turn into the same number – so Oracle stores the first 64 bytes (or 32 for versions of Oracle prior to 12c) of the actual value.