I was going to leave it for a few days but there have already been so many comments and discussions on all this, I thought I better write something up. In case anyone was wondering, yes I probably am driving my colleagues at work mad with my “Question of the Day” !!

Unfortunately, some might be disappointed at both Oracle and myself🙂

Yes, I did kinda set things up to trick the unwary and yes, perhaps the answer isn’t what many are expecting.

The answer to my previous question of which column is going to have a histogram when using the METHOD_OPT SIZE AUTO option is in fact Column 2. Well done to everyone who got it right.

Why ?

The simplest answer is because it’s the only column of the three that has 254 or less distinct values.

Here’s the key point. When using METHOD_OPT SIZE AUTO, every column with 254 or less distinct values that has been referenced within a predicate, will have a Frequency-based histogram. Each and every one of them, regardless of whether the data is actually skewed or not. So Column 2 with only 254 distinct values AND having previously been referenced in a predicate was guaranteed to have a histogram.

If a column has more than 254 distinct values, whether it then has a Height-Based histogram depends on how the data is skewed. If the data is perfectly evenly distributed, then it won’t have a histogram. Column 1, having sequenced based unique values will not meet the criteria and so not have a histogram.

Column 3 is interesting. Having inserted the outlier value, it now has 255 distinct values and so no longer qualifies for an automatic frequency based histogram. However, if all its values are evenly distributed, then it won’t qualify for a height based histogram either and Column 3 only has just the one outlier value, all other values are evenly distributed values. Unfortunately, Oracle doesn’t pick up on rare outlier values (even if you collect 100% statistics and it’s one of the low/high points of the column) and so will not generate a height-based histogram.

Notice I’m using a MOD function to generate a perfectly even distribution of data. I’ve noticed a few examples (such as that by Charles Hooper in the comments of the Quiz posting), in which the DBMS_RANDOM function is used. Note this will almost certainly generate data with enough natural skewness on a 1M table with 254 random values that when the outlier 255th value is introduced, it will qualify for a height-based histogram. Very easy way to test and find out. Simply generate the 1M data with 255 random values and I suggest a height-based histogram is created regardless.

OK, I’ll run some SQL to generate sufficient workload to qualify the columns for automatic histograms:

BTW, the difference between the SIZE AUTO and SIZE SKEWONLY options, is that AUTO requires previous workload to suggest a histogram might be relevant, SKEWONLY does not.

If we were to collect statistics at this stage, we would notice that the second and third columns both have a Frequency-Based histogram as both columns only have 254 distinct values and so automatically qualify:

Note that the third column now has 255 distinct values and so no longer qualifies for the automatic Frequency-Based histogram. As most of its data is perfectly evenly distributed with just the one outlier value, the column doesn’t qualify for a Height-based histogram either and so now has no histogram at all.

Note as I collected 100% accurate statistics, Oracle is definitely aware of this outlier value:

When previously it had the cardinality estimates spot on, now they’re terrible (expecting not 1 row but 999,000 rows !!) because without a histogram, Oracle is assuming even distribution between its low and high point values.

I’m not a great fan of either the SIZE AUTO or SIZE SKEWONLY options😉

Hope you’re enjoying these little quizzes, I’ll have another one for you all soon.

Share this:

Like this:

Related

Nice. That’s a fun problem to think through. I guess the only caveat that we should mention in this discussion is that because we are talking “AUTO” Oracle could change this behavior at any future release.

you can see the MOD versions only have a histogram on ID3. The random value column with the outlier value has indeed been converted to a height balanced histogram by luck of my random skewness of column ID5

I only used SIZE 1 and then collect histograms just on those cases where its been necessary. Comes down to knowing your data, knowing your application and perhaps hitting issues previously where a histogram has proven to be beneficial.

The vast majority of columns I find don’t need histograms so just collect them on those that do.

Now that the questions raised regarding the generation of histograms by the SIZE AUTO option have been addressed, I think it is worth to point out that these “excess” frequency histograms that get generated for columns with a workload, uniform distribution and a low number of distinct values can have (more or less) subtle side effects one should be aware of:

– In general these histograms will take up space in the corresponding SYS cluster (C_OBJ#_INTCOL#), for larger databases with many (sub)partitions this can become significant, as the histograms might become collected for every (sub)partition

– These histograms potentially allocate space in the dictionary cache

– For every histogram an additional, separate pass is required while gathering statistics, so this can have a significant impact on the time it takes to gather statistics

– Join cardinality estimates will use a completely different code path in the optimizer if both sides of the join have histograms on the expressions/columns used, and this can lead to different cardinality estimates

– For versions >= 9.2 and <= 10.2.0.3:
For equal comparisons with values existing in the frequency histogram you'll end up with the famous “mystery of halving”, so the cardinality estimates will be off by 50%. Not a big problem on its own, but combine several of them and you might have a real cardinality estimate problem

– For versions >= 10.2.0.4
These versions address above “mystery of halving” problem, but they introduce another potential issue: Values that don’t exist in the frequency histogram no longer are estimated with a cardinality estimate of 1 but with half of the least popular value in the histogram. This can again lead to interesting scenarios, like this one.

So the basic idea behind this strategy seems to be good, since with a low number of distinct column values a frequency histogram gives the optimizer the perfect picture of the data distribution, but there are implementation details and side effects that seem to suggest using SIZE 1 and only generate histograms where appropriate and beneficial can be the better approach if one knows its data and application.

Thanks so very much for your comments, much appreciated. I couldn’t agree more.

In one of our large production systems we accidently left the automatic stats job running when we first migrated from 9i to 10g. In one fell swoop, we went from about 10 histograms which we manually collected to several 10,000s. The results were disasterous, resulting in row cache latching hell, CPU continously at 100% with parsing overheads and execution plans all over the place.