Column 1 is effectively unique, so it has 1M distinct values. Let’s say 1 – 1000000, not that it really matters.

Column 2 has 254 distinct values, all evenly distributed so it has approximately the same number of rows for each value. Let’s say the values are 0-253 again it doesn’t really matter.

Column 3 is identical to Column 2, it also has 254 distinct values, all evenly distributed as well such that it also has approximately the same number of rows for each value. Let’s say the range of values are the same, 0-253, again it doesn’t really matter.

You have various queries in the database in which all 3 columns are referenced somewhere in WHERE conditions (eg. WHERE Column1 = 42).

You then insert just one row that has the following values based on our example: VALUES (1000001, 42, 99999999).

The key points here is that for Column1, it’s just another unique value, just 1 greater than the previous maximum value. Nothing special.

For Column2, it’s just another of the existing 254 values that doesn’t really change the even distribution of the data. Nothing special.

However, for Column 3, it’s not only a new value that didn’t previously exist (and so there’s just the one row with this value in the data, whereas all the other values correspond to roughly 1/254 of the rows) but it’s also a value that is way way way outside the normal range of existing values (nothing comes close to having a value of 99999999).

OK, we have the scenario, hopefully you can see where I going with this.

You decide to collect fresh statistics with DBMS_STATS, you want them to be completely accurate so you use a 100% sample size (or compute with estimate_percent=>null). But because you want to get with the AUTO program and make life easier for yourself, you decide to let Oracle work out which columns might require and benefit from a histogram by using METHOD_OPT=>’ FOR ALL COLUMNS SIZE AUTO’.

Now finally comes the question. Of the three columns, only one column will have a histogram. Which one and why is it so ?

If you understand how Oracle collects statistics, the answer will hopefully be obvious😉

I thought that http://orawin.info/blog/2011/08/31/histograms/ was a little long for a comment! I had a little difficulty interpreting “You have various queries in the database in which all 3 columns are referenced somewhere in WHERE conditions (eg. WHERE Column1 = 42)” because the example doesn’t match my understanding of the words; so I tested both.

It appears to me that in 11.2.0.2 the 3rd column switches from a frequency to a height balanced histogram (sort of the answer I expected, but not quite, and not what you implied by the question), whereas in 10.2.0.4 it didn’t.

Of course I’ve probably misunderstood the question (hence the use of a test case so people can quibble with it and play around and so on). If running on 9i then the purge will need to go from drop table and if you have a table called afp then read the script carefully!!

Charles Hooper pointed out that I’d missed the requirement to calculate statistics (estimate_percent => null). When I repeat my test with that condition catered, I am surprised by the results – but it’s beer o’clock now so further thoughts may be delayed and even more incoherent.

If I remember correctly, a histogram on the third column would have to be a height balanced histogram, because there can be a maximum of 254 buckets into which the column values could be associated. A histogram on the second column could be a frequency based histogram (a bit more useful to the optimizer) because there are still 254 or fewer distinct values.

What is interesting is that I see three different results:
If no select statements are performed against the table, there are no histograms generated:

I am confused why the DENISITY value for column C2 changed from 0.003937008 to 0.0000005 when queries have been executed against the table? This was the case for the table populated with MOD and the table populated with DBMS_RANDOM.VALUE.

Because oracle uses column usage (columns in where clauses) when deciding which columns are candidates for generating histograms on. This means that if you gathered stats last week, this week someone ran a new query, and you have method_opt set as above you’ll quite possibly get new stats (and cardinality estimates) for your queries. If you are coming to UKOUG I cover some of this in the graveyard slot on Wed pm (oracle calculates an expected cardinality for this talk as 1🙂 ).

strange results – and I got them too on 11.2.0.1. Maybe DBMS_RANDOM brings a little bit more skew into the data than MOD?

Reading in Wolfgang Breitling’s Paper “Histograms – Myths and Facts” (http://www.centrexcc.com/Histograms%20-%20Myths%20and%20Facts.pdf) I see that “size auto” is the same as “size skewonly” for columns used in predicates. There’s also an explanation for the strange density .0000005 (= 1/(num_rows * 2) for frequency histograms).

In my tests with MOD there are histograms on both columns when I create the table, issue a query and gather statistics. After adding the additional row with VALUES (1000001, 42, 99999999) there remains only one histogram for the second column. With Event 10046 I see queries like:

And it seems that the decision to create the histogram is somehow based on BKTNDV and UNQREP that are 2 and 1 for the last ntile-row (with the maxvalue: Typ=2 Len=5: c4,64,64,64,64 aka 99999999). But I have no idea what that means…

Now, since id3 has 255 distinct values, if oracle creates histogram, it ll be a HBH, therefore, it ll not know, how many values ‘d be there in last bucket..(as it ll divide values from 0-1000001 into 254 buckets.) That is why it doesnt create a histogram.

IMHO, Column2 will be the only column having a histogram. Although you said there was no trick, I believe that there’s a trick here.

Maximum number of histogram buckets is 254. When we insert an extreme number to Column3, Oracle will need to put this new distinct value to a new bucket, but it won’t be able to add 255th bucket. So it will try to recalculate the buckets by redistributing all distinct values. This extreme number will force (mislead) Oracle to distribute all numbers in one bucket.

If you have less than 254 distinct values (or less than 254 buckets), adding this extreme number will not prevent creating histogram for Column3.

I seem to have different results to the above comments. My initial thought was that only column 2 would contain a histogram as all that had happened was the data distribution had changed slightly by the insert.
However, I then wondered why column 3 would not just change to a height balanced histogram.

So I decided to test the theory. I am using Oracle 11G R2 on a Windows 32 bit XP laptop. The results are pasted below:

The big surprise was an initial height balanced histogram on ID1 and the second surprise was a conversion of a frequency histogram on ID3 to a height balanced histogram which seems to go against the flow from previous posts on this question.

I’m not sure I’d necessarily describe it as an error, just an illustration that the behaviour of DBMS_STATS is *very* dependent upon the exact parameters passed (or set in the environment) as well as underlying data distribution.

I now agree with the previous posts, There are initially two frequency histograms, one on ID2 and one on ID3. After the insert the histogram on ID3 is lost. However, I cannot explain why the frequency histogram on ID3 is not replaced by a height balanced histogram. Especially considering my flawed previous test did show a conversion from frequency to height balanced.

Yes, the key bit you missed in your example is having to collect 100% accurate stats. Without that, Oracle may miss the outlier value, not realise it has more than 254 distinct values and create a histogram regardless.

Note BTW that you’re using a random number to generate your values which explains why you’re getting a height based histogram after the 255th value as it likely is skewed enough still for stats collection to warrant a histogram.

How did you populate the other data ? I think you find you have natural skewness in your existing data that has resulted in the histogram and so has nothing to do with adding the additional row. Adding a new row with data that is perfectly evenly distributed shouldn’t result in a height-based histogram.

I’ve tried to simulate your example (in various 10/11g databases) and could only do so with data that is already somewhat skewed (such as using dbms_random to generate your data).

IMHO, I don’t think we really can say your data in skewed. I don’t believe the data is really skewed in the examples that use the dbms_random function either. For most practical purposes, the average density would be close enough and sufficient.

Unfortunately, that’s not the way stats collection sees it and it will generally collect way way way more histograms than is actually necessary (and for the possible issues with this, see Randolf’s excellent comments). And yet still miss cases where it might be useful such as my outlier example.

So I just wanted to point out that is was the existing data and not the new 260 value that caused your histogram.

The reason you’re not getting the expected endpoint_number values is because you have only collected sample stats, not full compute stats. Add the estimate_percent parameter set to null and the numbers should be more accurate.

The endpoint_values are numeric representation of the actual values. Rather than going through it all, the following blog post by Martin Widlake does an excellent job of explaining it all: