September 20, 2010

Frequency Histograms 2

I find it convenient occasionally to “translate” a frequency histogram into a report of the underlying data (sometimes to see how closely the histogram matches the real data). To demonstrate the type of query I use I’ve created a data set with a small number of distinct values and generated a frequency histogram on the data set. This is what the data and histogram look like:

If you look at the histogram data you’ll see that the “endpoint_value” column holds a list of the values that appear in column n1, and the “endpoint_number” is the ‘cumulative frequency’ for the appearances of the endpoint_value – the number 10 appears 4 times, the number 20 appears 7 – 4 = 3 time, the value 30 appears 13 – 7 = 6 times, and so on.

Once we’ve recognised this pattern we can write some SQL to convert the histogram back to the list of values it represents by using the analytic lag() function. I’ll demonstrate this in two steps – first generating an intermediate result to show the effect of the lag() function.

This is the code for numeric columns; for data columns you simply need to select to_date(to_char(endpoint_value,’J’)) rather than just endpoint_value. (Edit: this works for date-only values, but if you stored date and time then see the correction in comments 3 and 4 and below). Character strings, though are more of a problem in the general case – and I may talk about those in another blog.

Note, by the way, that the mechanism that Oracle uses to generate the histogram data means that the results appear in the same order whether you sort by endpoint_value or endpoint_number. (Unless – as can happen with character columns, a point I initially forget – the same endpoint_value appears in two rows: so I’ve edited the original code where I had used endpoint_value to use endpoint_number)

Related

Mr. John
End of day, I hope you are insisting on this, with the above comparison between real and histogram if we find any vast difference, then we should perform collecting histograms post collecting the regular stats.

Rather more importantly – if this code allows you to see that the histogram doesn’t represent real life then you have a clue that you may need to write a program to generate an “artificial” histogram and stop Oracle from collecting one.

This was my error (and a common error in the Oracle world) – I forgot to generalise the principle. I ran up a quick data set to test the code and used something that generated ‘date-only’ values, but the column (and histogram) could hold ‘date and time’ values.

In the case of ‘date and time’, the time component is stored as the fraction of a day, for example, 6:00 pm on 29th Sept 2010 would be stored as: 2455469.75

This is why your format model ran into error 1481 – it didn’t match the input value. The following expression should work:

This splits the value into a day part and fraction of day, multiplies the fraction by the number of seconds in a day, converts both bits to character (with no spaces), concatenates them with a ‘.’ in the middle, and then uses the ‘Julian’ and ‘seconds’ conversion format.

Strange – are you sure you’re not seeing the effects of the default format for dates on your system ? Putting the value from your example into my formula (selecting from dual) gives me the date and time that I expect to see.

According to the documentation:
“All number format models cause the number to be rounded to the specified number of significant digits.”

I wonder what settings effect this rounding or is it an outright bug. I checked Metalink and could find nothing relevant. For what it’s worth, when I do a “show” of my NLS parameters, the only ones set are:
nls_language AMERICAN
nls_length_semantics BYTE
nls_nchar_conv_excp FALSE
nls_territory AMERICA

I don’t know if it’s okay to ask a question that’s rather “peripherally related” here, but this being one of the more “basic” ones among your posts related to frequency histograms (and as I don’t find any related documentation on the net) I’ll just try :-;
Looking at dba_tab_col_statistics, I find lots of frequency histograms where num_buckets is (even substantially) lower than num_distinct (in 11.1.0). However, in the literature (also in Oracle’s Performing Tuning Guide) it always seems to say that the distinctive feature of a frequency histogram is that there’s one bucket per distinct value…
Am I getting this totally wrong here? Or might there be a (seldom mentioned?) algorithm that lets Oracle skip some values (perhaps being too infrequent?)

It’s a good question, as it’s in the right technical area for this blog, suitably general in content, and one that’s commonly asked and addressed on the web.

You didn’t give me the details of a specific example – but I’d guess that the problem lies in sampling. If you have a relatively small number of distinct values that cover MOST of the data and the rest of values represent a tiny fraction of the data then the sampling mechanism that Oracle uses is quite likely to miss some of the scarce values.

A recent client had a column where two values covered about 10 million rows in a table, leaving a couple of hundred rows for the remaining five or six values. From time to time the stats collection would report that there were only two distinct values in the table – and it rarely managed to report every single value. It’s cases like this that you might want to write a program to create and fix some representative stats.

” … and the rest of values represent a tiny fraction of the data then the sampling mechanism that Oracle uses is quite likely to miss some of the scarce values … ”
“… the stats collection would report that there were only two distinct values in the table …” –

what puzzles me here was that the sampling mechanism seems to be accurate enough to get the (possibly) correct number of distinct values, but then constructs the histogram in another way (so it even looked “on purpose” to me…).

Right now I’ve had the possibility to get the “real counts” at the same time as performing the statistics gathering on a clone of the production database, and picking a column with a substantial difference between num_buckets and num_distinct I get e.g.:

sqlplus> select mycol, count(1) from myschema.mytable group by mycol order by mycol;

sqlplus> select count(*) from myschema.mytable where mycol is not null;

COUNT(*)
———-
83182

I unfortunately don’t have the time to pursue this further with the other columns now, but I wonder might there be some algorithm like “if the count of a value is less than e.g. total_count / sample_size (or some proportion of this), don’t build a bucket for it” – here for example, the “most frequent” value that got no bucket has count 28, and the ratio total_count / sample_size is 27.68… (of course this is not really much data for a guess yet :-;)