Decoding high_value and low_value January 3, 2010

The table DBA_TAB_COLUMNS holds the LOW_VALUE and HIGH_VALUE for columns. This information is potentially very useful to us, especially as when the CBO looks at a WHERE clause on a column for a given value, the High and Low values for that column are considered. If the value in the WHERE clause is outside the known range, the expected number of values to be found is decreased in proportion to the distance outside the range.

What is not so helpful is that Oracle stores, and displays, the information in an internal raw format. Which is utterly unhelpful to us of course. Why can’t they at least expose the information in a view in a human-readable format? Tsch.

So here is some code on how to decode low_value and high_value. I pulled most of this together a few years back but always struggled to decode the low and high values for dates, until I found this post by Gary Myers.

Most of the translation is done via the utl_raw.cast_to_datatype functions but no such function is provided for dates, which is where most of us come unstuck. Gary recognised that the data was being stored in hex format, with an offset for the year.

I hope the script helps some of you.
{There are also functions under DBMS_STATS, DBMS_STATS.CONVERT_RAW_VALUES, that can also be called to translate many datatypes}

If anyone knows how to decode timestamps, I’d love to know as it would save me trying to work it out/find it on the Net somewhere. I’ll add it to the post so that there is one place to find all translatations.

Thanks Greg.
The linked page also includes creating a PL/SQL function to act as a front end to the DBMS_STATS functions, to make using them “cleaner”, so anyone planning to create such a function would do well to check it out.

Martin, i think, your formula for date datatype doesn’t reflect how it is stored, it yields correct results only per incident for some dates. The raw representation of date is:
1st byte: century + 100
2nd byte: year + 100
So, for year 2010, the first two bytes are 78,6e (hex) or 120,110 (decimal), that means 20 (120 – 100) century and 10 (110 – 100) year, or 2010, according your formula – 1780 + 120 + 110 = 2010 ( the same result). But for 1990 raw representation is 119,190 and your formula would calculate 1780 + 119 + 190 = 2089.
By timestamp first 7 bytes are the same as for date and the rest 4 bytes represent fractional seconds (it is covered in Tom Kyte’s Expert Oracle Database Architecture).

Gary, Martin, sorry, my bad – i missed that in hurry ( as usual) and put more attention on the code, than on the text, but thank you for correction anyway.
Btw., you can get the rid of all that ltrim’s, if you use to_char(…,’fm00′) – in that case the blank reserved for sign (by positive numbers) will be stripped anyway.

SQL> @get_column_low_hi
Enter value for tab_own: TEST
old 40: where owner like upper(‘&tab_own’)
new 40: where owner like upper(‘TEST’)
Enter value for tab_name: T1
old 41: and table_name like upper(nvl(‘&tab_name’,’WHOOPS’)||’%’)
new 41: and table_name like upper(nvl(‘T1′,’WHOOPS’)||’%’)

SQL>
SQL> /
Enter value for tab_own: TEST
old 40: where owner like upper(‘&tab_own’)
new 40: where owner like upper(‘TEST’)
Enter value for tab_name: X
old 41: and table_name like upper(nvl(‘&tab_name’,’WHOOPS’)||’%’)
new 41: and table_name like upper(nvl(‘X’,’WHOOPS’)||’%’)

It looks to me as though it is working – at least, the raw values are correctly being interpreted back into numbers and characters.

Try adding in the columns for the LOW_VALUE and HIGH_VALUE without modifiying them to the script. I think you will see that numerics are represented by a hex sting starting “C” (eg C102 = 1) whereas varchar2 values are represented by a hex number with twice the digits of the string it represents (unless you have multi-byte character sets).

I think what you are seeing is the impact of the sampling done when the table is analyzed. It will not necessarily get the real high and low values, depending on sample size, and the high and low values could well have changed since the last time the table was analyzed. This is the main reason why I want to be able to see the true High and Low values, so I can get a feel for if the values are significantly different to reality.

One of the main failings of doing a small sample size gather is that you can end up with inaccurate HIGH and LOW values and also, possibly more significantly, lower NUM_DISTINCT value. I have no proof for this, but I suspect something changed from V9 /V10.1 and V10.2 which makes the NUM_DISTINCT more vulnerable to being incorrect with small sample sizes as I have had recent issues on 10.2.0.3 with 0.5% sample sizes on very, very large tables (hundreds of millions of rows) getting very low NUM_DISTINCT value that I cannot remember being an issue with even 0.1% sample sizes against similar sized table on V9 and V10.1

Why is it a problem? Well, I think I will put up a whole post on this but in short the CBO guesses how many rows will match a value outside the min-max range by how far out the range it is. As an example, if the CBO sees a LOW_VALUE of 100 and a HIGH_VALUE of 1100, there are 1000 NUM_DISTINCT values and there are 5000 rows, it will reckon a value in the range will match 0.1% of the num rows, ie 5 rows. If it sees a value outside the range, let us say 50% outside the range at 1600, it will estimate 50% of that value, ie 3 values (rounded up – and I might be missing a tweak of 1 row as well). I know Mr Lewis covers this well in his book and I am sure there are a few good blog entries out there about it too. I’d want to do some tests and proofs to validate this before I do a proper blog though.