June 14, 2013

Webinar questions

Following the webinars about 11g stats that I presented on Monday John Goodhue emailed me a few questions that had come through the chat line while I was speaking, but hadn’t appeared on my screen. He’s emailed them to me, so here are the questions and answers.

1. I’d like to know what parameter to use for faster results on dbms_stats.gather_dictionary_stats

I don’t think I’ve come across any complaints about gathering dictionary stats being particularly slow, and you haven’t mentioned a specific version so I can’t make any comment about possible bugs. If you can, you first need to find out where most of the time is going, so tracing the call would be a good idea to start with. If you can’t trace it you could look at the “standard” types of problem such as: AWR history tables getting very large, optimizer stats history tables (particularly the histogram one) getting very large; synopsis tables getting very large. Histograms can be very expensive to collect, especially on partitioned tables, and many of the 11g dictionary tables are partitioned – so you might want to changed the method_opt for the collection to “for all columns size 1″. It is possible to get Oracle to dispatch multiple jobs to collect stats (the global preference CONCURRENT) if the issue is elapsed time rather than workload – check dbms_stats.set_global_prefs(), parameter CONCURRENT.

2. Do incremental statistics also work with sub-partitions?

No. Even when you have a composite partitioned table, Oracle only uses the approximate NDV method to create synopses at the partition level. Given that a synopsis is up to 16,384 rows for each column for each partition, the volume of synopsis information that could be generated by extending the mechanism down to the subpartition level could be catastrophic.

3. I have always been benefiting from block_sample=>TRUE .. I do not see you using it – is there any reason to not use it {while having in mind of course, that the accuracy may suffer a bit, with heap tables, at the cost of faster gathered stats}

The only good reason not to use it is the reason you’ve given – there are patterns of skewed data that could make a block sample even less accurate on the number of distinct values than a row sample. The particular benefit of the block sample (for “safe” distributions) is that even a small percentage row sample may require Oracle to read every block of the table as it picks pseudo-random rows – perhaps even doing “db file scattered read” requests to scan the whole table and discarding lots of blocks (one of reasons behind statistic “prefetched blocks aged out before use”) because it didn’t need rows from them. With the 11g approximate NDV, of course, the question is moot since the new mechanism examines every row in the table.

4. is this #rows * #columns sorting for count distinct as well happening for any distinct? (as SELECT DISTINCT is generally a group by every selected column)

You’re referring to my comment about “select count(distinct n1) , count(distinct n2). count(distinct n3) from table;” seems to turn each row into N rows (where N = 3 in this case) of (column id, column value) and then sort the entire set in one operation rather than N concurrent operations. I don’t think you can generalise this, you’d have to check each SQL pattern separately – initially just checking the execution plan. For example, something like ‘select distinct n1, n2, n3 from table” is finding distinct combinations, and it will do a hash unique on the number of (non-null) rows in the table irrespective of the number of columns selected.

5. what is your view on using table preferences to explicitly specified required histograms?

I made a brief statement about collecting histograms for anything but your explicitly chosen columns; following up with the suggestion that your default action should be “for all columns size 1″ followed by explicit code to generated histograms for specific tables. I would be perfectly happy to set the global prefs to “for all columns size 1″ and then set table prefs to something specific for the columns I really wanted; the only reason that I don’t put this forward as a standard practice is that it does conceal what’s going on, and it’s easy to forget that some tables have preferences set. If your organisation is really good about documentation of systems then it makes perfect sense. Having said that, though, don’t forget that I also pointed out that sometimes you may still need to create “fake” histograms to get the best possible results.

6. (paraphrased) I think Jonathan should have acknowledged that his presentation was based on the work of Amit Poddar.

I agree. This was an error of omission that I shouldn’t committed. It has been several years since Amit first described how Oracle uses a hash table for the approximate NDV and keeps halving it when the number of hash values reaches 16,384. When he closed his website he allowed me to publish the original presentation and white paper (64 pages) on my blog so I really should have remembered that the algorithm is still so little-known that he still merited attribution. As far as I am aware his work is still the most thorough and detailed description of how the algorithm works, although Oracle has extended it in 12c (as explained in the presentation I did about histograms for OOW 2012).

Hi,
I do not have access to Oracle 12c, and I also do not have access to your Open world presentation. Is the new histogram collection enhancement you are talking about is incremental maintenance of histograms without rescanning the entire table ?

As it suggests (though there was more information in the voice-over) for frequency and the new “top-N” histograms the hash table seems to be holding a count and a rowid for a representative value – which means the histogram can be generated in one pass, with access by rowid back to the database afterwards to get the actual values.

You will probably appreciate that there are a couple of interesting issues relating to handling hash collisions, and top-N histograms when there are more than 16,384 distinct values in the table.

“As it suggests (though there was more information in the voice-over) for frequency and the new “top-N” histograms the hash table seems to be holding a count and a rowid for a representative value – which means the histogram can be generated in one pass, with access by rowid back to the database afterwards to get the actual values.”

Jonathan,

That does not look like a new idea, this idea and a more elaborate scheme of keeping a sample of records for each hash collision was mentioned in the original paper by Gibbons (Bell labs). Gibbons also discussed using the algorithm to answer many DSS types of questions where exact NDV was not required. Please see section 3.2 in

For TOP-N frequency histogram the, for the values that are put in one bucket does oracle also keep track of how many distinct values were put in one bucket and divide the row count in the bucket by this to estimate cardinality of each value ? Because its not evident in the presentation