Statistics on Partitioned Tables - Part 2

Doug's Oracle Blog

In the last part, I asked you to trust me that true Global Stats are a good thing so in this post I hope to show you why they are, to make sure you don't kid yourself that you can avoid them. (Updated later - this is all on 10.2.0.4)

Why would you even want to avoid them? Global stats can take a lot of work to gather if you're working with very large objects because Oracle has to visit all partitions. As an alternative, Oracle has the capability to aggregate lower level statistics to generate simulated global statistics at higher levels of the same object. In our case, as we INSERT data into new subpartitions or use partition exchange operations, we gather statistics at the SUBPARTITION level and allow the statistics to aggregate up to the PARTITION and TABLE level. Here's how it looks ....

I'll delete the existing table stats and regather at the SUBPARTITION level.

Note that, because I haven't specified a subpartition name, all of the subpartitions will need to be visited to gather these statistics and will result in the following Table, Partition and Subpartition stats.

I can see that the table statistics are not global stats, but the number of rows looks right. These stats are actually aggregated statistics that Oracle has populated, based on the data found in the subpartitions. Let's look at the partitions.

So this looks pretty good, doesn't it? We've gathered 'Global' statistics at the subpartition level and yet the stats at the table and partition level look accurate too. Why would we want to use this approach? Well, to use my current system as an example, it's a near-real-time datawarehouse which creates tens of thousands of subpartitions per day, most of them over a period of a few hours so if we were to re-gather global statistics at the table and partition levels, there would be substantial associated stats-gathering workload and the system is under enough strain as it is. Therefore, if we can just gather stats at the subpartition level for the new subpartitions and have Oracle aggregate them to generate derived Table and Partition stats at the same time, so much the better. To simulate that, I'll insert some more data and see if the stats still look accurate after adding data and regathering at the SUBPARTITION level.

Everything's still looking very good in this case and so it looks like a great strategy - low collection overhead and accurate statistics. That is, until you start drilling down to the column level statistics, using Greg Rahn's query and identify some horrible problems. These are the statistics at the table level.

At first glance they look pretty good, too. To give you a specific example, the STATUS column does have the correct High Value of 'U', which has just appeared in the last set of rows that were inserted. Based on what I've seen to date, Oracle does accurately update the High/Low column values and row counts when generating aggregated stats, but there's a problem here. According to the column statistics, there are 4 distinct STATUSes in the table, but that's not true, there are only 2

SQL> select distinct STATUS from test_tab1;
S
-
U
P
SQL>

Based on a problem with such a small number of rows and only two distinct values, the chances of the Number of Distinct Values calculated during stats aggregation being accurate looks pretty slim and, when you consider what a key input to cost-based calculations those values are ....

Why aren't the values accurate? Well let's compare High/Low values to Number of Distinct Values (NDV).

When we gathered statistics on the new subpartition, we had access to the previous High/Low values at the table level. Here are the column statistics before stats were gathered on the new subpartition.

So at this stage, there is one distinct value of status, which is P. When we gathered stats on the new subpartition, Oracle could see all of the STATUS values for the rows in that subpartition and noticed STATUS='U' on one of the rows and could work out very easily that it's higher than 'P', so updated the High Value accordingly as per the example shown earlier.

Now, what to do about the NDV? Remember, Oracle can't look at any of the data in Partitions or Subpartitions other than the one we're gathering stats on (that's the point, to reduce overhead). So it has to decide what the new NDV should be based on several inputs

1) The actual values in STATUS for the rows in the subpartition we can look at.2) The previously-gathered and stored NDV for the other subpartitions.3) The previously-aggregated NDV stored at the table and partition levels.

The problem is that Oracle knows the number of distinct values in other subpartitions (we looked at the data previously to calculate them) but not the values themselves and, without that information, how can it say whether the 2 distinct values (P and U) in this subpartition are distinct when compared to the values in the other subpartitions? Actually, in this case, we might expect Oracle to do something clever and realise that, as there was only one distinct value of P prior to our new subpartition and the only values in the current subpartition are P and U, that there can only be 2 distinct values. Sadly, it just doesn't work that way!

Although relying on lower-level statistics being aggregated up to higher levels might initially seem like a neat trick, it's going to lead to some pretty strange statistics, at least in 10g. Which is why Oracle recommend you gather Global Stats at the TABLE and PARTITION levels.

I wish I could say that was the only problem with this aggregation process, but there's more to come in the next post ...

Statistics on Partitioned Tables - Part 3
As soon as I'd committed my last post, I knew it wasn't what I'd hoped for and said as much to a couple of people before they'd read it. I knew it would probably just add to any confusion people already had about this subject (something I'm particularly k

Log Buffer #180: a Carnival of the Vanities for DBAs
Hello and welcome to Log Buffer #180. Time’s a-wastin’, so let’s go! Oracle There was so much Oracle stuff this week that I’ve decided to cram a little more of it into Log Buffer by providing a little less context than usual. ...

That's an excellent questions but one I'm not going to answer at this point for a number of reasons.

1) At the start of the series, I very deliberately said

I'll steer clear of any remotely advanced angle

2) Although I think you're absolutely right to say that the NDV would make more sense as 3 instead of 4, I would argue it makes even more sense for it to be 2 and Oracle does have enough information to derive 2 and yet it still doesn't, at least not in 10.2.0.4

3) I might write a blog post specifically about how it derives the NDVs at a later point, or let someone else leave a comment to answer your question here, but I personally think it's better to know that the NDV aggregation process doesn't work properly and should be avoided.

If Oracle doesn't get it right with such simple examples, imagine the potential horrors when you scale it up to more realistic situations.

Thanks Doug,
Excellent answer.
I agree with you it's a risk to completely believe in NDV aggregation process.
And I'm sad to say, I've tested this process in 11gR1 and 11gR2, still the same issue.
I noticed in 11gR2 doc, it mentioned new value "APPROX_GLOBAL AND PARTITION" for GRANULARITY, it looks like some improvement for NDV aggregation, but I have no time to test yet.

Disclaimer

For the avoidance of any doubt, all views expressed here are my own and not those of past or current employers, clients, friends, Oracle Corporation, my Mum or, indeed, Flatcat. If you want to sue someone, I suggest you pick on Tigger, but I hope you have a good lawyer. Frankly, I doubt any of the former agree with my views or would want to be associated with them in any way.