Sunday, January 13, 2013

HAVING Cardinality

When performing aggregate GROUP BY operations an additional filter on the aggregates can be applied using the HAVING clause.
Usually aggregates are one of the last steps executed before the final result set is returned to the client.
However there are various reasons, why a GROUP BY operation might be somewhere in the middle of the execution plan operation, for example it might be part of a view that cannot be merged (or was hinted not to be merged using the NO_MERGE hint), or in the more recent releases (11g+) the optimizer decided to use the GROUP BY PLACEMENT transformation that deliberately can move the GROUP BY operation to a different execution step of the plan.
In such cases, when the GROUP BY operation will be input to some other operation, it becomes essential for the overall efficiency of the execution plan preferred by the optimizer that the cardinality estimates are in the right ballpark, as it will influence the choice of other related execution steps like join orders and methods or simply the decision between an index-based access or a full table scan.
While the optimizer based on the statistics can come up with a reasonable estimate regarding the cardinality of the GROUP BY expression (the emphasis here is on *can*, it might also be wrong), it is important to understand that an additional filter on the aggregates using the HAVING clause is in principle treated like an "unknown" expression and therefore the estimates are based on built-in defaults that might not have much to do with actual filter selectivities of that HAVING expression.
Here is a simple example to demonstrate the point:

There is a table of 100K rows with two dates in it that have each 100 distinct values (we can ignore in this case here that the values generated are actually correlated) - the ID column is unique.
If I ask for an estimate for queries similar to the following on this table:

Note how the optimizer gets the cardinality right for the first statement. The aggregation doesn't reduce the cardinality due to the uniqueness of the ID column.
Notice how the filter predicates on the aggregates in the second case actually do not filter at all as they select the whole date range available. But the optimizer simply assumes 5% selectivity per unknown range predicate resulting in 0.05 times 0.05 = 0.0025 selectivity.
I don't think that it is a bug - it simply cannot assume anything regarding the possible MIN and MAX values given any potential arbitrary other filters.
If I now wrap the aggregate query as a view into a more complex statement, it becomes obvious that the HAVING clause can also implicitly be derived by applying corresponding filters to the view:

Notice how the incorrect cardinality estimate leads to a NESTED LOOP for the join operation, which is very likely a bad idea in such cases where the number of actual loop iterations is much higher than estimated.
In general such bad cardinality estimates can echo through the whole execution plan with potentially devastating results.
For my particular case here one potential workaround besides using undocumented CARDINALITY or OPT_ESTIMATE hints is to prevent the optimizer from pushing the filter into the view (and thereby avoiding the implicit generation of the HAVING clause) by wrapping the view with another view on top that includes a reference to ROWNUM:

This way the cardinality estimate is much better for my particular case here and a more suitable HASH JOIN gets used instead.
There are however a couple of noticable drawbacks with that workaround:
- The optimizer is still clueless: If for example the filter on the aggregates actually filtered any data, it would still assume NO filtering at all (still 100K in this case here), which might be just as bad, but in the opposite direction (over-estimate instead of under-estimate)
- Any other potentially useful filters (the "ID > 50000" in my case here) will also be prevented from being pushed into the view, so in my case the GROUP BY has to operate on a much larger data set than actually necessary (100K rows instead of 50K rows) - not good
- The ROWNUM evaluation causes overhead and will be a problem when trying to run this statement using Parallel Execution, as it will cause the plan to be split into multiple DFOs (you'll find multiple PX COORDINATOR operations in such plans which can have nasty side effects, for more info read my OTN mini-series on Parallel Execution) with a serialized operation in between to determine the row number

Summary

Be careful whenever the cardinality estimates for HAVING clauses become relevant to your execution plan, as the optimizer simply applies default selectivities that can be way off.
If the aggregation is one of the final steps of execution, the cardinality estimate is probably not that relevant, but there are other cases possible where it matters a lot.

Footnote

From 11.2.0.3 on there is a new undocumented parameter "_optimizer_filter_pushdown" that defaults to "true".
When setting it to "false" the "Filter Pushdown" (FPD) transformation used above and prevented via ROWNUM will be disabled, however on a global statement / session level and not only for a particular view / query block as with the ROWNUM workaround.

The only difference to your plan with HAVING is the cardinality of 0.05 (2501) instead of 0.0025 - it seems the CBO ignores that there are two unknown range predicates. The reason for the test was that I wanted to see if a MATERIALIZE hint would result in more adequate cardinalities - and it seems to do, but at the cost of materialization ...

> The reason for the test was that I wanted to see if a MATERIALIZE hint would result in more adequate cardinalities - and it seems to do, but at the cost of materialization

But there is no materialization in your example, and in fact you can arrive at the same cardinality by using your "basedata" as simple in-line view without using CTE.

Note that your example differs from mine in that you've applied the filter on ID inside the view whereas my examples apply the filter outside, but that doesn't seem to explain the different filter ratios.

I haven't investigated further why sometimes Oracle only applies the 5% default selectivity once instead of twice.

for the sake of completeness: a CBO trace of your initial query with the HAVING clause and my CTE-Version with a following filter in the main query shows that both queries are transformed into the same "Final query after transformations":

I don't see other major differences between the two traces (aside from the transformation steps) - and so it seems to me the CBO is not very consequent in its costing for the unknown range predicates: in both cases the selectivity shoud be 0.002500.

thanks for taking the time and offering another interesting alternative regarding the cardinality estimates.

Now your previous comment makes a lot more sense to me, you're right.

Your idea might have one advantage over the ROWNUM variant when using Parallel Execution: Although it generates a second DOP, too, it doesn't have to go through a "serial" execution part as the ROWNUM plan has to.

Of course, as you pointed out, it comes at the expense of a unnecessary physical materialization.

It's definitely interesting to see again that it just needs a rather simply query to come up with surprising differences in optimizer estimates.