Flip Flop Hooray, Ho

An investigation into a flip flopping sql statement – the downside of dynamic sampling – why it can be a problem for fast changing tables and frequently executed SQL.

Yesterday morning a sql statement which processes outstanding messages was having performance problems.

There were two child cursors being executed by the four concurrent sessions, depending on their optimizer settings.

I’m not going to detail the optimzer environment differences because it’s mainly irrelvant other than, because of the child cursors, there was a hard parse at time T0 and a hard parse at time T1 and then these cursors were executed repeatedly / regularly over a number of hours.

I mentioned that this code was processing outstanding messages.
These different processes aren’t balanced and aren’t expected to be.

But plan hash 3791039348 is good for everyone and in general plan hash 1081606725 suits no-one.

Here’s the sanitised SQL – remember this isn’t so much a SQL tuning/rewriting exercise so much as a root cause of slowness exercise (where the root cause is not initially meant to be “because the SQL is not brilliant”)

It’s always nice to simplify a situation and because the problem seems to be in this non-merged view, we can actually eliminate half the sql and then see if we can just reproduce the problem by adjusting the estimates.

Which we can, and via trial and error, we see that as we approach an estimate of 14000 rows from CMSG, the plan changes to the FULL TABLE SCAN of SETI:

So, optimizer environment differences and bind variable peeking are pretty much ruled out as influencing factors.

What remains is that the bad plan has a higher estimate of rows than the good plan and the bad plan has dynamic sampling.

Both SQL statements are the same.

Dynamic sampling level is the same for both (4 set at system level).

Dynamic sampling should be being applied because of the parameter setting and because there are two single table access predicates
– one from the predicate on CMSG_IL.EVENT_TYPE and one implicit IS NOT NULL predicate on the join column UNACKNOWLEDGED.

So, the results of dynamic sampling are discarded as unreliable and the estimates fall back on using the stored object stats.

But at some point (one of the hard parses) the data returned in the sample must have found some relevant data and been considered reliable which led to the “bad” plan being executed repeatedly in inappropriate circumstances.

Two footnotes to this investigation.

Firstly, when you have tables with fast-changing data and row counts, there are usually two choices in approach for statistics
– either find a set of stats that gives the best (balance of) plans and lock them.
– or use dynamic sampling

The scenario above shows the downside of the dynamic sampling approach.
It may well be that the plan hash 1081606725 was good for the initial execution(s) of the statement when there maybe were a large number of outstanding messages but it very quickly became unsuitable given the frequent executions.

Secondly, I mentioned that this was not a SQL rewriting exercise, but, if it was, I’m not a great fan of the outer-joined inline view. It’s not merged but it seems sensible that a predicate be pushed into view.
Pushing the predicate, whether happening by default or hinted explicitly, would give a subsection of the plan like: