What AWR isn’t telling you

It is well-known that AWR, and Statspack before, take snapshots of V$ views (or rather the underlying objects) to produce the data in AWR.

It is also well-known that, when considering sql and its statistics in the shared pool, if something big hitting happens but the big-hitter is no longer in the shared pool by the time of the snapshot, then it can’t be recorded in your AWR picture of activity.

But like many things supposedly well-known, it can still come back and remind you that you’ve forgotten or overlooked this behaviour.

Here is a little illustration which happened earlier this week.

This week I’ve been looking closely at activity which happens on a particular DB between 16:00 and 17:00 because the IO subsystem is showing signs of stress and the timings for ‘Average Synchronous Single-Block Read Latency’ (V$SYSMETRIC_SUMMARY / DBA_HIST_SYSMETRIC_SUMMARY) have pushed above our amber warning levels.

Don’t get me started but our amber level for a single block read is 20ms!
Way too high for my liking for such sustained average but that is apparently the SLA from the SAN.

Why do we have such an alert?
The purpose is two-fold.

It can sometimes be a warning that some of our IO-sensitive batch work might show some performance degradations, threatening SLAs

In the past, it has been an effective warning that a sql statement executed concurrently in multiple threads from the applications has switched to a "bad" execution plan whose impact is system-wide (normally heavy concurrent direct path reads).

So… I was hovering around this system during the relevant timeframe looking at what was going on.

And I observed some heavy hitting queries running in parallel originating from another system (i.e. not my application).

There were two executions of one sql id (0cu2sm062jutc) executing concurrently with one single execution of a another sql statement.

The top-level RTSM execution metrics for these three executions were all very similar to this:

Executing for about 10 minutes, in parallel doing a lot of physical direct path reads (38GB) and accumulating over 1 hour in parallel processing time.