October 23, 2007

Analysing Statspack (7)

Here’s an interesting extract from an Oracle 9i statspack report taken from Oracle Database 10g Performance Tuning Tips and Techniques*** . The book doesn’t give the snapshot interval, number of CPUs, or any idea about the average number of sessions active during the period, so all the comments I make about the sample are highly speculative.

But, as a pure “thought experiment”, I want to give you a feasible explanation of the figures shown just to give you a flavour of how counter-intuitive a statspack report can be.

The question is – which of these top five figures will have the most impact on the end-users’ perception of system performance?

Let me repeat that this is purely conjectural and proposed for the purposes of enlightenment only, but it’s possible that the performance issue that’s most visible to the end users is the one that’s reporting the shortest time on the list above.

I have no idea how long a time period this snapshot lasted, nor how many sessions were actually involved. The best I can say is that if this was running with one CPU the snapshot must have been at least 267 hours (CPU time / 3,600) and if there were 16 CPUs it was at least 17 hours. Allowing for the fact that people rarely run at 100% CPU, and 17 hours is an odd snapshot time, then possibly it’s 16 CPUs for 24 hours, or 64 CPUs for 6 hours – let’s assume the latter.

When you have your mind operating at the right scale the rather large counts on single block reads and buffer busy waits start to look slightly less dramatic.

Of course when you do extreme amounts of reads you tend to get a lot of buffer busy waits as one session waits for another session to finish reading a block that they both happen to want; and the CPU due to latching of the cache buffers chains and LRU chains can get a bit expensive – so the activity does look a bit undesirable and should be reviewed. (10g conveniently renames this particular read-related wait a “read by other session”)

But take a close look at the times: the average read time is 6.41 milliseconds (total time divided by total reads) and the average buffer busy wait is 4.42 milliseconds – but the average wait for the log file switch checkpoint completion is 0.994 seconds.

There are two important points to note here. First, it’s the end-user sessions that wait on log file switch (checkpoint incomplete), and secondly the timeout on that event is one second (which means this might really be a small number of very long waits cut into one second slices since the average is so close to one second).

Now we don’t know how many end-users hit the checkpoint problem, or how long each end-user waited (ticking the seconds away one at a time), but log file switches shouldn’t be happening all the time and, quite possibly, there’s a lot of query based activity going on that doesn’t generate redo log and therefore doesn’t hit this problem. So let’s play numbers and see what happens.

If, in that hypothetical 6 hour time period, we actually have 188,000 periods of session activity that all wait for one second, and share out the other waits between them, how much extra wait time gets associated with each log switch wait? Just under 2.9 seconds of buffer busy wait, and 13.6 seconds of single block read time – for a total of 16.5 seconds. (That’s just dividing the other times by the 188,000 assumed sessions).

This means that if you have just 15.5 sessions running for every session that hits a log switch wait on DML, the log switch wait becomes the most visible contributor to the jammed sessions’ wait time. And our assumption was that this was a big system, so a 15.5 : 1 split between query and DML activity could easily be that sort of size.

Now I’ve done a lot of hand-waving and playing with numbers – but the argument is viable: it could really happen. This is a big system (or a small system viewed over a very long period of time) that manages to get through a lot of work. There are clear signs of resource wastage and contention from the buffer busy waits – but it is possible that the only performance problem the end-users notice is at the log file switch when some of them go into the checkpoint not complete.

The smallest statspack wait could be the most visitble end-user issue – averages hide the detail and only the sessions can tell.

Footnote:log file switch (checkpoint not complete) happens to be a wait time that can (usually) be eliminated very easily – having more, or bigger, log files is often all that it takes to avoid the checkpoint completion issue. Sometimes, of course, the problem can be reduced by ensuring that the database writer(s) have less to write, or are able to write faster (which typically means changing the fast_start_mttr_target or increasing the number of database writers, respectively). The most appropriate action will only become clear when you examine the log files sizes, log file switch timings, settings for the two log_checkpoint parameters and the rest of the statistics.

Related

The PL/SQL lock timer suggests 4,077 one-minute ‘sleeps’ which might be associated with a batch job timed to pick up from a queue. It also suggests 68 hours if there was one job doing all the waiting, and nothing else, and it was running all the time. That is a multiple of the 17 hours you get as a minimum time for 16 CPUs. Maybe there were only 4 CPUs and it was a 68 hour interval, with CPUs maxed at a constant 100%. As you say, pure conjecture.

To me it looks like a not so badly tuned db, I compared your figures to similar figures I got from a 255 users Siebel database before I started to tune it. So speculatively:
Indexes need to be rebuilt possibly to remove buffer busy waits;
sequential read contentions in the cache, lots of concurrency;
they might not be using ASSM;
check point incomplete needs to fixed evidently;
if there are 16 cpu’s there could be a lack of RAM.
P.

This is pretty much what Millsap and Holt write in their “Optimizing Oracle Performance”: you cannot extract detail from a aggregated value. Which is perfectly logical but reminding people is probably a good idea since often only these aggregated values are looked at. (Speculation: maybe it’s because they are easier available.)