From AWR I am seeing a lot of CPU taking most call time %. Can anyone explain what this means. I have looked on metalink and there are a number of doc regarding top events in general.

So what could we guess from this minimal slice of an AWR report.

Something has to be top of the list. In this case it’s CPU – but if this is a typical AWR install (which means hourly snapshots), it’s reported 1,167 seconds of CPU in one hour: which is about 30% of the CPU available if it’s running on a single CPU system. (But we haven’t been told the length of the snapshot or the number of CPUs and whether this is supposed to represent a fully loaded production system, or a lightly loaded development system).

Second is the single block read time – which is averaging 2 milliseconds per read – so it’s not coming from disc (mostly) it’s coming out of memory (possibly a SAN cache). That might be pushing up CPU usage outside of Oracle, and could be using memory that would be better inside Oracle; but it’s not giving any obvious threat indications.

Log file sync – and its driver “log file parallel writes” – are also averageing 2m/s so we aren’t overloading the discs on log file writes, and (most) users probably don’t even notice the pause on commit. But we can’t tell whether the counts are appropriate for the work being done – so we can’t say if this is good or bad.

The multiblock-reads are blindingly fast – so must be coming from local file-system cache. But perhaps this is supposed to be an OLTP system, and that could be a lot of inefficient data accesses being used to collect a very small amount of data.

All in all, we might say that the hardware is not under stress – but it is possible that the application is so badly written that the work done per transaction is far higher than it should be. These stats say you might (for example) be doing a 128 block tablescan with one subsequent row lookup and update with commit as you work your way through a batch job. The hardware seems to be coping, but that doesn’t mean the application is healthy.

Without a context, a small collection of numbers is not helpful.

But if you want a couple of guesses –

it’s an OLTP system (77,000 log file syncs) and there are some SQL statements that need tuning to do less work for their result (79,000 db file scattered reads).

Fixing a few SQL statement will probably eliminate a lot of db file scattered reads and a fair bit of CPU.

At the same time, it looks like the machine is surviving on the back of a large file system cache – put some more memory into the buffer cache and the CPU will drop further as Oracle does fewer ‘physical reads’ to the file system.

Two simple questions you can always ask about performance figures are these: “Can I see any resource usage that seems unreasonable for the result achieved ?” and “where are the gaps in the information that might make me jump to the wrong conclusion ?”

Related

Jonathan, this reminds me very much of a situation I saw just last week (of course, I had access to more information). In my case, excessive parsing seemed to be an additional factor in the high CPU; something else to look for here?

Also, even if the buffer cache is increased there will still be “physical reads” if full scans are made, unless the tables are in “cache” mode.

I think it’s always a good idea, when looking at the headline figures, to start by creating a list of possible causes and then checking for evidence that dismisses or confirms your ideas.

CPU could be a side effect of heavy parsing (both hard – i.e. optimisation – and soft). We even see that we must have executed at least 77,000 data-modifying statement (or we wouldn’t be able to get 77,000 log file syncs).

Of course, with the full report, we could just cross check a couple of relevant figures (parse time CPU, library cache latching, parse count(hard), session cursor cache hits etc.), but even when you do have the whole report it’s quite educational (and entertaining) to look at just the first page (load profile and waits) and try to guess what you’re going to see elsewhere.

I was under the impression that log file syncs were also triggered by DBWn flushing dirty blocks. I haven’t checked on that behaviour since 8i, but it stands to reason: no block on disk should be allowed to contain changes protected by redo that is not yet flushed (i.e., “get ahead of” the redo stream on disk).

So – if this is still correct – we may have executed a single data-modifying statement which, together with subsequent demand for free blocks, may have caused 77,000 blocks to be modified and then written to disk, triggering 77,000 log file syncs (in the background).

I’d look at the background wait events before concluding that 77,000 log file syncs = 77,000 commits. But of course we don’t see that information either.

Good point – both the suggestion for possible cause and strategy for follow-up.

The behaviour is still largely the same – the database write can’t write a block if the log buffer still describes some changes that have been applied to it.

The dbwr algorithm did change some time in 8, though – possibly 8.0 – so that the database writer didn’t have to wait for the log write to finish before carrying on writing other blocks.

I haven’t checked recently, but I think that dbwr posts lgwr, without a sync, sets aside the problem block, then comes back to it when it’s finished dealing with the rest of the write list to see if lgwr has completed the write.

I’ve copied your correction back into the first post and deleted the second post to minimise the risk for someone reading the incorrect version and not spotting the correction. Let me know if you want it changed.

you said: “At the same time, it looks like the machine is surviving on the back of a large file system cache – put some more memory into the buffer cache and the CPU will drop further as Oracle does fewer ‘physical reads’ to the file system.” In my experience %CPU will go up if more requests can be satisfied from buffer cache because the time for physical IO drops and Oracle spends relatively more time getting blocks from buffer cache etc. I am probably missing something here (for example “CPU” means “overall CPU” and not “Oracle CPU”). Can you clarify this? Thanks a lot!

As you point out – if my session has to keep stopping to read from the file system, then my CPU usage is constrained. If my session can keep its data in the buffer cache, it doesn’t have to stop, so it can (probably) use more CPU in a given time interval. (See the final paragraphs on Statspack 9 for an example of what this can do).

On the other hand, if my session has to read blocks from the file system before doing its “logical I/Os”, then some code has to kick a block off the LRU (possibly promoting a few before doing so) move a buffer to the LRU mid-point, detach the buffer from one hash chain and re-attach it to another, and move data from the file system to Oracle’s memory space. Getting rid of that reduces the total CPU requirement for the job.

This is not the appropriate forum for posting Statspack/AWR reports. I have deleted most of the content, but left the part to which Flado has responded. His observation about 30 seconds being too short is relevant – if you have a problem with this granularity you should probably be looking at specific sessions.

You could try posting your report on the OTN Oracle database forum. (see the “Oracle DB Forum” Link at right of page). If you do, remember to use the {code} tags when posting your report so that it appears in a readable form using a fixed font.

With this I can extract from a specific and “interesting” period (&dad=from-date, &ada=to-date) avarage of timing.
If I set &evt at “%read%” or “%write%” I can verify avg in millisecond of disk access and verify if there are slow disk acess (for example verify if it exceeds 10-5ms threshold). It is true? Can you comment this please?
Can you suggest optimization of this statement?

Thanks for the contribution – I’ve edited it to try and make it more readable within the limits of WordPress comments.

I think you could make the code both simpler and more efficient by making more use of subquery factoring (the “With Subquery” approach you have used on the the dba_hist_snapshot view). I note that you’ve access the dba_hist_system_event view four times to get a relatively simple report from it – you could make use of analytic functions rather than self-joins to do this; see, for example: http://www.jlcomp.demon.co.uk/statspack_trend.html

But I think that my results are incorrect. I want extract avg wait from first snap (from_snap) and last snap (to_snap) of my input range date (&fromdate and &todate) by event fom partial name of event. If I set &evt at “%read%” or “%write%” I can verify avg in millisecond of disk access and verify if there are slow disk acess (for example verify if it exceeds 10-5ms threshold).