January 7, 2013

Analysing Statspack 13

A recent (Jan 2013) post on the OTN database forum reported a performance problem on Oracle 9.2.0.6 (so no AWR), and posted a complete statspack report to one of the public file-sharing sites. It’s been some time since I did a quick run through the highlights of trouble-shooting with statspack, so I’ve picked out a few points from this one to comment on.

As usual, although this specific report is Statspack, the same analysis would have gone into looking at a more modern AWR report, although I will make a couple of comments at the end about the extra material that would have been available by default with the AWR report that would have helped us help the OP.

First, here are the Load Profile, and “Top 5” parts of the report. To make sense of them, you also need to know that report covered a two-hour interval, and the machine had 12 CPUs; there’s also a little clue in the fact that this is 32-bit Oracle with a 600MB SGA (with the sga_max_size set to 1.6GB) running (alone) on a machine with 12GB of memory.

Working from the top down, the first thing we notice is that 2M logical I/Os per second is quite busy, and that 43, 000 blocks read per second is pretty amazing – especially since the number of user calls and executes isn’t particularly large.

A cross check to the Top 5 then shows that a huge fraction of the work done by the database is CPU. At 2 hours and 12 cores we have 12 * 2 * 3600 = 86,400 CPU seconds available, and we’ve used 74,700 of them – and that’s into Cary Millsap’s “knee of the curve” for queueing problems. Since this isn’t the AWR, of course, we don’t have the OS Stats available, so we can’t see if the rest of the machines CPU is being used up outside the database – but looking at the db file read figures I’ll bet it is. If you have extreme CPU usage in an Oracle system, then latch contention problems are very likely to show up – and that’s our next highest figure – AWR would probably tells us which latch but, again based on the db file read figures, I’ll guess it’s the cache buffers chains latches.

So we look at the db file read figures and do the arithmetic that the AWR would show us to work out that we’re seeing multiblock read times of 0.06 milliseconds, and single block read times of 0.08 milliseconds. These have to be coming from the file system cache to be this fast. (Even then I think we may also seeing a timing problem (perhaps a measurement bug) that’s managing to lose wait time.) First thoughts, then – are that there is some code that is so inefficient that it’s doing far more physical I/O than it should; on the plus side the excess memory (12GB) means we’re being protected from the worst effects of disk I/O, but the caching effect means we’re able to burn up CPU at an astonishing rate to do the work.

So let’s look at the SQL to see where the CPU and disk activity is likely to be going. Here’s extract from the SQL ordered by Gets, and SQL Ordered by Reads. (We don’t have SQL ordered by Time in Statspack, of course):

The first statement reported by Gets is responsible for 500 millions buffer gets in the interval (and even then that’s 40% of the total – so there may be a couple more big ones that have fallen out of the library cache in the interim, perhap an hourly report on the previous hour would catch them). Look at the table that the SQL is inserting into – RA_INTERFACT_ERRORS. We might hope that an error table wasn’t going to receive many errors – so perhaps this is just an example of a statement with an insanely bad execution plan; clearly one to look at. Trouble is, though, that unless you set statspack to run at level 6 you don’t capture the execution plans – unlike the AWR which capture them by default. The batch took a couple of days to complete, though, so it would have been possible either to change the snapshot level, or to pull the execution plans from memory.

Looking at the SQL ordered by Reads, we see another example of the top statement being much more expensive than anything following it – but there’s an odd little pattern that caught my eye as I scanned the list. The next three statements look very similar to each other, and have similar statistics, and if you add up their stats the answer is very close to the figures for the first statement. Moreover, the report says that the top statement is responsible for 90% of the work – I think the top statement is actually calling the next three, so we need to look at those statements to see how to make them efficient in order to fix the first statement.

You might note, by the way, the capitals and the bind variables (:B1) in the next three statements – I suspect that there is a PL/SQL function in the first statement that is being called too many times. (20 executions of the first statement, 8,500 of the other three). Perhaps a change in the execution plan of the main query has produced a change in table order which has resulted in some predicate – one with a pl/sql function in it – being tested much more frequently than usual.

A huge fraction of the physical reads seem to be for prefetched blocks – that a good indication that they are for tablescans (or index fast full scans), and when we compare rowid access with tablescan access we see that there are a lot of “long” tablescans (maybe the expensive three queries are all doing tablescans of a long table, that would account for 25,500 of those reported), with with 360M blocks scanned. Apart from anything else, all those block scans will be using a big chunk of CPU. The AWR would show us which segments were being scanned, of course, as would statspack if we were running it at level 7.

I’ve included the table fetch by rowid and table fetch continued row. Normally we should be concerned by “continued fetches” – but in this case I’m prepared to overlook them temporarily: it’s possible that after the critical SQL is corrected we may still see that most of the fetches by rowid turn into continued fetches, and at that point it will be worth worrying; but at present, even though they’re probably responsible for a serious fraction of the CPU usage, those continued fetches may be for rows that we’re not going to be fetching once the SQL is doing the right thing.

The last little detail, just for confirmation, comes from the latch activity part of the report; I’ve just selected the busiest 4 latches:

As you can see, and as we had guessed, they’re all about the buffer cache – doing multiblock reads and loading/dumping blocks from the cache; and almost all of the sleeps are on the cache buffers chains.

Next Steps

As a quick and dirty which might help a bit – put more of the sga_max_size into the buffer cache – it doesn’t happen automatically in 9.2 – and that may reduce the interaction between the O/S and Oracle and save a bit of latching and CPU.

Address the three statements doing all the physical I/O, and check the path for the statement that cause them to be executed. Fix the statement that has done 40% of the buffer gets. In all cases the problem may simply be an execution plan change – but perhaps the machine has been hiding an inherent performance problemand on this occasion a small change in data (a few more errors, perhaps) resulted in the excess work generating a massive increase in workload and contention.

We do know, however, that there is a lot of work done on buffer gets that didn’t show up in this report – so run off the report for the previous hour and see if we can find a couple more big hitters. Once we’ve sorted out the worst SQL, we might then want to see if the volume of “continued rows” is high – and find out why they’re appearing. It’s possible that it’s due to a table with more than 254 columns, of course, or a long column, but we’ll worry about that only when we need to.

One final thought – it’s easy to fix the wrong thingwhen you’re looking at statspack reports. Although we can see some SQL that is “obviously” overloading the machine, maybe the business critical task that has to complete promptly is simply hidden because it does (relatively) little work. Unless we can fix the big problems extremely quickly, we ought to take a little time to check with the business and the developers that what we’re looking at really is the code that’s running the critical business task.

Longer term, of course, upgrade to 9.2.0.8 at least and then start running Statspack at level 7 for better information. Then upgrade some more.

Related

These things must attract each other! Just finished a tuning exercise on a third party j2ee app here.

A single SQL was in its 3rd day of uninterrupted execution. Finally got permission from the business to shoot it and fix the blessed thing.

Typical j2ee “framework” disaster: auto-generated, 3.8Mrow-table joined to 4Mrow-GTT, with multiple subqueries on both predicates AND select column list coming from equally monster-joins. None of them with the slightest chance of ever using an index.
What I call a “legacy query”: our grandchildren will see its results.

The execution plan in EM was a very long, sorry list of nested over nested over nested loops of million row FTS…
But because the buffer cache was quite large, the whole lot fit in cache. Hence, a LOT of CPU being used and negligible reads.
And of course: lots of buffer cache waits and even much more lots of CPU waits!

Which prompted the supplier’s “Oracle guru” to *categorically* claim “you are starved of CPU – add more of it”.
Of course, that got shot in the bud.
After a little judicious re-coding and trimming of totally unnecessary nested subqueries, the whole lot finished in around 10 minutes.

In many ways this boils down to the question: “Who knows what the system is supposed to do?”

Intuitively we feel that a query that takes 10,000,000 buffer gets and runs hundreds of times per day in an OLTP system has to be wrong – either it should only be running rarely, or it should be taking thousands of LIOs rather than millions. There are special cases, of course, and there’s a very large grey area where the reource usage looks suspicious but not necessarily outrageous. When in doubt, we need to be able to get a good answer to the question: “does this seem reasonable?”, and I often find that the answer to that question is that nobody knows, and you have to go through a lot of work trying to reverse engineer the requirement and the intent of the programmer.

Very true. In this case, it was quite clear: it’s a daily aggregation batch job that starts executing at 11pm and is supposed to be finished by 8am.
Used to work well within that window. Until the Xmas load from our centres made the “demo data” in the tables involved grow from 100 rows to 3.8 million.
And of course: archiving older data is unheard of – apparently, adding CPU to enable the thing to “scale” was the recommended way to address this.
Conveniently ignoring the process does not use parallel query and of the 4 cores available only one was flat out.
Given I got a say in this one, I preferred to tune the SQL: now down to 10 minutes, from never finishing after 72 hours. I reckon that was a better improvement than “adding CPU”.
Although of course the Oracle sales rep hates that! :)

Thanks for the comment – it’s too easy to lose track of which bits appear in which version (though this was a 9i system, of course). The only bit I don’t forget is that Statspack 10 reported the event histograms that didn’t get into the AWR until 11 ;)

Could you please explain further your thoughts when you’ve wrote the following sentence :
“and even then that’s 40% of the total – so there may be a couple more big ones that have fallen out of the library cache in the interim, perhap an hourly report on the previous hour would catch them”

Do you think 5 billion of LR should represent more than 40% of the total?

The point of that comment was that the instance activity reports something like 14.4 Billion LIOs, so somewhere we’ve “lost” 8 or 9 billion. This means that solving the 5B problem might be nice (and it might even be easy) but there’s still a huge amount of work hiding somewhere that we’re still going to have to address.

Maybe the big one that we can see all happened in the seconde hour of the report, and there’s a similarly sized query that took place in the first hour. It’s always worth thinking about the things that we can’t see, and wondering whether there’s a way in which we can find them.

Where did you get this number? Original thread doesn’t show available cores count.

Working from the top down, the first thing we notice is that 2M logical I/Os per second is quite busy

Just to translate it into bytes makes sense: that’s around 16GiB/s, which could tell you more if you happen to know chipset – maybe that’s close to its limit.

Since this isn’t the AWR, of course, we don’t have the OS Stats available, so we can’t see if the rest of the machines CPU is being used up outside the database – but looking at the db file read figures I’ll bet it is

Could you please clarify what is it in the db file read figures which makes you think something else is eating up CPU outside the DB? BTW Statspack in 11g provides OS stats (10g doesn’t AFAIR), so it’s not only the tool that is an issue.

If you have extreme CPU usage in an Oracle system, then latch contention problems are very likely to show up – and that’s our next highest figure – AWR would probably tells us which latch but, again based on the db file read figures, I’ll guess it’s the cache buffers chains latches.

1. That’s not AWR which will tells the latch name, but modern Oracle kernel which has special names for specific latch free events, so Statspack in the 10g or 11g will show CBC latch in the Top 5 as well.
2. It’s hard to decide what is the cause and what is the consequences – latching is causing CPU saturation or extreme CPU usage causes latch waits to appear in Top 5. In this case I think it is the 1st scenario, i.e. too much block gets cause too much CPU usage due to too much latch acquisitions.

so there may be a couple more big ones that have fallen out of the library cache in the interim, perhap an hourly report on the previous hour would catch them

That is a very likely situation, especially with relatively small shared pool. Another possibility is that something is still being executed, and in this case Oracle kernel in 9i could not help Statspack in identifying a query which is still running at the end snapshot.

Trouble is, though, that unless you set statspack to run at level 6 you don’t capture the execution plans – unlike the AWR which capture them by default

… but it doesn’t help in some situation as it doesn’t capture predicates (Bug 7493519: ACCESS_PREDICATES AND FILTER_PREDICATES MISSING IN DBA_SQL_PLAN_HIST, fixed in 12c). I don’t remember if Statspack gathers the predicates correctly under level 6 – if it is, then Statspack is preferable tool for this task.

As a quick and dirty which might help a bit – put more of the sga_max_size into the buffer cache – it doesn’t happen automatically in 9.2 – and that may reduce the interaction between the O/S and Oracle and save a bit of latching and CPU.

If it is indeed 32-bit binaries then I’m not sure it’s possible to allocate more than 2.7G to SGA. If it is 64-bit then SGA increase should also be combined with the switching to direct IO and possibly to the huge pages if it’s Linux (hostname suggests so). But without fixing SQL it’s possible to make things worse: removing IO completely, the app will (try to) use more CPU, resulting in potentially worse latch contention and more problems.

What looks a bit suspicious to me in the report are Parallel operations: there were 3/3 DDL/DML statements parallelized, 54 DFO trees parallelized, 48 queries parallelized. I’m not sure if these numbers are consistent – but I think that even 9 PX slaves on a 12 core system could utilize 75% of the CPU if they catch the wrong plan. It could be related to a DDL on the index, perhaps, changing its degree although this is just a guess. Anyway, I think it’s worth looking if Parallel Execution is acceptable on this system or not.

The original question came up in (at least) two places on OTN – this is the one I saw and responded to.

Your comments about me saying “AWR” when really I should have been saying “9i” are correct, of course. There is some OS information in 10g, though the presentation is different; in fact if you enable it you can get some OS stats appearing in the 8i and 9i Instance Activity.

My inference from the db file read figures is that since they’re so fast they must be coming from the file system cache – and if that’s the case then there’s going to be a big chunk of CPU being recorded against the OS time for file system manipulation. (I’ve assumed, of course, that the huge excess of memory, and the low version of Oracle, are good indicators of a simple set up.).

I think I made the point about confusing cause and effect in the OTN thread, but at this point I think it ought to be fairly clear that which is which. If anyone doesn’t get your point though, it’s worth mentioning that if you have a bit of a latch sleep problem then if everyone starts chewing up all the CPU (and without necessarily doing anything related to latches) then the latch sleeps get worse because it takes longer for a sleeper that’s woken up to get to the top of the run queue because of all the processes in the queue ahead of it that are using up their time slices.

Good alternative explanation for how we could lose about 8 Billion LIOs that took place in this time interval. The OP might benefit from running off the next two or three (or more) hours reports as well.

Predicate capture (statspack doesn’t do it): at this point, with such extreme problems, I’m only really interested in the big picture – are today’s (or this interval’s) execution plans the ones we’ve been seeing for weeks, or is today a very special day. I’ll worry about WHY there’s been a change after I’ve determined that a (massive) change has occurred.

Regarding SGA size: it may not be in the thread you saw, but the sga_max_size is set to 1.7GB, and the db_cache_size and shared_pool_size are set (and reported) as 300MB – it looks as if someone has allowed for manual adjustment up to the legal 32-bit max for the platform, but has not actually allocated it. So there seems to be about 1GB of memory doing nothing that ought to have been in the db cache to reduce the constant copying from file system to Oracle cache. But it is possible that an increase might make the wrong thing go faster – with the side effect (see “top of run queue” comment) that the thing we’re worried about might slow down even more.

In fact there’s a follow-up post from the same person comparing two consecutive days. I’m trying to find time to write some comments about it as “Statspack 13a”. It’s very educational.