A More Complex Statspack Example - Part 3

Doug's Oracle Blog

The more I compare these two reports, the more the workload looks different. In particular, as Gary pointed out in his comment to the last blog, our environment is committing a lot more transactions than the vendor's. (Remember those log file sync waits? The extra redo per second?)

Let's take a look at the various SQL statements that are running during this batch job to see if the differences are apparent. (This is a long section of the report, so I'll take it a few key statements at a time, focussing on the top 5 SQL statements ordered by Gets first.)

Most of the statements are duplicated in the two environments, but the bold statement in the vendor's environment doesn't appear in ours. As the top SQL by number of gets, that's a little worrying. What's even more confusing is that I expected to see more workload in *our* environment. Well, I've only looked at the top 5 SQL statements, so I'll keep looking further down, but to save us all from a 12-part blog series I'll narrow it down a bit.

All the evidence so far is that our environment is experiencing a higher transaction throughput. so I'll focus on INSERT, UPDATE and DELETE statements to try to work out where the additional activity is coming from. Note - the output below is limited to those statements.

Okay, so at this stage it looks like our environment is handling activity that the vendor's system isn't. I'd already told the vendor that it looked like our workload was higher and so we set up a conference call to discuss what those additional statements were doing. It turned out that when the vendor is performing the test, the database is working in isolation on this job (which is what we want) but when we're running the job, our test environment has 10 application servers connected to it, performing normal maintenance activities. i.e. We're handling the normal 'background noise' of this application, whereas the vendor's test system is simplified.

I could wade through the rest of the report but I think I'm going to leave it there for now. Problem solved, although this one probably took at least half a day of comparing the reports and sending out an analysis to the vendor, then a meeting and now we await their fixes to make our system perform better but the fixes won't be to the main batch job, but the background processes.

50 million gets taking 1016s of cpu time is excessive. Do some sql tuning on the statement to see if there are better index options to be found ( abit tricking since there is an OR and NULL in the sql) (Note that even 3000+ gets per exec is not that bad) Resort your data based on gets per exec.

Without knowing the application, you cannot say that "50M gets taking 1016 CPUs excessive". For a start, you don't know how many rows were acquired, sorted and returned per execution. 3,000 gets per execution may be perfectly reasonable if the result set was a few hundred rows each time.

As it is, based on apparent column names, the query probably has to do a tablescan - and tablescans can soak a lot of CPU, especially when they are a few thousand blocks long.

One possible reason for the large difference in resource usage of that particular statement is that the vendor is testing on a database populated from an export file we sent them. It's likely that this table and it's indexes, which suffer large scale inserts and deletes might not be as efficiently organised in our database as the vendor's. I suspect that's the cause of some of the more subtle variations in the statistics.

It's only one possibility, though, which I planned to mention at some point, but the difference is quite dramatic so it could be something else entirely.

The problem is that, until we can be sure we're testing the same workloads, I've shelved this for now because the situation is too confusing, but that statement is definitely worth investigating further - cheers.

Another possibility, also relating to the export/import cycle and possibly affected by the extra workload on the production box, is the cost of both read-consistency and delayed block cleanout. A rough indication of this effect would come from cross-checking the three statistics "%undo records applied%".

I'm at work at the moment, so I need to go and run some release scripts

Just at a glance, I think the big difference in 'consistent gets - examination' reflects the fact that our environment has a constant stream of background activity on objects completely unrelated to the batch job we're tuning.

Doug, given the very small number of application of undo records, compared to the huge volume of examinations, I suspect that most of the examinations are actually related to index root block accesses - possibly for foreign key referential integrity checks (unless you have some perfect single table hash clusters).

Disclaimer

For the avoidance of any doubt, all views expressed here are my own and not those of past or current employers, clients, friends, Oracle Corporation, my Mum or, indeed, Flatcat. If you want to sue someone, I suggest you pick on Tigger, but I hope you have a good lawyer. Frankly, I doubt any of the former agree with my views or would want to be associated with them in any way.