Buggy overhead of SQL Plan Baseline Capture

There’s at least one bug related to the performance overhead of SQL Baseline Capture and ASH is less than clear with recursive SQL

First up – apologies, I’ve not got a reproducible test case because I flushed the shared pool between tests and the whole performance degradation scenario disappeared. I should have known when one of the Metalink notes talked about the problem disappearing when bouncing the instance that I shouldn’t have touched the shared pool. Now I can’t reproduce. I’m going to keep trying, but for now, let me tell you what I’ve got (I am trying to be more precise in my blogging but yet again it’s something woolly or nothing).

With that out the way, we’re in the initial stages of an upgrade from 9.2.0.8 on 32-bit Solaris to 11.2.0.2 on 64-bit Linux, a one-step all-or-nothing change-everything at once upgrade.

Plan was to run with optimizer_features_enable set to 9.2.0, various other parameters set to current production settings and to capture 9i plans into a baseline.

Initial runs of “The Batch” started, all processes reported as slow. The initial major contributor turns out to be recursive SQL related to SQL Plan Baseline Capture (see bug#11600319, also bug #11719151 and doc id 1295054.1 – thanks to Timur Akhmadeev for the pointers)

I haven’t yet raised this with Oracle Support because a) I can’t face the wait for a response and b) local red tape means that patches need to be approved by the DB Engineering team and approval for non-production environments is difficult (I’m not going there, not now).

Bug # 11719151 – “RECURSIVE MERGE WHEN optimizer_capture_sql_plan_baseline=true DOING FTS” – is certainly relevant.
The MERGE statement into SQLOBJ$AUXDATA does use a FULL TABLE SCAN. But if I were to show you the AWR reports then you’d see the MERGE statement into SQLOBJ$DATA was the bigger problem and the MERGE statement into SQLOBJ$ also not insignificant.

Pick a metric, any metric:
3 out of the top 10 by Elapsed Time, 2 out of the top 5, 1 of the top 1.

If I move over to the ASH data and give an example of what that was showing, then we can go and look at those SQLOBJ$ tables for the SQL concerned.

The ASH data – probably my most favorite feature of the last few years – I’m only displaying a little bit of in terms of height and width. But it illustrates the second point I wanted to highlight – that recursive SQL is not properly reported. Whilst the SQL_ID column differs from the TOP_LEVEL_SQL_ID (which I’ve omitted to keep it slim), this is still the SQL Id of a nested lookup, not the SQL Id of the recursive MERGE which you might have expected – I can see pros and cons of both I suppose.

By the way, I was also a little surprised to see the direct path read/write operations in the ASH data but these SQLOBJ$ tables that we’re constantly MERGING into recursively have LOBS and further investigation revealed these events to be associated with those LOB segments.

As you can see from above, lots of MERGE operations which have nothing directly to do with SQL_ID d99anrdgbb4jf which looks like the statement below. By the time I’ve got to some of this data for the purposes of this, it’s gone from V$SQL so I’m fetched it from AWR.

Baselines are stored according to signature. SQL statements have hash_values, sql ids and also exact/force matching signatures. By the way, the latter can be translated from the text of the sql statement using DBMS_SQLTUNE.SQLTEXT_TO_SIGNATURE using the FORCE_MATCH parameter for either the exact or force matching signature.

In AWR, we’ve only got the FORCE_MATCHING_SIGNATURE of this sql statement sql_id ‘d99anrdgbb4jf’ but take it was the same as the EXACT_MATCHING_SIGNATURE.

I used this signature to see what was happening with the baseline data.

There’s not much point in me pasting the outputs of those tables. I could see that the SYS.SQLOBJ$.LAST_EXECUTED, SYS.SQLOBJ$AUXDATA.CREATED (!?!) and SYS.SQLOBJ$AUXDATA.LAST_MODIFIED were being updated although not the additional metrics like BUFFER_GETS, EXECUTIONS, etc in SYS.SQLOBJ$DATA.

But then I flushed the cache and like that … the performance degradation was gone and not reproducible… for now.

That’s all there is really.

Oh, by the way, I did trace one session while it was reproducing. Below are some of the relevant bits.
I was using a simple slow-by-slow lookup because I was then going to use concurrent sessions which initially were causing me problems with SMO rowcache lock contention – quite possibly related again to the baselines, but I never got that far into the investigation before we flashed it all back and turned off capture.
Also, looping lookups were relevant to one of the bug notes above and are very much representative of some of the legacy Pro*C codebase.

Last November we upgraded our Oracle Apps 11.5.10.2 system from RDBMS 9.2.0.8 to 11.2.0.2. It has been something of a nightmare. Granted the fault was ours. We didn’t test the upgrade enough. Several other emergencies slammed us at the same time and database testing fell through the cracks. My bad. But here is a quick summary of what happened:

In a nutshell – the peformance of the entire system was terrible. Everything was sluggish. It seemed to be systemic rather than isolated to any single process or set of processes.

The first problem we noticed involved reading the data dictionary. When we started up development tools like Sql Navigator or Toad they tried to pre-load parts of the schema. This took minutes to open the tools. Also, searching for objects within the database seemed to take forever. We found a bug in the ALL_SYNONYMS view. We ran the fix in note: Selects Against ALL_SYNONYMS Very Slow in 11.2.0.1. and 11.2.0.2. [ID 1242844.1] and this seemed to help.

But the system was still slow.

Next, having nothing to do with overall performance, we found an optimizer bug that was causing certain where clauses with OR’s and repeating column names to return incorrect (generally nothing) results. This was breaking a couple forms that couldn’t query. We fixed this with bug 9776940 and a hidden parameter “_projection_pushdown” = FALSE.

But the system was still slow.

Next, we noticed that the server was doing an incredible amount of virtual memory page-in’s. At the rate of 4000 to 6000 pages per second. There was no apparent reason for this since the servers seemed to have plenty of free RAM and swap space. After much looking by several people we discovered that the ASO_ORDER_FEEDBACK_T queue had grown very large and the new 11g database was doing full table scans on it at an alarming pace – basically continues FTS again and again and again. Since they were full table scans it was causing the table to be read in from the database everytime and as it was read into the SGA it was also being read into the swap space. We cleaned up the table with How to Purge the ASO_ORDER_FEEDBACK_T Table In Minimum time Using Asoqueue Script ? (Doc ID 740305.1). and the page-in rate dropped to a very small number.

This greatly improved the situation. But the system was still slow.

Then we started investigating one process – Journal Import. On our old 9i database a particular test ran in about 2 minutes. On the new 11g database it took 50 minutes. Oracle support said that nothing was wrong. We disagreed. After lots of back and forth, I was looking at the trace file and noticed 3 different MERGE INTO SQLOBJ$DATA and SQLOBJ$AUXDATA. I didn’t know what these tables were. But a quick google search found that they were part of the SQL Plan Management Capture process. When I added up the times for these three MERGE INTO’s it came to 46 minutes. Adding that to the 2 minutes that the process should have taken gave me almost exactly the actual run time of 50 minutes. Once I found this it was trivial to find out that SQL Plan Capture has a bug in it – Bug 11719151: MERGE STATEMENT USED WHEN SPM CAPTURE CAUSING PERFORMANCE ISSUE . We turned off capture by setting optimizer_capture_sql_plan_baselines=FALSE and performance dramatically improved. We are working on the opatch to repair this problem but it will have to wait until after quarter end processing.

The moral of the story – there is a good reason why they say to test your upgrades before going into production🙂