Problem Description
—————————————————
Everyone knows that the FILTER_PREDICATES and ACCESS_PREDICATES columns in DBA_HIST_SQL_PLAN are not populated.
If I remember correctly, this was originally a bug in 9i and statspack and has been carried forward into AWR.
As a result, any execution plan fetched using DBMS_XPLAN.DISPLAY_AWR has no predicate information
The statement which populates the underlying data has hardcoded NULLS.
Why do these continue to be null through all versions?
Is there any plan to address this?

“Can we see if a particular SQL execution in AWR used a baselined plan?”

Initial thoughts might be:

Q: Does DBMS_XPLAN.DISPLAY_AWR tell us this?
A: Apparently not. See below. This question could also be rephrased as two other possible questions:

Q:Isn’t there a column on DBA_HIST_SQLSTAT which tell us this?
A: No. You’d think there should be. There is a SQL_PROFILE column. There isn’t a SQL_PLAN_BASELINE column.
There also isn’t an EXACT_MATCHING_SIGNATURE although there is a FORCE_MATCHING_SIGNATURE.

Q: Is it in DBA_HIST_SQL_PLAN.OTHER_XML?
A. No although this is where DBMS_XPLAN.DISPLAY_AWR gets it’s notes about cardinality feedback and dynamic sampling from.

First of all, Let’s see that it’s not shown in DBMS_XPLAN.DISPLAY_AWR.
Let’s get a sql statement baselined, in AWR and not in memory.

From querying DBA_HIST_SQLSTAT I happened to notice that over a period of 65 days a particular query had 63 different plans (PLAN_HASH_VALUE).

I found that the differences were down done to the different internal temporary table names used for a materialized subquery, i.e. the plans were essentially the same but for the name of SYS_TEMP_0FD9D6786_B6EF87D2, SYS_TEMP_0FD9D6825_BE8671F, SYS_TEMP_0FD9D6684_EAD00827, etc.

Exactly what bug 10162430 addresses is unclear because it is listed as fixed in 11.2.0.3 and my different plan hashes come from an 11.2.0.3 database.

But the key information in the bug description highlighted that PLAN_HASH_2 is not affected by these differing temp table names.

Exactly what plan_hash_2 is and how it differs from plan_hash_value is undocumented but:

it is found in V$SQL_PLAN.OTHER_XML;

also in the AWR repository DBA_HIST_SQL_PLAN.OTHER_XML;

and is used as the enforcing plan hash in sql plan baselines – not exposed in DBA_SQL_PLAN_BASELINES but internally represented in column PLAN_ID in SYS.SQLOBJ$, SYS.SQLOBJ$DATA, SYS.SQLOBJ$AUXDATA.

So, if you regularly look at DBA_HIST_SQLSTAT to look at historic execution statistics of a SQL statement and your queries use materialized subqueries then I recommend adding PLAN_HASH_2 to the output, something like this:

If you have a problem with the automatic snapshots:
– Check whether you can manually snap using DBMS_WORKLOAD_REPOSITORY.CREATE_SNAPSHOT.
– Check MMON and alert log / trace files for related messages.
– See Metalink note 1301503.1: “Troubleshooting: AWR Snapshot Collection issues”

This came up in a question on the OTN forums where the OP said that AWR was not snapping automatically but the GATHER_STATS_JOB seemed to be ok. And the obvious first question would be where did you get that idea from?
Sometimes you can’t win on the forums. If you ask a question, it’s not uncommon to be pointed to a google search. But if you google something, you can’t trust everything that’s out there – it gets outdated and sometimes it never was accurate.

This is a follow-up to a database performance problem which happened a couple of weeks back, first briefly mentioned in this previous article of mine.

You’ve probably heard it a thousand times before and there’s no significant variance in this case nor revelation in resolution:

The Business have reported a performance problem.

Some reports which normally take a couple of minutes have been taking closer to one hour.

Some DBAs / managers / developers know that when this problem has happened previously, if they flushed the shared pool, the problem mysteriously went away – so that’s what they’ve already done.

The first area of comment is the flushing of the shared pool.

Ok, so it’s a bit of a case of using a sledgehammer to crack a nut but the business have deadlines. With production performance problems and business deadlines, sometimes you’ve got no choice.

Then again, nobody knew it would work – fortunately, from some perspectives, it did work. But, until the root cause is identified and fixed, it’s always going to be uncertain. And “flush the shared pool” is always going to be the battle cry of the panicked, like “reboot your PC”.

Clearly, this is a system that can cope with a flushing of the shared pool. Not true of many.

So, the shared pool has been flushed and we’ve thrown the baby – the evidence & some of the causes – out with the bath water.

However, if you’re appropriately licensed on 10g onwards, you’ve got AWR and ASH to fall back on.

35 and 42 days are oft-mentioned sensible suggestions, mostly so that a) you’ve got a decent time buffer to react and b) you’ve got significant cover to allow for most business cycles of activity.

It’s set to 7 here which is a bit of pain.
Fortunately, I got access to the data within a couple of hours of requesting an account. But even so, I was working on other stuff at the same time and would occasionally think of things to double check but it was too late – the AWR and ASH data was long gone (also I’m writing this article about two weeks after I originally intended to and I wanted to post the queries and data from the actual problem but what with the disruption of the snow and Christmas, the data to match the queries has gone).

Anyway, you’ve been told something has been slow, the problem’s over now, the pool’s flushed, but people want an explanation – there’s an invASHtigation to be done.

In general, when I get access to production, I get data dictionary access. No access to underlying objects, no access to GUIs.

This is part of what I call a blind investigation.
I’m new to the system so I don’t know much about it, I’m not familiar with the SQL or objects, etc.
But I might have looked at some Statspack or AWR reports to get a general picture of activity.

DBA_HIST_ACTIVE_SESS_HISTORY is nearly always my starting place, or V$ACTIVE_SESSION_HISTORY if the data has not been flushed yet, to look for non-standard waiting.

First up though, you need to have asked some basic questions:

Is/Was the problem system-wide or limited to a few sessions?

If limited, limited to who? Connecting as which user? Connecting via which application?

What was the timeframe of the problem

If the problem is system-wide, identifying it from looking at the ASH data for particular sessions might be tricky.

If the database is not the source of the problem, then using AWR and ASH might not give any clues.

It’s not terribly scientific, but paging quickly through and aggregating the ASH data can often give a good “feel” for whether the database was experiencing any non-standard extra time in any particular timeframe (although comparing system-level pictures of activity like statspack and AWR reports might be better).

In this case, I opened up with a query a bit like this, targetting a small timeframe when the problem was almost certainly to have been occurring for the end-user:

Then adding a predicate to run a similar query targetted on the particular user/s experiencing the problem:

and user_id = <user_id>

Followed by zooming in on a particular session if one stands out in the data above:

and session_id = <session_id>
and session_serial# = <serial#>

With this particular performance problem, it was quickly obvious that this was a case of good plan gone bad with the affected sessions not blocked or waiting but mostly ON CPU running the same SQL_ID and the same SQL_PLAN_HASH_VALUE.

By changing the query to look at counts of a particular hash, it might be possible to find if it’s a particular plan hash that is causing a problem and when it came along:

One big, big downside of pulling the plan from AWR and not from v$sql_plan is that the access_predicates and filter_predicates columns are not populated. The columns are there but the information has not been copied across. A big oversight in the implementation surely.

Having got a bad plan and some not-so-bad plans (same sql id, different plan hash values), then we can start to look at the differences and what might have caused them.

I’m having trouble with uploading and/or presenting these plans in a way that works. WordPress won’t let me upload text files, the plans are 3000-odd lines long and the various options that I’ve tried so far do not give a satisfactory result.

Actual plans aside, suffice to say that there are significant differences between good and bad plans with the bad plans generally employing full table scans against a big table rather than index lookups but there can be an awful lot of distractions in a big plan.

So … so far… good plan, bad plan, same sql, obvious differences. Why? Where else can we look?

There’s one more place which might help – DBA_HIST_SQLBIND.
This should have captured the peeked binds used when our bad plan was parsed along with any binds captured at regular intervals. Unfortunately, we can only tie in with the SQL_ID as there’s no sql plan hash value here so we’re really looking at LAST_CAPTURED in conjunction with the dates/times from the other views.

So. In conclusion…

The details are a bit vague and I apologise for this. I took too long before starting this article and ended up with a good illustration of why 7 days is too short for your AWR retention. However, AWR & ASH can be a very good way of identifying the causes of plan changes.

In this case, it seems highly likely that a particular bit of sql got a bad plan.
It might not have been bad for the particular combination of binds which were peeked at at parse time, but at least bad for the majority of uses that day.

There is, sadly, still just an element of theory about the causes of this particular performance problem.
I can see the bad plan. I can see the plans which work better in that the performance is acceptable for everyone (better does not necessarily mean good). I can see the binds which produced the better plans. I can see the binds which produced the band plan. But, as yet, running the query with the binds which produced the bad plan is not producing the bad plan :(.

And that’s the outstanding problem – if you can’t reproduce something at will, how do you know if you’ve fixed something? Can you prove you’ve understood the whole problem? You can’t know. Not for sure.

I hope to figure out the best way to upload these plans in another article.

This particular problem is almost certainly a case of bind variable peeking – arguably where bind variables shouldn’t be used. On a reporting system where the reports typically take some minutes, the overhead of hard parsing shouldn’t really be a consideration – this seems to be a system where the mantra “you must use bind variables” has been applied inappropriately. The reports take in date parameters which might be historic date ranges or might just be the latest month – a case of skew, a case of one plan does not suit all, a case for literals.

Post 11g upgrade, one of the first things I noticed was in the AWR reports.

In the Instance Efficiency Percentages, the “Parse CPU to Parse Elapsd %” was a less than impressive “0.00”. I’ve become accustomed to this being low in reports here, but not that low ;-) .

This has been confirmed as a new bug but the support analyst was not hopeful of it being fixed at any point soon (periods of not months but years and 12g was mooted).

It’s not a big deal. The forumula for this ratio is an unsurprising 100*(parse time cpu / parse time elapsed), figures which can be found elsewhere in AWR (and statspack of course) in the Instance Activity Stats.

So, the figure that should have reported was 85% – still less than impressive.