Plans gone AWRy

This is a follow-up to a database performance problem which happened a couple of weeks back, first briefly mentioned in this previous article of mine.

You’ve probably heard it a thousand times before and there’s no significant variance in this case nor revelation in resolution:

The Business have reported a performance problem.

Some reports which normally take a couple of minutes have been taking closer to one hour.

Some DBAs / managers / developers know that when this problem has happened previously, if they flushed the shared pool, the problem mysteriously went away – so that’s what they’ve already done.

The first area of comment is the flushing of the shared pool.

Ok, so it’s a bit of a case of using a sledgehammer to crack a nut but the business have deadlines. With production performance problems and business deadlines, sometimes you’ve got no choice.

Then again, nobody knew it would work – fortunately, from some perspectives, it did work. But, until the root cause is identified and fixed, it’s always going to be uncertain. And “flush the shared pool” is always going to be the battle cry of the panicked, like “reboot your PC”.

Clearly, this is a system that can cope with a flushing of the shared pool. Not true of many.

So, the shared pool has been flushed and we’ve thrown the baby – the evidence & some of the causes – out with the bath water.

However, if you’re appropriately licensed on 10g onwards, you’ve got AWR and ASH to fall back on.

35 and 42 days are oft-mentioned sensible suggestions, mostly so that a) you’ve got a decent time buffer to react and b) you’ve got significant cover to allow for most business cycles of activity.

It’s set to 7 here which is a bit of pain.
Fortunately, I got access to the data within a couple of hours of requesting an account. But even so, I was working on other stuff at the same time and would occasionally think of things to double check but it was too late – the AWR and ASH data was long gone (also I’m writing this article about two weeks after I originally intended to and I wanted to post the queries and data from the actual problem but what with the disruption of the snow and Christmas, the data to match the queries has gone).

Anyway, you’ve been told something has been slow, the problem’s over now, the pool’s flushed, but people want an explanation – there’s an invASHtigation to be done.

In general, when I get access to production, I get data dictionary access. No access to underlying objects, no access to GUIs.

This is part of what I call a blind investigation.
I’m new to the system so I don’t know much about it, I’m not familiar with the SQL or objects, etc.
But I might have looked at some Statspack or AWR reports to get a general picture of activity.

DBA_HIST_ACTIVE_SESS_HISTORY is nearly always my starting place, or V$ACTIVE_SESSION_HISTORY if the data has not been flushed yet, to look for non-standard waiting.

First up though, you need to have asked some basic questions:

Is/Was the problem system-wide or limited to a few sessions?

If limited, limited to who? Connecting as which user? Connecting via which application?

What was the timeframe of the problem

If the problem is system-wide, identifying it from looking at the ASH data for particular sessions might be tricky.

If the database is not the source of the problem, then using AWR and ASH might not give any clues.

It’s not terribly scientific, but paging quickly through and aggregating the ASH data can often give a good “feel” for whether the database was experiencing any non-standard extra time in any particular timeframe (although comparing system-level pictures of activity like statspack and AWR reports might be better).

In this case, I opened up with a query a bit like this, targetting a small timeframe when the problem was almost certainly to have been occurring for the end-user:

Then adding a predicate to run a similar query targetted on the particular user/s experiencing the problem:

and user_id = <user_id>

Followed by zooming in on a particular session if one stands out in the data above:

and session_id = <session_id>
and session_serial# = <serial#>

With this particular performance problem, it was quickly obvious that this was a case of good plan gone bad with the affected sessions not blocked or waiting but mostly ON CPU running the same SQL_ID and the same SQL_PLAN_HASH_VALUE.

By changing the query to look at counts of a particular hash, it might be possible to find if it’s a particular plan hash that is causing a problem and when it came along:

One big, big downside of pulling the plan from AWR and not from v$sql_plan is that the access_predicates and filter_predicates columns are not populated. The columns are there but the information has not been copied across. A big oversight in the implementation surely.

Having got a bad plan and some not-so-bad plans (same sql id, different plan hash values), then we can start to look at the differences and what might have caused them.

I’m having trouble with uploading and/or presenting these plans in a way that works. WordPress won’t let me upload text files, the plans are 3000-odd lines long and the various options that I’ve tried so far do not give a satisfactory result.

Actual plans aside, suffice to say that there are significant differences between good and bad plans with the bad plans generally employing full table scans against a big table rather than index lookups but there can be an awful lot of distractions in a big plan.

So … so far… good plan, bad plan, same sql, obvious differences. Why? Where else can we look?

There’s one more place which might help – DBA_HIST_SQLBIND.
This should have captured the peeked binds used when our bad plan was parsed along with any binds captured at regular intervals. Unfortunately, we can only tie in with the SQL_ID as there’s no sql plan hash value here so we’re really looking at LAST_CAPTURED in conjunction with the dates/times from the other views.

So. In conclusion…

The details are a bit vague and I apologise for this. I took too long before starting this article and ended up with a good illustration of why 7 days is too short for your AWR retention. However, AWR & ASH can be a very good way of identifying the causes of plan changes.

In this case, it seems highly likely that a particular bit of sql got a bad plan.
It might not have been bad for the particular combination of binds which were peeked at at parse time, but at least bad for the majority of uses that day.

There is, sadly, still just an element of theory about the causes of this particular performance problem.
I can see the bad plan. I can see the plans which work better in that the performance is acceptable for everyone (better does not necessarily mean good). I can see the binds which produced the better plans. I can see the binds which produced the band plan. But, as yet, running the query with the binds which produced the bad plan is not producing the bad plan😦.

And that’s the outstanding problem – if you can’t reproduce something at will, how do you know if you’ve fixed something? Can you prove you’ve understood the whole problem? You can’t know. Not for sure.

I hope to figure out the best way to upload these plans in another article.

This particular problem is almost certainly a case of bind variable peeking – arguably where bind variables shouldn’t be used. On a reporting system where the reports typically take some minutes, the overhead of hard parsing shouldn’t really be a consideration – this seems to be a system where the mantra “you must use bind variables” has been applied inappropriately. The reports take in date parameters which might be historic date ranges or might just be the latest month – a case of skew, a case of one plan does not suit all, a case for literals.

Related

6 Responses to Plans gone AWRy

Just for curiosity was there a statistic gathering between good and bad ?

If there was I think using old stats with the bad plan binds might work

Did you also try awrsqrpt.sql instead of dbms_xplan.display_awr for plan over 3000 lines. I don’t have a huge plan so I can’t test it but if you have maybe you can tell if calling the report is working for those kind of plans ?