UKOUG 2011 - Ash Outliers

Doug's Oracle Blog

My final UKOUG 2011 post is about another of my favourite presentations - "ASH Outliers: Detecting Unusual Events in Active Session History" by John Beresniewicz. (JB for short, but Marco Gralike made a fairly good stab at pronouncing his surname correctly during the introduction.)

I'd been looking forward to this presentation because I'd already been aware of his work in this area for a while and was supposed to be helping out (but I'll come back to that later ...). I'd also expected to see it in 2010 but JB withdrew the abstract later. Although I knew a lot of the content already I enjoyed it because of the subject and because JB's always likely to make me see something old from a new angle. Anyway, what was it all about?

One of the most important limitations of ASH data is that, because the data is sampled, it doesn't include every individual timed event that sessions waited for and is inherently biased towards longer duration (and more common) events. As long as you understand this design decision then you can understand sensible ways to use the data and also less sensible ways!

For example, it is not sensible to write queries against ASH data that SUM(time_waited) or AVG(time_waited) because the results look sensible but are fairly meaningless. If you sum all the wait times in ASH but ASH doesn't include all activity then what does the total value represent? Likewise, if you calculate the average 'log file parallel write' time in ASH for a specific period, it's virtually guaranteed that the result will be higher than the true average because ASH will tend to capture more long duration events than short ones.

Generally it's more sensible to use simple COUNT functions and make the assumption that a sample represents a second of time. It's obviously an approximation, but one that works surprisingly well as reasonably long experience of the OEM performance pages and ASH queries has shown me.

However, although the Oracle employees I've discussed this with encourage people to use the TIME_WAITED column in ASH with caution, I've personally found it useful on many occasions when ASH is the only session-level data available because I'm trying to diagnose a problem after the event. In fact, ASH can be extremely useful in that situation because, as well as having session-specific information, it contains it for all the sessions. So I confess that when querying ASH data using SQL, I've often found the TIME_WAITED of individual rows to be a useful indication of what's been going on if unusual or particularly long waits appear immediately prior to a serious problem. By looking at the series of events that led up to a problem, I can see the interactions between those sessions via blocking session information and at least build up an approximate view of what happened.

JB's presentation was essentially about a single query that he's been working on that is designed to look at the TIME_WAITED values in ASH data to identify outliers - events with unusually long wait times relative to the usual wait times for that event. The thinking being that the appearance of long wait times for certain events could indicate the root cause of serious system-wide performance problems. (Note that AWR data would be next to useless for this because the problems might only occur for a few seconds in the lead-up to the problem and that level of detail would be aggregated to insignificance in the wider AWR time scope.)

Having hopefully said enough technical things in this post now to keep grumpy people happy, allow me a humorous interlude. From the start of the presentation, JB kept alluding to a couple of things

a) The singular lack of response he'd had from people he'd shared the query with. Fortunately no names were mentioned but he was clearly griping about the limited response he'd had from me after sending me the query a while back and how long it had taken. What can I say? I'm *busy*!

b) Why the feedback is so important to him - because people in Oracle Development appear very interested in what customers are doing on their real Production systems. Even though Oracle do have their own internal Production systems, they're only a small fraction of what customers are doing so it's difficult to know what will work. In any case, I guess Oracle don't let developers play around on their production systems, like most companies.

So here's the deal. As JB mentioned, he is probably the least on-line guy there is. He might do DBA 2.0, but he doesn't do Web 2.0! So he suggested that I could get the script out there to other people to try on their own interesting systems and see whether they found it useful.

Note that this version references the V$ rather than GV$ version of ASH data so only uses officially supported features as JB alluded to during the presentation. i.e. a RAC version is possible but this is single-instance for now, just to let people give it a try.

To save a veritable deluge of mail (yeah, JB, you *wish*! LOL) it might be best to post any feedback on the script as comments here or I could forward it on.

Oh, and something I heard later that a few people had struggled with during the presentation was the concept of Significance Levels that are central to the script and the presentation. Fortunately, I already host a short document on the subject, again courtesy of JB, here. However, as that's more focussed on Adaptive Thresholds and time grouping in particular, you might find the discussion on this AskTom thread more generic and relevant to the ASH Outliers script.

All in all, a terrific presentation, as I expected and something that people might want to try out. It’s certainly an interesting concept and an attempt to automate something I’ve been doing manually for a few years. Thanks, JB!

Apologies for a late addition to this post. I attended John's presentation and was intending to run his script. Business has intervened and I have just got round to running it. I have results for four confidence levels, 0.99, 0.999, 0.9999 and 0.99999. I have enclosed the latter here and can supply all four in a zip file if required. I would be interested to find out how the data is used and the results of any analysis.

Hi Tony -
Thanks for posting your output from the script. As far as how the data is used and what conclusions might be drawn, it would probably be good to recall the motivating use case for developing the script in the first place:

RAC performance experts were noting that the EM database Top Activity page is not particularly useful for diagnosing performance problems where unusually long waits are implicated in cascade-effect performance issues. Top Activity and other EM pages show ASH data rolled up into one or more aggregation buckets, individual events are lost in the aggregation. So the query was designed to provide a way to select out events from ASH satisfying an event-independent measure of "unusually long" based on using v$system _event as representing the statistical distribution of event wait times, and defining "unusual" in terms of statistical rarity. Since the motivating use case assumed RAC, it was important to get the "gv$" version working for the UKOUG presentation.

So for the motivating use case, we would be hoping to pick up some unusually long events in one or more nodes that seem to be time-proximate to known performance events, perhaps also (but not necessarily) signaled by other unusually long events.

Now it seems we might do better than just the motivating use case, and perhaps that is what you are interested in. For instance, your output shows a lot of outliers in "db file scattered read" but the next question is "these may be unusual, but are they unexpected?" If we are looking for a 1-in-a-million event, but there have been 10 million events since we began looking, then we might well expect to see some of these sampled into ASH. So this is a bit more difficult perhaps but more generally useful. We could try to write a query that answers the question:

Which events/classes are showing larger than expected counts of highly unusual events sampled into ASH?

From your data there is not a lot of hard conclusions to draw, but it is interesting that the two very long (~5 second) events were both for "SQL*Net more data to dblink" and same SQL_ID both times. So perhaps this is expected or perhaps it is showing some issue in the application. Other than that there is some clustering of unusual events in time (same sample_id) but not in the extreme, usually only 2 or 3 sessions sampled together.

Thank you John for the explanation and analysis of the supplied data. I had thought the outliers were infrequent events that could be disguised within the granularity of the AWR snapshot period. It is interesting to learn the reasons behind the development of the query.
The results of the ASH outliers query has reinforced the AWR analysis I have performed on our system for user perceived performance issues. The same SQL ID have appeared in both reports. I shall be visiting the client site next week to perform further analysis and measure performance gain by making changes to some execution plans which these queries have highlighted as problem areas. The specific 'SQL*Net more data to client' event is due to our use of Multi Master Replication technology and represents the call to the deferred push job. This call can take a while when columns with a large data content are pushed to the secondary database server.
I have now added the ASH outliers query to my standard tool set. It is fast to run and the data produced is concise enough to pinpoint potential problem areas.
Thanks again for the blog post (Doug) and the query/analysis (John).

Disclaimer

For the avoidance of any doubt, all views expressed here are my own and not those of past or current employers, clients, friends, Oracle Corporation, my Mum or, indeed, Flatcat. If you want to sue someone, I suggest you pick on Tigger, but I hope you have a good lawyer. Frankly, I doubt any of the former agree with my views or would want to be associated with them in any way.