Adaptive Thresholds in 10g - Part 1 (Metric Ba ...

Doug's Oracle Blog

[I really didn't want to get into another multi-part blog post, but this has grown longer than I hoped, so I'll split it up ...]

One of the more interesting components of the 10gR2 instrumentation improvements is the use of Metric Baselines and Adaptive Thresholds by OEM DB/Grid Control to generate alerts. (There are further improvements in 11g but I’ll address those in a later post.)

Although these features have been designed to be extremely simple to implement without understanding the mechanics, I suspect that most Oracle DBAs still struggle with the concept of implementing something without having some idea how it works (at least I hope so). There are some good technical documents floating around out there (see References below), but I feel they're at one end of the technical spectrum or the other, so this and the next post are intended to bridge the gap as well as showing the features in use. (Harder than you might think, but I'll come back to that.)

How can we use Metric Baselines and Adaptive Thresholds? I’ll try to put it as simply as possible :-

• Generate a baseline of how a system looks during ‘normal’ operation.• Generate alerts when specific metrics exceed threshold values relative to that baseline.

There’s quite a bit more to it than that, as I hope will become apparent.

But first, a warning. This is not about system optimisation – that happened during the design, development, test and initial implementation phases, right? ;-) – but about monitoring and capturing unexpected variance in performance indicators. Because even with an optimised application and hardware infrastructure that generally meets requirements ...

• … a new Execution Plan could be produced for a SQL statement because of an overnight change in object statistics.• … a failed cache battery somewhere in the storage infrastructure could cause a sudden increase in I/O times.• … you could encounter a larger number of concurrent users than the system has handled before or was designed for.

I've seen all of these and more and I’m sure you’ve experienced plenty of your own. So there's no denying that it’s important to implement an optimised system first and foremost but, even when you’ve done so, something will go wrong one day. I’m probably making up my own phrases for things again, but I think these tools cover the areas of Performance Monitoring and Troubleshooting, rather than Performance Tuning (or Optimisation).

For existing Production systems, I’d like a facility that detects performance anomalies in the single database instance that has a problem right now (out of the hundreds I’m managing) and notifies me so that I can focus my attention where it’s required.

You could argue that such a facility already exists - called Users. They will generally detect serious performance anomalies and they will notify us using the tried and trusted telephone. If this happens too frequently, a further indicator would be the endless crisis meetings with white boards and wildly differing theories. However, it would be nice if I knew there was a problem before someone had to tell me. Users also don’t tend to be very good at reporting overnight batch throughput problems until the next day!

Metric BaselinesI'll dig into Metric Baselines first because

1) Defining a Baseline is the first thing you have to do.2) Whilst configuring the Baseline might initially seem the simplest step, it's the foundation on which the threshold alerts depend. (That's why Oracle has tried to simplify it for busy DBAs.)

There are really two types of Baselines, which differ in the way the time period is defined; the way statistics are calculated and, consequently, their suitability for different uses.

A Moving Window Baseline uses recent data from the AWR repository over 7, 21, 35 or 91 days. As each day passes, the window on the data progresses forward by one day. The statistics are re-calculated on a regular basis (possibly as frequently as every hour depending on the Time Grouping). The effect of this is that the statistics change to reflect the recent workload and performance characteristics of the system.

A Static Baseline uses AWR data from a user-defined period which must be at least 7 days long. The statistics are calculated once, when you define the baseline, and are used forever until you switch to a new baseline. So Static Baselines don't make any allowance for the change in a system's work profile over time. If you have a steadily increasing number of concurrent users, you will eventually reach a stage where the system is alerting regularly because the work profile is so much greater than the Baseline.

Which is best for you depends on the characteristics of the system you're managing and what you're trying to achieve. I'd suggest that you need a Moving Window Baseline in most cases, unless you have very strict performance requirements and a very stable system that you don't expect to change over time.

One situation that almost demands the use of Static Baselines is playing around with this on your own setup at home. (In fact, this is just the first of a few difficulties I've faced playing around with this stuff because a single-user laptop is not the design target!) Think about it. For a Moving Window Baseline to make any sense, your system has to have been processing a 'normal' workload for at least the past 7 days, which is pretty unlikely on a laptop I switch off each night The design expects systems to be active on a more or less continuous basis, as most business systems are. So, in order to give me Metric Baseline statistics that I could re-use in future without needing ongoing continuous activity, I created a Static Metric Baseline covering last week. Why last week? Well, that comes to the next difficulty I faced. The Baseline period must have included enough activity on which to base the statistical computations used (see next post). Most weeks there probably wouldn't have been sufficient data on which to base the computations, but my laptop was more active during a week when I was teaching the course for two days and preparing in the evenings.

The best way to show you what I mean is to create a new Static Metric Baseline. First click on the Metric Baselines link at the bottom of various pages (e.g. Database Home page, Performance Page). If you don't have Baselines enabled, you'll be prompted to confirm that you want to enable them. When you click yes, DB/Grid Control will set the value of the _awr_flush_threshold_metrics hidden parameter to TRUE. (Unfortunately, this is thrown as a compliance error in DB Control 10.2.0.3 – see this Metalink Forum Thread relating to bug number 4749372)

There are 135 metrics in Oracle 10.2.0.4 (including some old friends like the Buffer Cache Hit Ratio!), which you can see by querying the V$SYSMETRIC_HISTORY view. e.g.

but there are only 15 used by Metric Baselines and for which you can set Adaptive Thresholds, which are persisted in the AWR repository when _awr_flush_threshold_metrics=TRUE

SQL> select distinct metric_name from DBA_HIST_SYSMETRIC_HISTORY
2 order by metric_name;
METRIC_NAME
----------------------------------------------------------------
Current Logons Count
DB Block Changes Per Txn
Database Time Per Sec
Enqueue Requests Per Txn
Executions Per Sec
Logical Reads Per Txn
Network Traffic Volume Per Sec
Physical Reads Per Sec
Physical Writes Per Sec
Redo Generated Per Sec
Response Time Per Txn
SQL Service Response Time
Total Parse Count Per Txn
User Calls Per Sec
User Transaction Per Sec

15 rows selected.

Actually, there are a further 7 metrics persisted in that view if you also have _awr_flush_workload_metrics=TRUE, although enabling Baselines doesn't do this by default, you won't be able to set adaptive thresholds for these metrics and you shouldn't be changing the value of hidden parameters anyway ... but worth mentioning should you ever see 22 rows returned by the previous query!

you need to click the 'Manage Static Metric Baselines' link at the bottom of the screen. (No, you won't find me disagreeing that the OEM web interface can be a frustrating search for the right screen/link/button/breadcrumb, but I've sort of got used to it because of some of the cool stuff it can do.) Click the, erm, Create button to create a new baseline and you'll see this screen.

Give it a name, pick a period of at least 7 days and then, if you want, you can just click OK and your Static Metric Baseline is created. The statistics will be computed and you'll be able to see your new baseline in the dictionary.

But that simple creation of a Static Metric Baseline ignored the 'Time Grouping' and 'Statistics Preview' sections that I, and I suspect others, have found a little confusing at first. So they deserve a post of their own - the next one ...

References

Although the volume of information out there isn't great, the quality of some of it is, so here's some further reading if you're interested in the subject.

Log Buffer #146: a Carnival of the Vanities for DBAs
Hello and welcome to the 146th edition of Log Buffer, the weekly review of database blogs. I have to make this a quick one, but I hope (as always) that the links give you the highlights of this week’s blogs. Oracle Let’s start with Jonatha...

Recurring Conversations: AWR Intervals (Part 1)
I've seen plenty of blog posts and discussions over the years about the need to increase the default AWR retention period beyond the default value of 8 days. Experienced Oracle folk understand how useful it is to have a longer history of performance metri

Quick question - in order to enable the moving window baseline, do you need to use EM, or is there a DBMS_ procedure that can also do this? I ask because we don't have EM on our Failsafe clustered production environment, but I'm keen to use a non-static baseline, and then further investigate administering the thresholds, again through the command line (so assuming the latter can also be done via DBMS_). Cheers much

Disclaimer

For the avoidance of any doubt, all views expressed here are my own and not those of past or current employers, clients, friends, Oracle Corporation, my Mum or, indeed, Flatcat. If you want to sue someone, I suggest you pick on Tigger, but I hope you have a good lawyer. Frankly, I doubt any of the former agree with my views or would want to be associated with them in any way.