CPU Capacity planning from OEM metrics

The CPU used by your Oracle Database is expensive because it is the metric used by licensing. The more you can control and know what you need, the more freedom you will have to optimize the costs. With instance caging, available in all editions, you can put a soft limit. This means that:

you run on a limited number of threads and after a while, this gives a good idea of what you really need. You can forecast the capacity for a future consolidation.

you monitor ‘resmgr: cpu quantum’ and if activity is high you can decide to scale-up immediately, throttle some services, or do some query/design tuning.

In order to set instance caging, you need to define a value for CPU_COUNT according to the past activity. This post is the detail behind the following tweet:

Oracle Enterprise Manager cloud Control collects many metrics from the target databases and maintains a history of them with daily minimum, maximum and average,… The Oracle documentation mentions the view MGMT$METRIC_DAILY but I prefer the MGMT_METRICS_1DAY because the ‘underscore’ views are not mentioned in the Licensing Information documentation. The ‘dollar’ metric views require Diagnostic Pack. However, those scripts are to be used with databases where Diagnostic Pack is enabled.

Average, Maximum or Percentile?

The goal is to define a CPU_COUNT which accepts the peak activity. Then I’ll get the maximum value from the daily metric: the maximum number of sessions in CPU observed during each day. However, we may have experienced an issue where some queries take lot of resources. If this is caused by an application bug or bad execution plan, we need to fix it. We don’t want to pay for CPU to cope with these issues. Then, rather than taking the maximum, I’ll look at a percentile. Here is a graph from 4 instances during 2 years:

Do you want to pay for 120 threads just because of one peak? Probably not. Taking a percentile 99% will give the value to cope with 99% of the days.

My query for these instances returns the following: 125 thread is the maximum, but 74 threads are sufficient for 99% of the days observed. And 50 threads are ok for 90% of the days:

You see the same number for multiple instances because they are nodes for the same database.

RAC instances

Here is another example with two nodes from the same database. In addition to the peaks where we can accept to throttle for a short time, there is a service that has been relocated several times:

Then which CPU_COUNT do you want to set on each instance? The safest would be to accept, on each node, the possibility to run the load observed on the whole cluster. This can happen in an exceptional case where only one node remains. But this costs a lot. Rather than setting the sum of cluster load, we can set the maximum observed, so that each node can accept a peak observed on one of the nodes. Or, if we don’t expect service relocation, or if we accept some response time degradation in those cases, we can just set what has been observed on the node.

The following query joins the daily metrics with some target information: “Line of Business” and “Department” to group the databases from a business point of view, “LifeCycle Status” to differentiate production, test, and development.

The metric selected is the per-instance CPU second per second, which is the average number of sessions in CPU.

“CPU load/instance/day” does the join and calculates the value, the max, and the sum among the cluster. With single-instance, you don’t have to worry. With RAC, you may decide, in “Target CPU load/instance/day” which one you use. Then “Proposed instance caging” calculates the percentiles. The main query, at the end, does a ‘group by rollup’ to add some aggregates per business and environment. Of course, you will customize for your needs. My goal is to show the useful sources of information and how to mine them.

You may like or not my way of writing SQL queries. I use Common Table Expressions to define the source of data, name each column according to its role in the final result, and use natural join because the names define clearly the join columns. I find this very easy to develop and test each step.

The “CPU load/instance/day” result can easily be exported to an Excel pivot graph to look at the whole picture, as above, before deciding which percentile and which cluster aggregation to use. You may even have to look at your logs to see if the peaks are related to business activity (where you want to scale-up) or a problem (which you want to cage). And remember that the goal is to set a base for instance caging, which can be adapted easily later. When you run a while with a controlled number of threads, you can consider consolidation and licensing optimization. And don’t forget to see if you can reduce the CPU_COUNT with some tuning. Please, don’t hesitate to comment, here or on Twitter, with remarks or improvements.