CPU Usage Monitoring – What are the Statistics and when are the Statistics Updated?

22022010

February 22, 2010 (Updated March 7, 2010, August 10, 2010)

As mentioned in Chapter 8 of the “Expert Oracle Practices” book, there are several ways to measure the amount of CPU time used, whether at the operating system level, at the instance-wide level, at the session level, or at the SQL statement level. In a previous blog article I described why driving the database server’s CPUs to 100% utilization was a bad idea and provided quotes from several articles that disagreed as well as several that agreed with my statements (Cary Millsap’s follow-up comment should be read carefully).

I mentioned in the previous blog article that some CPUs will slow their core frequences if thermal thresholds are reached, which might be possible by pegging the CPU at 100% utilization for long periods of time. Intel offers the Intel Processor Identification Utility for the Windows platform (similar utilities for other platforms and non-Intel processors probably also exist) that shows the rated processor speed and the current CPU speed for each CPU/core in the server. For example, the following is the output from the Intel program when run on my three year old laptop:

Here is the same laptop, captured in a different time period. This particular slow down was not caused by a thermal threshold problem, rather it was created by setting the Windows power plan to Power Saver:

In the above, notice that the reported core speed for both CPU cores is 50% of the expected CPU speed, and even the system bus speed dropped slightly (I have seen lower CPU speeds and system bus speeds when the computer wakes from hibernation mode). (Edit: See the pictures at the end of the article for examples of slower speeds 8+ hours after coming out of hibernation mode.)

—

Let’s jump into Oracle now (we will use Oracle Database 11.2.0.1 and 11.1.0.6). There are several statistics in the database views that report CPU utilization. One such statistic is CPU used by this session. This is a potentially challenging statistic to use. From the documentation:

CPU used by this session: Amount of CPU time (in 10s of milliseconds) used by a session from the time a user call starts until it ends. If a user call completes within 10 milliseconds, the start and end user-call time are the same for purposes of this statistics, and 0 milliseconds are added. A similar problem can exist in the reporting by the operating system, especially on systems that suffer from many context switches.

CPU used when call started: The CPU time used when the call is started

If we try to understand the above, the CPU used by this session statistic is the time, in centiseconds, of CPU utilization while a user call is running on the server’s CPUs. The second statistic, CPU used when call started, indicates the value of the CPU used by this session statistic when the user call was started. The documentation leaves a couple of questions unanswered:

When are these statistics updated? If the CPU used by this session statistic is updated in real-time (if you are wondering, it is Not), the CPU used when call started statistic could be used to determine how long the session has been on the CPU (and not in a wait event and not in one of the other CPU run states), but if the statistics are not updated in real time, what is the point of having two separate statistics?

What is considered a user call – is it the full execution of a user’s SQL statement, or is it a fetch of a block from disk, or is it the completion of a fetch of some of the result rows, or is a user call something entirely different?

We need a test case to see what is happening with these statistics, and several of the other statistics that measure CPU utilization (warning, this test table may take an hour or longer to create).

Once the test table is created, disconnect from the database and then reconnect to the database to reset the session-level statistic values. Note that this test is performed with the FILESYSTEMIO_OPTIONS parameter set to SETALL to enable direct, asychronous I/O. Next let’s determine the SID for the session:

SPOOL CPU_STATS.TXT
SET PAGESIZE 2000
SET ARRAYSIZE 1000
SELECT
STAT_NAME,
VALUE
FROM
V$OSSTAT
WHERE
STAT_NAME IN ('SYS_TIME','USER_TIME','BUSY_TIME','IDLE_TIME','IOWAIT_TIME','NICE_TIME','RSRC_MGR_CPU_WAIT_TIME')
ORDER BY
STAT_NAME;
SELECT
STAT_NAME,
VALUE
FROM
V$SESS_TIME_MODEL
WHERE
STAT_NAME='DB CPU'
AND SID=552
ORDER BY
STAT_NAME;
SELECT
SN.NAME,
SS.VALUE
FROM
V$SESSTAT SS,
V$STATNAME SN
WHERE
SS.SID=552
AND SS.STATISTIC#=SN.STATISTIC#
AND (SN.NAME LIKE 'CPU%'
OR SN.NAME LIKE '%cpu%'
OR SN.NAME LIKE '%consistent%')
ORDER BY
SN.NAME;
host sleep 60
SELECT
STAT_NAME,
VALUE
FROM
V$OSSTAT
WHERE
STAT_NAME IN ('SYS_TIME','USER_TIME','BUSY_TIME','IDLE_TIME','IOWAIT_TIME','NICE_TIME','RSRC_MGR_CPU_WAIT_TIME')
ORDER BY
STAT_NAME;
SELECT
STAT_NAME,
VALUE
FROM
V$SESS_TIME_MODEL
WHERE
STAT_NAME='DB CPU'
AND SID=552
ORDER BY
STAT_NAME;
SELECT
SN.NAME,
SS.VALUE
FROM
V$SESSTAT SS,
V$STATNAME SN
WHERE
SS.SID=552
AND SS.STATISTIC#=SN.STATISTIC#
AND (SN.NAME LIKE 'CPU%'
OR SN.NAME LIKE '%cpu%'
OR SN.NAME LIKE '%consistent%')
ORDER BY
SN.NAME;
host sleep 60
...
SPOOL OFF

In the above script, repeat the three SQL statements (replacing 552 with the SID number returned by your query) between sets of host sleep 60 calls as frequently as required, possibly 30 times, so that the full run of the SQL statement that will be executed by the first session is captured (see page 213 of the “Expert Oracle Practices” book for a Windows equivalent of the Unix/Linux sleep command) .

The index hint is added in the above just to make certain that the IND_T1 index is used when we select 1.27% of the rows from the test table. Start the monitoring script in the second session and immediately start the above script in the first session.

When the first session completes, it should output something like the following (from the 11.2.0.1 test):

If we take the spool file output that was generated by the second session and calculate the delta values for the statistics, we find the following statistic deltas were captured during each 60 second time period for the 11.2.0.1 test run:

As evidenced by the above, the SQL statement completed in just less than 10 minutes in Oracle 11.2.0.1, while the SQL statement in 11.1.0.6 completed in just less than 4 minutes. Why the significant time difference? Maybe the above stats provide a clue?

The IDLE_TIME and BUSY_TIME statistics are server-wide with each of the server’s CPUs accumulating one second for every second of elapsed time, with the one second split between these two statistics. USER_TIME is a server-wide statistic that indicates how many CPU centiseconds were spent executing application/Oracle level activity. SYS_TIME is a server-wide statistic that indicates how many CPU centiseconds were spent executing operating system code (in kernel mode), typically on behalf of another program that usually accumulates CPU consumption time in the USER_TIME statistic. In the above, DB CPU is accumulated at the session level and must be divided by 1,000,000 to convert the values to seconds. The CPU used when call started and CPU used by this session statistics are at the session level and are reported in centiseconds. If we look at the first time period for the Oracle 11.1.0.6 execution, we are able to derive the following statistics:

(24,296 / 24,985) * 100 = 97.24% of the consistent gets were performed by fastpath (the statistic definition is apparently not documented – I suspect that it might mean that the blocks were read by direct path reads, thus bypassing the buffer cache – but I suppose that it might mean instead that multiple single-block read requests were combined into a single parallel read request)

The DB CPU statistic in the V$SESS_TIME_MODEL view and the various statistics in the V$OSSTAT view are updated in near real-time – the monitored session only consumed about 0.6 seconds of CPU time for every 60 seconds of execution time. You will notice that for both Oracle Database 11.2.0.1 and 11.1.0.6 the delta values for the CPU used when call started statistic and the CPU used by this session statistic appear as something other than the number 0 throughout the execution time, so the statistics must be updated at least once a minute in this test run, but that appears to contradict the results shown in Table 8-1 on page 201 of the “Expert Oracle Practices” book where those statistics were only updated 5 minutes and 35 seconds after a SQL statement was submitted (in the book, the CPU used by this session statistic was updated to show only the parse CPU time, until 5 minutes and 35 seconds elapsed). Is the book wrong, or are the above statistic deltas wrong, or maybe there is another explanation?

Fortunately, while the SQL statement executed, I also captured several statistic delta values from V$SQL for the SQL statement using my Toy Project for Performance Tuning, but a script like that found on page 216 of the “Expert Oracle Practices” book will also collect similar statistics. The captured deltas for the SQL statement on Oracle 11.2.0.1 are as follows (captured roughly every 60 seconds):

There were roughly 13 fetches in the first minute, roughly 15 fetches in the second minute, 14 fetches in the third minute, etc. The statistics in V$SQL are updated at least once every 6 seconds, so the displayed CPU time values (shown in seconds) are reasonably close to actual CPU consumption times. You might recognize that the above CPU Time delta values are very close to the delta values shown for the CPU used by this session statistic, so does that mean that the CPU used by this session statistic is updated every 6 seconds or less? That brings us back to the question, what is considered by the documentation as a “user call” in regard to these two statistics? Given the above evidence, a “user call” in this context actually includes fetch calls.

Because of the inconsistency with which the CPU used by this session statistic is updated (it might not be updated for more than an hour while executing certain SQL statements), it is a better idea to use the DB CPU statistic found in the V$SESS_TIME_MODEL view (available starting with the Oracle 10.1 release) when trying to determine how much CPU time a particular session is consuming. Of course operating system tools such as ps and top or my VBS script for Windows may also be used to determine which sessions are consuming CPU time. The V$SESSION and V$SESSION_WAIT views are also able to help you determine which sessions are currently consuming CPU time and roughly how much CPU time has been consumed since the last wait event ended.

But a parallel question remains, why is a particular session consuming CPU time (the above test case is not necessarily the best example for showing uncontrolled CPU usage, try this test case with the OPTIMIZER_FEATURES_ENABLE parameter set to 10.2.0.4 or lower at the session level)? The V$SESSION_LONGOPS and V$SQL_MONITOR (introduced in Oracle 11.1, requires a Tuning Pack license to access) views might provide enough information to determine why a session is consuming CPU time. A 10046 extended SQL trace could also potentially show why a session is consuming an excessive amount of CPU time, unless, of course, nothing is written to the trace files because the session is performing a CPU intensive operation completely in memory. What then? Multiple ORADEBUG SHORT_STACK traces (there is a small chance that creating one of these dumps will crash the session) for the session with the help of Metalink Doc ID 175982.1 could provide a clue of what is happening in the session. If that does not help find the cause of the CPU usage, you could just declare that the Oracle Wait Interface is Useless (Sometimes) and move along to something more productive (I suggest reviewing that blog series).

Added March 7, 2010:
I thought that the laptop was running a bit sluggish today. It has been in and out of hibernation on a daily basis for the last two weeks. I thought that I would check the CPU speed just for fun – this is what I received on 2 checks of the performance with the Power Plan set to High Performance and with the laptop plugged into the electrical power:

I guess that explains why this darn laptop feels so slow – as I write this update to the blog article I see that the core frequency has again dropped:

Is Intel trying to strong arm me into buying one of those Core i7 920QX laptops, or is Microsoft trying to get me to upgrade to Windows 7 on this laptop?

—————-

Update August 10, 2010:

The CPU speed has dipped to 0.210GHz several times, possibly caused by overheating:

Hints for Posting Code Sections in Comments

********************
When the spacing of text in a comment section is important for readability (execution plans, PL/SQL blocks, SQL, SQL*Plus output, etc.) please use a <pre> tag before the code section and a </pre> tag after the code section:
<pre>

SQL> SELECT
2 SYSDATE TODAY
3 FROM
4 DUAL;
TODAY
---------
01-MAR-12

</pre>
********************
When posting test case samples, it is much easier for people to reproduce the test case when the SQL*Plus line prefixes are not included - if possible, please remove those line prefixes. This:

SELECT
SYSDATE TODAY
FROM
DUAL;

Is easier to execute in a test case script than this:

SQL> SELECT
2 SYSDATE TODAY
3 FROM
4 DUAL;

********************
Greater than and Less than signs in code sections are often interpretted as HTML formatting commands. Please replace these characters in the code sections with the HTML equivalents for these characters: