Supporting mission-critical production databases,
an Oracle database administrator always faces the
challenge of being questioned when applications run
slow on an otherwise well-optimized and stable environment.
Myriad factors could cause this degradation including CPU
spikes on a database server. Above an acceptable range, the
spikes impact the response and throughput of all applications
that database server serves. With years of experience and
the assumption that the database is guilty until proven
innocent, our team has developed a systematic approach. We
utilize data from Automatic Workload Repository (AWR) and
Automated Session History (ASH) to find out SQL statements
that are suffering as a result of these spikes and possibly
which of those could be the root cause of CPU spikes. This
article discusses this topic and illuminates several discoveries
using two real-life case studies.

Introduction

We closely monitor our production database systems and CPU spikes are always
on the top of the watch list. Although the CPU spikes vary depending on intensity
and duration, the impact is obvious. CPU spikes point directly to contention
and starvation of system resources, resulting in reduced database performance.
A database server that is part of RAC database architecture, prolonged CPU
spikes could cause the cluster database to become unresponsive that may
result in instance or node eviction.

We have a very systematic and well-documented approach for building the
underlying infrastructure for a database. One of the key rules is to build and
deploy the hardware stack such that the peak baseline CPU utilization is never
more than 50 percent with no degradation in performance for the application
workload defined. Over the years we have found out that this has helped us in
delivering systems that meet and/or beat the high expectations of our
management. From a rudimentary capacity-planning approach this makes
sense, but it assumes that the application workload is well-defined and the
application is optimized. However, there are always going to be problems that
arise from well-detailed systems. Several times when there was absolutely no
code or workload change, application performance degradation usually is a
result of sudden CPU spikes.

With this in mind, as a DBA team we have developed a systematic approach to
troubleshoot database performance. This approach also provides compelling
evidence to show whether there were anomalies going on inside database
during CPU spike time. With Oracle Database10g’s introduction of AWR and
ASH technologies, we have formalized a series of steps to analyze their data
with the aim of nailing down problematic SQL statements that could cause
CPU spikes. (By now, we should all be aware that AWR is a tool to collect
instance-level performance data, while ASH collects session-level performance
data!) By combining these two powerful tools, we have been successful in
unfolding CPU spike mysteries. The approach outlined can be used on both
10g and 11g databases.

Assumptions

In this analysis, the first assumption is that all external causes outside the
database have been eliminated, and that the SQL statement(s) cause the CPU
spike occurring on the database server. Several known processes running on a
database server may contribute to CPU spikes. Database backup process, OEM
agent process, or any other OS daemon process, etc. This can be easily
confirmed by using system monitoring tools, such as Oracle’s OS watcher that
tracks top CPU consumer processes at the OS level during the CPU spike time.

Secondly, our approach is built on top of performance data maintained
automatically by AWR and ASH. The CPU spike may have happened in the past,
or maybe it is an ongoing issue. Our approach works consistently for both
situations, assuming that there is enough AWR and ASH data available to
cover the CPU spike time past or present. Both AWR and ASH are part of Oracle
Enterprise Manager Diagnostic Pack that must be licensed as a separate
option. Oracle has default retention policy for AWR data (seven days) but you
can adjust it according to your needs.

Methodology

Our team believes that the performance data in AWR and ASH at the CPU spike
time are skewed compared to those at the non-CPU spike time. Our approach
is a symptom-driven, bottom-up technique that focuses on understanding the
outstanding variations. Though, the challenge of using AWR data always lies
in the fact that it sums up information at the instance level where symptoms
and problems could both be involved. The first step to tackle any CPU spike
issue starts with one symptom: The SQL statement with the highest CPU time
(defined as top CPU SQL). Be aware that the identified top CPU SQL may or
may not be the real problem. After you identify the top CPU consuming SQL,
review other relevant AWR data looking for problem SQL statements. Any
finding from AWR data is confirmed with the corresponding ASH data. Identify
the sessions that are top CPU SQL, and examine all aspects of session-level
information such as wait events, blocking session, etc.

When using this approach, follow two basic guidelines:

Both AWR and ASH data are time-based data. If the top CPU SQL is not new
to the database, one should be able to compare and confirm findings from
AWR or ASH data across time.

Both AWR and ASH data are dimension-based data. Figure 1 and Figure 2
show those major dimensions in AWR and ASH data of Oracle 10g, which
hold true for 11g. Supporting evidence will be displayed from multiple
dimensions in AWR data as well as in ASH data when AWR and ASH data

To expedite analysis, our team created a series of SQL scripts to extract data
directly from AWR and ASH data dictionary views, instead of generating the
whole comprehensive AWR or ASH reports.

To summarize the steps involved:

For the CPU spike under investigation, first identify the begin snapshot
id and end snapshot id as the range of AWR and ASH data at the CPU
spike time. The script awr_snapshots.sql is provided to take the
instance_number as input parameter, and output the list of snapshot
id for us to choose.

For the AWR data identified above, extract the top CPU SQL by using script
awr_top_cpu_sql.sql. It takes the instance_number, begin_snapshot_id,
end_snapshot_id as input parameters, and displays the list of top 10 SQL
statements with the highest CPU time in the view DBA_HIST_SQLSTAT. To
get a nice rolling window to show the movement of SQL statements over
time, this script pulls from the base AWR snapshot interval and provides
the list for every snapshot existing between begin_snapshot_id and
end_snapshot_id. The goal is to check whether there is a SQL statement
standing up consistently at the top of the list during the time period of
the CPU spike, but does not show up as a Top CPU consuming SQL during
time periods when there were no CPU spikes.

Once the top CPU SQL is identified, retrieve its run-time information out
of the full AWR data. The script awr_sql_run_history.sql is provided
to take the instance_num, sql_id as input parameters, and output all
records found in the view DBA_HIST_SQLSTAT. Consider the impact of a SQL
statement on the database that can be demonstrated by values in columns
of CPU_TIME_DELTA, ROWS_PROCESSED_DELTA, BUFFER_GETS_DELTA,
DISK_READ_DELTA, EXECUTION_DELTA and PARSE_CALLS_DELTA. By
comparing the values at CPU spike time with those at non- CPU spike
time, you should able to see whether the changes are obvious enough
to make this SQL statement a problem. Otherwise, consider it only as
a symptom.

Check other dimensions of AWR data to confirm whatever the findings
have drawn from the top CPU SQL identified above. The scripts
awr_top_events.sql and awr_load_profile.sql are provided to take
the instance_number, begin_snapshot_id, end_snapshot_id as input
parameters, and display instance-level top-five timed events and load
profiles for review.

Further dig out database sessions that run the top CPU SQL at the CPU
spike time and check their session level information from multiple
dimensions of ASH data. The goal is to use the Oracle Wait Interface to see
whether contention is happening for those sessions. If yes, then from what
points and from where are those contention points coming. The script
ash_session_for_sqlid.sql is provided to take the instance_number,
begin_snapshot_id, end_snapshot_id, sql_id as input parameters, and
display sessions matched in the view DBA_HIST_ACTIVE_SESS_HISTORY.
You can then use scripts ash_session_for_sid.sql and ash_session_
details.sql to derive from session dimension to wait event, blocking
session and application dimensions.

Also, sum up ASH data for all sessions at the CPU spike time to confirm
our findings about the top CPU SQL sessions. The scripts ash_blocking_
session.sql is provided to take the instance_number, begin_snapshot_id,
end_snapshot_id as input parameters and rollup session count according
to columns EVENTS and BLOCKING_SESSION in the view DBA_HIST_ACTIVE_
SESS_HISTORY, thus easily detecting whether a particular session, or wait
event, or combination of both, is a hot spot. The scripts ash_top_events.sql
is provided to display the top-10 wait events based on the session count in
ASH data and we need to confirm with top-five events in AWR data.

Having describing the basics, we are going to elaborate our analytic technique
using the following two case studies.

Case Study 1

The monitoring tool alerted us that a particular database server was having
CPU spikes above 90 percent. We notified the on-call DBA and started using the
UNIX top command at the OS level to check what was consuming significant
CPU resources. Despite the difficulty of having more than 10 oracle processes
always staying at the top of the list and their process ids as well as positions
shuffling up and down, the DBA finally traced these down to certain database
sessions and the associated SQL statements currently running.

Then, the real questions started. The SQL statements identified were
all legitimately issued from an application server. According to the
application support team, these SQLs are not new and the database runs
them on an ongoing basis during the day. So why the CPU did spikes show
up all of sudden at this particular time? The on-call DBA also checked that
there was no change in the SQL execution plan and the wait events of those
database sessions were db file sequential read most of the times,
indicating sessions were doing active IO. Because the database is doing
what it is supposed to do, we used our top CPU SQL approach trying to
solve this puzzle before asking other teams to help diagnose the actual
underlying issue.

As shown in Listing 1, our team identified that the CPU spike time started at
1:30 p.m. and the begin snapshot id in AWR as 176856. As the issue was
currently going on, we used DBMS_WORKLOAD_REPOSITORY.CREATE_SNAPSHOT
to create a new snapshot 176860 to record performance data into AWR.

Listing 1: Identify the begin and end snapshot id for the CPU spike time

As shown in Listing 2, the team decided to take SQL vu7ytmpa3s5ek as the top
CPU SQL because it stayed consistently at the top of the list. We also decided to
always work on the biggest problem first, while bearing in mind to come back
to check the SQL statements at the second or third place, because they could
also contribute to the CPU spike.

As shown in Listing 3, we pulled out all the run-time information for SQL
vu7ytmpa3s5ek in AWR data and the last few lines really popped out. We saw
a big spur in the number of Executions of this SQL around the time when
CPU spike started, and the huge jump of Buffer Gets incurred thereafter. By
reviewing the history of how this SQL ran in the past (Executions < 5 and
Buffer Gets < 9 digits per snapshot), we thought this could be the problem.
We confirmed it by checking other SQL statements without revealing any
similar issues.

As shown in Listing 4, we also confirmed with the top-five timed events in
AWR. The event read by other session, db file sequential read and
db file scattered read were taken as being in line with the heavy
'Buffer Gets' observed.

Listing 4: Review the top 5 timed events at the CPU spike time in AWR data

To confirm, we need to further validate it by ASH data. Listing 5 displays a list
of sessions that we found to be running SQL vu7ytmpa3s5ek at the CPU spike
time. Listing 6 shows one top CPU SQL session that we traced down. By looking
at the wait event and blocking session information, we observed another
problem when the multiple sessions ran this SQL simultaneously: contention.
The event 'latch: cache buffers chains' and 'gc buffer busy' were taken
as signals of contention that again were in line with the heavy Buffer Gets.
Listing 7 and 8 show the summarized wait event and blocking session
information for all sessions at the CPU spike time, which also pointed us to the
same direction.

After we identified the top CPU SQL as the problem, we identified which side of
application it came from as shown in Listing 9, and its SQL text, execution
plan as shown in Listing 10. With all detailed data in hand to support our
findings, we were confident to move forward to talk to the application team.
a time because the value of Rows was always equal to that of Executions, we
concluded that this SQL was just a symptom of showing that it could not run
fast enough at CPU spike time.

Listing 9: Check the application-level information for the top CPU SQL session 885

The real-time troubleshooting tool was very helpful in revealing what is
currently happening. However, it doesn’t provide enough information on what
happened in the past, which could blur the understanding of the test results.
Combining AWR data and ASH data, this approach reveals both historically
and realistic answers for troubleshooting issues.

Case Study 2

Another database server was found out to have CPU spike between 1 a.m. and
2 a.m. every morning. There was no monitoring alert because the spike came
up just right below the warning line. However, it generated a noticeable
pattern in the CPU graph. To take preventive actions, our team was assigned to
find out the root cause of this daily occurrence.

Selecting one day’s CPU spike time and identified SQL 37sgt41qpy0p8 as the
top CPU SQL. As shown in Listing 11, its run-time information actually
indicated a different story. The CPU time of this SQL was stable at 10 seconds
at non-CPU spike time, while jumping up to 5,000 seconds at CPU spike time.
While the Buffer Gets and Disk Reads did show some increases at the CPU
spike time, their numbers were considered not significant compared to other
SQL statements running at the same time. The number of Executions, on the
contrary, was reduced substantially at the CPU spike time. Taking into
consideration that this SQL was being run more than 2,000 times per AWR
snapshot window at the non-CPU time, and each time it processed one row at
a time because the value of Rows was always equal to that of Executions, we
concluded that this SQL was just a symptom of showing that it could not run
fast enough at CPU spike time.

To get a hint of what else could be happening, we checked the top-five
timed event in AWR in Listing 12. From the event row cache lock and
library cache lock, we observed that there was contention as well as an
object lock issue.

Listing 12: Review the top 5 timed events at the CPU spike time in AWR data

We kept digging into ASH data and identified one session (SID = 833) that
was running SQL with SQL ID 37sgt41qpy0p8 at the time of the CPU spike.
As shown in Listing 13, we confirmed it was running into row lock and library
cache contention as indicated by the wait event. The blocking session
(SID = 837) was also identified and we traced it down as shown in Listing 14.
We could see that this blocking session was causing the problem: It blocked
sessions while being blocked also. Confirmed with Listing 15 and 16, in which
the wait event and blocking session information for all sessions in ASH data all
pointed to the session (SID = 837) as the big blocker.

Listing 13: Trace down the top CPU SQL session 833 at the CPU spike time

By checking the application-level information for this session as shown in
Listing 17, it came from a DBMS_SCHEDULER job. After reviewing this job’s
schedule and the run log in the view DBMS_SCHEDULER_JOBS and DBMS_
SCHEDULER_JOB_RUN_DETAILS, there was a perfect match between the run
time of this job and the CPU spike time on daily basis. As this was a Daily
object statistics collection job, it explained both the CPU spike as well as the
row cache/library cache contention as the stats collection invalidated cursors
that needed to be reparsed and reloaded into the SQL cache.

Listing 17: Check the application-level information for the top CPU SQL session 837

When the CPU spike happened in the past, our approach has also proven to be
efficient, without many other options to explore as when CPU spike issue is
actually occurring.

Conclusion

AWR and ASH are considered to be great tools that Oracle provides to
troubleshoot the database for the purpose of determining and solving
performance issues. Combined with ASH data, AWR data has become more
meaningful in determining issues occurring at both instance-wide as well as
at the session level. There are many possibilities of exploring the usage of AWR
and ASH data - our top CPU SQL approach to untangle CPU spike issue is just
one good example. Our team continues to learn about performance issues,
refining this method as new situations are encountered.