Find Objects with the Greatest Wait Time – What is Wrong with this Quote?

28062011

June 28, 2011

While reading the alpha edition of the “Oracle Database 11g Performance Tuning Recipes” book, I noticed a couple of interesting SQL statements in Recipe 5-17 that just did not look right. One of those SQL statements, slightly reformatted, appears below:

The above SQL statement is supposed to retrieve a list of object names that experienced the most wait time in the last 15 minutes. What is wrong with the query? Consider that a slight variation of the above query is found in the following locations:

Page 160 of the book “Oracle Tuning Power Scripts: With 100+ High Performance SQL Scripts”

So, what is wrong with the query? OK, the table aliases are inconsistent, but that is too easy. Let’s refer to the Oracle Database 11.2 documentation for the V$ACTIVE_SESSION_HISTORY view.

“V$ACTIVE_SESSION_HISTORY displays sampled session activity in the database. It contains snapshots of active database sessions taken once a second. A database session is considered active if it was on the CPU or was waiting for an event that didn’t belong to the Idle wait class. Refer to the V$EVENT_NAME view for more information on wait classes.

This view contains one row for each active session per sample and returns the latest session sample rows first. A majority of the columns describing the session in the active session history are present in the V$SESSION view.”

Still not seeing the problem? If we believe the documentation, assume that we are only interested in a 10 second time interval, and one of the sessions started waiting on a wait event at the start of the time interval, and was still waiting in the same wait event at the end of the 10 second time interval. SUM(A.WAIT_TIME + A.TIME_WAITED) for the session (assuming that rounding does not happen due to view update frequency) would show 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + 10 = 55 seconds of total wait time for the result of the SUM function. Not too bad, right? Now, assume that the session had already waited 240 seconds and continued to wait for an additional 10 seconds – the SUM(A.WAIT_TIME + A.TIME_WAITED) for the session would show that the session waited 240 + 241 + 242 + 243 + 244 + 245 + 246 + 247 + 248 + 249 + 250 = 2695 seconds (44.92 minutes) in that 10 second time interval.

In Session 2, attempt to insert the same row into the table (a primary key violation if the first session commits):

INSERT INTO
T1
VALUES (
1);

Session 2 is now hung. Back in Session 1, wait 20 minutes and check AWR ASH using a slightly fixed version of the query (mainly the table aliases are corrected, and adjusted to examine only the last 10 seconds of entries in V$ACTIVE_SESSION_HISTORY for the two sessions of interest):

Well, that throws off the results of the query found in the book a bit. The CURRENT_OBJ# is -1 and the session has not waited any time (SUM(A.WAIT_TIME + A.TIME_WAITED) = 0 for SID 4, but would be 116,972,415 for SID 189 for the same CURRENT_OBJ# = -1).

Maybe the query found in the book just does not work for potential primary key violations? What about a case where one session updates a row and another session attempts to update the same row? Issue a COMMIT in Session 1 and Session 2. In Session 1:

COMMIT;
UPDATE
T1
SET
C1=C1;

In Session 2, attempt to update the same row:

UPDATE
T1
SET
C1=C1;

Session 2 is hung while waiting for Session 1 to COMMIT or ROLLBACK. Now wait for a while. While you are waiting, let’s take a quick look at the documentation for the V$ACTIVE_SESSION_HISTORY view from the 11.2 Oracle Documentation library:

“TIME_WAITED: If SESSION_STATE = WAITING, then the time that the session actually spent waiting for that event (in hundredths of a second). This column is set for waits that were in progress at the time the sample was taken.

If a wait event lasted for more than a second and was caught waiting in more than one session sample row, then the actual time spent waiting for that wait event will be populated in the last of those session sample rows.”

OK, must check the SESSION_STATE column, only the last of those sample rows in a string of rows for the same wait event, and the time is in 1/100th (0.01) of a second. Let’s re-run the query in Excel to see the progress:

That 2,471,728,088 number in the TIME_WAITED column appears to be in conflict with the description in the Oracle Documentation library – the time appears to be reported in 1/1,000,000 (0.000001) of a second, not 1/100 of a second.

—

Back to the original topic of this article. I have yet to determine why the query subtracts 30/2880 from the SYSDATE to retrieve the activity in the last 15 minutes – wouldn’t it be more clear to subtract 15 / (24 * 60) or 15 / 1440? Considering that the SAMPLE_TIME column in V$ACTIVE_SESSION_HISTORY is a TIMESTAMP datatype, would it make more sense to use SYSTIMESTAMP rather than SYSDATE?

Finally, What is Wrong with the Quote Found in the Book?

The point of blog articles like this one is not to insult authors who have spent thousands of hours carefully constructing an accurate and helpful book, but instead to suggest that readers investigate when something stated does not exactly match what one believes to be true. It could be that the author “took a long walk down a short pier”, or that the author is revealing accurate information which simply cannot be found through other resources (and may in the process be directly contradicting information sources you have used in the past). If you do not investigate in such cases, you may lose an important opportunity to learn something that could prove to be extremely valuable.

16 responses

After seeing your response I thought that I made a mistake in the title of this article, so I double-checked. The book states just prior to the query: “Execute the following query to find out the objects with the highest waits”

By the way, excellent point – even if I was looking for a bit of a different answer. This chapter of the book needs a complete re-write – I have close to 8 pages of review notes (accuracy adjustments, so to speak) just for chapter 5.

Charles hit upon the main point, which can be easily missed, that the event fields are only relevant for rows that indicate waiting. ASH doesn’t always clear out the values in other fields, which leads to another confusing issue. The object fields

CURRENT_OBJ#
CURRENT_FILE#
CURRENT_BLOCK#

are only valid for certain waits, which are AFAIK

IO waits
buffer busy waits
enqueue tx waits

and even for these waits the values may be missing, -1, or 0 and in those cases the only way to get the objects is from the file and block #s, using p1 and p2 for most of the user I/O waits:

Using sum for v$active_session_history is a bad thing – you can’t summarize samples it has to be counted instead – that’s first issue.
TIME_WAITED column has different meaning depend on session_state (ON_CPU or WAITING) and if one event (wait) is longer than sample period (1s)
waited time of one event will be added more than once.

Very interesting document that you linked to – if I remember correctly, the author of that document was also heavily involved in the development of ASH at Oracle Corp. I probably learned a few things from reading that document. A couple of items that stood out, which are related to this blog article:
Slide: “ASH: Top IO SQL”
Slide: “ASH data gotcha’s”: Beware of Obj#, File#, Block# (not cleared); Wait time vs Time waited
Slide: “ASH: Bad SQL”: select sum(a.time_waited) total_time
Slide: “ASH: WAIT_TIME vs TIME_WAITED”

The “ASH: WAIT_TIME vs TIME_WAITED” slide certainly explains the Excel screen captures that appear in the blog article (except for the unit of measure inconsistency).

Using COUNT rather than SUM… I should have thought of that, but then that is part of the reason why reader comments are requested for my blog articles :-)

ASH is generating sample every second so if your session was locked for let say 10 s with any event event you will see 10 rows in ASH with same session_id and sql_id – of course we can’t be sure if all samples are related to same execution of SQL_ID but from statistic perspective we can ignore it

This query will show you most session utilization profile for last minute – wait time (cnt column) are not exact but if you will sample it for quite long period measure error related to sampling frequency will be decreasing over time.

IMO Macin hit the nail on the head. ASH is good for counting and statistical approximation.
I basically stay away from timings in ASH, though they can be of occasional use.
I especially stay away from wait_time, which, as I understand it , should have been left out of ASH and is basically the basis for “On CPU” or “Waiting”.
TIME_WAITED can be used but keep in mind, wait times will be skewed towards the longer waits as shorter waits will be caught less frequently.

If one is concerned with the top objects then look at the counts of waits on those objects. Waits on objects are only going to be instrumented for 3 types of waits AFAIK

You also have to target not only the wait events, but actual waits as opposed to ASH rows that were for CPU but still have left over data in other columns from previous waits.
For IO, you could do something like

Kyle,
Wow, great answers and examples. In the second of the above SQL statements you have a column tcnt in the ORDER BY clause – is it OK to replace that column with aas?

Martin,
Thanks for putting together the summary. For some reason WordPress required me to authorize Kyle’s comments and your comment, thus delaying the posting of the comments by an hour, so that explains why your summary does not include any of the extra points made by Kyle.

—

No comments yet from anyone about SAMPLE_TIME being a TIMESTAMP column and any potential problems with comparing it to SYSDATE in the SQL statement at the start of this article? Are there potential issue… what about a remote DBA in another time zone?

>>is it OK to replace that column with aas?
yep, error in script , thanks for the catch. I think tcnt (total count) is a vestige from an original version before I shifted it to AAS
>>SAMPLE_TIME being a TIMESTAMP column and any potential problems with comparing it to SYSDATE
One trick that has help me in queries is
cast(sample_time as date )
not that it makes dealing with timezones any easier but it makes timestamp calculations easier

Charles,
I’ll ry a little summary so far:
* you showed how a _single_ wait can be SUMmed multiple times. Do you think the max WAIT_TIME or TIME WAITED in a direct series of ( equal EVENT _AND_ [SEQ# constant increasing] ) can solve that particular issue? (or at least reduce the risk)
* we know now, “ON CPU” is not an object
* the document refuses any information about current_obj#, it the wait is not part of “application, cluster, concurrency, and user I/O wait events”
* the documentation seems to be wrong in TIME WAITED as “hundredths of seconds”
* Niall is right, “objects don’t experience waits” – but let’s keept it similar to “hot blocks”?
* Marcin is right (even not stated directly): “the most wait time” can never be provided ba a sampling mechanism as ASH. Even unlikely, there could be a crazy high number of other waits on a single object just between the samples.

Hi there
nice post to say the least …
I do have a website of mine on http://www.zion-dba.com with some charts done using AWR data
best to all
patrick boulay

(Edit C. Hooper August 3, 2011: Please note that comments on this blog MUST refrain from advertising commercial products. It does not appear that the intention of the above link is to advertise a commecial product, but instead to share a data analysis concept that could be implemented in Microsoft Excel. If it is determined that this comment is in fact a commercial advertisement, this comment will be treated as spam, and will disappear.)

Hints for Posting Code Sections in Comments

********************
When the spacing of text in a comment section is important for readability (execution plans, PL/SQL blocks, SQL, SQL*Plus output, etc.) please use a <pre> tag before the code section and a </pre> tag after the code section:

<pre>

SQL> SELECT
2 SYSDATE TODAY
3 FROM
4 DUAL;
TODAY
---------
01-MAR-12

</pre>
********************
When posting test case samples, it is much easier for people to reproduce the test case when the SQL*Plus line prefixes are not included - if possible, please remove those line prefixes. This:

SELECT
SYSDATE TODAY
FROM
DUAL;

Is easier to execute in a test case script than this:

SQL> SELECT
2 SYSDATE TODAY
3 FROM
4 DUAL;

********************
Greater than and Less than signs in code sections are often interpretted as HTML formatting commands. Please replace these characters in the code sections with the HTML equivalents for these characters: