Modelling a “simple” ITL problem

Last night important client processing missed its SLA with the application reporting ORA-02049: timeout: distributed timeout waiting for lock.

This post isn’t about the distributed lock timing out.

But consider the evidence that further investigation into these distributed locks threw up.

The parameter distributed_lock_timeout is set to the default of 60 (seconds) – this means that any distributed transaction that waits for an enqueue for 60 seconds will be picked off by a sniper, similar to the standard deadlock detection after 3 seconds.

These same ASH samples show one particular sql id as being involved in the ITL waits.

This particular sql id is a DELETE on a single table, no triggers, no on delete cascade referential integrity.

Sounds easy right?

But…

V$SEGMENT_STATISTICS and DBA_HIST_SEG_STAT show 0 ITL waits for the segment/s related to the DELETE sql id above.

In the ASH data, for the event “enq: TX – allocate ITL entry”, the CURRENT_OBJ# column indicates a variety of objects that might show ITL waits under concurrent DML but, again, all unrelated to the sql id identified by ASH.

So…

Time for some theories:

ASH could be misleading us about the sql ids and/or

ASH could be misleading about current_obj# and/or

If our enqueue times out because of the distributed lock timeout, then perhaps the ITL wait is not recorded – i.e. segment stats are misleading.

Theories about ASH being misleading under certain circumstances are incredibly convoluted to test but, as far as I am concerned, even up to the latest 11gR2 this is a known known. There are certain values that are not cleared out or not recorded properly or disappear under certain conditions.

Now I’m going to delete some rows across three sessions.
The first two session simple uncommitted deletes. The third will do a select over a db link in order to start the distributed transaction followed by the same simple delete.

First session – delete some rows and leave uncommitted so the ITL cannot be cleared:

Because the statement was sniped by the distributed transaction timeout, the segment ITL statistics are not incremented and so these (and AWRs snaps thereof) miss the significance.

A good point made by Coskan is that I did not include the blocking session information in my ASH output.

Simply this is because it made the output too wide.

One of the sessions is reported as blocker but in reality there may be many blockers – it’s as many as there are ITL slots.

Also, depending on how many sessions are waiting, you may find that your blocked statement is unlucky and gets superceded by other sessions who get the ITL just through good timing of when they tried to get the enqueue. This is what was happening in my particular situation where an unlucky session was blocked for 1 minute – up to the distributed lock timeout – by a variety of blockers who at other times in that minute were busy doing other stuff not all waiting on this enqueue.