Database hang and Row Cache Lock concurrency troubleshooting

Issue background

This post will help to analyze Oracle database instance slowdown that can happen due to considerable row cache lock (enqueue) wait events. It’s is based on a real case of a database hang that I worked on recently. I must admit this type of situation does not appear often but it’s very dangerous since it can considerably slow down a database instance or even freeze it for a short period of time. In most cases SQL against ASH view and Systemstate dumps can help to nail down the problem unless this is an Oracle bug.

Usually it occurs suddenly and disappears quickly. See an example ASH graph below with brown peak that represents this type of concurrency: row cache lock wait events.

ASH graph – Row Cache Lock concurrency

What is a Row Cache Enqueue Lock?

The Row Cache or Data Dictionary Cache is a memory area in the shared pool that holds data dictionary information. Row cache holds data as rows instead of buffers. A Row cache enqueue lock is a lock on the data dictionary rows. It is used primarily to serialize changes to the data dictionary and to wait for a lock on a data dictionary cache. The enqueue will be on a specific data dictionary object. This is called the enqueue type and can be found in the v$rowcache view. Waits on this event usually indicate some form of DDL occurring, or possibly recursive operations such as storage management, sequence numbers incrementing frequently, etc. Diagnosing the cause of the contention

Diagnosing the cause of the contention

Check database alert.log

If the Row cache enqueue cannot be obtained within a certain predetermined time period, a systemstate dump will be generated in the user_dump_dest or background_dump_dest depending on whether a user or background process created the trace file. The alert log is usually updated accordingly with the warning and the location of the trace file. The message in the alert.log and the trace file generated will tend to contain the message:

AWR, ADDM and ASH Reports

Run AWR and ASH reports for the time when the problem is reported as well as a report leading up to the problem as these can sometimes help build a picture of when a problem actually started.

ASH Report – Row Cach Lock top wait event

SGA Shrink/Resize Operations

When the SGA is dynamically resized, various latches need to be held to prevent changes from being made while the operation completes. If the resize takes a while or occurs frequently you can see Row Cache Lock waits occurring. The key identifiers for this is high waits for ‘SGA: allocation forcing component growth’ or similar waits at the top of waits in AWR. You can also use following SQL for the same check:

Issues by Row Cache Lock Enqueue Type

For each enqueue type, there are a limited number of operations that require each enqueue. The enqueue type therefore may give an indication as the type of operation that may be causing the issue. As such some common reasons are outlined below along with SQL that helps to find the qnqueue type:

DC_OBJECTS
Look for any object compilation activity which might require an exclusive lock, blocking other activities. If object compiles are occurring this can require an exclusive lock which will block other activity. Tune by examining invalid objects and dependencies with following SQL:

select * from dba_objects order by last_ddl_time desc;
select * from dba_objects where status = 'INVALID';

Can be a bug like the following ones: Bug 11070004 – High row cache objects latch contention w/ oracle text queries Bug 11693365 – Concurrent Drop table and Select on Reference constraint table hangs(deadlock) – fixed in 12.1 DC_SEGMENTS This is most likely due to segment allocation. Identify what the session holding the enqueue is doing and use errorstacks to diagnose.

DC_USERS
– This may occur if a session issues a GRANT to a user, and that user is in the process of logging on to the database.
– Excessive calls to dc_users can be a symptom of “set role XXXX”
– You can check the presents of massive login attempts, even the failed ones by analyzing listener.log (use OEM 12c-> All Metrics or by checking database AUDIT if available or using own tools).
– Bug 7715339 – Logon failures causes “row cache lock” waits – Allow disable of logon delay

DC_TABLESPACES
Probably the most likely cause is the allocation of new extents. If extent sizes are set low then the application may constantly be requesting new extents and causing contention. Do you have objects with small extent sizes that are rapidly growing? (You may be able to spot these by looking for objects with large numbers of extents). Check the trace for insert/update activity, check the objects inserted into for number of extents.

DC_USED_EXTENTS and DC_FREE_EXTENTS
This row cache lock wait may occur similar during space management operations where tablespaces are fragmented or have inadequate extent sizes. Tune by checking whether tablespaces are fragmented, extent sizes are too small, or tablespaces are managed manually.

DC_AWR_CONTROL
This enqueue is related to control of the Automatic Workload Repository. As such any operation manipulating the repository may hold this so look for processes blocking these.

Possible blockers – ASH & Systemstate dump

Often and in my case the wait for a Row Cache Lock is the culmination of a chain of events and the lock being held is a symptom of another issue where a process holding the requested row cache enqueue is being blocked by other processes. If you see a lot of different sessions doing different things are blocked and waiting on Row Cache Lock, it is often a symptom, not the cause. What can be a real cause then? Some examples are below:
– LGWR process can cause massive Row Cache Lock contention while waiting for a long redo log switch
– Messy application system triggers on user LOGON
– Massive session kill – … How to find a blocker:

Find and analyze trace file generated in UDUMP directory or ask for MyOracleSupport help.
The challenge here, as mentioned above, to catch the issue since it’s usually disappears quickly.
There are a few ways though how you can accomplish that. I’ll share them in the next posts.

Notes:
– I saw more often Row Cache Lock issues on Oracle 11.1 -> Upgrade to at least 11.2.0.3
– Be careful doing systemdumps on production system. They might cause additional system instability.