August 5, 2015

Flash back causing library cache: mutex X

Recently one of our applications suffered from a severe performance issue. It is an application running on a database(11.2.0.4.0) used to validate a pre-production release. This performance issue has delayed the campaign test and the validation process for more than 3 days. The ASH data taken during the altered performance period shows this:

Mutexes are similar to locks except that they lock object in shared memory rather than rows in tables and indexes. Whenever a session wants to read or write into the library cache shared memory it needs to pin that object (cursor generally) and acquire a mutex on it. If another session wants simultaneously to read the same piece of memory it will try to acquire a mutex on it. This session might then wait on one of those library or cursor mutex wait event since another session has already preceded it and has still not released the latch (the mutex).

So, extended to my actual case what has been exaggerated so that a library cache: mutex X has made the database unusable?

The above piece of SQL code is generated by Oracle behind the scene when flashing backward a content of a given table. And this is exactly what this client was doing. At the end of their pre-production test campaign they flash back a certain number of tables to the data they have at the beginning of the test. And since the generated code uses parallel run with default degree it has produce such a kind of monitored execution plan

For every flashed back table Oracle started 96 parallel servers (96 sessions) in order to do a simple insert statement causing the observed library cache mutex X wait event. The DOP 96 is the maximum DOP which represents in fact the default DOP determined by the following simplified formula:

Having no possibility to hint the internal flash back Oracle code so that it will not execute in parallel, all what I have been left with is to pre-empt Oracle from starting a huge number of parallel process by limiting the parallel_max_servers parameter to 8 and, and as such, the maximum DOP will be limited to 8 whatever the cpu_count is.

Once this done I observed the following new situation for one flashed back sql_id (a5u912v53t11t)

The flashed back treatment ceases completely from being run in parallel and the campaign test started again to perform quickly.

This is not an invitation to go with drastic and brutal workaround to reduce the effect of many sessions waked up due to a very high degree of parallelism itself due to the default maximum DOP. It represents a demonstration on

how a high degree of parallelism can affect the locking in the library cache

how the parallel_max_servers parameter can bound the DOP of your query

“Books to the ceiling, Books to the sky, My pile of books is a mile high. How I love them! How I need them! I'll have a long beard by the time I read them”—Lobel, Arnold. Whiskers and Rhymes. William Morrow & Co, 1988.