Blogroll

Posts Tagged ‘kqlfxp’

We had an interesting latch contention issue with a client database worth sharing. Client complained that they are seeing increased library cache latch contention in production with no logical explanation (yet). So, client wanted us to find the root cause. This performance issue seems to have started recently and occurs intermittently and also resolves by itself. Client is planning to upgrade to 10g in 4 weeks time frame and this issue need to be resolved to avoid any delays in 10g upgrade.

From statspack analysis and various statistics analysis, it is visible that there is an increase in library cache latch contention, but nothing jumps up as a problem. Of course, this is a very active application [ 5 node RAC application] with very high number of literal SQLs but that is nothing new, it is always been this way. No new dynamic SQLs, no DDLs invalidating cursors or anything of that sort. I started watching production instance, in real time, waiting for the re-occurrence of this issue. It occurred!

latchprofx for rescue

At this very juncture, instance was suffering from latch contention and It was visible that sessions were starting to queue up for latches.

I was ready with necessary tools. Tanel Poder has written a great tool latchprofx to sample v$latch_holder with high frequency [ Actually, it uses x$ tables but you get the idea ]. In this case, high frequency sampling is a must since this performance problem disappears in few minutes or so. Executed latcprofx script for library cache latches and 10000 samples as printed below.

Yikes. We can see that one session holding library cache latch in 95.5% of samples. That seems bit odd.

Of course, there are many latch children for library cache latches. Is that session repeatedly acquiring just one latch children? Or is it acquiring many different latch children? More importantly what is that session executing ?

Latchprofx to the rescue again. Latchprofx can also print latch children address. Re-executing latcprofx script with modified second argument to the call. Only difference between these two calls is that first argument to latchprofx includes laddr now.

Whoa! This session was acquring nearly all latch children. Notice that latch addresses are different in every row output. Meaning, various samples of latchprofx saw that this session is holding a different latch children. In a nutshell, this session was acquiring and releasing ALL library cache latches ( aka children) in quasi sequential fashion and causing massive performance issue. There are 53 latch children for library cache latch (in this database). Column held% is approximately 2%, indicating all library cache latch children were acquired and released in a quasi-uniform distributed(2% times 50), quasi-cyclical fashion! Same pattern repeates if I reduce or increase samples.

Latch greedy session

Initially, I thought, this is a bug. I have not seen a case where all library cache latch children were needed to execute a SQL or a process causing massive latch contention. [ Okay, I confess, I have simulated that kind of issue in my test database holding library cache latch children using oradebug kslgetl stuff, but that is a different topic altogether]. Session 140 was executing a SQL printed below. This SQL is querying execution plan from shared_pool for a specific hash_value in this 9i database( Equivalent to dbms_xplan.display_cursor stuff in 10g.]

55 seconds to execute this SQL? That seems bit excessive. This SQL is the only time-consuming recursive SQL generated from the select statement. So, somehow, this must be causing latch contention. I will print only relevent lines from the plan output.

Line #6 is the problem. This table x$kqlfxpl is accessed using Full table scan. Looks like, x$kqlfxpl can NOT be accessed without holding library cache latches. Since there is a FTS on this fixed table, almost every library cache object need to be touched which potentially means all library cache objects must be inspected. Further, accessing this x$ table means library cache buckets (_kgl_bucket_count controlled area) need to be walked, latches acquired and released before that memory area is inspected. Makes sense!

Repeated calls to kqlfxp, kglic, kglic0 also visible in the stack output. These calls suggests that library cache latches were acquired and released. Systemstate dumps confirms that too.

I was also able to reproduce this issue querying x$kqlfxpl in a cloned database.

Running latchprofx script from another session concurrently shows that accessing this x$ table will result in excessive library cache latching activity. Row estimates column for this table shows 100 which is a CBO default cardinality estimate if there are no statistics on a table.

Now, a little bit background is necessary. What started all this? As a part of 10g upgrade planning, Client DBAs had setup a job to capture costly SQLs and store execution plans from shared_pool using dynamic_plan_table script, borrowed from Tom kyte dynamic_plan_table discussion. Excellent idea and wonderfully written sql, Tom.

If you don’t know already, Tom’s method creates following view. Client script passes the hash_value of costly SQLs from v$sql to this view.

Unfortunately, such an awesome method has side effects in version 9.2.0.8, as explained above.

At this point, I am almost sure that we need to collect statistics on fixed tables to resolve this issue. There is a fixed index on x$kqlfxpl (KQLFXPL_HADD, KQLFXPL_PHAD,KQLFXPL_HASH) and if that index is used, effect of this problem should disappear. Since there is a plan to upgrade to 10g soon, client didn’t want to collect fixed statistics.

So, I worked around the issue. Not an elegant solution, but it avoids problem by recreating the view with hardcoded hash value for every loop.