Another LatchProfX use case

Especially if the latch contention problem is ongoing, looking into system wide stats (like v$latch.sleep columns) is not the best idea in busy systems. This may sometimes lead you to fixing the wrong problem.

This is because sometimes the latch contention is not caused by some system wide inefficiency but rather by one or few sessions.

Why is that someone holding the latch so much (query v$sesstat counters for that session or run LatchProfX)

Riyaj used this approach and successfully found out the troublemaker causing heavy library cache latch contention. He used my LatchProfX tool for part of the diagnosis (and I’m very happy to see that my advanced oracle troubleshooting tools find real world use in hands of other troubleshooters/tuners too!).

Here’s an excerpt from latchprofx output (from a little test case I put together), showing which session is holding which library cache latch how much, hold mode (shared/exclusive) and also the actual reason (function name) why the latch is held:

kglic means Kernel Generic Library cache Iterate Chain (AFAIK), it’s the function which is executed when you access most X$KGL tables. And this is where Riyaj got the evidence that the latch contention comes from inefficient scanning of library cache, caused by a session running queries against V$SQL_PLAN views with a bad plan.

How did you determine there were long waits for that latch? Did you look into v$session_wait.event perhaps *without* realizing that the state column was not waiting? Unfortunately Oracle doesn’t clear up the last wait event name when going back onto CPU, so that often misleads people (including me, that’s why I use a case statement to remove the wait event name when not waiting anymore in my sw.sql script)

Your tkprof script shows CPU time as big as elapsed time so there should be no significant waits going on (did you enable sql_trace with waits=>true – in that case tkprof should also report total time spent waiting for different events)

Today I had this issue of latch_free waits for one of the session. I have a database system dump using oradebug dump systemstats. I believe there is a way to locate the latch blocking session from looking at dump file. Can you please guide me how to do this?
I appreciate your help.

From systemstate dump you can search for “latch:” or “latch free” (depending on db version and latch type) to find the waiters and latch address and then you can look the latch holder up by that latch address.

But before going there, tell me how exactly (what view, what columns) you concluded that there was a latch free problem in the system?

Yes, it is system state dump. I checked the session_wait for that particular session and it was latch_free event..latches were ‘cache buffers chains’ and ‘mostly latch-free SCN’. This session has been running almost about 16 hrs. Again at the system level db was normal. My db version is 9208 on SunOS 64 bit.I think I had seen an example on your site where you were showing how to track back to the latch blocking session.

Info: This is what I have in my dump file for this latch free waiting session..I did a grep on address 380010668 in the entire dump file and I don’t see any reference to this address. So I am puzzled ..did I use wrong address?

That’s the reason why I asked – I suspected that the session was not actually stuck waiting for the event but you managed to misinterpret the v$session_wait instead. If STATE != WAITING in v$session_wait then ignore whatever the event columns says as you’re not waiting anymore, but burning CPU! The event column just shows the LAST wait waited for, but its not relevant anymore as you’re not waiting anymore.

That’s why I use my sw.sql script which masks the last event name and replaces it with “on cpu” when not waiting.

And systemstate dump confirms that as well, if you were waiting you’d see “waiting for …” in there instead of “last wait …”

I would look into v$sesstat counter of that session (with snapper for example) and see what counters are incrementing (especially whether the execute count is going up) and then see into the execution plan of the prevalent SQL_ID/sql_hash_value in v$session for that session. This may be caused by just a bad execution plan, but snapper would confirm it.

I think this explanation answers why I don’t see the same address in the dump file.

I do see in dump , session info shows as last wait..however, I was checking the v$session status which was active.
From my observation, if the session is waiting
on anything in the database, the v$session status shows as active ( I might be wrong ).

So is it possible that during the dump duration when it was scannig this session, the session was doing some work?

Now about next time, when this happens, if I use sw.sql and v$sesstat, I think I should get more info on this issue.
However, do you forsee any other info to be collected – just based on what you have seen?
Your insight is greatly appreciated. Thanks,

The v$session.status = ACTIVE when the session is servicing a user request. Whenever a new request comes in (through TTC layer) then one first thing Oracle does is update session idle bit which says that session is active. The session stays active regardless CPU usage or waits for some events. The session is marked inactive only when the response to application is sent back to client and db call ends.

What you probably meant was v$session_wait.state (statE, not statUS) which shows the wait state, whether you’re waiting for something or not. The statE column is also in v$session starting from 10g, but they’re different. STATE shows whether you’re waiting/not on CPU or working/ON CPU, but STATUS shows whether the session is servicing a user request or not.

Regarding systemstate dumps, yes it’s possible to get “unlucky” so that when the dump is taken the session just happens to be doing something else than the problem operation… this is in case you don’t have a hang but just bad performance. That’s why you should take a few systemstate dumps to see if anything is changing over time. Note that systemstate dump is quite expensive, better use @sw/v$session_wait if just one session is slow and hanganalyze at level 3 if there seems to be a hang affecting multiple sessions.

I will post a little script in my blog for information gathering hopefully today, but what I usually gather first for a problem session is v$session_wait, v$sesstat and exec plan info from v$sql_plan.