Advanced Oracle Troubleshooting Guide, Part 8: Even more detailed latch troubleshooting using LatchProfX

In my last AOT post I published my LatchProf script which is able to sample detailed latchholder data from V$LATCHHOLDER.

Latchprof allows you to drill down into your latching problems at session level (which V$LATCH, V$LATCH_PARENT and V$LATCH_CHILDREN can’t do). It allows you to get valuable details about individual sessions who are holding a latch the most, therefore likely contributing to the latch contention problem the most.

However after you have discovered the troublemaking session, then what next? One way forward is looking into V$SESSTAT counters using Snapper tool. Depending on what latch is the problematic one, you would look for different stats like various buffer get stats for cache buffers chains latches and parsing/executing stats when looking into library cache latches. However if those stats look “normal”, is there any other way do drill down further?

Yeah, there is and lets look into it!

A side comment: Nowadays I try to avoid using X$ tables in favor of V$ views (because in my experience in many client environments you don’t always have easy SYS access or X$ table “proxy views” created). Also, if you can get your job done using simple V$ views, then why bother with more complex X$ tables? (yeah, it’s much cooler, I know ;)

Anyway, here is an exception: LatchProfX script, which means Latch holder Profiler eXtended, because it provides some very interesting extended information about latch holders. It shows why the latch currently held was taken after all. It uses X$KSUPRLAT instead of V$LATCHHOLDER as the latter doesn’t externalize all required information. I’m talking about latch Where statistics.

By “Where statistics” I mean that Oracle kernel’s latching module coders have instrumented their code well enough – whenever they get a latch, they mark down the code location ID of the latch getter within the latch structure itself.

Let’s not keep you waiting anymore and show some examples. The output is quite wide so you may need to size your browser window accordingly.

Normally you would run LatchProfX with specific enough parameters to monitor only one or few latches which you know are the troublemakers. However the first example run I do across all instance latches (the third parameter is %) to illustrate the usefulness of LatchProfX result data better.

LatchProfX syntax is similar to the original LatchProf script, with the addition of ksllwnam and ksllwlbl field names you can sample. These are exactly the fields (from X$KSLLW) which give us the descriptions of “where in kernel the latch was aquired from” opcodes we get from X$KSUPRLAT (see the script itself for syntax and internals).

What do we see? While LatchProf would just have provided latch holder info about which (child) latch was held and by which SID, then LatchProfX will show you additional why this latch was gotten.

From above example you see that there were various reasons why the shared pool latch was gotten, but the reason which caused the SP latch to be held the most time, was a code location kghalo, which means something like Kernel Generic Heap Allocate. Which essentially is the function used fol searching and allocating free space from shared pool. This kind of makes sense as stuff like freeing a chunk and updating heap global stats are simple operations as Oracle already knows the memory location it needs to update, but searching for free chunks for allocation can require scanning through a lot of chunks if there are no suitable ones in freelists and a LRU scan is needed.

Next, there’s a kghupr1 location which has caused the shared pool latches held for 1.62% of total sampling time. This location, in function kghupr (un-pin recreatable) needs to take shared pool latch for putting the unpinned chunk back to heap LRU list. An Oracle kernel coder has also specified a label (action description) as “Chunk Header” which is seen from KSLLWLBL column. Its in place to state that this kernel location gets shared pool latch as it needs to update a chunk header – putting a chunk back to LRU list means updating the chunk header (and its neighbor(s) headers).

There are also kgh_heap_sizes and kghfre locations, which update global heap statistics (the stuff you see from v$sgastat for example) and free a chunk in heap respectively.

Wherever you have estabilished which latch is the troublemaker (under too much contention) and you know the “why” reason for it, you probably also want to know the sessions contributing to the problem. And LatchProfX can show that:

From above you see are 2 main sessions contributing to shared pool latch contention and both are doing lots of kghalo calls, thus allocating memory. When V$SESSTAT for those sessions shows high “parse count (hard)” statistic it would be obvious that the heavy shared pool activity is due hard parsing (which is kind of common knowledge as excessive hard parsing very often the reason for shared pool latch contention). However check out another sample with same parameters:

What’s this? Why is there a session with SID=0 which has called kghalo function and held the latch for much longer on average than any other sessions? LatchProfX has detected only 5 latch gets by that session from kghalo location which still have taken the most time in total. And average hold time for that latch is estimated over 42 milliseconds compared to most others which are in microseconds (even though the 3.6 milliseconds average hold time for a latch is quite long too).

One of the reason for such long hold times could be CPU preemption – a latch holder process was switched off the CPU and it took a while before it got back on it to complete its work and release the latch again. However on my 4 CPU test machine the was plenty of CPU time available so such preemption should have not been the problem. The SID 0 gives a clue. When a new session is started in Oracle, apparently it’s reported as SID 0 during the session initialization sequence in some places, which includes the memory allocation for things like the v$sesstat and v$session_event arrays and also the session parameter values and set events in some cases. Considering that total size of these structures is quite large in normal shared pool usage context (tens of kB), then if freelists are empty and LRU scan must happen, this inevitably takes longer time – and shared pool latch is held all that time. (phew, I must write shorter statements).

Anyway, I recommend playing around with this script in a test environment, run different workloads like hard parses, soft parses, physical IOs and buffer gets and you should get quite a good overview of some Oracle internal workings that way. This hopefully also answers how do I know some of the Oracle function names and what’s their purpose – the answer is that Oracle itself exposes quite a lot of this information in the X$ tables I’m using in my script. You just need to experiment, observe and try to figure out what are the real names behind function names like kghalo etc. Fun :)

Few more notes:

If you want to drill down to latch child level, then use LADDR as another field (among SID,NAME etc) in first parameter to LatchProfX (I’ve described this in AOT guide part 7).

All those scripts never sleep during their sampling but spin on CPU – thus they would make the CPU starvation even worse

More importantly, if all your CPUs are running at 100% utilization, the latch contention you see is probably just a symptom of the CPU starvation itself! The same goes to wait events as well – when your CPU runqueues are long, some of the CPU runqueue waiting time can be attributed to wait events such IO instead (for example when IO completes, but the process can’t get onto CPU to “end” the IO wait

I have updated both LatchProf and LatchProfX do detect whether they’re running on Oracle 9i or 10g+ and dynamically run different code based on version (on 9i, there is no GETS column in V$LATCHHOLDER).

NB! If you want to move to the "New World" - offload your data and workloads to Hadoop, without having to re-write your existing applications - check out Gluent. We are making history! ;-)

11 Responses to Advanced Oracle Troubleshooting Guide, Part 8: Even more detailed latch troubleshooting using LatchProfX

Hello Tanel,
a colleague pointed me to your blog – and it’s an impressing deep look into the oracle kernel; All these things where i used to close my eyes due to it’s complexity.
will read your blog in future;

Even though your scripts are meant for single instance , it can be very well used for RAC instances as well as for parallel queries.
In fact, I regularly use your scripts for troubleshooting. These are what I do.

1. stored all the relevant scripts like snapper , latchprof, waitprof etc.. in a directory of one node [say node 1] of the cluster.
2. created tnsnames entry for all the other nodes from the node [node 1] where all the scripts are stored.
3. Log into the instances in node 1,find the issue in any node by using any of the views like gv$session or gv$session_wait or gv$session_event.
For example , this query gives an overview of any wait event in any of the instance:
select inst_id inst, sid, module, action, osuser, username, logon_time, event, seconds_in_wait sec from gv$session where username is not null and event not like ‘SQL*Net%’ and seconds_in_wait > 10 order by seconds_in_wait;

4. Create a remote sql session to affected instance using the tnsname created in step 2 from the directory of node 1 where all the scripts are stored in step 1.
And use the relevant script to get to the root of the issue.

In my opinion all your scripts are very much relevant for all environments with just a bit tweak here and there.

I calculate the “Avg hold ms” by checking how fast does the GETS increase during the sampling loop. If each iteration of the V$LATCHHOLDER/X$KSUPRLAT scan takes 607 microseconds, then the max precision of the “Avg hold ms” I can report is 607 microsecond. So in reality the Avg hold ms may be even smaller. The only way to have better precision would be to sample faster, but my script is already sampling at the max speed (the bigger the V$PROCESS array is, the longer it takes to sample).

However, this is not a problem, the main column to look at is how busy a latch is in total (the Held %) and other columns are just extra supporting information.

Hi Tanel
we have plsql code calling java which is stored in database.
i see the plsql code many times waiting on latches such as library cache pin or ‘latch: row cache objects’
I tried LatchProf but did not get information related to query or database object in contention
Also is there any way to find what java code or java stored procedure session is executing.
When i see ADDM reports or awr reports always java executions shows in top
but can drill down to exact java stored procedure in question

AWR data does not contain the required level of detail that latchprofx gives. Namely the link between latch holding session, SQL active during hold and the kernel function responsible for taking the latch. AWR latch misses sections provide some instance level detail, but it’s instance level so not always useful for drilling down into single user/application. ASH links together the latch waiter session, SQL and latch name so it’s a start (and the blocking session often shows the latch holder).