Tanel Poder

2008/07/08

I have been too busy since getting back from vacation, thus no posts for a while. But I hope the waiting was worthwhile as I present you LatchProf, a tool for digging in to latch contention problems – using plain SQL and sqlplus!

As, I’m still busy, I make it short.

LatchProf is a script similar to WaitProf, only it samples latch holder statistics from V$LATCHHOLDER. As V$LATCHHOLDER contains a SID column (with session ID of a latch holder) it becomes possible to find who is hitting a latch the most (a way to prove that crappy monitoring tools which constantly scan through V$SQL DO cause library cache latch contention themselves).

Some day I will write longer about this (and check the script itself), but here’s the syntax:

@latchprof <what> <sid|%> <latch|%|childaddr> <#samples>

Here are the parameter definitions:

1) Columns to select actually map to V$LATCHHOLDER columns. So you can use SID,NAME,LADDR there for example.

2) SID is either a sid as number or % to see latches held by all SIDs

3) latch_name can be either % for all latches, a LIKE string, like %library% to see all latches with library in their name or the actual latch address! This allows us to monitor only specific child latches we want.

4) Number of samples to take. I normally use 100 000 samples as starting point.

As latchprof (and waitprof for that matter) never sleep, they burn as much CPU resource as they can. This is good for sampling accuracy (well, at least on multiprocessor systems with enough spare CPU capacity), but will consume one of your CPUs. So, you don’t want to run this for too long at a time ( you could implement a short wait in there with a nice trick by Jonathan Lewis ).

100k samples is usually done in around a second on my test kit (giving 100+ kHz sampling rate :) This will vary by your hardware and Oracle version and what’s your processes parameter (as V$LATCHHOLDER iterates through process state objects – the more processes you’ve got, the more time it takes).

Anyway, let say we have a performance issue and using wait interface we have determined that for a particular session, waits for cache buffers chains latches are unreasonably high.

Now, one way forward would be to identify the database blocks involved in this problem, by looking what blocks are protected by the latch (using X$BH.HLADDR = V$LATCH_CHILDREN.ADDR mapping) and trying to figure out the hottest block by X$BH.TCH touchcount. But where would that lead us? We would find out into which segment that block belongs, but it doesn’t really help us to understand who’s the troublemaker causing this issue.

Therefore we use another approach – sampling V$LATCHHOLDER to see which sessions hold the latches the most:

Just for demo purposes I start with listing general latchholder stats (not broken down by holding SID):

Yes it looks that cache buffers chains latches are held the most. The Held column here means that during how many samples a latch was held. The Gets column shows roughly how many times that latch was gotten during sampling time.

As we took 100000 samples and during 21248 of them a cache buffers chains latch was held, we can conclude that this latch was busy (Held) 21.25% of the time. And as the sampling lasted about 1.3 seconds that time, we can extrapolate that of the 1.3 seconds, some cache buffers chains latches were in Held state for rougly 267 milliseconds. And as we know the number of gets, then we can compute the average latch hold time in milliseconds. This could help us detect very long latch holds like what can happen when someone is scanning through X$KSMSP or the server is CPU starved. The current estimate of 15 microseconds per latch hold is not that bad (note that with very few samples like 10000 or less, the avg hold time estimates become unreliable).

We see that sessions 81 and 85 are hitting CBC latches the most. Let say (for demo purposes again) that the SID 85 was the one which experienced bad performance and led us to investigate this issue.

As there are tens of thousands (or even more) of CBC latches in a decent Oracle database, then of course looking only at aggregated latch name level stats is not enough. Do the sessions compete for a single child latch? Or do they just use lots of different CBC latches over time and not collide at all? This is the question where LatchProf can help.

Let’s include latch address (child latch address) into picture and query only “cache” related latches. This returns lots of rows, so I normally press CTRL+C after first pageful of results as they are most important (the results are ordered by most busy ones first):

And voila, we see SID 81 is hitting one single child latch in maniac fashion while SID 85 is behaving more normally, just hitting lots of different CBC latches (you see the number of gets of a single child during 100k sampling doesn’t exceed 34). Interesting.

So now that we know the child latch address (LADDR), we can refine our query to only monitor that child latch to see if any other sessions could be affected:

This is Jonathan Lewis’es kill_cpu script in action, which hammered the same datablock in a tight loop using connect by ( you need to set _old_connect_by_enabled=true on 9i+ for this test to work :)

I will blog about some more use cases and an extended version of LatchProf script in (hopefully) near future.

Note once more that this script only detects latch holders. A latch held the most doesn’t necessarily mean it’s the latch waited for the most. You still need to start your diagnosis from wait interface, using Snapper or WaitProf for example. Only if these tools show significant latch waits, you should drill down into latches further. And as Anjo Kolk recently (very well) said, latch contention is a symptom, not the problem. The root cause is somewhere else (in the application likely), once you fix that, the latch contention will go away itself.

Update: I have published a more detailed latch profiling tool (latchprofX) which uses X$ tables for getting the extended data and can give even more detailed info about the latch holders. Read this article plus links in it: