Latching

ALL,
I am trying to get a handle on an issue with SQL Server 7.0 SP3. I have been monitoring with perfmon looking at latching. On my big strong 8 cpu server we are experiencing an average latch wait time of 2040 ms. I double checking my math and the documentation, it seems we wait for a latch for an average of 2.04 seconds??? Pretty consistant throughout business day. I believe that all my latch related metrics are high.

I am trying to get a handle on what latch is causing this this latch contention. My theory is the latch contention is for the most part internal to SQL server. There maybe latch contention related to various indexes - as we have some unnecessary and unused indexes (I am working on getting those dropped).

I am trying to get some information out of SQL Server (or really MS ) as to how I would prove or disprove my hypothesis.

Here are my observations â€“
1) Almost 100 % Buffer cache hit ratio
2) Almost no lock contention (very low Lock related waits)
3) Many tables in various application used as queues i.e. we have tables with a status field(s) that we are repeatedly queried for a certain status (the applications run as NT Services).

So we have tables and indexes in the buffer cache and we are querying the heck out of them. The buffer cache is protected by a latch. (Of course Microsoft doesn#%92t come out and say that there is a latch â€“ but as Sybase and Oracle have latches that protect the hash table that maps the buffer cache and latches that protect the linked list that hangs off the hash table â€“ I am guessing that Microsoft has them also â€“ if this is incorrect please correct me and please provide a reference)

So my questions are

1) How many latches (spinlocks) does SQL Server acquire/ configure in a multi-cpu environment. If I have multiple CPUs how many latches are allocated for the buffer cache. As an aside - Is the reason that MS can up with the Federated Servers architecture is that it was simpler to do that, then what Sybase and Oracle do i.e allow you to partition the buffer cache and configure the number of latches.

2) As an MVP do you have access to any information about Dbcc perfmon(lruwaits) and/or dbcc perfmon(spinlockwaits) â€“ which seem to produce the same output. I can#%92t find any meaningful information about the output. Is there any method to determine what is the cause of latch contention.

You have a great question, and I wish I had a great answer. Unfortunately, there is very little documentation on the subject of latches (I looked), so I don't have a definitive answer, but here's some feedback that may help.

First of all, for those that are not familiar with latches, a latch is in essence a "lightweight lock". From a technical perspective, a latch is a lightweight, short-term synchronization object (for those who like technical jargon). A latch acts like a lock, in that its purpose is to prevent data from changing unexpectedly. For example, when a row of data is being moved from the buffer to the SQL Server storage engine, a latch is used by SQL Server during this move (which is very quick indeed) to prevent the data in the row from being changed during this very short time period. This not only applies to rows of data, from index information as well, as it is retrieved by SQL Server.

Just like a lock, a latch can prevent SQL Server from accessing rows in a database, which can hurt performance. Because of this, you want to minimize latch time.

SQL Server provides three different ways to measure latch activity. They include:

Average Latch Wait Time (ms) -- The wait time (in milliseconds) for latch requests that have to wait. Note here that this is a measurement for only those latches whose requests had to wait. In many cases, there is no wait. So keep in mind that this figure only applies for those latches that had to wait, not all latches.

Latch Waits/sec -- This is the number of latch requests that could not be granted immediately. In other words, these are the amount of latches, in a one second period, that had to wait. So these are the latches measured by Average Latch Wait Time (ms).

Total Latch Wait Time (ms): This is the total latch wait time (in milliseconds) for latch requests in the last second. In essense, this is the two above numbers multiplied appropriately for the most recent second.

Note: When reading these figures, be sure you have read the scale on Performance Monitor correctly. The scale can change from counter to counter, and this is can be confusing if you don't compare apples to apples.

Now let's get to your question. In your post, you commented that your Average Latch Wait Time was 2040 ms., but you didn't supply the two other factors, which are also important, as while this figure does seem a high, it is not too relevant if the Latch Waits/sec is not too high. So if you have not done so, could you provide this number, along with the Total Latch Wait Time. This would help be get a bigger view of how your server is working.

Generally, if there is a problem with a high rate of Average Latch Wait Time or a high rate of Latch Waits/sec, this indicates a hit cache hit ratio, and more RAM is needed for the server. You have indicated that your hit cache ratio is near 100%, so that is probably not the issue.

Also, for the fun of it, here are these three figures for a 4-way server I have that runs Windows 2000 and SQL Server 2000:

Brad,
Thanks for the reply. This morning on my 4way machine - I have an average latch wait of 2046 (ms), Total Latch Wait time 6000 (ms) and Latch Waits/Sec 221. Not bad â€“ it has been much worse. It has been improving as I wade through the various problems/issues.

I am not even sure the general user community understands that there are latches that protect the internal data structures of Sql Server. I am really concerned at these latches.
There are only two paragraphs in the Inside SQL Server 2000 book (page 92) on the subject of Access to In-Memory Pages, which describe a hash table, and a linked list of pages that hang off the hash table. I am sure there are latches that serialize access to these data structures. These structures are accessed every time a Logical IO is performed, but we do not have any information to figure out if out queries are waiting to access these latches.

Intuitively, I know the solution is to reduce the Logical IO load, make sure we are reusing as much SQL as possible and reduce the number of unused indexes â€“ it would be nice if I had information about all the different types of latches so I could prioritize my efforts. I guess this is the art of performance tuningâ€¦

I think the reason we (SQL Server User Community) don't have any detailed information about Latching is that a) the SQL Server kernel is not instrumented to collect this information and/or b) there are few or no features (configuration parameters) we can configure to alleviate the problems. In this regard SQL Server is so far behind Oracle and Sybase it incredible.

What I think you need to do is understand where the latching is occurring. I can think of two ways to find that out.

1. Select from the SYSPROCESSES table. For example
SELECT * FROM SYSYPROCESSES WHERE waittime>0 and spid>50 will return a SPIDs that are waiting and will also provide a waittype, waittime, lastwaittype and a waitresource. The lastwaittype and waitresource will tell you what you're latch type is in this case and the waitresource will tell you what object you waiting on.

2. Run DBCC SQLPerf (waitstats).
I would run DBCC SQLPerf (waitstats, clear) to clear then stats, then run DBCC SQLPerf (waitstats) every 30 seconds or so for 5 to 10 minutes when you're seeing the high latch times. The output will give you a Wait Type and a Wait Time.

This doesn't give you any more theoretical information on Latches, but it should help you solve your problems.

My question is,
1. What tool are you guys using? SQL-profiler or NetIQ's SQL-check?
real-time or not?
2. I have a high value of average latch wait time (>2000ms) after I re-boot
my SQL Server(Win NT4.0/SQL 7.0; 2-way CPUs).
As time pass by, this factor goes down to 800-900ms at peak-time.

The reason might be I/O bottleneck and leak-memory

The solution is either improve I/O or adding memory.
I already added mem to 2.5G.
Improve network band seems not so easy for this case.

My experience of latching was seeing a lastwaittype of PAGEIOLATCH_%% or PAGELATCH_%% when monitoring blocking at the same time as disk queueing and a very high disk time. After the was array controller was upgraded, the disk queueing and disk time reduced and latching virtually disappeared.

Referring to ufobox's post above, I use SQL Server Profiler for my traces. I have not used NetIQ's software, although I would like to try it if I every find the time. Generally, I will capture a trace, then look at it after the fact, although in some rare cases, I will look at it in real-time if I am troubleshooting.

I think that the post by HarryArchibald above makes a very good point. One of the reasons that latches can take more time is because the data has to first be moved from disk to the buffer, then from the buffer to the data engine. When this happens, obviously I/O plays a big part.

In regard to RORTLOFF post - thanks for the feedback. I appreaciate you taking the time to post<br /><br />In regard to repeatedly querying sysprocesses - I#%92ve tried that with some success â€“ I usually do that in a loop inserting values into a table for later analysis- like so: <br /><br />BEGIN<br />insert into Waits<br />select waittype,waittime,lastwaittype,waitresource <br />from master..sysprocesses where spid &gt; 50 and waittime &gt; 0<br />Waitfor delay '00:00:05'<br />END<br /><br />The issue I have is the granularity off the loop â€“ If I run the query constantly, am I going to affect performance â€“ if I don#%92t run the query at the correct time or often enough am going to miss something. The second issue I have is â€“ as each thread scans the buffer cache there must be a latch on the hash table and on the linked list that hangs off it â€“ a thread never waits for these latches? I have never seen anything other than page and parallel query related waits. <br /><br />The other issue I have is that this method is kind of labor intensive. I have some code that loops over the information and parses the wait resource and converts that to an object or page number and then uses dbcc page and object_name() to determine the hot tables and indexes. <br /><br /><br />So the top waits that I see in sysprocesses is PAGEIOLATCH_SH and PAGEIOLATCH_EX<br /><br />I have also used use DBCC SQLPERF(waitstats) and of course the most time consuming wait resource is â€œMISCELLANEOUSâ€ (is that the same as undefined??) followed by <br /><br />PAGELATCH_EX , PAGELATCH_SH , <br />PAGELATCH_UP , PAGEIOLATCH_EX , <br />PAGEIOLATCH_SH , PAGEIOLATCH_UP<br /><br />The issue I have going back to my very fist post on this subject â€“ is that all we have is one crumby Q article that lists these latch related waits - Q244455 - no details, no definitions or descriptions just a name. The article does not define the waits â€“ it just says these are the page latch related waits. <br /><br />So my theory is that I have several very hot tables and indexes that are already in the data-cache and my applications are thrashing the hell out of them by repeatedly querying the same data.I believe that these latches are a result of Logical IO not physical IO (my buffer cache hit ratio is consistently &gt; 99.7). I have identified the objects from the output of sysprocesses.waitresource...<br /><br />Any holes in my theory??<br /><br />Once again thanks for all the comments...<br /><br />Peace, <br /><br />Joe E O <br />P.S How can we get Microsoft to document in detail the output of <br />dbcc perfmon(spinlockwaits). I suppose mentioning that Oracle ducuments it wait interface in great detail would help <img src='/community/emoticons/emotion-1.gif' alt='' /><br /><br />

With regard to selecting from sysprocesses you can use the "with (nolock)" hint to ease some of the overhead. This may damage the integrity of your perf data but if you collect enough samples it should even out.<br /><br />When I do performance analysis the first tool I use is PerfMon. Seeing that there is lots of pagelatching have you looked at Physical Disk counters in perfmon? I normally look at sec/Read and sec/Write along with the Queuing counters. My first guess is that you have a disk bottleneck.<br /><br />Sysprocesses gives you a file id with the pagelatch waittype under the lastwaittype. I believe the layout in lastwaittype is DatabaseID:FileID<img src='/community/emoticons/emotion-7.gif' alt=':S' />lotID. You can execute select filename( file id ) in the correct db context within Query Analyzer and it will tell you what the hot page is. If you created your tables or indexes on separate filegroups you'll know what the hot object is. The other thing you can do is write your loop query with a DBCC Inputbuffer call nested in it. While this is not always 100% correct it will give you the statement being executed by the SPID that is waiting on a pagelatch.

Sorry I had fumble fingers - Rortloff - Again explain how I can have a Disk bottle neck with 99.8 Buffer cache hit ratio? I did also verified disk io via perfmon no queuing. FYI - we have a pretty high EMC storage array and our apps are pretty read intensive..

Also, I think someone mentioned about avg latch wait time - that metric is pretty static on my server - wouldn't that fluctuate, depending on if you were waiting to perform physical versus logical???

Joe E O

(I did verify through perfmon - I normally have very little (almost none) disk queuing)

Good point, it is tough to explain. You could be caching reads very well, but having problems with writes. What does the DBCC Inputbuffer command tell you when you trap a pagelatch?
I would still look at the Disk counters though. Or have you already looked at them and completely ruled I/O out of the problem domain?

RORTLOFF,
Pretty much rulled out Physical IO - I am not seeing any queuing so am trying to understand how I could be waiting to do physical IO...

I not sure if you caught my first post but we have many tables that act as queues and many apps that more or lease run as a services that query the table(s) for a change in status. (I lovingly refer to one of these apps as the "load generator" )

The sql server does a index seek on the clustered index on (cola, colb). The query is really not to bad but it is executed constantly (120/minute). Here are the stats from a 107 minute sample
(Duration, CPU and Reads are average ms/exec)

In a 107minute we spend 27 minute servicing this one stupid query....I got this information from profiler - saved and summerized in a database table. (I can't easily change the app (politics) so I am going to archive data from the table as a short term fix)

My point is that I have to believe that each time this and other queries like it (does lots of logical io and returns few rows) are executed the buffer cache hash table and the linked lists of pages that hang of it are latched. I say this not so much based on anything I have every read about SQL server (documentation is lacking!!!) but based upon my knowledge of Sybase and Oracle..
I would want to quantify this contention and be able to partition the data buffer cache to segment this bad table and query...what I going to end up doing is creating another instance (once we upgrade the cluster to 2000) and stick it there...

I think you are on the right track about latches giving you a performance hit when the same data is being queried very frequently. We had a very similair latch problem like you and that was the cause for our problems.<br /><br />I work on a hosting company and one of the sites we had was an ecommerce web site built in ASP, loadbalanced among a number of web servers. The cause of our latch problem was that in a navigation.asp page there was a sql query that did something like "select * from product_categories". This was a site with a lot of visitors (and still is) and with all web servers pounding the sql server with this query for every visitor and every page (navigation.asp was included on every page), well then the SQL server couldn't take it and the latch wait time and latch requests per second sky rocketed. Cache hit ratio was still at 99%. This was on a 4-way active/active cluster running Windows NT4, SQL 7 SP3 and using a compaq RA8000 diskrack.<br /><br />In our case the solution was to get the developers to cache the product category data instead of querying it all the time for every single page and visitor. This was ok since the data is quite static.<br /><br />Some temorary solutions that we had to use before the developers could finish a caching solution was:<br />1.<br />Change the sql code from ad hoc to using a stored procedure.<br /><br />2.<br />Put the database in read only mode. Bought us 20-30% more performance in terms of number of queries that could be handled at the same time.<br /><br />3.<br />Copied the read only database to the second node in the cluster and pointed 7 webservers to one database and 7 to the other. This was not very good of course incase of a failure <img src='/community/emoticons/emotion-4.gif' alt='' /> (one node would get all the load again), but there wasn't money at the time to buy and install new servers just to host this database.<br /><br />/Argyle