I was building a new SQL Server 2008 R2 failover cluster recently and encountered a problem that I hadn’t seen before (which is rare as I’ve seen A LOT of cluster setup problems in my time!). This time it was strange as it was an error before setup actually ran, it was when I was going through the dialogue boxes to configure setup.

The scenario was this:

1. Cluster was fully built and validated at a windows level, all resources were up and OK 2. I was about to run SQL Setup when I noticed the network binding order was wrong 3. I changed this and then decided to reboot both nodes as I always do this before a cluster setup 4. The nodes came back online OK and all resources came up as well 5. I ran setup but when I got to the cluster network configuration dialog box, there were no networks to select from, so you couldn’t go forward.

My first thought was that I must have done something dumb when changing the network binding order but checks on the network adapters showed that they were all up. I then went back through a few other things and noticed that the cause of the error was actually that the cluster service was having issues with connecting to one of the networks. There were 2 types of error / warning in the cluster logs and the system event logs:

Error

Cluster network ‘xxxxx’ is partitioned. Some attached failover cluster nodes cannot communicate with each other over the network. The failover cluster was not able to determine the location of the failure. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

Warning

Cluster network interface ‘xxxxx – xxxxx’ for cluster node ‘xxxxx’ on network ‘xxxxx’ is unreachable by at least one other cluster node attached to the network. The failover cluster was not able to determine the location of the failure. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

I had to engage the help of some network specialists as I couldn’t get to the bottom of this on my own. The networks actually appeared up and we could connect to them and use them independently outside of the cluster, but the cluster was convinced that they were partitioned. To cut a long story short, after checking many things we realised that the problem was down to the fact that one of the networks was actually a teamed network implemented using BASP virtual adapters, and this network team was not coming up fast enough after the node rebooted, before the cluster service tried to bind it in as a resource.

The fix was simple, in that we set the cluster service to delayed start and then everything was fine. We didn’t need to make any configuration changes beyond this. Once the cluster service was happy that the network was OK, SQL Server setup was able to continue just fine.

We make extensive use of Hyper-V within Basefarm and I recently encountered a strange problem when doing maintenance on a 5 node windows cluster running the hyper-v role. For reasons I won’t bore you with here (pre-production testing basically), I had been evicting and adding nodes to my host level windows cluster, but when trying to add a node back I encountered errors in the validation tests. This was strange as the node had actually previously been in the cluster and nothing had been changed on it whatsoever, so I already knew that it had previously passed the validation test and been successfully running as a cluster node!

The cluster validation reported an error of this format in the storage tests named

Downloading and installing this on all the nodes and potential new nodes resolves the error.

This goes to show the value of pre-production testing as the aim of this cluster is to provide a dynamically expanding virtualisation service for one of our largest customers. If it came to a production situation where I needed to add a node quickly, this is not the type of scenario I would be wanting to troubleshoot live!

In Part 1, I looked at how one could identify that a large blocking chain was causing resource exhaustion in SQL Server leading to a cluster failover. Being as we were working retrospectively, we were lacking data to analyse what actual condition was causing the blocking chain to begin with. This post address that area of the troubleshooting.

The problem had occurred a number of times in this environment, and no-one had ever managed to log into the machine to analyse it before it failed over, so I wanted to put something in place proactively to ensure we caught the condition as it happened. For reference had I been able to log into the machine I would have simply run the SQL 2008 Perfstats scripts as a starting point, which I tend to keep handy on my critical production machines, just in case I ever need it in a hurry. I just run the script for a couple of minute thru SSMS and you often get a good look at the problems. You can download the scripts from codeplex, as they are held as part of the SQLNexus project.

However in this case I have the opportunity to capture exactly what I want as I’m planning to setup everything in advance. This means that it’s a perfect opportunity to use SQLDIAG to collect all the data I want in one simple place. If you don’t know what SQLDIAG is, then read these first:

The good thing nowadays (since the above were written) is that there are a couple of automated ways to get things going, as opposed to manually editing XML files. Personally I do still sometimes do manual edits depending upon how many customizations of my own I want to put in, but we also have the options of using the BAT files supplied in the SQLNEXUS project or I’ve recently noticed that a SQLDIAG configuration tool has been released.

The bottom line here is that I configured a SQLDIAG to collect the following:

1. Perfstats scripts

2. Perfmon of common SQL and OS counters

3. DBCC MEMORYSTATUS (I always like to include this as it has such a low overhead and can be incredibly useful if you get unexpected internal problems like compilation bottlenecks)

4. General base information (defaults from SQLDIAG)

5. Profiler – restricted to batch level events, as this is a heavily used system.

***IMPORTANT NOTE***

You need to show extreme care with running profiler nowadays on some of the more high end SQL machines. In the past 2 year I’ve seen a couple of enterprise sharepoint systems crashed through the use of excessive SQL Profiler traces. If you’ve not run profiler before on your system and you have very high throughput, make sure you do a test run of just a server side profiler first at a very high level capturing very few events first, to check that your server and disks can handle the throughput.

***IMPORTANT NOTE***

Since I didn’t know when the problem would occur next, I then needed to implement a routine to manage the SQLDIAG start and stop times and file output. The BOL documentation above lists the switches and syntax you need to do this, with examples.

So I put all this in place and then waited for the problem to occur again, which is promptly did within a few days.

ANALYSIS:

Anyone who analyses performance problems within any frequency will probably tell you that there’s no right or wrong way to go about this, but in this case since I’m pretty certain I’m looking for a blocking problem, my first port of call is the perfstats scripts output. I’m suspecting that I’ll only need the other data to backup my findings, but I’ve got it there just in case.

Perfstats output is normally my first port of call in most performance cases anyway, as my first question is:

Is there a user batch running for a long time?

The answer to this question often removes whole areas of troubleshooting possibilities, so I like to answer it first. If the answer is yes, you can look at said batches and try to work out why, and if the answer is no, then you need to be looking at different areas such as internals. In this case, very clearly from perfstats, there was a long running batch. I’m limited to how much output I can show here for security reasons, but here’s some sample output from the [headblockersummary] section of perfstats.

(This wide output doesn’t display well on our fixed width blog, so if you want to read it, cut and paste it into a decent text editor like notepad ++ )

If you note the highlighted sections above, we can see that SPID 70 starts causing some blocking at 14:36:25, but it’s only blocking 1 session and there is other blocking anyway, so it appears transitory. This code is a trigger firing and it needs to do a large update which is going to take a few minutes (I can’t show the code or why this is, that’s just the way this application works)

20 seconds later SPID 70 is still blocking, but only directly blocking 2 sessions. Now there are about 80 other SPIDs blocked, and then are effectively behind SPID 70 deeper down the blocking chain. However these then clear and so the blocking still seems transitory, if we look at the web server logs over this period, the web servers are still up and the majority of all OLTP throughput is being served within normal limits.

But then 4 minutes later we see the same SPID 70 but this time its directly blocking 449 other sessions, and it’s at this point that we start to see alerts from our monitoring of the application servers. Over the next minute this blocking chain grows larger and larger until we hit the deadlocked schedulers and then shortly after that the cluster fails over.

So at first glance, we have a simple problem here. SPID 70 seems to have a long running piece of code which blocks all users on the system. But upon further investigation this trigger fires regularly without blocking the entire system (we saw leading up to the failure that it was running for several minutes with transitory blocking) so why did it suddenly block everyone.

The answer here is in what is being blocked (the blockee) rather then the head blocker. Upon looking further into the details of the blocked SPIDs it turns out that they are all running the exact same code from the exact same machine, which is a BizTalk host instance in this environment. It turns out that this host instance has no control over error handling or throttling, and when one session gets blocked, it just opens another, and another and so on. Because it happened to clash with the trigger running under SPID 70 it continues to open more and more connections until eventually we reach complete resource consumption and there are no sessions left for any users whatsoever. Then we get out failover.

So the solution here was 2 fold, firstly we could attempt to optimize the work of the trigger which was running for a long time, to reduce the potential for concurrency issues, but secondly we could attempt to control the BizTalk host more effectively to stop it attempting to open so many connections concurrently. Without BizTalk being involved here, although we would have experienced some minor blocking, the main application OLTP throughput would still have functioned, but BizTalk stole all the server connections, and then we had our crash.

The moral of this story is that in complex production situations, you may need to make multiple data collections before you can come to the root of your problem (however deep your skills and knowledge are), and as in many situations it’s about knowing what diagnostics and tool to utilise, as opposed to knowing the answers off the top of your head.

Here at Basefarm we offer a number of different levels of SLA and support, which can be tailored to customers’ individual needs. One of the highest levels that we offer is called Basefarm Application Management. At this level we take full responsibility for the operations of the infrastructure but also the customer developed applications and code. The following post is an example from some troubleshooting I did for one such customer recently. It’s a typical SQL Server problem in high end production environments and one that you need to be able to handle if you want to be able to keep your SQL machines running 24×7. (it also has some good geek value as it allows us to employ several advanced diagnostic routines including a windows debugger) The analysis here is quite long so I’ve split it into 2 posts.

The symptoms were as follows: (they all appear within a couple of minutes of each other)

All in all the performance problem lasts for a few very minutes (less than 10), and the actual complete web site outage is probably less than 2 minutes due to the resilience of the windows cluster and SQL Server, however if you’re aiming for a very high uptime, small outages like this will really hurt you. (and I take it as a personal affront if machines I’m responsible for run slowly and failover!)

My initial plan when dealing with this, was to log into the SQL machine (as I suspected this was where the problem was out of gut feeling) and run some diagnostics to prove or dis-prove this hypothesis. As it was the cluster proved this hypothesis for me by failing over before I could connect. So this is the evidence I was left to work with:

The 18056 error is the first evidence we see in the error log of anything being wrong in SQL. I don’t find this error particularly helpful actually in terms of its description, but I do know that I recognise it as being something that often occurs as machines start to lock up in the SQL engine and / or begin to shutdown / failover. In this case it’s more of what we in England would call “A red herring” as there actually aren’t any issues here with connection pooling.

The we see the key information in that there are deadlocked schedulers and we’ve produced a memory dump. This is the real root problem here. It means that every SQL scheduler is unable to process any work, which means that the server is effectively dead at that point in time.

At this point the windows cluster will detect that the SQL Server is unavailable (as it can’t log in to run it’s health routines) and it therefore fails the service over to another node (as defined in our cluster resource failover settings).

So this is good as we know somewhat which type of problem we are encountering and where it lies (within the SQL engine) but we have no idea why. I did actually check in the SQL default trace file to ensure that no nasty DBCC events or file growths has occurred here. They hadn’t which meant that I could form a hypothesis that it was probably a standard user SQL batch event causing the issue.

At this point it’s time to open the memory dump files and see if we can find anything of use. As usual with SQL, we have 3 files, the dump itself, a textual representation of the dump and an extract of the error log. The latter 2 here only really give us what’s shown above, so we need to open the dump itself to get anything good. At this stage if you don’t like debugging you might like to skip to part 2 of this post (which is collecting more information and diagnosing the root problem, but the dump is interesting in (to me at least).

There’s no time here to explain how to configure a windows debugger, but it’s well documented on MSDN. The setup I used is this:

A default installation of debugging tools for windows downloaded from here:

It’s at this point when debugging can become a little frustrating. As an ex-Microsoft engineer I’m used to having source access and private symbols, but now that I’m out in the “real” world, I only have public symbols like everyone else (not for example that I have no symbol for the bottom 2 rows in the stack – these are just the windows thread start functions so not a problem here, but for some reason I can’t get this customers’ windows symbols from Microsoft). Now this may well be fine for CLR debugging but our options are very limited for SQL. That said we can simply run ~* kL to check the stack of all of the process threads. Being as SQL source is written in a beautifully structured way (most of the time!) the method and function names are usually very indicative of what is happening and with a little common sense you may be able to get a clue to what’s going on.

The output of every thread is obviously too much to show here, but here’s a simple summary.

There are a total of 674 threads in this dump of which the first 70 or so are what I would classify as system threads. This isn’t an official piece of terminology, it’s just that whenever you look at a SQL dump the first threads in numerical order are always common functions within all SQL processes such as deadlock checker (thread 4 in our case) , resource monitor, service brokers, the list goes on, but they can basically be ignored in this case. What’s interesting is that after thread 75 there are basically 600 threads with an identical stack which looks like this

This is a rare scenario, as if you took a dump of a working SQL Server (you can do this with SQLDumper and it’s a great technique employed in the famous Ken Henderson book) you’d see all the different threads running different classes and methods as each one is processing it’s own workload.

The highlighted functions near the top of the stack should be self explanatory. This thread is processing a worker which is trying to acquire a lock. The lock can’t be acquired at this time, so it goes into sleep mode. This is a very common example of how standard lock blocking looks in memory dumps. Since nearly the entire dumps’ threads are in this function, what we appear to have here is a huge lock blocking chain, which eventually leads to the actual failover event, which is deadlocked schedulers, i.e. no schedulers at all are available to process work, as they are all stuck waiting for this block to clear. The other function I’ve highlighted is one to note whenever you are debugging SQL Server (well versions 2000, 2005, 2008 and 2008 R2, as this could change in future), and it’s the [sqlservr!process_request] which is the function that all standard TSQL batch executions go through. In simple terms, if you see this on the thread stack then your thread is processing TSQL statements.

So in summary at this point we now know that out server failed over because a huge blocking chain ground the server to a complete halt. Eventually the blocking chain was so long that the cluster resource monitor could not log in. At this point it deemed the server was down (it was actually up but only in a theoretical sense as it couldn’t process new work) and the cluster failed it over. This also probably explains why it came back up without a problem, as the head blocker would have been killed by the failover and then other work resumed normally.

The question at this point obviously, is what was causing the blocking. Now if we had private symbols, we could simply explore through the locals on that stack and dig out the memory sections which describe the actual TSQL being executed. Since only Microsoft engineers themselves can do this, we have 2 options, either send the dump to MS and pay for them to look at it, or prepare advanced diagnostics to attempt to the capture the problem again using public tools, the next time it happens. Since I’d prefer not to pay Microsoft just to sort out blocking problems, we’ll do the latter, and part 2 of this post describes what we did.