Database Disconnect Issues with SharePoint

Database Disconnect Issues with SharePoint

I have spent a good deal of time lately working with one of my MOSS 2007 customers on a database disconnect issue that has been plaguing them since SPS 2003. The following blog entry are the steps I went through to track down this issue. It was not without the help of some of my co-workers that made this journey successful. I also need to mention that the customer made an incredible contribution of time and effort on this as well. In many cases most customers would have bailed and just resigned to the fact this issue would not be resolved but this one was different. They stuck it out and in the end we nailed the issue.

The diagnostic tools used during the investigation included:

ULS Logs

Netmon

ADO.Net Tracing

Netstat

DbgView

TDIMon

NT Event Log

WinDBG (Debugger)

LogParser

netdiag

This issue occurred in every single environment they had, Production, QA, Dev, POC and was prevalent in that it occurred several dozen times a day on each server. The issues presented as multiple event log entries in the event log, specifically there were many event 7888 and 6483 events littered about in the Application Event Log.Looking at the ULS logs we could confirm the same error being logged there.

What was most interesting about these events where that each where logged as a timeout expired operation and the operation being performed was an initial connection. You can tell this by looking at the various method names such as TdsParser.Connect() (on an already existing connection we would be running TdsParser.Parse), AttemptOneLogin, CreateConnection(), etc would all seem to indicate we were not using a pooled connection. The timeout message really bothered me too; I knew the connection strings being used where managed by SharePoint and typically the TIMEOUT parameter is not set in the connection string explicitly so this was probably not the result of a low TIMEOUT value being set by a rouge configuration setting.

Looking down at the bottom of the stack I could see the method that started this entire mess was one that runs within the OWSTimer. This was important because we knew this was not as a result of a user making a connection to the DB in the process of doing some kind of SharePoint operation on the Web Front End (WFE). This meant this error was not impacting their experience as well as when a job in the OWSTimer fails the operation will be retried at the next interval for the job.

As typically with many issues that present as a problem with networking we captured several network sniffs at the time of the events. We could see that the server was responding fine and there appeared to be no errors on the network, very fast responses basically all the time. In fact during the period in which the events were being thrown every attempted SQL connection from the time of the event and back 30 seconds succeeded.

Next I started looking into the ADO.NET source code and soon I developed a theory. So the theory was that we are inside of the ADO.NET ReadSync() method and we are getting a very small timeout value (possibly zero) passed in which caused us to not do the proper WaitForSingleObject() on the socket, basically passing 0 for the TIMEOUT which does not allow enough time for the response from the server. A TIMEOUT of zero basically causes the thread to release the remainder of its quantum and then immediately go back into a ready state to be scheduled back on the processor. So what would cause such a condition? ADO.NET calculates the timeout time by using the timeout passed in from the user code and calculates a time in the future with which to expire the connection. As the code progresses it compares the current time with this expiry time to get the number of milliseconds remaining and then uses that as the basis for the TIMEOUT on the WaitFor… methods. The problem with this approach is that it is susceptible to clock skews and corrections. So assume a 15 sec timeout and the connection attempt started at 12:00:00, the timeout would be at 12:00:15, now the timer service has figured out the clock time for the system is a bit slow and needs to be adjusted to 12:00:20, we just blew our timeout period. At this point you are probably saying to yourself this guy must be on crack if he really thinks this is what is going on. Well I am not on crack and you are correct this is not what was going on. If you think about it there is a really small window of opportunity for this to occur and to occur at the frequency we were seeing was very unlikely. What made this more unlikely was that we did not have the issue on just one server but rather dozens in various isolated environments.

The next step was to enable ADO.NET tracing to get a peek into what ADO.NET is really doing and try to ascertain why we are getting this timeout. Setting up ADO.NET tracing is trivial once you know how to do it. Below are the following files that I used.

Setup.cmd -- is what you run to install the necessary settings to perform the tracing.

Setup_trace.reg -- is the registry file that is imported.

ADO_NET.mof -- is the mof file that is compiled to enable the tracing.

_startTrace.cmd -- is the file you run when you are ready to start the tracing

Ctrl.guid.todd -- is the file that contains the IDs of the tracing handlers

_stopTrace.cmd -- is the file you run when you want to stop the tracing.

Ctrl.guid.todd

_stopTrace.cmd

@Logman stop MyTrace -ets

Once the trace is complete you end up with a potentially huge Output.etl file. So once I opened the file in my favorite text editor (UltraEdit) I needed to find the error. So I began hunting around for the string “{WINERR}” with the actually error number preceding this string. The only errors I could find were 258 (WAIT_TIMEOUT) and 10065 (WSAEHOSTUNREACH). Both seemed relevant and both occurred on the same thread. Using LogParser I was able to break out the huge file into something more manageable. One of the columns you can use to do this is the ThreadID. My errors occurred on threaded 3136 so I used this command to break out the file:

LogParser.exe "SELECT EventNumber,Timestamp,EventDescription,ThreadID,msgStr INTO bid_3136.csv FROM Output.etl WHERE ThreadID=3136" -oTsFormat:"yyyy-MM-dd:hh:mm:ss.llnn" -fMode Full

I narrowed my search even further and finally I was able to figure out what was going on. Using EventNumbers we can see at 62463 we attempted to connect a socket, and at 62515 (the next trace item on this thread) we can see this event occurred 21 seconds later with our destination unreachable error. We then retry the connection, but remember our connection timeout of 15 seconds is already past, so in 62647 we have a timeout of 0, we send on the stocket and then attempt to wait on it and get the error in 62651 which is our WAIT_TIMEOUT which is the result of calling WaitForSingleObject on the socket handle with a TIMEOUT of 0. This is the error that is logged in the event log however it is not the real reason we failed to connect. Rather it was the first error we needed to focus on.

We then went through a battery of config changes to see if we could knock this one out, we checked the IP settings and routing tables to confirm everything was correct. We checked GPO IPSec policies and none were being applied. We disabled or removed additional NICs that were not being used. We removed load balancers, AV software, etc but nothing helped.

We also performed a TDIMon trace and this confirmed exactly what we found in the ADO.NET logs.

Next I attempted to use a Debugger however the logistics to make this reality were very difficult and once attached I had limited time to try to reproduce the issue. We seemed to run clean with the debugger which typically means we have some sort of timing issue or race condition (not the case here however).

Next I attempted to use iDNA which is a really cool technology developed by MSResearch which can trace every operation the machine performs. You can then load the log into the debugger and step forward or back in time, aka time traveling. Unfortunately there are some limitations, it does not work on x64 and it does not work within a VMWare VM. Our test environment was both.

The customer was on SP1 of Windows 2003 and we upgraded to SP2, while we did see some improvement it was only marginal.

Knowing this was something within the customer’s environment it must be something that applies to all machines so I received a copy of their build documents and attempted to build an identical machine using their process however this did not prove helpful.

Finally we decided to use a special build of tcp.sys that dumped a message to a DebugDiag window when a Destination Unreachable error occurred. Fortunately there are only a few places within tcp.sys where this error is returned. Once we received the log we could see were the message was being returned and looking at the code we saw the code path necessary to get us to that place meant IPSec was enabled. BUT we checked IPSec right. Wrong, we did not check to see what policies were being applied at the local registry. To determine whether IPSec policies are being applied from either the local registry or through a Group Policy object (GPO) we followed these steps:

We Installed Netdiag.exe from the Windows Server 2003 CD by running Suptools.msi from the Support\Tools folder.

Opened a command prompt and changed directories to %ProgramFiles%\Support Tools. (Assumes you chose the default install path).

Ran the following command to verify that there is not an existing IPSec policy assigned to the computer: netdiag /test:ipsec.

Now if there was no policy we would have expected to see a message similar to “IP Security test…….:Passed IPSec policy service is active, but no policy is assigned” however what we found was that although there was no Group Policy IPSec policy being applied someone (namely the security team) had gone in and applied IPSec policy directly to the machine. In my experience this is typically not done just because this practice is an administrative headache and is riddled with problems. What happens when you update your policy? What about if you miss applying a policy to one machine? What if you apply an outdated policy to a machine? You get the point…

Here is one of the entries from the IPSec log, note we are blocking TCP source port 2283.

So what was it able this IPSec policy that was a deal breaker? Several of these blocked ports fell within the ephemeral port range and when we attempted to use one of these ports we would fail.

So how do you fix this issue? According to all documentation the fix is to set the registry key HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\ReservedPorts to exclude a range of ports so that TCP did not attempt to use them. This makes since, knowing they are blocked to remove them from the ephemeral pool of possible ports. The KB article that documents this key however does not mention that multiple ranges can be included. By adding ranges such as 1080-1080 and 2282-2283 one can exclude these ports from the ephemeral pool. There is however an open question on whether this actually works or not. The customer chose to use another workaround mainly the built in firewall in Windows 2003 Server.

So why are these ports being blocked? It seems the security folks connect to each server and make the IPSec changes to block ports that have known vulnerabilities however they failed to tell anyone that they actually do this so this is likely why it did not make it into their build documents.