We have a just completed a migration from SQL 2000 to SQL 2008 R2 and have started to intermittently receive SqlExceptions with the following two error messages:

A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 - The semaphore timeout period has expired.)

A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 - The semaphore timeout period has expired.)

We have 3 web servers connecting to this SQL Server running around 100 applications (all accessing the same 8 databases on the SQL Server).

Because these exceptions were not occurring on the 2000 server, we feel like it is unlikely to be an application issue (however, we are not ruling it out). Traffic on the web sites is typical, ruling out a high traffic issue. The old SQL 2000 box had 4 CPUs and 8 GB RAM, while the new one has 24 GB RAM and 16 CPUs (which is currently and during the issue underutilized).

These errors occurred for a period of about 5 minutes several hours ago and have not as yet reoccurred.

The sys.dm_os_ring_buffers system view does not show entries for these disconnects, and there are no corresponding event log entries on either the server or the client.

Some googling has found a few similar reports, however nothing seems definitive(see links below). Has anyone seen errors like this after migrating from SQL 2000 to SQL 2008 R2?

Which operating system are you running on? If there were absolutely no entries in the server event logs or SQL Server errorlogs, it would lead me to believe that the connection is never even making it to the instance...
–
Pam LahoudOct 13 '10 at 4:23

3 Answers
3

We have tracked down and fixed this issue in our environment. The description as I understand it is below (please excuse potential inaccuracies below; This is the way I (as a software developer) understand the descriptions given to me by our Network Administrator (who also was working with our hosting company).

The cause was eventually tracked down as a network configuration issue involving the Load Balancer. We had expected that the Load Balancer was sitting between the internet and our web servers, and that all of our servers were communicating freely with each other. Unfortunately the network was set up in such a way that all network traffic (including traffic between the SQL Servers and Web Servers) was passing through the Load Balancer. The Load Balancer was configured to limit bandwidth passing through it, and when the limit was exceeded it simply dropped packets. The limit was often exceeded when large file transfers were occurring between the servers (eg, when database backups were copied off of the database server, etc). This was hard for us to see as we didn't have access to the Load Balancer (only our hosting provider could access it), and as far as we could tell we were far from saturating our network interfaces. Additionally, these issues were extremely sporadic (on the order of a handful of minutes every 3-5 months).

The fix was to rearrange the environment so our internal network traffic did not go through the LB; I believe the network was rearranged to fit a One-armed Load Balancing Architecture. Since making this change we have not experienced the intermittent connectivity issues.

If I'm understanding correctly you've not only changed your software but also your hardware - so there are plenty of changes that could be causing this connection error. I've seen plenty of recommendations to double check your NIC drivers and motherboard firmware (!!) to fix this. Yikes!

Anyway - you should be able to see this error in your server application log. From here you may be able to get an idea the date/time the exception occured so you can compare it to the individual client/application event to narrow down what's happening when this exception pops up.

You can also use Netmon to trace the connections from the clients to the server. You'll want to give it a couple of days to reproduce the error. This should narrow it down a bit and at least give you and idea of what is failing.

There were no entries in the event log, and the errors occurred in various several of the applications for the duration of the event. I will look into Netmon to see if it can help.
–
Chris ShafferOct 12 '10 at 22:35

Last time I saw "The semaphore timeout period has expired" was when I tried to copy files from one hard drive to another on Windows Server 2008. Appeared to be because of heavy fragmented hard drive with bad clusters. Western Digital 2TB caviar Green, by the way, not in RAID.