TCP Offloading again?!

TCP Offloading again?!

I have spent probably hundreds of hours on cases involving TCP Offloading and I know most of the signs (intermittent dropped connections, missing traffic in network traces). However, I have to admit I got burned by it the other day and spent several more hours working an issue than I should have.

I was working on a server-down case for a financial trading company (in other words, large dollars involved every minute they were down) where the customer was experiencing slow connections to SQL Server. The customer reported only some Linux ODBC clients were impacted. Based on that description, we started looking at the client side. However, we soon discovered that, while there was no detectable correlation between the clients, the problem was only visible going to a specific SQL Server instance. The affected clients had no problem communicating with other instances of SQL Server. Based on this, we started focusing on the SQL Server machine itself.

From the client application’s perspective, every query was taking roughly five seconds longer than expected. Therefore, we collected a PSSDiag and looked at the performance of the SQL Server machine as a whole. The Profiler traces showed that there was no delay inside SQL Server:

So, where were the five seconds coming from?

The next step was to look at a network trace:

Check out the two sets of timestamps circled. Both of them had a five second delta! Now we had physical proof of the problem, but we still don’t have a reason…

Then, I noticed something that turned out to be the key – the five second delay was always between the data sent from the client and the server’s response to that data. That clinched the fact that this was a server-side is 100%. I couldn’t explain yet why only some clients were impacted, but this was definitely a server-side issue. The other interesting thing to notice above is that the delay is even visible on the login! This was completely surprising because this customer was using SQL Authentication. That is a highly optimized query which should never have performance issues. This, combined with the fact that the subsequent query wasn’t showing up inside SQL Server as being delayed caused me to start thinking about things outside of SQL Server.

The next thing to check was for filter drivers that might have inserted themselves in the TCP stack – antivirus, firewall, NIC teaming, etc. Unfortunately, nothing like this was installed so there were no clues there. We also reconfirmed that TCP Chimney was turned off at the OS level. And then it hit me…NIC level TCP Offloading!!!

We disabled all of the Offloading settings, clicked OK and performance was back to normal. Connections were fast and query results were returning right after SQL Server generated them. I should point out that we didn’t take a stepwise approach here because this customer was losing large amounts of money every minute this system was down. In a less critical issue, it would be worth doing each setting one at a time and testing in between. In addition, I would also recommend that you go back after the fact and test enabling each setting to see if there is a negative impact. There are some non-trivial performance benefits to be gained from these settings if everything is working properly.

We never did figure out why only some clients were impacted since all of the clients were using the same driver. Nor where we able to figure out why only this SQL Server instance was impacted when several other SQL Server machines were configured the same way at the driver level.

The moral of the story? I need to update my standard steps for capturing network traces to include NIC level TCP Offload settings!

As of this morning, my first four steps for capturing a network trace now look like this:

1b. Confirm that TCP Chimney is turned off if any of the machines are Windows 2008 (see http://support.microsoft.com/default.aspx/kb/951037 for more details) a) bring up a command prompt and execute the following: netsh int tcp show global b) if it turns out TCP Chimney is on disable it netsh int tcp set global chimney=disabled

Maybe we should start a Facebook fan page of "TCP Offloading Sucks, So Disable it by Default." How can we get some pressure on Microsoft and the NIC vendors to stop screwing their customers, other software vendors, and even itself with millions in support costs? I work for a software co. that this causes problems for and it's just a waste of human resources dealing with this crap.

From a Microsoft perspective, we have certainly learned from our mistakes in this area. Windows 2008 ships with the feature off by default and Windows 2008 sets the default behavior based on the NIC speed (http://technet.microsoft.com/en-us/library/dd883262(WS.10).aspx#BKMK_chimney).

I cannot comment on the driver vendors, but I would hope they are listening to feedback both from customers and Microsoft.

My general recommendation is to leave TCP Offloading off unless you find yourself stressing your server to the extent that the potentical increased networking performance is worth enabling it.

could you confirm that " Turn of TCP Offloading/Receive Side-Scaling/TCP Large Send Offload at the NIC driver level " must be done at the card and/or does " netsh int tcp show global " show the status of this?

The network guys are saying this is disabled but if I open the nic settings it shows the IPv4 Checksum and IPv4 Large Send Offloads as both being enabled. Sorry, configuring TCPIP isn't one of my more usual skill sets as a DBA !

IL

24 Feb 2010 1:10 AM

Is any good reason to disable IPv4 Checksum Offload? This parameter is not mentioned in the article.

Dean

25 Feb 2010 11:04 PM

Why is it soooo hard to get this feature working correctly after all these years ? And if it can't be made to work then why doesn't every vendor just drop the idea ?

I have over 4000 servers to check for these properties being on. Does anybody know how to query WMI to check for these TCP Offload settings?

An automated way to change the settings?

skeptic

9 Mar 2010 10:13 AM

... We never did figure out why only some clients were impacted since all of the clients were using the same driver. Nor where we able to figure out why only this SQL Server instance was impacted when several other SQL Server machines were configured the same way at the driver level ...

In my opinion you didn't solve the problem. I used to hear a lot from PSS don't use /3Gb with SQL Server 2000, as we see a lot of stability issues with the customers who use it. I NEVER saw a problem with /3Gb so I did recomend it to my customers.

Looks like the line of advise that's coming from SQL PSS is to switch off all advanced tuning parametsr and pray that this will solve the problem. Better off to turn the server that runs SQL off and that way you are not going to see any issues.

On a serious note, test, test, test all possible scenarious within your environment, don't just follow silly advise not to use some features becuase someone can have problems with it.

We would hate to see you alter settings on over 4000 servers. In general I would say that if you are not seeing an issue with your servers, then you shouldn't need to alter anything. I think the point of this blog was if you do notice a Performance issue, it may be a result of the above.

It should be looked at on a case by case basis and not a blanket change to your environment.

Thanks,

Adam W. Saxton

Jeff Jordan

28 Sep 2010 7:41 AM

Can you give more details on what to disable on the NIC? As another poster had asked, should IPv4 Checksum Offload also be disabled? In my case I've disabled the TCP Chimney in the OS and also disabled Large Send Offload v2 (IPv4), Large Send Offload v2(IPv6), Receive Side Scaling, TCP Checksum Offload (IPv4), , TCP Checksum Offload (IPv6). Also what about UDP Checksum Offload? Should that be disabled as well?

Take another look at the failing clients for the applications they are running. While my experience is on a much smaller business case, the application which was actually failing on a TCP connection was Eudora mail client, which is heavily multi-threaded - it's the combination of TCP Offloading and multi-threading which causes the NIC driver/hardware to fall on its face. I don't pretend to understand the underlying mechanisms in detail but it appeared that switching threads requires the driver to switch its active TCP Offload connection in the NIC hardware and that switching process is prone to failure or excessive delay.

The NIC in question here is a nVidia nForce4 chipset on-board NIC and I did do a stepwise check of the offloading settings: in one case it was Checksum Offload which had to be disabled and in another it was Segmentation Offload. I turned both off on all the nForce4 equipped systems.

Note that specific multi-threaded switching situations are nigh impossible to reproduce so it's difficult to be sure that any given system is tested and validated.

I am curious. Besides the previously mentioned Offload options, we have a TCP Connection Offload option for the integrated Broadcom NICs on our HP servers and I was wondering if this option should be disabled (Tested) with the previously recommended disabled Offload options?