an insider's perspective, technical tips n' tricks in the era of the IT Revolution

March 09, 2011

Performance = art not science (with DelayedAck 10GbE example)

I’ve always found that resolving performance issues is more about the HOW than the WHAT. In other words, it’s not like there is a list of “secret switches” that simply make things scream (ergo: “just enable jumbo frames and it will solve all your problems” or “do/don’t use thin, it will be good/bad”).

I call these hyper-specific pieces of guidance “WHAT” (in the sense that there might be a specific set of configurations that will always be good).

There’s simply no way for any human to be able to keep track of all the infinite variations, which all are constantly changing.

So, here are my guiding principles on HOW I try to approach performance – in other words best practices that “ are process things get you to a good place”, rather than “knobs to turn to get you to a good place”

First Personal Rule: KISS is more than a cute acronym, it’s a way of life.

Second Personal Rule: If you find yourself needing to tweak a lot of defaults, you’re probably on the wrong track.

Third Personal Rule: You’ll inevitably run into problems – so sticking to a good “how to diagnose” process is very important.

I’m going to use ONE example to explore these ideas (and this is a real customer example, with real customer data).

If you’re interested, read on…

So – here’s the scoop – there’s a GREAT VMware and EMC customer that IMO is generally quite advanced. Back in 2010, they started deploying 10GbE on VMware (using Intel X520 interfaces), and used EMC’s midrange storage platforms (a CX4, they are happy, and evaluating VNX) and added 10GbE iSCSI Ultraflex SLICs.

they found that they were not getting the kind of bandwidth from the interfaces that they expected. Here were the ESXTOP results from one of the hosts.

First of all – remember – it’s not “real world” to drive a workload to saturate links (which is possible), so don’t auger in on maximum throughput. The actual thing that was generating the load was doing a sequential IO within a single VM. Also, they were finding latencies that were materially higher than they expected – here’s a vSCSIstats dump:

As you can see – the distribution had WAY too many that landed in the 100ms bucket. Long and short – they felt they could and should get more.

At VMworld 2010, I presented a series of topics on general storage best practices together with Vaughn. While we intentionally worked to make the content apply across the storage landscape, I put a couple things in there that were germane. Here was one:

There were two main points here: 1) to optimize your bandwidth, make sure you’re current on the software, and make sure you enable flow control end-to-end; 2) don’t obsess over line rate (and set expectations on what’s reasonable to expect on a given platform). BTW – I did it cognizant of the fact that competitors could use it for silly counters, but frankly, we are the leader, they will always needle us. I believe as much as possible, we serve our customers best when we’re as transparent as possible – and I worry less about our competition and what they’ll do.

The answer is simple – this doesn’t always have the desired effect, and we need to do more before changing it broadly. It doesn’t ALWAYS have the positive impact.

Well – I DO think we could/should update our core documents – the one that applies to this topic you can see here:

But – this is a optimization, that is, in effect, a reflection of a transient thing. I know that transient things can cause a LOT of pain (hence knowing about the workaround is important). An analogous (and related) issue was a lot of VMware customers struggled with Broadcom NICs with VMware – which was fixed in vSphere 4.1 update 1.

That type of solution – transient, manual, complex, are, in general, bad solutions to performance problems. I’m not a purist, and I know you need to make compromises – so for this particular customer (and maybe for others), the workaround is a good thing.

So – let’s take a look against the “guiding principles”

First Personal Rule: KISS is more than a cute acronym, it’s a way of life.

“Defaults are usually the best” – only change the default if you really have to. If the default should change, then the default should change.

Here, the KISS principle says that if the fix applies generally, it’s important to know about the workaround, but it’s most important for the fix be applied into the platforms, so you don’t need to change an advanced setting. Working on it. In the meantime, I want awareness to be high.

Please follow the general principles discussed in the recording we did at the Atlanta VMUG (see it here), and on this webcast here.

Second Personal Rule: If you find yourself needing to tweak a lot of defaults, you’re probably on the wrong track.

Stick, as much as possible, to the core recommendations in the core docs.

The fact that is why delayed ack is a workaround, not a fix, since changing this requires detailed changes on a host-by host basis, it’s “fragile”.

Third Personal Rule: You’ll inevitably run into problems – so sticking to a good “how to diagnose” process is very important.

This is what I think was perhaps the most instructive thing in this example - the customer started HIGHER in the stack (at the VMware layer, using ESXtop and vSCSIstats), and then worked LOWER (the array stats). This is a VERY important performance troubleshooting technique.

Personally, IMO #3 is one of the most important principles of performance troubleshooting. Here’s my “infrastructure troubleshooting” sequence:

If it’s serious, take the little bit of time to open official cases with the vendors involved.

the most powerful resolution to problems are open, community based. Use google.

When troubleshooting – start with the connectivity from client to the application, then…

This sequence has served me well many times. Note – there’s no correlation with problems occurring more/less at various parts of the troubleshooting process, or having more/less impact at given parts of this troubleshooting process.

The point is that it gets you to approach this stuff programmatically.

I REALLY want to highlight a couple of powershell scripts the vSpecialists have been using for step 4 and step 5.

Great post Chad. I have used your diagnosis process in your third rule for years. However I have always had to "Start" somewhere where I have seen the symptom. For example if you see a network issues you start at that step and go up or down the chain based on the information you have discovered. Also if you can't find a resolution to your problem you need to re-review all of your steps yourself or with a second pair of eyes. I have had cases where I go through the whole process don't find the problem and by re-reviewing everything with another pair of eyes I found I missed something...we are all human :)

The delayed ack (as a result of Nagle's algorithm) is something which affects a lot of environments, sometimes does not cause enough obvious issues as to warrant a deep investigation. People tend to live with it and accept "well, thats the best I can get".
Which is a pity.

While, it is true in VMWare and on the array target side, the same applies for Windows hosts (VMWare with in-VM iSCSI initators, Hyper-V Servers with iSCSI).

Standard IP network traffic, for the most part do not fall into the same realm in terms of payload (its much bigger - typically over 1492 bytes) as storage-based Block IP.

Quite often, SCSI commands are very small in terms of payload size, (could be just 10 bytyes) for slow-path CDB, inquiry, metadata and control commands. These are exactly the commands you do not want to be slow!
Storage vendors such as EMC, and OS/Hypervisor vendors use small SCSI commands to control and inquire storage. Typically the code written here is sequential single-threaded code to enable correct timing and arbitration of devices - and this is when Delayed Ack really stings. Like your customer, above with his sequential workload.

The default for iSCSI initators in Windows 2008 onwards is to enable Nagle's Algorithm, which I believe is the wrong thing to do. iSCSI, by its nature assumes a lower-hop point-to-point network between host and storage, and as such link saturation is less likely to occur due to aggregate traffic on trunks.

I tried to convince Microsoft to either change the default for an iSCSI adapter or at least enable a radio button on the iSCSI Initator UI in Windows 2008. Well, I got to write them a KB Article instead :-) http://support.microsoft.com/kb/2020559

So, more times than not, it can be better to have delayed ack disabled by default for iSCSI adapters.....

(Name and email address are required. Email address will not be displayed with the comment.)

Name is required to post a comment

Please enter a valid email address

Invalid URL

Please enable JavaScript if you would like to comment on this blog.

Disclaimer

The opinions expressed here are my personal opinions. Content published here is not read or approved in advance by Dell Technologies and does not necessarily reflect the views and opinions of Dell Technologies or any part of Dell Technologies. This is my blog, it is not an Dell Technologies blog.