A very interesting Microsoft cluster failure

It’s been a long time sine last post. I was out of internet due to health issue. Just got recovered and backed to normal work. I have to publish my article by English then translate it to Chinese later since I lost lot of me time after my baby born, but more fun. hopefully it not impact to Google search. 🙂

There was a interesting problem happend on Microsoft cluster when I came back from hospital. Our DBA team complaint Microsoft Cluster Service failed intermittently on virtual machine. This situation constantly happend for a week.

At the beginning of the whole troubleshooting, team noticed quorum disks failed with following Windows event:

Cluster service failed to update the cluster configuration data on the witness resource. Please ensure that the witness resource is online and accessible.

So we focused on disk performance. vbod.log also show some performance degrading but the time was not match. Microsoft was involved after that, they said the cluster failure actually caused by network connectivity issue according to following Windows event:

Cluster node ‘xxx’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

It became interesting since virtual machines share physical network links, it cannot be only single virtual machine had problem if there was network connectivity issue. Then we noticed there was following abnormal Windows event when some failure happend:

After involved multiple vendors from OS, virtualization, network and storage team, everybody said it’s not their problem. You could see this kind of problem in large datacenter since more and more system installed, it’s hard to find out which piece of the system caused the issue. You have to familar with each field of datacenter.

Eventually we figured out the issue related to storage workload. But why vendors cannot figured out this problem? First of all, Windows OS disk is running on shared storage, Windows no responding when the VMFS5 datastore latency of OS disk is high. From Windows perspective, it doesn’t know what happend on backend storage, it just know OS is very slow for few seconds, kind of pause the system. So it leads to network packages drop, and no Windows event for that since OS resumed very quickly, Windows takes it as normal behavior. Cluster actually failed at this moment. Secondary, the particular LUNs hosted the virtual machine was not busy, but the LUNs shared same storage pool with other LUNs. Any high workload LUN will impact rest of LUNs in same storage pool.

After understood these points, we figured out lot of virtual machines got high latency or high IO around 7PM every day, and most of the cluster failure happend this time. Since it’s impacted to large number of virtual machines, it must be caused by some common components on virtual machines. We eventially figured out it’s McAfee DAT updating after captured network packages. All virtual machines did same downloading in same time lead to high workload on shared storage and lead to cluster virtual machiens no responding for few seconds. The issue got fixed after change McAfee DAT updating schedule to random interval.

There are always common things going on on datacenter, it maybe small resource consumer but it can be a signaficant big monster in virtualized datacenter. Such as backup, monitoring, anti-virus or system management agents. It can impacts to shared storage or network links.

Yea, I agree. At this point, I do suspect its something to do with the storage side of things and I am glad my storage team is giving me SAS disks which I can make dedicated datastores on the Vmware side and store only my SQL virtual machines. I haven’t called Microsoft at this point and I hope I don’t have to. I will update this post after my RDM migration from HP3 Par to Netapp has been completed.

I’m not sure if 3PAR can assigns dedicate disks. Usually it’s shared storage on backend even you see dedicate LUN present to ESXi. Means if some significant workload on some disks on other services, it can also impact to your ESXi LUN. And cache of 3PAR also should be checked to make sure it’s not bottleneck.