Monday, January 30, 2012

Being able to vMotion MSCS cluster nodes is something that is quite difficult to achieve and that need some fine tuning of Windows cluster parameters. The modifications I am going to show you are meant to increase the cluster heartbeat timeout and to decrease the cluster sensitivity to network connection interruptions.

By default MSCS will fail your node if five pings are lost and will initiate a failover.

Unfortunately 5 seconds is sometimes an insufficient time slot for VMWare to complete the vMotion process because while the contents of the guest's memory are copied from
one physical host to another, the guest is queisced for a few seconds in order to allow the synchronization of changed blocks of memory. Typically you may lose up to 3 pings.

You need then to change the heartbeat values to their maximums by issuing the commands below on just one of your cluster nodes:

cluster /prop SameSubnetThreshold=10:DWORD

cluster /prop SameSubnetDelay=2000:DWORD

Here's the explanation of these parameters:

SameSubnetDelay: The value in milliseconds of the cluster heartbeat frequency. By default, this value is 1,000 milliseconds. The maximum possible value is 2000.

SameSubnetThreshold: The value represents the amount of missed
heartbeats that will be tolerated before a failover event occurs. The default value is 5. The maximum value is 10.

Setting these values to 2000 and 10 means that the cluster service will wait for 20 seconds before initiating a failover.

The commands above are a new feature of the cluster service under Windows 2008. There have been in fact many improvements to the Windows Server 2008 failover clustering service. One of this improvements concerns exactly the cluster heartbeat mechanism.

If these modifications don't resolve your failover problem, then you might also play with some more parameters: