Links

Common issues with NFS.LockDisable=1

After seeing a mention on Scott Lowe’s blog (blog.scottlowe.org) and on Storage Monkeys Blog (blogs.storagemonkeys.com) I’ve decided to discuss the issue(s) that I’ve came across in regards to disabling NFS Locking with the NFS.LockDisable=1 function.

As the problem can arise from many different circumstances, the majority of feedback I’m receiving appears to be caused by a VMware HA failover (either intentional or unintentional). Thus, I would like to discuss VMware HA and how it works (based on my experience and knowledge).

But before that, let me mention that the end result of having NFS.LockDisable set to 1 is that Virtual Machines can become corrupt (Windows VMs blue screen and give NTFS errors / Linux guests are more resilient and could potentially be fixed by a fsck, but you should always have a good backup regardless). This is caused by the fact that multiple ESX hosts can start the same VMX at the same time. Ok, lets continue…

From what I can see when you configure VMware HA the first (4) nodes configured are marked as primary, every host after the fourth is considered a backup node. In the event of a HA fail-over the primary nodes will all attempt to start the VMs that were running on the failed node. It appears they rely on the VM locking to determine if the VM is actually down or not. So what this means is regardless of Isolation Response the VM can actually be powered on multiple times. In fact, in the couple times this has happened to me I had the same running VM on up to (3) hosts at once.

You can also see some strange behaviors in VirtualCenter, such as the number of Virtual Machines registered in each host will jump up and down within seconds. I would look at the summary of one of my hosts and see the Virtual Machine count go from 20 to 35 to 28 to 40 and so on.

The only true way to clear this up is from the service console to do the following;

Run a vmware-cmd -l on each of your hosts within the cluster.

Output this data to a file so you can sort it later (ie: vmware-cmd -l > host1).

Now this is the tricky part, if you have tons of hosts within a cluster it will take some time to actually find where they are really located, but you do know which ones are registered multiple times. Knowing the list of multi-registered VMX files, you could potentially create a script that ssh’s to each of your ESX hosts and runs a vmware-cmd -l grepping for the VMX file, then returning a code notifying you if its there or not. Since I only had (4) nodes on the cluster that failed this wasn’t necessary for me.

You can run a ps aux | grep VMX-FILE on the hosts where they are registered to determine the PID.

Use kill -9 PID to remove the running VM. Magically it will become unregistered on the invalid hosts.

Ok, so in closing I do not want to put all the blame on VMware HA, it is actually a combination of NFS.LockDisable=1 and what happens because of that that causes the potential corruption. The same result can occur by manually registering and starting the same VMX on multiple hosts (as with disabling locking it removes the that added layer of security).

It is extremely important that you enable NFS Locking by changing NFS.LockDisable back to the default setting of 0. You should also install VMware Patch ESX350-200808401-BG. I discuss the fix of this issue in another posting, which can be found here.

The best practice and default is to Power Off Virtual Machines on the isolated host.

Long story short, it really depends on how comfortable you feel with your environment. HA depends on your network to test if a host is actually offline or not.

In an event where you have a network outage (and not a host failure) changing Isolation Response to Power On will make sure your VMs are not shutdown, but this may not be good if the network issue is isolated to that one host, since the VMDK lock is not removed another host cannot start it.

On the flip side, if you do have a host outage and Isolation Response is set to Power On it really wouldn’t matter, the VM is killed anyway and the lock is removed so another host will be able to start it (It is also possible to have a semi-system failure where the machine hasn’t fully crashed but is still in a state where VMDK locks are still in place).

Because the default and best practice is to Power Off/Shutdown I would have to recommend that. If your concerned with false positive network issues, you should really worry about fixing those problems before changing Isolation Response to Power On.

I currently run with the Shutdown setting, this will attempt to do a graceful shutdown first, then kill the VMX if it hasn’t responded to the guest shutdown.

Good Morning Rick – Good Discussions and thank you for your time. The default setting for Isolation Response has changed with Update 2 to Leave VM’s Powered On so I’m not sure I consider Powering off a best practice anymore. The response from our customers has been over whelmingly to leave them powered on. I have choosen to present it to my customers with the pro’s and con’s and let them decide. This is more information for me so that is great. Thank you!!

So, another question for you. If you have applied the NFS patch for ESX and you have locking set properly, is there a techincal NEED to change the Isolation Response to Power off? I haven’t seen one and I wanted to get your thoughts.

Does it really worth it that much to purchase Virtual Center? Would it be worth it to purchase Virtual center
If I don’t upgrade my VI3 from standard to Enterprise. The post above cheered me up a lot into buying it, but it did not
mention what it will give & not give in standard edition.

Andrew, your comment and that link confuse me a little. That blog you referenced talks about the advantages of using VirtualCenter with a VMware Server deployment–and your discussing an upgrade from VI3 standard to enterprise… You should already have VirtualCenter for your deployment. Can you please elaborate a little more on this?

shankyrhodes1:32 am on January 2nd, 2009

Hi,

we are considering to buy a VMware Virtual Center.
We have two servers running VMware Standard edition.
Do you believe it will be worth it? Or do we have to
upgrade our VMware licenses to Enterprise before upgrading
virtual center to make it worth it. I had just read the
following article VMware virtual center real value