Thursday, 26 April 2012

Large VM crashes during snapshot commit

Snapshots can be your friend but they can most certainly also make your
life miserable. The other day we had a rather large VM (with 20 GB mem, 8
vCPUs and 28 TB storage divided on 22 .vmdk's) that crashed during a
snapshot commit. The error stated: "Performing disk cleanup. Cannot
power off." The snapshot had been taken while the VM was powered off and
only a few changes had been made to the VM before the snapshot was
committed.

After the crash, the VM would not power on. The error stated: "Reason:
Cannot allocate memory" and in the error description (see screendump
below) there's an indication of disk a lock or disk error. Fortunately,
the VM could be started from the service console (ESX 4.1 classic) with
'vmware-cmd'.

After boot, vCenter stated that there was no snapshots on the VM. However, 22 delta files on a single LUN was telling otherwise.

A normal procedure to do cleanup is to power off VM and clone it. However, with 28 TB storage in the VM, this was not an option.

Instead, the following did the trick: Log on to the service console,
change directory to the folder where the .vmx file for the VM resides,
take a new snapshot and then do a remove all snapshots (see this KB article for more info). This removes the new snapshot as well as the 'defect' snapshot.

To see if any snapshots exist (that will probably not be the case):

vmware-cmd vmname.vmx hassnapshot

To take new snapshot (with no quiesce and no memory, see this KB article for details)

vmware-cmd vmname.vmx createsnapshot snapshot-name description 0 0

As you can see in screen dump below at first I tried to run the command
without the two boolean arguments that relates to QuiesceFilesystem and
IncludeMemory.

To remove all snapshots:

vmware-cmd vmname.vmx removesnapshots

In the screendump above the removesnapshots command returns an error
code '1' which means that all is well and snapshots are gone.