Configure VMware Fault Tolerance for single and multi-vCPU virtual machines

Configure a HA cluster to meet resource and availability requirements

HA cluster provides high availability for VMs, hosts are monitored and in the event of a failure VMs are restarted on another host in the cluster. As a HA cluster is created a master host is elected, this master host communicates with vCenter and monitors the state of all protected VMs and other hosts in the cluster. When a host is added to the cluster an agent is uploaded to the host and configured to communicate with other agents in the cluster. A host in the cluster must be a master host or a slave host.

The master host in the cluster is responsible for detecting the failure of other slave hosts, in the event of a failure VMs may need to be restarted on other hosts depending on the type of failure. Three types of host failures are detected

To create a new HA cluster Web Client – Host and Clusters – Datacentre – Actions – New Cluster. Give the cluster a name and tick the box for HA.

By default a lot of the features are disabled, the default enabled settings are Host Monitoring which sets VM restart priority to Medium.

Admission Control is also enabled and by default sets this to reserve failover capacity to 1 host. Should one host not be available for failover in the cluster VMs will be prevented from powering up. This can be disabled or can be edited to a percentage of cluster resources.

Configure custom isolation response settings

When a host becomes isolated you can configured the response to power off running VMs that are running on the isolated host or restart them on another none isolated host. VMware tools must be installed on the guest VM to be able to shutdown or restart.

Configure VM Component Protection (VMCP)

VM Component Protection (VMCP) can be used to prevent a split-brain condition that can occur when a host becomes isolated or partitioned from the master host and the master host cannot communicate with it using heartbeat datastores. When you enable VMCP with the aggressive setting, it monitors the datastore accessibility of powered-on virtual machines, and shuts down those that lose access to their datastores. VMCP is a vSphere v6 feature.

When a host can no longer access the storage path for a specific datastore, datastore accessibility failure is detected . You can configure the response that HA will make to such a failure, ranging from the creation of event alarms to virtual machine restarts on other hosts. Two types of failures can be detected

Permanent Device Loss (PDL) – an unrecoverable loss of accessibility that occurs when a storage device reports the datastore is no longer accessible by the hosts. Cannot be reversed without powering off VMs.

All Paths Down (APD) – represents a transient or unknown accessibility loss or any other unidentified delay in I/O. This issue is recoverable.

To enable edit the cluster settings and select Protect against Storage Connectivity Loss

Once enabled drop the Response for Datastore with Permanent Device Loss (PDL) drop down box to configure the response, here either disable, issue events or power off and restart VMs

Drop down the Response for Datastore with All Paths Down (APD) box to configure, here are the same options as PDL but you can also set the power off and restart VMs to either conservative or aggressive.

Conservative – does not terminate the VM if the success of the failover is unknown, for example a network parition.

Aggressive – does terminate the VM if the failover is unknown.

The timeout value can be adjusted, this is the time between the storage failure being detected to the point HA starts the APD recovery response.

Configure HA redundancy settings

Network redundancy should be configured for the management network for each host to help prevent a network partition, network partitions affect the protection of VMs and the HA status of a host.

Datastore heartbeating can be configured to help determine if a failed host is in a network partition, is network isolated or has failed. If a host has stopped datastore heartbeating it is considered to have failed and its running VMs will be restarted elsewhere. vCenter selects the datastores for heartbeating which can be changed. vSphere HA creates a directory at the root of each selected datastore called .vSphere-HA which is used for both datastore heartbeating.

A VSAN datastore can NOT be used for datastore heartbeating. If no other shared storage is available datastore heatbeating cannot be used in the cluster.

Configure HA related alarms and analyze a HA cluster

Alarms can be configured to monitor the status of the HA cluster. Below is a screen grab taken for the vSphere Availability document that can be found here.

Configure VMware Fault Tolerance for single and multi-vCPU virtual machines

Fault Tolerance (FT) in vSphere 6 has much improved, one of the improvements is support for mulit-vCPU (SMP) VMs can now be protected. FT is for the most mission critical VMs and can provide continous availability by creating and maintaining another VM that is identical, in the event of a failover the copy is made live. The protected VM is called the Primary and the replicated one is called the Secondary.

To support SMP-FT it is recommended to have a dedicated low latency 10Gb network. The license for vSphere is also required to be Enterprise Plus. vSphere features such as snapshots, storage vMotion, VMCP, vVOLs and storage-based policies cannot interoperate with FT.

Before I can enable FT I must first create a VMKernel interface for Fault Tolerence Logging and have a vMotion network configured, I also need to have HA enabled. I create a new dedicated VMKernel interface for FT – Web Client – Host and Clusters – host – Manage – Networking – VMKernel Adapters – Add Host Networking. I then enabled Fault Tolerance Logging service and give it an IP address.

Once added on each host I can enable FT from the required VM. From the Actions menus on the VM I got to Fault Tolerance – Turn On Fault Tolerance

I then select which datastore to hold the Configuration File, Tie Breaker File and the replicated vDisk

I then need to select the host that will hold the secondary VM

Once enabled the icon in the Web Client changes for that VM and the FT information can be viewed. In my lab im limited as to what I can do regarding FT but once you have it enabled you will be able to initiate failovers, test failovers, suspend / resume replication and delete replication.