Posts Tagged ‘failure’

Thanks to a spate of upgrades to vSphere 5.1, I recently (re)discovered the following inconvenient result when applying an update to a DRS cluster from Update Manager (5.1.0.13071, using vCenter Server Appliance 5.1.0 build 947673):

Immediately I thought: “Great! I left a host-only ISO connected to these VMs.” However, that assumption was as flawed as Update Manager’s assumption that the workloads cannot be vMotion’d without disconnecting the removable media. In fact, the removable media indicated was connected to a shared ISO repository available to all hosts in the cluster. However, I was to blame and not Update Manager, as I had not remembered that Update Manager’s default response to removable media is to abort the process. Since cluster remediation is a powerful feature made possible by Distributed Resource Scheduler (DRS) in Enterprise (and above) vSphere editions that may be new to the feature to many (especially uplifted “Advanced AK” users), it seemed like something worth reviewing and blogging about.

Why is this a big deal?

More the the point, why does this seem to run contrary to “a common sense” response?

First, the manual for remediation of a host in a DRS cluster would include:

Applying “Maintenance Mode” to the host,

Selecting the appropriate action for “powered-off and suspended” workloads, and

Allowing DRS to choose placement and finally vMotion those workloads to an alternate host.

In the case of VMs with removable media attached, this set of actions will result in the workloads being vMotion’d (without warning or hesitation) so long as the other hosts in the cluster have access to the removable media source (i.e. shared storage, not “Host Device.”) However, in the case of Update Manger remediation, the following are documented road blocks to a successful remediation (without administrative override):

A CD/DVD drive is attached (any method),

A floppy drive is attached (any method),

HA admission control prevents migration of the virtual machine,

DPM is enabled on the cluster,

EVC is disabled on the cluster,

DRS is disabled on the cluster (preventing migration),

Fault Tolerance (FT) is enabled for a VM on the host in the cluter.

Therefore it is “by design” that a scheduled remediation would have failed – even if the removable media would be eligible for vMotion. To assist in the evaluation of “obstacles to successful deferred remediation” a cluster remediation report is available (see below).

In fact, the report will list all possible road blocks to remediation whether or not matching overrides are selected (potentially misleading, certainly not useful for predicting the outcome of the remediation attempt). While this too is counter intuitive, it serves as a reminder of the show-stoppers to successful remediation. For the offending “removable media” override, the appropriate check-box can be found on the options page just prior to the remediation report:

Disabling removable media during Update Manager driven remediation.

The inclusion of this override allows Update Manager to slog through the remediation without respect to the attached status of removable media. Likewise, the other remediation overrides will enable successful completion of the remediation process; these overrides are:

Maintenance Mode Settings:

VM Power State prior to remediation: Do not change, Power off, Suspend

Temporarily disable any removable media devices;

Retry maintenance mode in case of failure (delay and attempts);

Cluster Settings:

Temporarily Disable Distributed Power Management (forces “sleeping” hosts to power-on prior to next steps in remediation);

These settings are available at the time of remediation scheduling and as host/cluster defaults (Update Manager Admin View.)

SOLORI’s Take: So while it follows that the remediation process is NOT as similar to the manual process as one might think, it still can be made to function accordingly (almost.) There IS a big difference between disabling removable media and making vMotion-aware decisions about hosts. Perhaps VMware could take a few cycles to determine whether or not a host is bound to a removable media device (either through Host Device or local storage resource) and make a more intelligent decision about removable media.

vSphere already has the ability to identify point-resource dependencies, it would be nice to see this information more intelligently correlated where cluster management is concerned. Currently, instead of “asking” DRS for a dependency list, it just seems to just ask the hosts “do you have removable media plugged-into any VM’s” – and if the answer is “yes” it stops right there… Still, not very intuitive for a feature (DRS) that’s been around since Virtual Infrastructure 3 and vCenter 2.

Popular Posts

In Medio Stat Veritas

SOLORI's Take and Quick Take posts express my personal opinion unless explicitly attributed to other sources. Where possible, supporting facts are presented to properly frame and ground these opinions, however they are presented "AS-IS" without regard to warranty or promise: expressed or implied.

Comments are open to all registered users and may be edited for decorum. Spam is deleted with prejudice.