Posts Tagged ‘Storage’

Last week I was made aware of an issue a customer in the field was having with a data protection strategy using array-based snapshots which were in turn leveraging VMware vSphere snapshots with VSS quiesce of Windows VMs. The problem began after installing VMware Tools version 10.0.0 build-3000743 (reported as version 10240 in the vSphere Web Client) which I believe is the version shipped in ESXI 6.0 Update 1b (reported as version 6.0.0, build 3380124 in the vSphere Web Client).

The issue is that creating a VMware virtual machine snapshot with VSS integration fails. The virtual machine disk configuration is simply two .vmdks on a VMFS-5 datastore but I doubt the symptoms are limited only to that configuration.

The failure message shown in the vSphere Web Client is “Cannot quiesce this virtual machine because VMware Tools is not currently available.” The vmware.log file for the virtual machine also shows the following:

Attempts to take a quiesced snapshot in a Windows Guest OS fails
Attempts to take a quiesced snapshot after booting a Windows Guest OS fails

After downloading and upgrading VMware Tools version 10.0.9 build-3917699 (reported as version 10249 in the vSphere Web Client), the customer’s problem was resolved. Since the faulty version of VMware Tools was embedded in the customer’s templates used to deploy virtual machines throughout the datacenter, there were a number of VMs needing their VMware Tools upgraded, as well as the templates themselves.

Dell storage customers who have been watching the evolution of Enterprise Manager may be interested in the latest release which was just made available. Aside from adding support for the brand new SCv2000 Series Storage Centers and bundling Java Platform SE 7 Update 67 with the installation of both the Data Collector on Windows and the Client on Windows or Linux (a prerequisite Java installation is no longer required), a Linux client has been introduced for the first time and runs on several Linux operating systems. The Linux client is Java based and has the same look and feel as the Windows based client. Some of the details about this release below.

Although the Enterprise Manager Client for Linux can be installed without a graphical environment, launching and using the client requires the graphical environment. As an example, neither RHEL 6 or RHEL 7 install a graphical environment by default. Overall, installing a graphical environment for both RHEL 6 and RHEL 7 is similar in that it requires a yum repository. However, the procedure is slightly different for each version. There are several resources available on the internet which walk through the process. I’ll highlight a few below.

Log in with root access.

To install a graphical environment for RHEL 6, create a yum repository and install GNOME or KDE by following the procedure here.

To install a graphical environment for RHEL 7, create a yum repository by following this procedure and install GNOME by following the procedure here.

Installing the Enterprise Manager Client is pretty straightforward. Copy the RPM to a temporary directory on the Linux host and use rpm -U to install:

rpm -U dell-emclient-15.1.2-45.x86_64.rpm

Alternatively, download the client from the Enterprise Manager Data Collector using the following syntax as an example:

Once installed, launch the Enterprise Manager Client from the /var/lib/dell/bin/ directory:

cd /var/lib/dell/bin/

./Client

or

/var/lib/dell/bin/Client

We’re rewarded with the Enterprise Manager 2015 R1 Client splash screen. New features are found here to immediately manage SCv2000 Series Storage Centers (the SCv2000 Series is the first Storage Center whereby the web based management console has been retired).

Once logged in, it’s business as usual in a familiar UI.

Dell, and before it Compellent, has long since offered a variety of options and integrations to manage Storage Center as well as popular platforms and applications. The new Enterprise Manager Client for Linux extends that list of management methods available.

There are a few spots where I could improve but what you see here is what you get – a quick video I threw together outlining a simple VMware vRealize Operations Manager 6.0.1 appliance deployment, including:

vCenter adapter configuration

Active Directory role integration

Dell Storage Solutions Pack installation and configuration

Dashboard sharing

Obviously I trimmed some of the “wait” intervals but the goal here was to cover the quick and easy steps to get vR Ops 6.x up and running from ovf download to collecting in a very short amount of time.

Updates cover all major areas of the product including installation, migration, configuration, licensing, alerting, dashboards, reports, and policies. To take advantage of the following significant enhancements, upgrade to version 6.0.1.

Improved scaling numbers

The number of objects that a single large node supports has been increased to 12,000. Also, in multi-node configurations, a four large-node configuration can manage up to 40,000 objects and an eight large-node configuration can manage up to 75,000 objects. For details on scaling numbers and a link to a Sizing Guideline Worksheet, see KB 2093783.

vSphere v6.0 interoperability support

With this release, vSphere v6.0 can function both as a platform for vRealize Operations Manager installation, and as an environment to which vRealize Operations Manager can connect for operational assurance.

User interface improvements

Corrections in the Views and Reports content for vSphere Hosts and Clusters.

Addition of Hierarchical View in the Topology widget.

Enhancement to the Geo widget displays objects on a world map.

Licensing improvements

New functionality provides a way to use the REST API to add a license key.

Metrics switched to Collection OFF to improve performance

Extraneous metrics are switched to Collection OFF in the default Policy. An option to enable Collection is available. However, maintaining metrics in the OFF state saves disk space, improves CPU performance, and has no negative impact on the vRealize Operations Manager functionality to collect and analyze data. For a list of metrics with Collection switched to OFF, see KB 2109869.

Improved alert definitions for hosts and virtual machines in the vSphere 5.5 Hardening Guide, to identify and report more non-compliance issues.

Additional alert definitions to detect duplicate object names in vCenter and vSphere Storage Management Service errors. Note: To identify duplicate object names in the vCenter Server system, the name-based identification feature must be enabled for the vSphere adapter.

I spent a fair amount of time with vC Ops 5.x and I’ll be the first in line to say vR Ops 6.x has a much more polished look and feel which generally makes consumption of this datacenter management tool much more of a pleasure to work with in terms of installation, configuration, and daily use. But don’t take my word for it, see for yourself:

If you manage Dell Compellent storage, you may or may not be aware that Windows PowerShell cmdlets are available to ease management pain by way of automation and consistency. While I am able to recognize when scripting is the right tool for the job, I do not author PowerShell scripts on a regular basis. For that reason, I’m not as deeply familiar with all of the cmdlets available within the Dell Compellent Storage Center Command Set Shell as I would like to be.

So how do I get started – what are the cmdlets? There are a few different ways to retrieve a list of cmdlets made available by a PowerShell snapin or module.

VMware vSphere PowerCLI simplifies the process by providing a cmdlet called Get-VICommand. When executed, it returns a list of all the cmdlets provided by the VMware.VimAutomation.Core snapin used to manage a vSphere environment via PowerShell. As of this writing in the 5.5.x generation of vSphere, there are a few other vSphere specific snapins installed with PowerCLI but the cmdlets provided by those aren’t returned by Get-VICommand. Those snapins are:

However, not all PowerShell snapins ship with a native shortcut to retrieve a list of their respective cmdlets. In these cases, use Get-Command. Now Get-Command by itself returns cmdlets for all snapins. For a snapin specific list, either of the following will work:

Those who don’t use PowerShell on a regular basis may find the above difficult to easily recall from memory. I had a discussion with Justin Braun (author of The Braun Blog – check out his Dell Compellent articles here) and Mike Matthews (a peer in my office who specialize in Microsoft SQL Server, PowerShell, and is an all around good guy). Is there an easier and persistent method to retrieve cmdlets from a given snapin? What resulted was a function that can be added to a PowerShell profile which performs just like VMware’s Get-VICommand (I’ll be original and call this one Get-SCCommand to get the list of Storage Center cmdlets).

Now open any PowerShell environment and use Get-SCCommand which shows a list of 105 Dell Compellent cmdlets (There are 49 additional cmdlets in the compellent.replaymanager.scripting snapin for Replay Manager):

It works with PowerShell ISE as well when the Microsoft.PowerShellISE_profile.ps1 profile is modified:

Of course the shortcut function provided in the example above is specific to the Dell Compellent snapin but it should work for for any PowerShell snapin including the list of VMware snapins not included in Get-VICommand discussed at the top of the article.

Several years ago, one of the first blog posts that I tackled was working in the lab with N_Port ID Virtualization often referred to as NPIV for short. The blog post was titled N_Port ID Virtualization (NPIV) and VMware Virtual Infrastructure. At the time it was one of the few blog posts available on the subject because it was a relatively new feature offered by VMware. Over the years that followed, I haven’t heard much in terms of trending adoption rates by customers. Likewise, VMware hasn’t put much effort into improving NPIV support in vSphere or promoting its use. One might contemplate, which is the cause and which is the effect. I feel it’s a mutual agreement between both parties that NPIV in its current state isn’t exciting enough to deploy and the benefits fall into a very narrow band of interest (VMware: Give us in guest virtual Fibre Channel – that would be interesting).

Despite its market penetration challenges, from time to time I do receive an email from someone referring to my original NPIV blog post looking for some help in deploying or troubleshooting NPIV. The nature of the request is common and it typically falls into one of two categories:

How can I set up NPIV with a fibre channel tape library?

Help – I can’t get NPIV working.

I received such a request a few weeks ago from the field asking for general assistance in setting up NPIV with Dell Compellent storage. The correct steps were followed to the best of their knowledge but the virtual WWPNs that were initialized at VM power on would not stay lit after the VM began to POST. In Dell Enterprise Manager, the path to the virtual machine’s assigned WWPN was down. Although the RDM storage presentation was functioning, it was only working through the vSphere host HBAs and not the NPIV WWPN. This effectively means that NPIV is not working:

In addition, the NPIV initialization failure is reflected in the vmkernel.log:

Storage presentation to the vSphere host HBAs as well as the virtual machine’s assigned NPIV WWPN(s)

If any of the above requirements are not met (plus a handful of others and we’ll get to one of them shortly), vSphere’s NPIV feature will likely not function.

In this particular case, general NPIV requirements were met. However, it was discovered a best practice had been missed in configuring the QLogic HBA BIOS (the QLogic BIOS is accessed at host reboot by pressing CTRL + Q or ALT + Q when prompted). Connection Options remained at its factory default value of 2 or Loop preferred, otherwise point to point.

Dell Compellent storage with vSphere best practices call for this value to be hard coded to 1 or Point to point only. When the HBA has multiple ports, this configuration needs to be made across all ports that are used for Dell Compellent storage connectivity. It goes without saying this also applies across all of the fabric attached hosts in the vSphere cluster.

Once configured for Point to point connectivity on the fabric, the problem is resolved.

Despite the various error messages returned as vSphere probes for possible combinations between the vSphere assigned virtual WWPN and the host WWPNs, NPIV success looks something like this in the vmkernel.log (you’ll notice subtle differences showing success compared to the failure log messages above):

One last item I’ll note here for posterity is that this particular case, the problem does not present itself uniformly across all storage platforms. This was an element that prolonged troubleshooting to a degree because the vSphere cluster was successful in establishing NPIV fabric connectivity to two other types of storage using the same vSphere hosts, hardware, and fabric switches. Because of this in the beginning it seemed logical to rule out any configuration issues within the vSphere hosts.

To summarize, there are many technical requirements outlined in VMware documentation to correctly configure NPIV. If you’ve followed VMware’s steps correctly but problems with NPIV remain, refer to storage, fabric, and hardware documentation and verify best practices are being met in the deployment.

Dell Compellent Storage Center customers who use the legacy vSphere Client plug-in to manage their storage may have noticed that the upgrade to PowerCLI 5.5 R2 which released with vSphere 5.5 Update 1 essentially “broke” the plug-in. This forced customers to make the decision to stay on PowerCLI 5.5 in order to use the legacy vSphere Client plug-in, or reap the benefits of the PowerCLI 5.5 R2 upgrade with the downside being they had to abandon use of the legacy vSphere Client plug-in.

For those that are unaware, there is a 3rd option and that is to leverage vSphere’s next generation web client along with the web client plug-in released by Dell Compellent last year (I talked about it at VMworld 2013 which you can take a quick look at below).

Although VMware strongly encourages customers to migrate to the next generation web client long term, I’m here to tell you that in the interim Dell has revd the legacy client plug-in to version 1.7 which is now compatible with PowerCLI 5.5 R2. Both the legacy and web client plug-ins are free and quite beneficial from an operations standpoint so I encourage customers to get familiar with the tools and use them.

Other bug fixes in this 1.7 release include:

Datastore name validation not handled properly

Create Datastore, map existing volume – Server Mapping will be removed from SC whether or not it was created by VSP

Add Raw Device wizard is not allowing to uncheck a host once selected

Remove Raw Device wizard shows wrong volume size

Update to use new code signing certificate

Prevent Datastores & RDMs with underlying Live Volumes from being expanded or deleted

Add support for additional Flash Optimized Storage Profiles that were added in SC 6.4.2

If you ended up here searching for information on PDL or APD, your evening or weekend plans may be cancelled at this point and I’m sorry for you if that is the case. There are probably 101 or more online resources which discuss the interrelated vSphere storage topics of All Paths Down (known as APD), Permanent Device Loss (known as PDL), and vSphere High Availability (known as HA, and before dinosaurs roamed the Earth – DAS ). To put it in perspective, I’ve quickly pulled together a short list of resources below using Google. I’ve read most of them:

vSphere HA in my opinion is a great feature. It has saved my back side more than once both in the office and at home. Several books have been more or less dedicated to the topic and yet it is so easy to use that an entire cluster and all of its running virtual machines can be protected with default parameters (common garden variety) with just two mouse clicks.

VMware’s roots began with compute virtualization so when HA was originally released in VMware Virtual Infrastructure 3 (one major revision before it became the vSphere platform known today), the bits licensed and borrowed from Legato Automated Availability Manager (AAM) were designed to protect against marginal but historically documented amounts of x86 hardware failure thereby reducing unplanned downtime and loss of virtualization capacity to a minimum. Basically if an ESX host yields to issues relating to CPU, memory, or network, VMs restart somewhere else in the cluster.

It wasn’t really until vSphere 5.0 that VMware began building in high availability for storage aside from legacy design components such as redundant fabrics, host bus adapters (HBAs), multipath I/O (MPIO), failback policies, and with vSphere 4.0 the pluggable storage architecture (PSA) although this is not to say that any of these design items are irrelevant today – quite the opposite. vSphere 5.0 introduced Permanent Device Loss (PDL) which does a better job of handling unexpected loss of individual storage devices than APD solely did. Subsequent vSphere 5.x revisions made further PDL improvements such as improving support for single LUN:single target arrays in 5.1. In short, the new vSphere HA re-write (Legato served its purpose and is gone now) covers much of the storage gap such that in the event of certain storage related failures, HA will restart virtual machines, vApps, services, and applications somewhere else – again to minimize unplanned downtime. Fundamentally, this works just like HA when a vSphere host tips over, but instead the storage tips over and HA is called to action. Note that HA can’t do much about an entire unfederated array failing – this is more about individual storage/host connectivity. Aside from gross negligence on the part of administrators, I believe the failure scenarios are more likely to resonate with non-uniform stretched or metro cluster designs. However, PDL can also occur in small intra datacenter designs as well.

I won’t go into much more detail about the story that has unfolded with APD and the new features in vSphere 5.x because it has already been documented many times over in some of the links above. Let’s just say the folks starting out new with vSphere 5.1 and 5.5 had it better than myself and many others did dealing with APD and hostd going dark. However, the trade off for them is they are going to have to deal with Software Defined * a lot longer than I will.

Although I mentioned earlier that vSphere HA is extremely simple to configure, I did also mention that was with default options which cover a large majority of the host related failures. Configuring HA to restart VMs automatically and with no user intervention in the event of a PDL condition in theory is just one configuration change for each host in the cluster. Where to configure depends on the version of vSphere host.

One thing about this configuration that had me chasing sense codes in vmkernel logs recently was lack of clarity on the required host reboot. That’s mainly what prompted this article – I normally don’t cover something that has already been covered well by other writers unless there is something I can add, something was missed, or it has caused me personal pain (my blog + SEO = helps ensure I don’t suffer from the same problems twice). In all of the online articles I had read about these configurations, none mentioned a host reboot requirement and it’s not apparent that a host reboot is required until PDL actually happens and automatic VM restart via HA actually does not. The vSphere 5.5 documentation calls it out. Go figure. I’ll admit that sometimes I will refer to a reputable vMcBlog before the product documentation. So let the search engine results show: when configuring VMkernel.Boot.terminateVMOnPDL a host reboot or restart is required.VMware KB 1038578 also calls out that as of vSphere 5.5 you must reboot the host for VMkernel.boot configuration changes to take effect. I’m not a big fan of HA or any configuration being written into VMkernel.boot requiring host or VSAN node performance/capacity outages when a change is made but that is VMware Engineering’s decision and I’m sure there is a relevant reason for it aside from wanting more operational parity with the Windows operating system.

I’ll also reiterate Duncan Epping’s recommendation that if you’re already licensed for HA and have made the design and operational decision to allow HA to restart VMs in the event of a host failure, then the above configuration should be made on all vSphere clustered hosts, whether they are part of a stretched cluster or not to protect against storage related failures. A PDL can be broken down to one host losing all available paths to a LUN. By not making the HA configuration change above, a storage related failure results in user intervention required to recover all of the virtual machines on the host tied to the failed device.

Lastly, it is mentioned in some of the links above but if this is your first reading on the subject, please allow me to point out that the configuration setting above is for Permanent Device Loss (PDL) conditions only. It is not meant to handle an APD event. The reason behind this is that the storage array is required to send a proper sense code to the vSphere host indicating a PDL condition. If the entire array fails or is powered off ungracefully taking down all available paths to storage, it has no chance to send PDL sense codes to vSphere. This would constitute an indefinite All Paths Down or APD condition where vSphere knows storage is unavailable, but is unsure about its return. PDL was designed to answer that question for vSphere, rather than let vSphere go on wondering about it for a long period of time, thus squandering any opportunities to proactively do something about it.

In reality there are a few other configuration settings (again documented well in the links above) which fine tunes HA more precisely. You’ll almost always want to add these as well.

vSphere 5.0u1+: das.maskCleanShutdownEnabled = True (Cluster advanced options) – this is an accompanying configuration that helps vSphere HA distinguish between VMs that were once powered on and should be restarted versus VMs that were already powered off when a PDL occurred therefore these are VMs that don’t need to be and more importantly probably should not be restarted.

vSphere 5.5+: Disk.AutoremoveOnPDL = 0 (advanced setting on each host) – This is a configuration I first read about on Duncan’s blog where he recommends that the value be changed from the default of enabled to disabled so that a device is not automatically removed if it enters a PDL state. Aside from LUN number limits a vSphere host can handle (255), VMware refers to a few cases where the stock configuration of automatically removing a PDL device may be desired although VMware doesn’t really specifically call out each circumstance aside from problems arising from hosts attempting to send I/O to a dead device. There may be more to come on this in the future but for now preventing the removal may save in fabric rescan time down the road if you can afford the LUN number expended. It will also serve as a good visual indicator in the vSphere Client that there is a problematic datastore that needs to be dealt with in case the PDL automation restarts VMs with nobody noticing the event has occurred. If there are templates or powered off VMs that were not evacuated by HA, the broken datastore will visually persist anyway.

That’s the short list of configuration changes to make for HA VM restart. There’s actually a few more here. For instance, fine grained HA handling can be coordinated on a per-VM basis by modifying the advanced virtual machine option disk.terminateVMOnPDLDefault configuration for each VM. Or scsi#:#.terminateVMOnPDL to fine tune HA on a per virtual disk basis for each VM. I’m definitely not recommending touching if the situation does not call for it.

In a stock vSphere configuration with VMkernel.Boot.terminateVMOnPDL = no configured (or unintentionally misconfigured I suppose), the following events occur for an impacted virtual machine:

PDL event occurs, sense codes are received and vSphere correctly identifies the PDL condition on the supporting datastore. A question is raised by vSphere for each impacted virtual machine to Retry I/O or Cancel I/O.

Stop. Nothing else happens until each of the questions above are answered with administrator intervention. Answering Retry without the PDL datastore coming back online or without hot removing the impacted virtual disk (in most cases the .vmx will be impacted anyway and hot removing disks is next to pointless) sends the VM to hell pretty much. Answering Cancel allows HA to proceed with powering off the VM and restarting it on another host with access to the device which went PDL on the original host.

In a modified vSphere configuration with VMkernel.Boot.terminateVMOnPDL = yes configured, the following events occur for an impacted virtual machine:

PDL event occurs, sense codes are received and vSphere correctly identifies the PDL condition on the supporting datastore. A question is raised by vSphere for each impacted virtual machine to Retry I/O or Cancel I/O.

Due to VMkernel.Boot.terminateVMOnPDL = yes vSphere HA automatically and effectively answers Cancel for each impacted VM with a pending question. Again, if the hosts aren’t rebooted after the VMkernel.Boot.terminateVMOnPDL = yes configuration change, this step will mimic the previous scenario essentially resulting in failure to automatically carry out the desired tasks.

Each VM is powered off.

Each VM is powered on.

I’ll note in the VM Event examples above, leveraging the power of Snagit I’ve cut out some of the noise about alarms triggering gray and green, resource allocations changing, etc.

For completeness, following is a list of the PDL sense codes vSphere is looking for from the supported storage array:

SCSI sense code

Description

H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x25 0x0

LOGICAL UNIT NOT SUPPORTED

H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x4c 0x0

LOGICAL UNIT FAILED SELF-CONFIGURATION

H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x3e 0x3

LOGICAL UNIT FAILED SELF-TEST

H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x3e 0x1

LOGICAL UNIT FAILURE

Two isolated examples of PDL taking place seen in /var/log/vmkernel.log: