Begin the journey to a private cloud with datacenter virtualization

Tag Archives: vaai

A new white paper entitled VMware vSphere Storage APIs – Array Integration (VAAI) is now available. You can download the paper from the VMware Technical Resources web site here. This paper includes a description of all of the VAAI block, NAS & Thin Provisioning primitives. It also looks at how VAAI in implemented in our Pluggable Storage Architecture (PSA) model, as well as some tools available to you to troubleshoot and monitor VAAI primitive usage.

Hope you find it useful.

Get notification of these blogs postings and more VMware Storage information by following me on Twitter: @VMwareStorage

Now this is an interesting one, and something that I had not noticed before. Nor have many other people I suspect. One of our customers was observing high KAVG (kernel latency) during Storage vMotion operations (which were leveraging the VAAI XCOPY primitive). The KAVG latency was showing an expected value when VAAI was disabled. I’d previously seen high KAVG values when queue full conditions were also being observed. This was not the case here. Below are two esxtop outputs showing the symptoms. The first of these screenshots is with VAAI enabled. Notice the high KAVG/cmd value. Click on the image for a larger view.

Now we have the same esxtop counters with VAAI disabled. In this case, SCSI Commands, Reads and Writes are very much higher since the I/O is not being offloaded to the array, but we do see an expected small value for KAVG/cmd. Again, you can click on the image for a larger view.

So what’s the cause? Well, eventually the explanation was found in the following KB article – Abnormal DAVG and KAVG values observed during VAAI operations. (Nice catch Henrik!) Reproducing verbatim the contents of the KB article, “when VAAI commands are issued via VAAI Filter, there are actually 2 commands sent. These are top-layer commands which are issued and are never sent to the actual device (they stay within the ESX kernel). These commands are intercepted by the VAAI filter and the VAAI plugin, and are replaced by the vendor-specific commands, which are issued to the device.

This is why esxtop shows device statistics for the top-level commands only, and as a result the values for DAVG and KAVG seem unusual when compared to results obtained when VAAI is not enabled . In this instance (and only for this instance), the DAVG and KAVG observed in esxtop should not be interpreted as a performance issue, absent of any other symptoms.”

So there you go. There is always something to learn in this job.

Get notification of these blogs postings and more VMware Storage information by following me on Twitter: @VMwareStorage

vSphere 5.1 is upon us. The following is a list of the major storage enhancements introduced with the vSphere 5.1 release.

VMFS File Sharing Limits

In previous versions of vSphere, the maximum number of hosts which could share a read-only file on a VMFS volume was 8. The primary use case for multiple hosts sharing read-only files is of course linked clones, where linked clones located on separate hosts all shared the same base disk image. In vSphere 5.1, with the introduction of a new locking mechanism, the number of hosts which can share a read-only file on a VMFS volume has been increased to 32. This makes VMFS as scalable as NFS for VDI deployments & vCloud Director deployments which use linked clones.

Space Efficient Sparse Virtual Disks

A new Space Efficient Sparse Virtual Disk aims to address certain limitations with Virtual Disks. The first of these is the ability to reclaim stale or stranded data in the Guest OS filesystem/database. SE Sparse Disks introduces an automated mechanism for reclaiming stranded space. The other feature is a dynamic block allocation unit size. SE Sparse disks have a new configurable block allocation size which can be tuned to the recommendations of the storage arrays vendor, or indeed the applications running inside of the Guest OS. VMware View is the only product that will use the new SE Sparse Disk in vSphere 5.1.

The Data Mover aims to have a continuous queue of outstanding IO requests to achieve maximum throughput. Incoming I/O requests to the Data Mover are divided up into smaller chunks. Asynchronous I/Os are then simultaneously issued for each chunk until the DM queue depth is filled. When a request completes, the next request is issued. This could be for writing the data that was just read, or to handle the next chunk.

Take the example of a clone of a 64GB VMDK (Virtual Machine Disk file). The DM is asked to move the data in 32MB transfers. The 32MB is then transferred in "PARALLEL" as a single delivery, but is divided up into a much smaller I/O size of 64KB by the DM, using 32 threads at a time. To transfer this 32MB, a total of 512 I/Os of size 64KB is issued by the DM.

By comparison, a similar a 32MB transfer via VAAI issues a total of 8 I/Os of size 4MB (XCOPY uses 4MB transfer sizes). The advantages of VAAI in terms of ESXi resources is immediately apparent.

The decision to transfer using the DM or offloading to the array with VAAI is taken upfront by looking at storage array Hardware Acceleration state. If we decide to transfer using VAAI and then encounter a failure with the offload, the VMkernel will try to complete the transfer using the VMkernel DM. It should be noted that the operation is not restarted; rather it picks up from where the previous transfer left off as we do not want to abandon what could possibly be very many GB worth of copied data because of a single transient transfer error.

If the error is transient, we want the VMkernel to check if it is ok to start offloading once again. In vSphere 4.1, the frequency at which an ESXi host checks to see if Hardware Acceleration is supported on the storage array is defined via the following parameter:

This parameter dictates how often we will retry an offload primitive once a failure is encountered. This can be read as 16384 * 32MB I/Os, so basically we will check once every 512GB of data move requests. This means that if at initial deployment, an array does not support the offload primitives, but at a later date the firmware on the arrays gets upgraded and the offload primitives are now supported, nothing will need to be done at the ESXi side – it will automatically start to use the offload primitive.

HardwareAcceleratedMoveFrequency only exists in vSphere 4.1. In vSphere 5.0 and later, we replaced it with the periodic VAAI state evaluation every 5 minutes.

Get notification of these blogs postings and more VMware Storage information by following me on Twitter: @VMwareStorage

We’re getting a lot of queries lately around how exactly VAAI behaves at the lower level. One assumes more and more VMware customers are seeing the benefit of offloading certain storage intensive tasks to the array. Recently the questions I have been getting are even more in-depth. I’ve been back over my VAAI notes gathered since 4.1, and have put together the following article. Hope you find it useful.

For those of you who don't know about Nutanix, these are the folks who won a gold award for desktop virtualization at VMworld 2011 last year. Their complete cluster product is both a hardware and software solution that delivers compute and storage to run virtual machines and store your data. Their other stance is that you no longer need a SAN to achieve enterprise-class performance, scalability and data management.

Nutanix achieves redundancy and availability in their platform by replicating between paired Nutanix NFS controllers on adjacent ESXi hosts. They also monitor VM data placement, moving associated data to new destinations based on operator-invoked or automated live migrations (using vSphere features like vMotion and DRS). It also keeps data local to the ESXi hosts & VMs. Nutanix has multiple tiers of storage, including SATA-attached SSD, SATA-attached HDD, and Fusion-IO PCIe SSD. Nutanix's Heat-Optimized Tiering (HOT) automatically moves the most frequently accessed data to the highest performing storage tier.

Nutanix recently announced a number of additional vSphere integration features in their recently released Nutanix Complete Cluster version 2.5. I had a discussion with Lukas Lundell, my technical marketing counterpart at Nutanix, and these are the vSphere integration features that he highlighted:

Support for NFSv3

Nutanix are releasing their own distributed NFS implementation, but are still keeping a localized data path between the VM and the ESXi host and storage controller. More about this implementation should be appearing soon.

NFS VAAI

Nutanix have implemented the VAAI NFS Full Copy & File Clone primitives in this release. Nutanix/VMware VDI customers will be able to leverage VMware's upcoming VMware View Composer Array Integration (VCAI) feature which is in tech preview for VMware View 5.1.

Datastore Creation Wizard

Nutanix now provides seamless integration with VMware's datastore creation workflow. You can create VMware NFS datastores directly from the Nutanix web console in just a few steps.

Avoidance of VMware iSCSI "All Paths Down" condition

Added support for VMware's PDL (Permanent Device Loss) SCSI sense codes so that the "All Paths Down" condition can be avoided when a device is removed from an ESxi host in an uncontrolled manner.

High Availability for Nutanix Storage Controllers

Nutanix will now automatically reroute storage requests to an active Nutanix storage controller in the event of one controller failing. Interestingly, this feature can also be used to support rolling upgrades for VMware's ESXi and minor releases of the Nutanix cluster software.

Other features include additional management enhancements, a new proactive alert system for hardware failures and performance, and a new tuning of the cache configurations and algorithms for real-world VDI work-loads.

Looks like the guys at Nutanix have a real nice integrated solution solution to offer. Find out more by visiting the Nutanix web site.

Get notification of these blogs postings and more VMware Storage information by following me on Twitter: @VMwareStorage

A week or so ago I published an article about new View 5.1 storage features. I followed this up with a short video post explaining how you would go about using View Storage Accelerator. In this article, I want to demonstrate the other very cool feature in View 5.1, VCAI (View Composer API for Array Integration) to you. Although this feature is still in Tech Preview for View 5.1, it is a very cool enhancements which could have very many benefits when it is eventually fully supported as a feature.

Another way of describing this feature is Native NFS Snapshots. Essentially, what the feature allows you to do is to offload the creation of the linked clones which back your View desktops to the storage array, and let the storage array handle this task. In order to do this, the NAS storage array on which the snapshots are being deployed must have the NAS Native Snapshot VAAI (vSphere API for Array Integration) feature, which was first introduced in vSphere 5.0. A special VIB/plugin (provided by the 3rd party storage array vendor) must also be installed on the ESXi host to allow us to use this offload mechanism.

The main advantage of VCAI is an improvement in performance and a reduction in the time taken to provision desktops based on linked clone pools. This task can now be offloaded to the array, which can then provision these linked clones natively rather than have the ESXi host do it.

What follows is a short video (approx. 3 and a half minutes) of setting up View 5.1 VCAI feature, showing an installed VCAI VIB from NetApp on the ESXi host, and then how to use native NFS snapshots when creating desktop pools based on linked clones. Again, my thanks to Graham Daly of VMware KBTV fame for his considerable help with this.

Further detail about the View Composer for Array Integration (VCAI) can be found on the EUC blog here.

Get notification of these blogs postings and more VMware Storage information by following me on Twitter: @VMwareStorage

In one of my very first blog posts last year around the new improvements made to VMFS-5 in vSphere 5.0, one of the enhancements I called out was related to the VAAI primitive ATS (Atomic Test & set). In the post, I stated that the 'Hardware Acceleration primitive, Atomic Test & Set (ATS), is now used throughout VMFS-5 for file locking.' This recently led to an obvious, but really good question (thank you Cody) – What are the operations that ATS now does in vSphere 5.0/VMFS-5 that it didn’t do in vSphere 4.1/VMFS-3?

Well, before we delve into that, I thought it might be a good idea to provide a general overview of VMFS locking.

Heartbeats

VMFS is a distributed journaling filesystem. All distributed file systems need to synchronize operations between multiple hosts as well as indicate the liveliness of a host. In VMFS, this is handled through the use of an ‘on-disk heartbeat structure’. The heartbeat structure maintains the lock states as well as the pointer to the journal information.

In order to deal with possible crashes of hosts, the distributed locks are implemented as lease-based. A host that holds a lock must renew a lease on the lock (by changing a "pulse field" in the on-disk lock data structure) to indicate that it still holds the lock and has not crashed. Another host can break the lock if the lock has not been renewed by the current holder for a certain period of time.

When another ESXi host wants to access a file, it will check to see if that timestamp has been increased. If it has not been increased, this host can take over ownership of the file by removing the stale lock, place its own lock on the file, and generating a new timestamp.

On-disk Metadata Locks

VMFS has on-disk metadata locks which are used to synchronize metadata updates. Metadata updates are required when creating, deleting or changing file metadata, in particular, file length. Earlier versions of VMFS (built on LUNs from non-VAAI arrays) use SCSI reservations to acquire the on-disk metadata locks. It is important to note that the SCSI reservations are not in place to do the actual metadata update. They are used to get the on-disk lock only, and once the lock has been obtained, the SCSI reservation is released. Therefore, to address a common misconception, the LUN is not locked with a SCSI reservation for the duration of metadata updates. They are only reserved to get the on-disk lock.

Acquiring on-disk locks is a very short operation compared to the latency of metadata updates. However you should not infer from this that metadata updates themselves are long; it is simply that metadata updates are typically longer than the the time taken to acquire the lock.

Optimistic Locking

The VMFS version released with ESX version 3.5 introduced an updated distributed locking called 'optimistic locking'. Basically, the actual acquisition of on-disk locks (involving SCSI reservations) is postponed as late as possible in the life cycle of a VMFS metadata transactions. Optimistic locking allows the number and duration of SCSI reservations to be reduced. This in turn reduces the impact of SCSI reservations on Virtual Machine I/O and other VMFS metadata I/O originating from other ESX hosts that share the volume.

When locks are acquired in non-optimistic mode, one SCSI reservation is used for each lock that we want to acquire.

In optimistic mode, we use one SCSI reservation for each set of locks required by a particular journal transaction. Optimistic locking uses one SCSI reservation per transaction as opposed to one SCSI reservation per lock.

We don't commit the transaction to the on-disk journal unless we are able to update all optimistic locks to physical locks (using a single SCSI reservation). You may see the message: Optimistic Lock Acquired By Another Host. This means that a lock which was held optimistically (not yet acquired on-disk) during a transaction was found to have been acquired on-disk by a different host. If we are unable to do said update we roll back the transaction's in-memory changes (no on-disk changes would have been made) and then we simply retry the transaction. But, as per the description, we are optimistic that this won't occur very often, and that in the vast majority of cases, optimistic locks will be upgraded to physical locks without any contention.

VAAI ATS (Atomic Test & Set)

In VMFS, as we have seen above, many operations need to establish a critical section on the volume when updating a resource, in particular an on-disk lock or a heartbeat. The operations that require this critical section can be listed as follows:

This critical section can either be established using SCSI reservation or using ATS on a VAAI-enabled array. In vSphere 4.0, VMFS-3 used SCSI reservations for establishing this critical section as there was no VAAI support in that release. In vSphere 4.1, on a VAAI-enabled array, VMFS-3 used ATS only for operations (1) and (2) above, and ONLY when disk lock acquisitions were un-contended. VMFS-3 fell back to using SCSI reservations if there was a mid-air collision when acquiring an on-disk lock using ATS.

For VMFS-5 datastores formatted on a VAAI-enabled array (i.e. as ATS-only), all the critical section functionality from (1) to (8) is done using ATS. We should no longer see any SCSI Reservations on VAAI-enabled VMFS-5. Even if there is contention, ATS continues to be used.

On non-VAAI arrays, SCSI reservations continue to be used for establishing critical sections in VMFS-5.

I hope this gives some clarity to the original statement about VMFS-5 in vSphere 5.0 now being fully ATS aware, and also gives you some idea of the types of locking used in various versions of VMFS.

Get notification of these blogs postings and more VMware Storage information by following me on Twitter: @VMwareStorage

VMware's flagship VDI product, VMware View, has a new release coming out. I don't normally blog about EUC (End User Computing) or VDI as it is not my area of expertise. However VMware View 5.1 has a number of really neat new storage related features which are making use of enhancements which were first introduced in vSphere 5.0.

View Storage Accelerator

This first feature was originally called CBRC (Content Based Read Cache). This was initially introduced in vSphere 5.0. Although it is a vSphere feature, it is designed specifically for VMware View. With the release of View 5.1, the View Storage Accelerator feature can now be used to dramatically improve the read throughput for View desktops. This will be particularly useful during a boot storm or anti-virus storm, where many virtual machines could be reading the same data from the same base disk at the same time. The implementation of the accelerator is done by taking an area of host memory for cache, and then creating 'digest' files for each virtual machine disk. This feature will be most useful for shared disks that are read frequently, such as View Composer OS disks. It will be available 'out of the box' with View 5.1; no additional components will need to be installed. This feature will significantly improve performance. More here.

32 ESXi nodes sharing NFS datastores

This storage feature is also quite significant. While VMware has been able to create 32 node clusters for some time, VMware View would only allow a base disk on an NFS datastore to be shared between 8 ESXi hosts for the purposes of linked clone deployments. View 5.1 lifts this restriction, and now 32 ESXi hosts can host linked clones deployed from the same base disk on a shared NFS datastore. This feature will significantly improve scalability.

Although this feature is a Technology Preview in View 5.1, it is another cool storage feature of the release. View desktops deployed on VMware's linked clone technology consumes CPU on the ESXi hosts, and network bandwidth when they are deployed on NFS datastores. With this new Native NFS Snapshot feature via VAAI (vSphere Storage APIs for Array Integration), customers can offload the cloning operation to the storage array, minimizing CPU usage and network bandwidth consumption. Once again this enhanced VAAI functionality was introduced in vSphere 5.0 specifically for VMware View. This feature requires a VAAI NAS plugin from the storage array vendor. Once installed and configured, customers will be able to use a storage array vendor's own native snapshot feature for deploying View desktops. Selecting this new desktop deployment method can be done via standard work-flows in View Composer. More here.

I'm sure you will agree that these are very exciting features. By providing a read caching mechanism, offloading snapshots/clones to the storage array and supporting up to 32 hosts sharing a single base disk, VMware View 5.1 now has greater performance and scalability than ever before. Of course, there are many other enhancements, including a vCenter Operations Manager (vCOps) extension specifically for View, so please check out the View 5.1 news release on VMware.com. For those of you using VMware View, this is definitely a release worth checking out.

Over the next couple of weeks, I hope to look at these features in even greater details.

Get notification of these blogs postings and more VMware Storage information by following me on Twitter: @VMwareStorage

I have done a number of blog posts in the recent past related to our newest VAAI primitive UNMAP. For those who do not know, VAAI UNMAP was introduced in vSphere 5.0 to allow the ESXi host to inform the storage array that files or VMs had be moved or deleted from a Thin Provisioned VMFS datastore. This allowed the array to reclaim the freed blocks. We had no way of doing this previously, so many customers ended up with a considerable amount of stranded space on their Thin Provisioned VMFS datastores.

Now there were some issues with using this primitive which meant we had to disable it for a while. Fortunately, 5.0 U1 brought forward some enhancements which allows us to use this feature once again.

Over the past couple of days, my good friend Paudie O'Riordan from GSS has been doing some testing with the VAAI UNMAP primitive against our NetApp array. He kindly shared the results with me, so that I can share them with you. The posting is rather long, but the information contained will be quite useful if you are considering implementing dead space reclamation.