Virtualization Best Practices: A House of Cards?

One way IT organizations try to assure an acceptable level of application performance in complex environments is to follow vendor-established practices, often called virtualization “best practices”. They are usually developed by application and platform vendors and reflect the application design, expected resource consumption and many other performance considerations. Vendors spend long hours in the lab, thoroughly testing and refining such specs. Virtualization, however, can create issues with such practices that can lead to anything but “best” results.

Virtualization Best Practices, Questionable Results

To better understand the problem virtualization creates with best practices, let’s look at a typical application, Exchange, which is commonly virtualized in IT organizations. Below is an example of a typical virtual machine sizing for Exchange deployment:

Let’s more carefully examine one aspect: the number of virtual CPUs. The number of required CPUs (physical or virtual) reflects a certain degree of parallelism implemented in an application. If an application is capable of processing several work streams at the same time, it will benefit from multiple processing cores. In the physical world this is fairly easy to accomplish: just run this application on a dedicated physical box with the required number of cores.

Once the application is virtualized, these practices are usually followed in the virtual world here, as shown above, by specifying the number of virtual CPUs. However, in many virtualized platforms there is a well-known effect of Ready Queue times when virtual machines have to wait in the ready queue to get scheduled when all or enough physical cores are available to map to the corresponding virtual CPUs. The time a VM spends in that queue is called “ready time” – where the application is effectively ready but has to wait. End users often notice this effect as visible application delays, e.g. higher-than-usual response time.

More Best Practices, Similar Results

Some hypervisor vendors are well aware of this fact and add their own virtualization “best practices” to application vendor best practices – asking to be conservative in the number of virtual CPUs and make sure their number matches the number of underlying physical cores. Now we have two sets of practices developed by two independent vendors – who may have vastly different expectations regarding production conditions – that IT admins need to take into account.

Let’s assume that all that work has been done flawlessly, the best practices have been reviewed and implemented, and the correct number of virtual CPUs configured. The issue is not only in the implementation of best practices for Exchange and the VM in which it runs. The Exchange virtual machine still shares the underlying physical host with dozens of other virtual machines; these other VMs (and the application workloads running inside them) have their own best practices implemented based on an entirely different set of factors.

In the morning the number of email requests peaks and more virtual CPUs are needed to process them. At any given moment of time the number of concurrent requests for physical cores is hard to predict which may result in longer-than-usual wait times and latencies. Sometimes even with the high degree of parallelism, the application won’t benefit from an increase in virtual CPUs and actually can perform better if the number of vCPUs is reduced. However, this is not the only remedy; moreover, it’s not always the appropriate remedy. If the workloads running on the same hosts peak at the same time, then even if the number of vCPUs is reduced, there are still too many of such VMs in the queue. So the solution could be to spread peaking VMs far from each other to different hosts. And what if these hosts have different number of physical cores than initially planned?

As a result, the fluctuating load of any one of these VMs could potentially impact the Exchange server at the same time as the number of email requests peak and more virtual CPUs are needed to process them. Now IT teams must implement their own best practices based on their experiences handling similar events.

So what we have just illustrated is that on top of two “best practices” from two different vendors we added another best practice – how to distribute or reconfigure virtual machine depending on load fluctuations or interference. However, unlike the other best practices this is much harder to accomplish in real life – theoretically one needs to always watch for the workload patterns, peaks and interference, and perform the appropriate action in real time.

The Problem with Best Practices in Virtualized Environments

The truth is that many such best practices produce very questionable results – they don’t take into account load fluctuation and interference – and actually creates a rather precarious house of cards. This is not limited only to vendor best practices, as IT administrators add their own wisdom as to how these workloads are managed, often derived from internal “tribal knowledge” to keep highly sensitive, mission critical workloads dedicated to their own virtual headquarters (e.g. “production clusters”, “Exchange clusters”, “VDI clusters”). Two major issues arise with this approach:

The efficiency of the virtualized infrastructure is seriously impacted as instead of sharing the computing hardware (the primary motivation to virtualize in the first place), specific virtual infrastructure is dedicated to a specific app type as is done in physical environments

The load can still interfere even if it is of the same type, which causes teams to reduce the density in such environments (e.g. over provision) impacting the efficiency even more

So people tend to reduce the density in such environments – impacting efficiency even more.

The end product is a virtualized environment bound by policies defined for a physical environment along with (sometimes conflicting) third party suggestions implemented through internal practices. Such a structure is inherently brittle – and such fragile layers tend to produce either a very low quality of service or an underutilized platform – sometimes even both.

We’ve just considered here one specific example of load fluctuation and interference related to CPU in the software-defined world. In the next several articles we will consider other aspects, such as memory and I/O to try to understand further the diminishing value of the “best practices” in the software-defined world and the need for IT to find a better solution.