Tackling the VM I/O Blender Problem

The breakthrough idea is in creating a dedicated disk-based log space for each virtual host that allows all writes to disk to occur serially, and combining that with a set of optimized de-staging algorithms to then asynchronously move that logged data to primary storage. The initial log writes are effectively insulated from any disk-based slowdown due to seek times or rotational latencies, because there is no randomness.

By Eric Burgener
02/08/11 5:00 AM PT

Server virtualization technology offers lots of benefits for administrators looking to consolidate to make more efficient use of resources, improve management, and save on power, cooling and floorspace. However, deploying server virtualization technology may lead to a few surprises. Let's walk through a very typical scenario on the storage side.

After reviewing resource utilization and performance data on the relevant physical servers, you're planning on consolidating between eight and 10 physical servers onto each virtual host. You're moving to shared storage to be able to take advantage of high-availability features that will help you meet higher uptime requirements, minimize the impact of maintenance, and leverage workload balancing to better handle peak processing requirements.

The first thing you notice once you've created your test environment is that storage performance is lagging, possibly by as much as 60 percent to 80 percent. How can that be? You're using the same storage configuration you had before, and performance was fine on the physical servers. As you look at the problem, you consider several options: buy more disk spindles? increase the size of the storage controller caches? add some solid state storage on the front end? Do you need a higher-performance storage array?

All of those are likely to help defray your performance problem at least somewhat, but at a significant cost. -- and if you've purchased an enterprise-class storage array as part of your virtualization project, that significant additional cost may be significantly more significant...!

The Culprit Is the Randomness

What you've run into is sometimes called the "virtual machine I/O blender" problem. A physical server with its own dedicated storage running a single primary application can make pretty efficient use of underlying disks, because it can optimize against a single I/O stream. When you combine the workloads of multiple servers, each with its own I/O stream that is completely unrelated, onto a single virtual host, you end up with a very random I/O pattern.

Disk seek times and rotational latencies start to comprise a much higher percentage of the I/O transfer times, significantly slowing down the I/O. The culprit is the randomness of the I/O, and there's not much you can do about it in the way of performance tuning, outside of throwing more hardware at the problem, to address it.

Surprise.

If you're the CIO, this means that original ROI calculation you did will likely need to be adjusted a little to compensate for the additional storage spend. If you're the technical director, you may not have the budget or space available to accommodate the additional storage just yet.

Of course, you could always implement the project with lower performance, but then you may run into a little trouble with the end-users. It's not likely that an explanation about the performance limitations of existing disk technology is going to adequately address their complaints.

Actually though, there are some new virtual storage architectures on the market that were specifically designed to provide high performance in environments with extremely random I/O.

The "VM I/O blender" problem is really a factor of the write performance that can be sustained against this extremely random I/O load, and VM environments tend to be write-intensive. The good news is that there is a way to address this problem with a software solution that, because of the way it works, not only does not require additional storage hardware, but can actually provide better performance than you had before on your physical servers using fewer physical spindles.

Perform Writes to Disk Serially

The breakthrough idea is in creating a dedicated disk-based log space for each virtual host that allows all writes to disk to occur serially, and combining that with a set of optimized de-staging algorithms to then asynchronously move that logged data to primary storage.

The initial log writes are effectively insulated from any disk-based slowdown due to seek times or rotational latencies, because there is no randomness in the writes -- they are all performed serially. This means that they occur at the highest possible speed of which a given disk technology (SCSI, SATA, etc.) is capable, allowing that disk to perform as much as 80 percent faster than it would against an extremely random I/O pattern.

The de-staging optimizations are critical as well -- just throwing more conventional write cache at the problem doesn't address it nearly as well, and it is a lot more expensive.

Like many interesting solutions, this one hearkens back to a legacy solution that has been effectively deployed with enterprise databases for 20 years or more. It is proven, and is in widespread use in information technology (IT) organizations today. Just look at the logging architectures of enterprise databases like Oracle, DB2, etc. Only now, there are several independent software vendors that have made them available for use against any block-based storage -- not just storage managed by Oracle or DB2.

What you'll see if you deploy one of these software solutions are some truly amazing results. You will likely end up needing fewer spindles than you had before to meet your performance requirements, as many as 30 percent fewer. You may even be able to meet your performance requirements with mid-range rather than enterprise-class storage, a factor that can have a significant impact on the $/GB you'll need to pay for storage.

You'll be able to support more VMs per virtual host with your existing storage, increasing your VM density. And because normal I/O operations to primary storage no longer affect application performance, you'll be able to use thin-provisioning technologies to use even less storage while still meeting performance requirements.

All of this boils down to a huge ROI benefit associated with employing these types of virtual storage architectures in VM environments. In fact, these benefits are so compelling that it's my contention that use of these technologies will become part of the VM "standard operating environment" over time.

If you've already run into these storage performance problems, and if you think that a storage architecture that is truly designed for the extremely random I/O patterns that are common in VM environments might be of interest, you may want to see what the ROI benefit in your shop would be.