The Virtual Machine Backup-and-Recovery Conundrum

By Mark Davis
May 12, 2010 5:00 AM PT

Server virtualization has crossed the proverbial chasm. Not long ago, when deploying new servers, one had to justify making them virtual, rather than physical. Now, the opposite is becoming true. IT managers in many organizations have decreed "virtual first" policies, requiring new server deployments to be virtual unless there is specific justification for a physical server.

Alas, for all its goodness, virtualization creates a number of new storage and data management issues. One big problem area is backup. While backup techniques from the physical server world can be used in a virtual machine (VM) environment, they don't work very well.

Why? VM backup presents four new technical considerations:

1. Reduced headroom due to increased server utilization. Backup applications are resource hogs. They consume big chunks of server I/O, CPU and memory bandwidth, and we usually want them to do their work in relatively quick jobs to stay within backup window constraints. In the days of physical servers running at, say 10 percent utilization, this was not a problem. Backup could suck up the idle resources without seriously impacting the application being backed up.

With virtualization, we run servers much closer to capacity. This is, in fact, a cardinal goal of virtualization. This is great, except that now servers haven't copious amounts of idle bandwidth within which to run. Now, the running backup job and the app it is backing up can heavily compete for server and storage hardware resources.

That's bad enough -- but since the server box is now running not one application but several (or better yet many), now backing up one virtual server negatively impacts not just that VM, but all the VMs that are sharing the same hardware. Ouch.

2. Existing backup clients run inside the guest VMs, and unless special precautions are taken, their scheduled runs are likely to overlap. Given the lost server headroom problem, server administrators have to be extra careful. If having one backup job running on a box is bad, having multiple going can be an application responsiveness disaster. So VM backup schedules have to be even more carefully constructed than for physical machines.

The problem of additional backup load is multiplied by the number of guest VMs involved. Further, the complexity is increased by the dynamic nature of VM workloads, where live migration of VMs (sometimes not initiated by a server administrator) can make a hash of handcrafted backup schedules.

3. The solution to these problems is to run a coordinated backup job at the hypervisor level, preferably off-host. Coordinating backup jobs within the virtualization layer, where the resource contention crops up, can minimize problems of starving live applications for bandwidth. Implemented properly, backup operations at the hypervisor level can be the most resource-efficient technique.

The ideal scenario is to run backups on separate and potentially dedicated hardware, so that the act of backing up a server in no way impacts the live applications.

4. Existing architectures like Microsoft VSS that are designed to run "hot backups" do not have sufficient architectural flexibility to map the virtual disks as seen at the hypervisor layer with the their representations inside guest VMs. Microsoft's Volume Shadow Service, introduced many years before the adoption of server virtualization, is widely used. But VSS operates on volumes, not on VMs. In the olden days, when the relationship between a disk volume and an application was usually static and 1:1, this was fine. However, in virtual servers, best practice is to place the virtual disks (in VMware, VMDKs; in Hyper-V VHDs) for many VMs on a single volume.

Invoking VSS creates a shadow copy of the whole volume, even if one wants to back up only a single VM. This makes running the current VSS architecture at the hypervisor level sub-optimal. The "solution" is to run backups at the guest VM level, but that leads back to problems 1 and 2 above.

Given these issues, what would a proper VM backup solution entail? The virtualization backup industry needs to deliver:

Ability to create fast, space-efficient and high performance snapshots of virtual disks attached to guest VMs. It should be possible to snapshot all of the virtual disks associated with a VM in sync, so that a crash-consistent backup image can be created. The virtual disks should be managed in groups according to the customer backup schedules and schema.

Integration of the creation and management of these snapshots with an application aware backup scheme. Crash-consistent backup is good; application-consistent is better.

Ability to access snapshots on a server other than the one running the live VM. This is crucial for segregating the backup load and its attendant resource drain from live production applications. As discussed above, the need for off-host backup is an especially pressing need in a virtual server environment.

Snapshots that are available online long term. Having snapshots online for long periods is ideal. The space-efficiency features of most snapshots mean that data blocks common among base images and their snapshot children are stored only once. This can make it inexpensive to keep lots of snapshots around for a long time, making it easy to rapidly restore data from previous snapshots.

Unfortunately, many snapshotting products are designed for short periods of persistence. The longer snapshots linger online, the more I/O performance degrades -- not only to the snapshots, but also their base images -- rendering it impractical to keep snapshots online long term.

Information for incremental backups. VM image files are large, often tens of gigabytes. Backup software can be much more efficient if it knows that certain parts of the image are unchanged since the last backup. The virtual disk abstraction layer should provide to backup software lists of changed data at the image and object levels.

No excessive performance penalties on live VM performance. While this might be an obvious requirement, today's state of the art inflicts substantial performance consequences.

Independence from storage hardware. Proprietary lock-ins are never good for IT organizations. With proper software design there is no reason why a VM backup solution should necessitate proprietary features of particular storage subsystems.

Virtualization is the best thing to happen to data centers (internal and cloud) in a long time. The promise of virtualization is enormous, and today much of the promise is readily attainable. However, until backup solutions optimized for the unique requirements of virtualization are brought to market, broader deployment of virtualization will be impeded.