VMotion – Seamless OS load-balancing

One of VMware’s key – and extremely cool – features is VMotion. VMotion allows individual virtual machines (let’s say a Windows 2003 Server) to be dynamically moved to another VMware server without impact to users. Reading right from VMware’s VMotion page:

VMware VMotion technology, unique to VMware,
leverages the complete virtualization of servers,
storage and networking to move an entire running
virtual machine instantaneously from one server
to another. The entire state of a virtual machine
is encapsulated by a set of files stored on
shared storage, and VMware’s VMFS cluster file
system allows both the source and the target ESX
Server to access these virtual machine files
concurrently. The active memory and precise
execution state of a virtual machine can then
be rapidly transmitted over a high speed
network. Since the network is also virtualized
by ESX Server, the virtual machine retains its
network identity and connections, ensuring a
seamless migration process.

Put more simply, the Windows VM is copied to another physical machine, the switch CAM tables are updated with a gratuitous ARP, traffic flows to the new VMWare server, and the old instance is shut down. All of this is completed in about 2 seconds which is short enough to prevent any TCP session failures. Users don’t have clue.

VMotion can be started manually (for let’s say a change window) or dynamically, based on resource usage, by VMware’s Dynamic Resource Scheduler (DRS). If a single VMware server’s CPU usage gets above a certain limit, DRS moves a few VMs to another server, balancing the load. Users have no clue.

At the network level, there are two major requirements for VMotion. First, a dedicated, gigabit Ethernet VLAN is needed for VMotion traffic. This dedicated VLAN and bandwidth ensures VMs can be moved without impact to users. Second, and most importantly, the group of servers that VMs can be balanced between must be in the same Layer-2 domain. When a VM moves it cannot change any attributes, like its IP addresses. Thus, all target servers must have connections to the same VLANs as the source VMWare server.

This requirement will definitely lead to an expansion of Layer-2 domains, and thus spanning-tree. VMotion is just too cool and useful to not use. Server teams are going to demand large “VMware Farms” to balance and move VMs around. All of these servers require several physical connections so switch ports will be at a premium. This means more switches and, since all VMware servers must be in the same VLANs, larger spanning-tree fault domains. Spanning-tree and Layer-2 domains were being minimized, or at least significantly controlled, in the last few years in network design best practices. That is likely to begin changing as VMware becomes more mainstream. New technologies like Cisco’s VSS will become very important to limit spanning-tree’s impact (when it goes wrong…and it will go wrong…or at least one of your engineers will go wrong with it).