3
Software aging of VMMs Software aging of a VMM is critical Software aging is... The phenomenon that software state degrades with time E.g. exhaustion of system resources Software aging of a VMM affects all VMs on it E.g. performance degradation VM VMM VM...

4
Software rejuvenation of VMMs Preventive maintenance Performed before software aging of a VMM affects its VMs Occasionally stops a VMM, cleans its internal state, and restarts it Typical example: rebooting a VMM Cleans the internal state automatically and completely The easiest way

5
Drawbacks (1/2): Increasing service downtime The VMM reboot needs: Rebooting all OSes running on the VMs The time tends to be long Larger number of VMs Longer startup time of services A hardware reset The BIOS power-on self test is time-consuming OS VMM OS VM... OS shutdown hardware reset OS boot VMM shutdown VMM boot

6
Drawbacks (2/2): Performance degradation The file cache is lost by the OS reboot OSes cannot restore performance until the file cache is re-filled They strongly rely on the file cache to speed up file accesses The time tends to be long The file cache size is increasing Large amount of memory for a VM Free memory as the file cache disk OS file cache process

7
Warm-VM reboot Fast rejuvenation technique Efficiently reboots only a VMM The VMM reboot causes no OS reboot Basic idea Suspend all VMs before the VMM reboot Resume them after the reboot Challenge How does a VMM efficiently deal with the large memory images of VMs?

8
On-memory suspend of VMs Freezes the memory images of VMs on the main memory That memory area is just reserved The time does not depend on the memory size Saving them into a slow disk is inefficient ACPI S3 state for VMs Suspend To RAM Traditional suspend is ACPI S4 state disk main memory VM freez e

9
On-memory resume of VMs Unfreezes the memory images preserved on the main memory They are reused directly as the memory of VMs No need to read them from a slow disk The file cache of OSes is also restored No performance degradation disk main memory VM unfreez e

10
Quick reload of VMMs Directly boots a new VMM without a hardware reset The memory images of VMs are preserved through the VMM reboot Software can keep track of them A hardware reset does not guarantee this A VMM is rebooted quickly No overhead due to a hardware reset old VMM new VMM preload VM main memory

11
Comparison with other methods Cold-VM reboot Needs the OS reboot Saved-VM reboot A naive implementation of the warm-VM reboot VMs are saved into a disk Reboot methodCold-VMSaved-VMWarm-VM Depend on # of VMsYesNo Depend on servicesYesNo Depend on mem size of VMsNoYesNo Performance degradationYesNo

12
Model for availability Must consider the software rejuvenation of both a VMM and OSes Warm-VM reboot The OS rejuvenation is independent Cold-VM reboot The OS rejuvenation is affected by the VMM rejuvenation # of the OS rejuvenation increases OS rejuvenation VMM rejuvenation OS rejuvenation VMM rejuvenation

13
RootHammer We have implemented the warm-VM reboot into Xen 3.0.0 On-memory suspend/resume Based on Xen's suspend/resume Manages the mapping from the VM memory to the physical memory Quick reload Based on the kexec mechanism in Linux Kexec for a VMM is included in the latest Xen It is not for reusing the memory images VM memory physical memory

19
Performance degradation The throughput of the Apache web server before and after the VMM reboot Warm-VM reboot No degradation Cold-VM reboot Degraded by 69%

20
Software rejuvenation in a cluster environment Clustering achieves zero downtime Multiple hosts can provide the same service Let us consider the total throughput of all hosts in a cluster Warm-VM reboot (m-1)p Cold-VM reboot (m-1)p (m-0.69)p for a while after the reboot m: # of hosts p: throughput of one host t mp (m-1)p total throughput 42 sec 241 sec

21
Comparison with VM migration in a cluster environment VM migration achieves nearly zero downtime VMs are moved to another host Xen's live migration, VMware's VMotion Total throughput Normal run (m-1)p One host is reserved for migration Live migration (m-1.12)p t mp (m-1)p total throughput 42 sec 17 min

22
Related work Microreboot [Candea et al.'04] Reboots only a part of subcomponents The warm-VM reboot enables rebooting only a parent component (VMM for VMs) Checkpointing/restart [Randell '75] Saves/restores OS processes Similar to suspend/resume of VMs Optimizations of suspend/resume Incremental suspend, compression of memory images