Short notes and essays about stuff that interests me (mostly technical stuff).

Sunday, February 24, 2013

Getting started with virtualization

System virtualization is one of the most powerful, yet most complicated, topics in computing.

If, like me, you're trying to deepen your knowledge about the basics of virtualization, may I suggest this short syllabus as an introduction to the concepts and principles?

System virtualization is actually one of the oldest sub-fields in computing; it's been intensively studied for 50 years. One of the best early papers about system virtualization is Creasy's The Origin of the VM/370 Time-sharing System. I actually spent some time on a VM/370 system in the mid-1980's, when I was working in the mainframe DBMS market (this was before DB2). Creasy notes that the critical innovation of VM/370 was its recognition of computer resource management as a separate layer in the operating system software:

A key concept of the CP/CMS design was the bifurcation of computer resource management and user support. In effect, the integrated design was split into CP and CMS. CP solved the problem of multiple use by providing separate computing environments at the machine instruction level for each user. CMS then provided single user service unencumbered by the problems of sharing, allocation, and protection.

There were many other early studies of computer resource virtualization; Goldberg's article is a great place to look for more historical background: A Survey of Virtual Machine Research.

Rather than attempting to modify existing operating systems to run on scalable shared-memory multiprocessors, we insert an additional layer of software between the hardware and the operating system. This layer of software, called a virtual machine monitor, virtualizes all the resources of the machine, exporting a more conventional hardware interface to the operating system. The monitor manages all the resources so that multiple virtual machines can coexist on the same multiprocessor.

Another major project in system virtualization was happening at the University of Cambridge in the UK. Originally called the "XenoServer" project, it soon became known as Xen. There are many important resources for learning about the Xen project; here are two. First, there is their primary paper: Xen and the Art of Virtualization

We avoid the drawbacks of full virtualization by presenting a virtual machine abstraction that is similar but not identical to the underlying hardware  an approach which has been dubbed paravirtualization. This promises improved performance, although it does require modifications to the guest operating system. It is
important to note, however, that we do not require changes to the application binary interface (ABI), and hence no modifications are required to guest applications.

As you will learn from the above, the primary techniques for system virtualization around the turn of the century required either that the guest software be modified, or that the Virtual Machine Monitor perform dynamic modification of the binary machine code of the guest software. Both approaches are extremely complex, so there was intense interest in improving the situation. To understand the issues, look to this paper from Adams and Agesen: A Comparison of Software and Hardware Techniques for x86 Virtualization

Ignoring the legacy “real” and “virtual 8086” modes of x86, even the more recently architected 32- and 64-bit protected modes are not classically virtualizable:

Visibility of privileged state. The guest can observe that it has been deprivileged when it reads its code segment selector (%cs) since the current privilege level (CPL) is stored in the low two bits of %cs.

Lack of traps when privileged instructions run at user-level. For example, in privileged code popf (“pop ﬂags”) may change both ALU ﬂags (e.g., ZF) and system ﬂags (e.g., IF, which controls interrupt delivery). For a deprivileged guest, we need kernel mode popf to trap so that the VMM can emulate it against the virtual IF. Unfortunately, a deprivileged popf, like any user-mode popf, simply suppresses attempts to modify IF; no trap happens.

Another active area of work involved the virtualization of hardware devices. Rosenblum and Waldspurger's article in ACM Queue is a great place to start: I/O Virtualization

When an application running within a VM issues an I/O request, typically by making a system call, it is initially processed by the I/O stack in the guest operating system, which is also running within the VM. A device driver in the guest issues the request to a virtual I/O device, which the hypervisor then intercepts. The hypervisor schedules requests from multiple VMs onto an underlying physical I/O device, usually via another device driver managed by the hypervisor or a privileged VM with direct access to physical hardware.

whenever the guest performs an I/O operation, the VMM will intercept it and switch to the host world rather than accessing the native hardware directly. Once in the host world, the VMApp will perform the I/O on behalf of the virtual machine through appropriate system calls. For example, an attempt by the guest to fetch sectors from its disk will become a read() issued to the host for the corresponding data. The VMM also yields control to the host OS upon receiving a hardware interrupt. The hardware interrupt is reasserted in the host world so that the host OS will process the interrupt as if it came directly from hardware.

The goal of the PCI-SIG SR-IOV specification is to standardize on a way of bypassing the VMM’s involvement in data movement by providing independent memory space, interrupts, and DMA streams for each virtual machine. SR-IOV architecture is designed to allow a device to support multiple Virtual Functions (VFs) and much attention was placed on minimizing the hardware cost of each additional function.

One of the hottest parts of the virtualization world recently is the area of Network Virtualization. Entire conferences are devoted to this topic nowadays, and you won't find a shortage of things to read. For a taste of what's going on, consider this recent paper by the Ncira team: Fabric: A Retrospective on Evolving SDN

We then describe how we might create a better form of SDN by retrospectively leveraging the insights underlying MPLS. While OpenFlow has been used to build MPLS LSRs [12], we propose drawing architectural lessons from MPLS that apply to SDN more broadly. This modiﬁed approach to SDN revolves around the idea of network fabrics which introduces a new modularity in networking that we feel is necessary

The paper essentially proposes a refinement to both OpenFlow and to the SDN architectural model. We might call it SDN 2.0, though that might seem a little glib and presumptuous (at least on my part). Regardless of what we call it, it is evident that certain elements in the vanguard of the SDN community continue to work hard to deliver a new type of cloud-era networking that delivers software-based services running over a brawny but relatively simple network infrastructure.

After four decades, it's quite clear that system virtualization continues to be one of the more important and most complicated areas of computing; it's unlikely that will change soon.

Read any great works on system virtualization? I'm always looking for good ideas to add to my reading list.
Let me know!

1 comment:

I really really like your posts! Especially when you point out a source paper and you then describe it a little bit on your own words. Thank you for pointing out this set of resources about Virtualization. There are a lot good and bad ones out there. You write out a nice filter and saved me a good amount of time. Thank you and keep up your good work.