Tuesday, May 5, 2015

VFIO GPU How To series, part 1 - The hardware

This is an attempt to make a definitive howto guide for GPU assignment with QEMU/KVM and VFIO. It should also be relevant for general PCI device assignment with VFIO. For part 1 I'll simply cover the hardware that I use, it's features and drawbacks for this application and what I might do differently in designing a system specifically for GPU assignment. In later parts we'll get in to installing VMs and configuring GPU assignment.

The system I'm using is nothing particularly new or exciting, it's simply a desktop based on an Asus P8H67-M PRO/CSM:

I'm using a Xeon E3-1245 v2, Ivy Bridge processor. I wouldn't necessarily recommend this particular setup (it's probably only available on ebay anymore anyway), but I'll point out a few interesting points about it to help you pick your own system. First, the motherboard uses an H67 chipset, which is covered by the Intel PCH ACS quirk. This means that devices connected via the PCH root ports will be isolated from each other. That includes anything plugged into the black PCIe 2.0 x16 (electrically x4) slot shown on the top of the picture above, as well as builtin devices hanging off the PCH root ports internally. The blue x16 slot is the higher performance PCIe 3.0 (3.0 for my processor) slot driven from the processor root ports. The motherboard manual will sometimes have block diagrams indicating which ports are derived from which component, this particular board doesn't.

A convenient "feature" of this board is that there's only a single processor-based root port slot. That's not exactly a feature, but processors for this socket (1155) that also support Intel VT-d include Core i5, i7, and Xeon E3-1200 series. None of these processors support PCIe ACS between the root ports (see here for why that's important), this means multiple root ports would not be isolated from each other. If I had more than one processor-based root ports and made use of them, I might need the ACS override patch to fake isolation that may or may not exist.

There are also a couple conventional PCI slots on this board that are largely useless. Not only because conventional PCI is not a good choice for device assignment, and not only because they're blocked by graphics card heatsinks, but because they're driven by an ASMedia ASM1083 (rev 1) PCIe-to-PCI bridge, which has all sorts of interrupt issues, even on bare metal. This spawns my personal dislike and distrust for anything made by ASMedia.The onboard NIC is a Realtek RTL8111, which is not particularly interesting either. Realtek NICs are also a poor choice for doing direct device assignment to a guest; they do strange and non-standard things. On my system, I use it with a software bridge and virtio network devices for the VMs. This provides plenty of performance for my use cases (including Steam in-home streaming) as well as local connectivity for Synergy.

Final note on the base platform, I'm using the processor integrated Intel HD Graphics P4600 for the host graphics. This particular motherboard only allows BIOS selection between IGD, PCIe, and PCI for the primary graphics devices. There is no way to specify a particular PCIe slot for primary graphics as other vendors, like Gigabyte, tend to provide. This motherboard is therefore not a good choice for discrete host graphics since we only have one fixed configuration for selecting the primary graphics device between plugin cards.

Note that even though I only have a single PCH root port slot, there are multiple internal root ports connecting the built-in I/O, including the USB 3.0 controller and Ethernet. Without native ACS support or the ACS quirk for these PCH root ports, all of the 1c.* root ports and devices behind them would be grouped together.

Let's move on to the graphics cards. This is an always-on desktop system, so noise and power (and having representative devices) are more important to me than ultimate performance. The cards I'm using are therefore an EVGA GTX 750 Superclocked:

The GTX750 is based on Maxwell, giving it an excellent power-performance ratio and the 8570 is based on Oland, making it one of the newer generation of GCN chips from AMD. The 8570 is by no means a performance card, but I don't have room for a double-wide graphics card in my PCH root port slot and it's only running x4 electrically anyway. OEM cards seem to be a good way to find cheap cards on ebay, but their cooling solutions leave something to be desired. I actually replace the heatsink fan on the 8570 with a Cooler Master CoolViva Z1. I'll also mention that before upgrading to the GTX750 I successfully ran a GT 635 OEM, which is fairly comparable in specs and price to the 8570.This system was not designed or purchased with this particular use case in mind. In fact, it only gained VT-d capabilities after upgrading from the Core i3 processor that I originally had installed. So what would an ideal system be for this purpose? First, IOMMU support via Intel VT-d or AMD-Vi is required. This is not negotiable. If we stay with Intel Core i5/i7 (no VT-d support in i3) or Xeon E3-12xx series processors then we need to be well aware of the lack of ACS support on processor root ports. In an application like above, I'm more limited by physical slots so this is not a problem. If I wanted more processor-based root port slots, the trade-off in using these processors is the lack of isolation. The ACS override patch will not go upstream and is not recommended for downstreams or users due to the potential risk in assuming isolation where none may exist. The alternative is to set our sights on Xeon E5 or higher processors. This is potentially a higher price point, but I see plenty of users spending many hundreds of dollars on high-end graphics cards, yet skimping on the processor and complaining about needing to patch their kernel. Personally I'd rather put more towards the platform to avoid that hassle.There are also users that prefer AMD platforms. Personally I don't find them that compelling. The latest non-APU chipset is several years old and the processors are too power hungry for my taste. A-series processors aren't much better and their chipsets are largely unknown with respect to both isolation and IOMMU support. Besides the processor and chipset technologies, Intel has a huge advantage with http://ark.intel.com/ in being able to quickly and easily research the capabilities of a given product.But what about the graphics cards? If you're looking for a solution supported by the graphics card vendor, you're limited to Nvidia Quadro K-series, model 2000 or better (or GRID or Tesla, but those are not terribly relevant in this context). Nvidia supports running these cards in QEMU/KVM virtual environments using VFIO in a secondary display configuration. In other words, pre-boot (BIOS), OS boot, and initial installation and maintenance is done using an emulated graphics device and the GPU is only activated once the proprietary graphics drivers are enabled in the guest. This mode of operation does not depend on the graphics ROM (legacy vs UEFI) and works with current Windows and Linux guests.When choosing between GeForce and Radeon, there's no clear advantage of one versus the other that's necessarily sufficient to trump personal preference. AMD cards are known to experience occasional blue screens for Windows guests and a couple of the more recent GPUs have known reset issues. On the other hand, evidence suggests that Nvidia may be actively trying to subvert VM usage of GeForce graphics cards. As noted in the FAQ we need to both hide KVM as the hypervisor as well as disable KVM's support for Microsoft Hyper-V extensions in order for Nvidia's driver to work in the VM. The statement from Nvidia has always been that they are not intentionally trying to block this use, but that it's not supported and won't be fixed. Personally, that's a hard explanation to swallow.My observation is that AMD appears more interested in supporting VM use cases, but they're not doing anything in particular to enable it or make it work better. Nvidia generally works better, but each new driver upgrade is an opportunity for Nvidia to introduce new barriers or shut down this usage model entirely. If Nvidia were to make a gesture of support by fixing the current "bugs" in hypervisor detection, the choice would be clear IMHO.Users also often like to assign additional hardware to VMs to make them feel more like separate systems. In my use case, I can run multiple guests simultaneously and have a monitor for each graphics cards, but they are only used by a single user, me. I'm therefore perfectly happy using Synergy to share my mouse and keyboard and virtio disk and network for the VMs works well. For an actual multi-seat use case, being able to connect an individual mouse and keyboard per seat is obviously useful. USB "passthrough", which is different from "assignment", works on individual USB endpoints and is one solution to this problem, but generally doesn't work well with hotplug (in my experience). Using multiple USB host controllers, with a host controller assigned per VM is another option for providing a more native solution. This however means that we start increasing our slot requirements and therefore our concerns about isolation between those slots.Assigning physical NICs to a system is also an option, though for a desktop setup it's generally unnecessary. In the 1Gbps realm, virtio can easily handle the bandwidth, so the advantage of assignment is lower latency and more aggregate throughput among VMs and host. If 10Gbps is a concern, assignment becomes more practical. If you do decide to assign NICs, I find Intel NICs to be a good choice for assignment.I generally discourage users from assigning disk controllers directly to a VM. Often the builtin controllers store their boot ROM in the system firmware, making it difficult to actually boot the guest from an assigned HBA. Beyond that, the additional latency imposed by a paravirtual disk controller is typically substantially less than the latency of the disk itself, so the return on investment and additional complication of an assigned storage controller is generally not worthwhile for average configurations.Audio is also sometimes an issue for users. In my configuration I use HDMI audio from the graphics card for both VMs. This works well, so long as we make sure the guest is using MSI interrupts for the audio devices. Other users prefer USB audio devices, often in a passthrough configuration. Connecting separate VMs to the host user's pulse-audio session is generally difficult, but not impossible. A contributing problem to this is that assigned devices, such as the graphics card, need to be used in libvirt's "system" mode, while connecting to the user's pulseaudio daemon would probably be easier in a libvirt user session.Finally, number of processor cores and total memory size plays an important factor for the guests you'll eventually be running. When using PCI device assignment, VM memory cannot be over-committed. The assigned device is capable of DMA through the IOMMU, which means that all of guest memory needs to not only be allocated in advanced, but pinned into memory to make the guest physical to host physical mappings static for the VM. Total memory therefore needs to accommodate all assigned device VMs that might be run simultaneously, plus memory for whatever other applications the host is running. vCPUs can be over-committed, regardless of an assigned device, but doing so breaks down some of the performance isolation we gain by device assignment. If a "native" feel is desired, we'll want enough cores that we don't need to share between VMs and a few left over for applications and overhead on the host.Hopefully that helps give you an idea of where to start with choosing hardware. In the next segment we'll cover what to expect in a GPU assignment VM as users often have misconceptions around where the display is output and how to interact with the VM.

5 comments:

Thanks for the article. This is very helpful. One question remains: Do I only have to have a look at the isolation properties if I don't want to use the ACS patch? In other words, regardless of how the iommu groups look like without the ACS patch, by using it I can always get a proper isolation? Of course there are risks, as you wrote, but currently I already use it and I don't experiencing any downsides so far.

Not to be overly dramatic, but that's sort of like playing Russian Roulette and saying you haven't been shot yet. You may never see a problem. You may also update QEMU some day, or install another card, or update some other software or hardware component and that lack of isolation will start to bite you. You might not even notice. Personally I'm not going to risk it, nor do I have any interest in patching my kernel forever, and if you know how IOMMU groups work and choose your hardware configuration appropriately, it's avoidable.

"..choose your hardware configuration approriately..." call me pessimistic, but maybe that's not too easy to achieve, I guess. In the best case someone else with the same goal had a piece of hardware already tested to provide proper information. I guess I will see how good this works when it comes to new/recent hardware

It's easy to achieve, use a Xeon E5 or newer with X79 or X99 chipset and you should be able to do whatever you want. If you choose to use an i5/i7 or Xeon E3 1200, then there are limitations, but often they can be worked around. See recent posts in the archlinux thread for further suggestions.

Personally I use UDEV rules to attach devices automatically to the desired machine. This may not solve all problems, but it seems the most convenient way to solve mine.Example:ACTION=="add", \ SUBSYSTEM=="usb", \ ENV{ID_VENDOR_ID}=="046d", \ ENV{ID_MODEL_ID}=="c21f", \ RUN+="/usr/bin/virsh attach-device infmc /etc/libvirt/qemu/infmc-devices/logitech-f710.xml"ACTION=="remove", \ SUBSYSTEM=="usb", \ ENV{ID_VENDOR_ID}=="046d", \ ENV{ID_MODEL_ID}=="c21f", \ RUN+="/usr/bin/virsh detach-device infmc /etc/libvirt/qemu/infmc-devices/logitech-f710.xml"The only issue I've met is when you shutdown machine (I have to do it before suspend host system, for example) device is detached. And you may need to replug it.