Tuesday, May 5, 2015

VFIO GPU How To series, part 3 - Host configuration

For my setup I'm using a Fedora 21 system with the virt-preview yum repos to get the latest QEMU and libvirt support along with Gerd Hoffmann's firmware repo for the latest EDK2 OVMF builds. I hope though that the majority of the setup throughout this howto series is mostly distribution agnostic, just make sure you're running a newer distribution with current kernels and tools. Feel free to add comments for other distributions if something is markedly different.

The first thing we need to do on the host is enable the IOMMU. To do this, verify that IOMMU support is enabled in the host BIOS. How to do this will be specific to your hardware/BIOS vendor. If you can't find an option, don't fret, it may be tied to processor virtualization support. If you're using an Intel processor, check http://ark.intel.com to verify that your processor supports VT-d before going any further.

Next we need to modify the kernel commandline to allow the kernel to enable IOMMU support. This will be similar between distributions, but not identical. On Fedora we need to edit /etc/sysconfig/grub. Find the GRUB_CMDLINE_LINUX line and within the quotes add either intel_iommu=on or amd_iommu=on, depending on whether your platform is Intel or AMD. You may also want to add the option iommu=pt, which sets the IOMMU into passthrough mode for host devices. This reduces the overhead of the IOMMU for host owned devices, but also removes any protection the IOMMU may have provided again errant DMA from devices. If you weren't using the IOMMU before, there's nothing lost. Regardless of passthrough mode, the IOMMU will provide the same degree of isolation for assigned devices.

Save the system grub configuration file and use your distribution provided update scrips to apply this configuration to the boot-time grub config file. On Fedora, the command is:

# grub2-mkconfig -o /etc/grub2.cfg

If your host system boots via UEFI, the correct target file is /etc/grub2-efi.cfg.

With these changes, reboot the system and verify that the IOMMU is enabled. To do this, first verify that the kernel booted with the desired updates to the commandline. We can check this using:

# cat /proc/cmdline

If the changes are not there, verify that you've booted the correct kernel or double check instructions specific to your distribution. If they are there, then we next need to check that the IOMMU is actually functional. The easiest way to do this is to check for IOMMU groups, which are setup by the IOMMU and will be used by VFIO for assignment. To do this, run the following:

If you get output like above, then the IOMMU is working. If you do not get a list of devices, then something is wrong with the IOMMU configuration on your system, either not properly enabled or not supported by the hardware and you'll need to figure out the problem before moving forward.This is also a good time to verify that we have the desired isolation via the IOMMU groups. In the above example, there's a separate group per device except for the following groups: 1, 9, 11, and 12. Group 1 includes:

This includes the processor root port and my GeForce card. This is a case where the processor root port does not provide isolation and is therefore included in the IOMMU group. The host driver for the root port should remain in place, with only the two endpoint devices, the GPU itself and its companion audio function bound to vfio-pci.

Group 9 has a similar constraint, though in this case device 0000:00:1c.7 is not a root port, but a PCI bridge. Since this is conventional PCI, the bridge and all of the devices behind it are grouped together. Device 0000:05:00.0 is another bridge, so there's nothing assignable in the IOMMU group anyway.

Group 11 is composed of internal components, an ISA bridge, SATA controller, and SMBus device. These are grouped because there's not ACS between the devices and therefore no isolation. I don't plan to assign any of these devices anyway, so it's not an issue.

Group 12 includes only the functions of my second graphics card, so the grouping here is also reasonable and perfectly usable for our purposes.If your grouping is not reasonable, or usable, you may be able to "fix" this by using the ACS override patch, but carefully consider the implications of doing this. There is a potential for putting your data at risk. Read my IOMMU groups article again to make sure you understand the issue.Next we need to handle the problem that we only intend to use the discrete GPUs for guests, we do not want host drivers attaching to them. This avoids issues with the host driver unbinding and re-binding to the device. Generally this is only necessary for graphics cards, though I also throw in the companion audio function to keep the host desktop from getting confused which audio device to use. We have a couple options for doing this. The most common option is to use the pci-stub driver to claim these devices before native host drivers have the opportunity. Fedora builds the pci-stub driver statically into the kernel, giving it loading priority over any loadable modules, simplifying this even further. If your distro doesn't keep reading, we'll cover a similar scenario with vfio-pci.The first step is to determine the PCI vendor and device IDs we need to bind to pci-stub. For this we use lspci:

The Vendor:Device IDs for my GPUs and audio functions are therefore 10de:1381, 10de:0fbc, 1002:6611, and 1002:aab0. From this, we can craft a new option to add to our kernel commandline using the same procedure as above for the IOMMU. In this case the commandline addition looks like this:

pci-stub.ids=10de:1381,10de:0fbc,1002:6611,1002:aab0

After adding this to our grub configuration, using grub2-mkconfig, and rebooting, lspci -nnk for these devices should list pci-stub for the kernel driver in use.A further trick we can use is to craft an ids list using the advanced parsing of PCI vendor and class attributes to create an option list that will claim any Nvidia or AMD GPU or audio device:

If you're using kernel v4.1 or newer, the vfio-pci driver supports the same ids option so you can directly attach devices to vfio-pci and skip pci-stub. vfio-pci is not generally built statically into the kernel, so we need to force it to be loaded early. To do this on Fedora we need to setup the module options we want to use with modprobe.d. I typically use a file named /etc/modprobe.d/local.conf for local, ie. system specific, configuration. In this case, that file would include:

Next we need to ensure that dracut includes the necessary modules to load vfio-pci. I therefore create /etc/dracut.conf.d/local.conf with the following:

add_drivers+="vfio vfio_iommu_type1 vfio_pci vfio_virqfd"

(Note, the vfio_virqfd module only exists in kernel v4.1+)Finally, we need to tell dracut to load vfio-pci first. This is done by once again editing our grub config file and adding the option: rd.driver.pre=vfio-pci Note that in this case we no longer use a pci-stub.ids option from grub, since we're replacing it with vfio-pci. Regenerate the dracut initramfs with dracut -f --kver `uname -r` and reboot to see the effect (The --regenerate-all dracut option is also sometimes useful).Another issue that users encounter when sequestering devices is what to do when there are multiple devices with the same vendor:device ID and some are intended to be used for the host. Some users have found the xen-pciback module to be a suitable stand-in for pci-stub with the additional feature that the "hide" option for this module takes device addresses rather than device IDs. I can't load this module on Fedora, so here's my solution that I like a bit better.Create a small script, I've named mine /sbin/vfio-pci-override-vga.sh It contains:

This script will find every non-boot VGA device in the system, use the driver_override feature introduced in kernel v3.16, and make vfio-pci the exclusive driver for that device. If there's a companion audio device at function 1, it also gets a driver override. We then modprobe the vfio-pci module, which will automatically bind to the devices we've specified. Don't forget to make the script executable with chmod 755. Now, in place of the options line in our modprobe.d file, we use the following:

install vfio-pci /sbin/vfio-pci-override-vga.sh

So we specify that to install the vfio-pci module, run the script we just wrote, which sets up our driver overrides and then loads the module, ignoring the install option (-i) to prevent a loop. Finally, we need to tell dracut to include this script in the initramfs, so in addition to the add_drivers+= that we added above, add the following to /etc/dracut.conf.d/local.conf:

Note that the additional utilities required were found using lsinitrd and iteratively added to make the script work. Regenerate the initramfs with dracut again and you should now have all the non-boot VGA devices and their companion audio functions bound to vfio-pci after reboot. The primary graphics should load with the native host driver normally. This method should work for any kernel version, and I think I'm going to switch my setup to use it since I wrote it up here.Obviously a more simple script can be used to pick specific devices. Here's an example that achieves the same result on my system:

(In this case the find and dirname binaries don't need to be included in the intramfs)

Another couple other bonuses for v4.1 and newer kernels is that by binding devices statically to vfio-pci, they will be placed into a low power state when not in use. Before you get your hopes too high, this generally only saves a few watts and does not stop the fan. v4.1 users with exclusively OVMF guests can also add an "options vfio-pci disable_vga=1" line to their modprobe.d which will cause vfio-pci to opt-out devices from vga arbitration if possible. This prevents VGA arbitration from interfering with host devices, even in configurations like mine with multiple assigned GPUs.If you're in the unfortunate situation of needing to use legacy VGA BIOS support for your assigned graphics cards and you have Intel host graphics using the i915 driver, this is also the point where you need to patch your host kernel for the i915 VGA arbitration fix. Don't forget that to enable this patch you also need to pass the enable_hd_vgaarb=1 option to the i915 driver. This is typically done via a modprobe.d options entry as discussed above.At this point your system should be ready to use. The IOMMU is enabled, the IOMMU groups have been verified, the VGA and audio functions for assignment have been bound to either vfio-pci or pci-stub for later use by libvirt, and we've enabled proper VGA arbitration support in the i915 driver if needed. In the next part we'll actually install a VM, and maybe even attach a GPU to it. Stay tuned.

47 comments:

Very nice how-to's.I've read them all so far and I'm now trying to actually make it happen :)So far so good, but I'm a bit lost after the pci-stub part.I added pci-stub.ids to the grub and the cards are nicely linked to the pci-stub driver.After that you mention stuff about kernel 4.1, which I currently don't have. But I can't figer out from your story is the pci-stub part is enough for kernel <4.1 or that something still needs to be done in order to work in the next part of your how-to.

Yeah, I sort of meandered around there. The point of any of the early binding is to prevent native host drivers from claiming the devices, generally because graphics drivers don't have good support yet for dynamically unbinding when you want to startup the VM. pci-stub is a sufficient workaround for this. libvirt will unbind the device from pci-stub and bind it to vfio-pci when you startup the VM. If you don't have 4.1+, the difference between binding to pci-stub or vfio-pci is large academic. Once you do have 4.1+, vfio-pci will put unused devices into a low power state, which may save you a couple watts, but the functionality is largely the same.

First of all thanks for taking the time to write this all up. Many of the similar articles I've seen are... incomplete at best.

However, I'm a little unclear which parts not to do if not using kernel 4.1+I do not wish to bind more than the one specific card to vfio or stub, so I left the specific addresses for the card I wish to pass through on pci-stub.ids=. Since you said vfio-pci supports the ids option in 4.1+ I left it pci-stub.

Does one still need to rd.driver.pre=vfio-pci to grub and create one of the scripts in this? or should I be doing something else? As is I cannot launch the VM due to errors about the operation not being permitted, not being able to get the group, and device initialization failing.

I assume this is a result of not having given the device to vfio-pci instead of stub, but I'm not exactly sure how to.

If the devices you want to assign have unique IDs and your distro builds pci-stub directly into the kernel, then the pci-stub.ids= option on the kernel commandline is just fine. Permission denied errors can be a result of using <qemu:arg> options in your libvirt xml, lack of support for interrupt remapping, or platform breakage with reserved memory regions. We'd need to see the error and dmesg to know which it is.

I have not manually changed anything in the xml file, I created it much as described in your part4, with the exception of having left it bios instead of uefi. If there are specific logs or files I should add information from please let me know, I've only recently started experimenting with virtualization and most of my linux experience is quite dated.

dmesg likely provides the solution, does it say something about using allow_unsafe_interrupts? If so, try "options vfio_iommu_type1 allow_unsafe_interrupts=1" in modprobe.d. You'll need to at least manually unload the module or reboot to have it loaded with the correct option. This would also mean that your hardware isn't protecting you from possible MSI attacks from the guest. If you trust the guest, not an issue.

No, the only thing 4.1 brings is the "ids" option to vfio-pci, but just like pci-stub, that doesn't help when you have devices with the same IDs, split between host and guest. The driver_override support is in any reasonably new kernel (3.16+).

I have installed the ed2.git-aarch64 and edk2.git-ovm-x64 from the https://www.kraxel.org/repos/firmware.repo repository. I have also installed the Visualization Preview Repository.

The virt-manager on my machine does not appear the same as the version in Alex's examples, 1. After pressing the "Finish Button" on the "Create Virtual Machine" Window I do not get a window showing the overview of the installation. Instead the overview Windows Begins the install process.2. I am unable to change the firmware setting.3. In the processor Configuration I do not have the option of host-passthrough

My system is running Fedora 22, virt-manager is 1.2.1, libvirt 1.2.17, and qemu is 2.4. My system is up to date with latest version in the Virtaulization Preview Repository. What version are you running?

As noted in part 4, you must select the customize before install box to get to the advanced configuration. There it should be possible to change the firmware. Also as noted in part 4, host-passthrough can be typed into the selection window, it is not a pre-defined selection.

I just want to report that I have successfully got near the same configuration on Debian GNU/Linux 8.0 system. I've used ASUS P8Z77-V Deluxe mainboard, Core i7 3770T, ASUS Radeon R5 230 (marked to support UEFI on official site), Zotac nVidia GTX 760 4Gb (is not marked to support UEFI, but it looks like it supports it).After any guest starts Gnome Shell (which runs on IGD) loses all its effects and animations. I suppose that it is the issue with VGA arbiter.

I did not used any side repositories. The only thing I had to do is to plug testing and unstable repositories for few newer packages.

If anybody is interested here are software versions I've used:1. Linux kernel 4.0.0 from testing repository2. OVMF from unstable repository3. libvirt 1.2.9 from stable repository4. qemu-kvm 2.1 from stable repositoryI had manually write OVMF arguments for qemu since virt-manager does not support it (did not check whether libvirt does).

Additionally I noticed that several games on Linux guest (Borderlands 2, Fahrenheit) have issues with sound. I tried to enable MSI for audio device, but it did not help. I am wondering why MSI for soundcard is not enabled by default, it looks like there're no issues with it.

I used the script you've provided in your post (vfio-pci-override-vga.sh) it works but only if I'm using the open source radeon drivers. If I use the drivers from amds site then kvm says the resource is busy. I ran "lspci -vnn" it says the cards claimed by "vfio" but yet kvm says its busy. I also ran "dmesg" it says something along the lines of "vfio@4.00.0 vs fglrx@4.00.0". I dont want to use the opensource drivers because they dont fully support my cards also I cant run a 144hz monitor with them. So what can I do to resolve this?

I don't understand what you're doing, do you have two AMD cards, one of which you want to assign to the guest and the other to be used by the host? And fglrx is claiming some resources of the guest card even when claimed by vfio-pci, preventing use in KVM? That sounds like an fglrx problem, complain to AMD.

Thanks for putting this out there. I'm trying to assign a DVB-T card to a VM, but am really struggling with it. I thought it would work when I read your post and added this to my grub command line:iommu=pt intel_iommu=on vfio_iommu_type1.allow_unsafe_interrupts=1 pci-stub.ids=14f1:8802,14f1:8800.

I understand the problem is that I have multiple PCI devices in group 11:# find /sys/kernel/iommu_groups/ -type l/sys/kernel/iommu_groups/0/devices/0000:00:00.0/sys/kernel/iommu_groups/1/devices/0000:00:02.0/sys/kernel/iommu_groups/2/devices/0000:00:16.0/sys/kernel/iommu_groups/2/devices/0000:00:16.3/sys/kernel/iommu_groups/3/devices/0000:00:19.0/sys/kernel/iommu_groups/4/devices/0000:00:1a.0/sys/kernel/iommu_groups/5/devices/0000:00:1b.0/sys/kernel/iommu_groups/6/devices/0000:00:1c.0/sys/kernel/iommu_groups/7/devices/0000:00:1c.4/sys/kernel/iommu_groups/8/devices/0000:00:1c.6/sys/kernel/iommu_groups/9/devices/0000:00:1c.7/sys/kernel/iommu_groups/10/devices/0000:00:1d.0/sys/kernel/iommu_groups/11/devices/0000:00:1e.0/sys/kernel/iommu_groups/11/devices/0000:05:00.0/sys/kernel/iommu_groups/11/devices/0000:05:00.1/sys/kernel/iommu_groups/11/devices/0000:05:02.0/sys/kernel/iommu_groups/11/devices/0000:05:02.2/sys/kernel/iommu_groups/12/devices/0000:00:1f.0/sys/kernel/iommu_groups/12/devices/0000:00:1f.2/sys/kernel/iommu_groups/12/devices/0000:00:1f.3

My DVB-T card is 0000:05:02.0 and 0000:05:02.2.

Unfortunately I still can't get the VM to start and get this error when booting it so obviousuly what I added to the kernel; command line didn't work:

I have set my system up as per the vfio-pci-override-vga.sh script method and it pretty much works flawlessly on a clearos (centos) 7 system. I compiled qemu, libvirt and virt-manager from source as the clearos packages are too old for the setup to work this way. The only issue i have is if I try to run a second gpu for another windows machine (two separate machines running with two separate cards). The machine with the card in the second pcie slot will boot with a white screen instead of the efi boot screen, function for a little bit (usually with white lines and artifacts) and then lock up the entire host machine).

The cards and slots are fine as I can take either one out and run them from either slot and everything performs as expected. I have also installed windows on the host and was able to run both cards as a dual monitor setup.

Hi,I hope someone is still reading the comments^^I had a setup like this running on Fedora21 but I switched to Debian 8 lately and nothing works.From what I figured out, pci-stub is configured as a module and loaded way to late when I look at my bootlog. (after modules like e1000e and xhci_hcd already did their thing)I blacklisted radeon so my GPU can be claimed by pci-stub, but pcie-usb controller, ethernet and gpu hdmi audio can't be claimed at this point.I think I could blacklist e1000e, too but not xhci_hcd because there are multiple devices.

My mouse and keyboard are connected to a kvm switch which is connected to onboard usb and some pcie x1 card I assigned to my virtual machine, so I can easily switch mouse and keyboard between host and virtual system.

Any idea what I could do to solve this without changing my system?I think recompiling the kernel would solve it, but I'd like to keep this system as simple as possible.

Having a similar issue with module load order/priority and devices being grabbed by the wrong module. Haven't figured it completely but added 'vfio-pci' to the top of the list in /etc/initramfs-tools/modules and it solve 99% of my problems. Think the issue is that the kernel parameter Alex notes (rd.driver.pre=vfio-pci) does work in Debian based distros...I use Ubuntu so guessing you've tried this also and failed?

I keep receiving this error whenever i try to boot two guests at onceif A is started first, then it pops up an already claimed error for gpu ! which is for guest domain A when I attempt to start BLikewise in reverse for A on guest domain B's gputwo seperate discreet GPUsrror starting domain: Requested operation is not valid: PCI device 0000:01:00.0 is in use by driver QEMU, domain A

Thanks for the tutorial, I am trying to follow multiple sources to set up a GPU passthrough on Fedora 23 with Kernel 4.3.5, where the host use Intel IGP (Z170+i6-6600), the guest Windows 10 uses AMD 7950.

However, after everything is set (IOMMU OK, vfio-bind is done as checked using lspci -nnk), and installing UEFI for QEMU,whenever I attached the two PCI devices, (graphics + audio), the KVM simply froze, and the CPU usage was constant and the screen was blank (no video output). I have spent a few days on it but really can't figure it out, much appreciated if you wouldn't mind pointing me to some possible fixes.

One point I think maybe the problem is the IOMMU group, I saw it to be group 1 which included 3 devices, graphics, audio and the PCI bridge.Is there anything I should do for the bridge? Or I should actually change the card from slot 1 to slot 2?

I just built a nice X99 system for this very purpose, but I made the mistake of getting two identical GPUs. You said, "This script will find every non-boot VGA device in the system, use the driver_override feature introduced in kernel v3.16, and make vfio-pci the exclusive driver for that device."

The problem is that when I check:

find /sys/devices/pci* -name boot_vga

it indicates that both GPUs are marked as boot_vga. I've scoured the internet and can't find a way to set the 2nd gpu to NOT be boot_vga. Do I set the contents of /sys/devices/pci0000:00/0000:00:02.0/0000:02:00.0/boot_vga to 1 (it currently has the single character of zero in it)?

Wait - the primary GPU has a 1 in the boot_vga file, so I think that the 2nd GPU is actually NOT set to boot_vga. I'm going to push forward with your instruction and just see if there actually isn't a problem.

I was able to get up to the pci-stub piece. I have verified that my HDMI audio and video are the only items in an IOMMU group. When I attempt to move them from pci-stub to vfio, the HDMI audio will bind, but the HDMI video will not.

I want to add that to get the fix for multiple devices with same IDs working on Ubuntu, do everything as described except for the part where you modify the dracut config to copy the /sbin/vfio-pci-override-vga.sh script. Ubuntu uses initramfs-tools. Specifically, add a file to /etc/initramfs-tools/hooks/vfio-pci-override-vga with the following contents:

>Another couple other bonuses for v4.1 and newer kernels is that by binding devices statically to vfio-pci, they will be placed into a low power state when not in use. Before you get your hopes too high, this generally only saves a few watts and does not stop the fan.

Maybe this is common knowledge by now, but using libvirt 2.0.0 on Linux 4.6 with a VFIO configured graphics card the fan may turn off completely if the graphics card has a "Zero RPM" mode. In my case I'm using EVGA GTX 970 SSC, and the RPM control seems to work as if native on Windows.