Arbitration and Translation, Part 3

This post is the third in a series. You can see the others here, Part 1 and Part 2.

What is an Arbiter?

In the NT PnP subsystem, an arbiter is an interface that a bus driver can expose which is able to intelligently assign PnP resources of a single specific type (memory, I/O ports, DMA channels, interrupts, bus numbers) to its children. In general, an arbiter cannot assign resources that it has not claimed from its parent.

The PnP manager itself exposes five arbiters, one for each type listed above. These arbiters are relatively dumb. They give out ranges of numbers, with the only criteria being these:

· Is this range free? If so, you can have it.

· If the range is already claimed, but with the shareable flag, and your claim is marked shareable, you can have it too.

· If any part of the range is already claimed as exclusive, you can’t have it.

These arbiters aren’t bus-specific, but they don’t have to be. They’re enough to get started.

Yesterday, I covered the translator interface and what it does. The arbiter interface is similar. Both are about manipulating the resources for child devices and putting them is less domain-specific terms. The difference between a translator and an arbiter is simply that translator interfaces are sufficient when you cannot really change the resources available to a child device and arbiters are necessary when you can. Translators are, as you might expect, much simpler.

HALMPS

To illustrate the difference, I want to talk about HALMPS. This was a HAL that was shipped as part of Windows NT 3.5 through Windows Server 2003. It might have even shipped in Server 2008. I don’t remember when it got pulled from the tree.

It ran on machines that conformed to the Intel Multiprocessor Specification, versions 1.1 through 1.4. If you’re curious, you can find it here. That spec has been entirely obsoleted by ACPI. MPS was simple where ACPI is very complex. But MPS can’t describe a machine that changes configuration dynamically at run time while ACPI can. As it turns out, this adds a whole lot of complexity.

MPS describes a system in terms of, among other things, the number of local APICs (which deliver interrupts to processors) and I/O APICs (which collect interrupts from devices.) It says which pins on which I/O APICs each PCI device is connected to. This is actually encoded as device-function-IntPin, and HALMPS represents a devices “IRQ” thusly. You can see this in Device Manager of a machine running HALMPS. The assigned IRQ is just these values all run together. This was very confusing to many people, as they might see two devices with the same IRQ in Device Manager, but that just meant that those two devices occupied the same slot on two different buses. They might have been sharing interrupts, or they might not.

The important part of the story here is that the BIOS picked all interrupt-related routing and it was fixed forever at boot. There aren’t any decisions to make, except one. The OS gets to pick which of the processors get targeted by a specific I/O APIC input.

When we were gluing PnP onto the side of the NT driver model, during the development of Windows 2000, the existing scheme for choosing a target processor set for a device’s interrupts involved the driver calling HalGetInterruptVector. The target IDT entries, the processor set mask and the IRQL for the device all had to be chosen there. Furthermore, if two devices shared interrupts, they had to get the same answer, even if one driver was PnP-aware and one made this obsolete call. So I left the IRQ-to-IDT mapping code in the HAL.

If a PnP driver made a resource claim for an interrupt, then that claim would make its way toward the root of the PnP tree (see yesterday’s post: link) and it would reach an interrupt translator at the HAL device node. The HAL would see the device’s claim, do the math on how the interrupt was routed, including which I/O APIC and which pin on that I/O APIC, and then make an internal call to HalGetInterruptVector, which would choose a target processor set, an IRQL and a vector. The target processor set (actually the APIC cluster ID) was then encoded in the upper 24 bits of the “Vector” that the device was assigned and the IDT entry was encoded in the lower 8 bits. This was then presented to the root interrupt arbiter within the PnP manager, where it was claimed.

Just for fun, I fired up a VM running HALMPS and dumped this out in the debugger. You can see the relevant parts here:

Now let’s contrast that with HALMACPI. This is the HAL that runs (to this day) on any machine that conforms to the ACPI spec and has more than one processor, which is nearly anything you can go out and buy.

The ACPI spec says a few things about interrupts:

· There are a discrete number of I/O APICs and their base addresses are listed in ACPI tables.

· ISAPnP- or ACPI-enumerated devices are attached to I/O APIC inputs and those attachments are described in the ACPI namespace under each device. A device can be moved from one input to another by invoking the _SRS method under the device.

· PCI devices are either directly attached to I/O APIC inputs or they are attached to IRQ steering “link nodes” which themselves can be attached to one of a set of I/O APIC inputs. The set of possible attachments is described under the link node (which is itself sort of a device) in the ACPI namespace. The exact pin that they are attached to can be changed by invoking the link node’s _SRS method.

This is entirely different from HALMPS. Now we have a choice about how devices are routed, if the motherboard designer designs the board that way and if the BIOS guy exposes the functionality. If we want to move one or a group of PCI devices from one IRQ to another, we can. I put an interrupt arbiter in the ACPI driver, as that was where it was possible, or at least easy, to interact with all the various parts of the ACPI namespace.

An arbiter gets requests like: “Here’s a set of four devices, each of which has a fairly complex set of possible interrupt assignments. Please find the optimal configuration which satisfies all the requirements. When a device needs I/O ports, memory ranges and interrupts, these requests get made by the PnP manager to each type of arbiter simultaneously. If a fit can be found, all the device eventually get IRP_MN_START_DEVICE with a resource set that meets their needs.

Note that this problem is NP-complete. So we don’t look at every possible solution. There are a bunch of heuristics about which parts of the solution space to look at first and how long to spend looking.

In truth, the NT PnP team came to a fairly painful conclusion after a couple of years of tweaking these algorithms. (It was painful mostly because it took so long to fully understand the situation.) The first major truth is that you can no longer add a truly new bus architecture to a PC because Windows 95 (and now many other OSes) only understood PCI. At the point that a largely-deployed OS that did PnP natively existed, every machine had to expose the interfaces that that OS understood. Thus we have HyperTransport, PCI Express and lots of internal bus architectures that never got widely published, all of which pretend to be PCI at a PnP level so that they work with old OSes which do PnP natively.

The kicker is that all of those, particularly the chipset-internal ones, have deviations from the PCI spec. I’ve sat in meetings with chipset designers who said that their devices didn’t have to be PCI-compliant because they were inside of a chipset. From a hardware guy’s perspective, this makes perfect sense. It doesn’t have PCI pins, it doesn’t have any PCI logic, so it isn’t PCI. But, for various reasons, it does have a PCI configuration space. When I point out to them that there’s no way for the OS to differentiate between these “non-PCI PCI devices” and real PCI devices, they shrug and say that’s not their problem, since the BIOS sets it all up right anyhow.

And that’s the second major truth. The BIOS sets most or all of it up anyhow.

So the arbiter interface and NT PnP, in general, have a way of asking about how a device was configured by the BIOS. When a device is first discovered, the PnP manager sends IRP_MN_QUERY_RESOURCES. This IRP asks the question “what resources is this device using, right now?” The PCI driver will look at a device’s Base Address Registers and its Interrupt Line register and send that claim back in response. The PnP manager then calls into the relevant arbiters with the device’s PDO (or a proxy PDO if the driver an NT4-style non-PnP driver) and claim those ranges unconditionally for the device, with a flag saying that this is a “boot reservation.” See the ‘B’ in some lines of the debugger dump above, and you’ll see these boot claims.

When the device stack for the device is being built, the PnP manager sends IRP_MN_QUERY_RESOURCE_REQUIREMENTS to ask “what are the set of all possible sets of resources this device could use?” And once the FDO and filters have been loaded, it sends IRP_MN_FILTER_RESOURCE_REQUIREMENTS to ask “what modifications would you like to make to this claim that the bus driver has generated on your behalf?”

The resulting claim set is sent to the arbiters. Now those arbiters know what resources the device booted with, if the device was present in the machine at boot time. So they, for the most part, just choose what the BIOS chose. This is what makes slippery chipsets work just fine. The BIOS is the expert and NT leaves that alone.

Some resource types don’t work this way. Most notably, there’s no notion of which IRQ a device was connected to at boot time if your machine is running with the APIC enabled. The BIOS only configures the IRQ routing for the PIC (not APIC) interrupt controller, in preparation for running Windows 98, which never supported APICs. So the ACPI IRQ arbiter, when running on an APIC system, throws away the boot claims.

Note that the boot claim system has some interesting properties. There may be conflicts, and sometimes that’s okay. BIOSes tend to make claims for ACPI-enumerated dummy devices like “Motherboard Resources” when there is a device which must claim some I/O ports but which mostly doesn’t ever get a driver loaded. The most famous example of this tends to be an SMBus controller. Most machines don’t run a driver on it, but the BIOS needs to access it in System Management Mode. So it will claim the ports. Sometimes, people write drivers for them and then those driver show up as a conflict with a boot claim. This is mostly benign.

Message-Signaled Interrupts

Interrupt arbitration tends to be the most complicated part of the system. Or, at least, it seems that way to me, since I’m still messing around with it almost fifteen years after I first began. Most of the other arbiters haven’t changed much in years beginning with a ‘2’.

Devices which can generate Message-Signaled Interrupts don’t need to use an I/O APIC input. But they can, usually, also use one, particularly if the OS in question doesn’t understand MSI. With MSI, the interrupt is sent by doing a short busmaster burst involving 32-bits of data to a special address. The device need not understand the address nor the data. It just gets told, when you want to trigger this interrupt, send this blob here.

The PCI Spec has taken two passes at defining how this should be configured in a device, both of which have proved insufficient for representing the problem at hand. “MSI” was introduced in PCI 2.2 and it involved writing a single address into the device, and a single data value too. If the device wanted to send more than one interrupt, it could vary N low-order bits of the data value, at the OS’s discretion. This meant that the data values were constrained to a naturally aligned range of values, and that range was a power of 2 in length. (See the PCI Spec for the scary details.)

Given the way Intel defined the special address/data format in the Software Developer’s Manual, Volume 3, Chapter 8, Section 11 (http://www.intel.com/products/processor/manuals/) the address determine the target processor set. This means that MSI (as defined in PCI 2.2) can only work if every interrupt targets the same processor or set of processors. You can’t choose to send one interrupt message to one processor and one to another.

Thus MSI-X was defined in PCI 3.0. Both still exist, and they’ve been carried into PCI-X and PCI Express. MSI-X allows each interrupt message to have separate address and data values. It also allows as many as 2048 messages per PCI function.

Given that the processor-set-to-address mapping was fixed by Intel, virtualization and large numbers of cores is forcing another level of indirection through I/O MMUs, called “VT-d” by Intel and “IOMMU” by AMD.

The fundamental problem here is that the PCI spec never should have tried to define message-signaled interrupts at all. They just don’t have anything to do with the PCI bus. Every interesting thing about them is external to the PCI bus. (Full disclosure: I didn’t always understand this, and I sat on the committee that defined MSI-X.) The only thing the PCI spec allows you to do is to have a defined mechanism for telling the device to target a busmaster transaction to a specific address with specific data when the device needs attention.

There’s no standard mechanism for telling a PCI NIC to send your network data to a specific address, as that’s just part of the definition of the device behavior. You don’t want to standardize that because it removes degrees of freedom when you want to do it differently in the future. There shouldn’t be one for interrupts, either, on exactly the same grounds. I’ll quit ranting now.

What you really need is a way to say, for example, “my device needs to trigger 36 interrupts, two-per core in this 16-core machine, plus four more for various housekeeping tasks.” That’s not really expressible in the PCI capability structs which define MSI and MSI-X, but it is expressible inside of Windows.

Once the PnP manager has assigned IDT vectors, IRQLs, target processors and the lot, you need a way of programming these into the device. This is expressible in the PCI spec, though it’s redundant in my mind. Whether the bus driver does it or the function driver does it doesn’t matter much.

Mechanically, it works like this:

1. The PnP manager sends IRP_MN_QUERY_RESOURCE_REQUIREMENTS. The PCI driver reads the various capability structs and some registry keys that were set during INF processing (since, as we saw above, the PCI spec can’t express everything necessary) and responds to this IRP with some interrupt claims. Typically, there will be three possibilities expressed in the resultant IO Resource Requirements List: lots of message-signaled interrupts, one message-signaled interrupt and, lastly, one line-based interrupt.

2. The PnP manager builds the rest of the device stack and sends IRP_MN_FILTER_RESOURCE_REQUIREMENTS. If the device is trying really hard to squeeze out performance by targeting specific interrupts at specific processors, the FDO (usually NDIS or storport, along with the miniport) will “filter” that claim to affinitize certain interrupts to certain cores, and possibly to cut down the total number of messages in the first claim to some multiple of the number of cores actually installed.

3. The PnP manager passes these sets of claims to the interrupt arbiter in the ACPI driver, which looks at them and tries to satisfy them in the order that they’re listed. If there are enough free IDT entries (and the underlying processor and chipset support MSI at all) then the first claim gets satisfied. If not, it goes for the single message claim. If that can’t be satisfied, it will back off to the line-based interrupt, which is usually shared with something else and will almost certainly succeed.

4. The PnP manager translates these resources down to the bus terms. (See yesterday’s post.) This involves changing these vector and target processor sets into addresses and data again. These values end up in your interrupt resources in your raw resource list.

5. The PnP manager translates these “up” into processor-relative terms. This populates the translated resource list with Vector, Level and Affinity for each interrupt message.

6. The PnP manager sends IRP_MN_START_DEVICE with both lists. The PCI driver sees the IRP first (since the FDO handles start on the way up, remember) and programs the MSI or MSI-X capability structures, if they exist. The FDO sees the IRP next, and stores the information for calling IoConnectInterruptEx. It may use the raw resources to derive address and data values if it likes.

ACPI IRQ Arbiter Dumps

The ACPI IRQ arbiter handles all this by considering a list of things simultaneously.

· Free IDT entries on all the potential cores.

· Free I/O APIC inputs for devices which have some flexibility.

· Whether MSI is available in the processor and the chipset

· Whether the device has an MSI request

Since that arbiter is looking across a couple of dimensions simultaneously, dumping it is a little more complicated. The default debugger command “!arbiter” will show you the IRQ claims. “!acpiirqarb” will show you the other state. I’ll walk through these dumps below.

This first dump is of the default arbiter in the PnP manager. It says that lots of vectors are reserved for internal use and lots of vectors are assigned to ACPI (across every core) for redistribution to other devices.

The large numbers for IRQs are placeholders for MSI assignments, which in this machine are all PCI Express root ports.

!acpiirqarb tells us about the other internal arbiter state, including IDT assignments on every core and state of the ACPI link nodes, which exist but aren’t used in APIC mode in this machine. It also details all the I/O APICs in the machine, including the metadata on all the inputs.

The “not on bus” claims are interesting. They’re the inverse of the IDT entries that got claimed above in the root arbiter. It means, essentially, that ACPI can’t give them out because it doesn’t own them.

In conclusion, arbitration is complicated and we keep adjusting it. Windows 7 actually added a little bit of knowledge about VT-d to interrupt arbitration so that we could easily go beyond 64 cores.

People have been asking us for years to document the interfaces so that non-Microsoft-employed driver writers could write their own arbiters. This would be most useful for “converged NICs” where a single PCI function exposes a bus driver which in turn exposes a NIC, an RDMA device, an iSCSI initiator and/or an FCoE HBA. These bus drivers jump through many hoops to do second-level interrupt dispatch for their children, which they wouldn’t have to do if they could write an interrupt arbiter.

It’s particularly difficult, though, to do interrupt arbitration in a distributed manner. I/O port or memory arbitration can be done locally on the bus related to the device. But interrupts are often run as side-band signals straight from one part of the motherboard to another. It’s difficult to prove that you can make this code work if it’s decentralized.

We wrote a simple bus driver that claims resources and doles them out for children. It’s called “MF.sys” and it works so long as the resources you need for one child are completely disjoint from the resources you need for another child. This tends not to be the case with converged NICs. Some register or some interrupt gets used for some shared purpose.

For now though, the best answer I can give is that all this information is mostly useful for debugging.