Computers, often from a low-level systems perspective. Note that I speak for myself, not my employer.

Thursday, October 13, 2005

Linux NMIs on Intel 64-bit Hardware

Why are NMIs cool?

If you are running x86_64 Linux 2.6.x, grep for "NMI" in /proc/interrupts. This line exports a running tally of "non-maskable interrupts" on each CPU since system boot. Just what are these NMI thingies? What is Linux doing with them?

In the x86, "non-maskable interrupts" differ from regular old IRQs not so much in their maskability (they're pretty much maskable, just not by the same methods typically used for IRQs), but in their source (they are signalled to the CPU via a different line than IRQs) and semantics.

The architectural purpose for NMIs is to serve as a sort of "meta-interrupt;" they're interrupts that can interrupt interrupt handlers. This may sound ridiculous initially, but for a kernel developer, judicious use of NMIs makes it possible to port some of the luxuries of user-level development to the kernel. Consider, e.g., profiling. User-level apps typically use SIGPROF, which in turn is driven by the kernel's timer interrupt handler. But what if you're a kernel developer concerned with the performance of the timerinterrupthandler itself?

NMIs provide one solution; by setting up periodic NMIs, and gathering execution samples in the NMI handler, you can peer into the performance of kernel critical sections that run with disabled interrupts. We've used this technique to good effect to study the performance of VMware's virtual machine monitor. The oprofile system-wide profiler on Linux leverages the same technique.

Another important application for NMIs is best-effort deadlock detection; an NMI "watchdog" runs perioically and looks for signs of forward progress (e.g., those counts of interrupts in /proc/interrupts rolling forward) has a decent chance of detecting most "hard" kernel hangs. 9 times out of 10, an NMI handler that detects a wedged system can't do much of use for the user. The system will crash, and often do so just as hard as if there were no NMI handler present; however, perhaps it will dump some sort of kernel core file that can be recovered after the inevitable reboot to aid kernel engineers in diagnosing the problem post-mortem. Even something as simple as pretty-printing a register dump and stack-trace to the system console provides a world of improvement in debuggability over a mute, locked-up box.

It's this last application that gets Linux excited. On x86_64, the Linux kernel defaults to building with an NMI watchdog enabled. If you cat /proc/interrupts on a 32-bit x86 system, you'll see the NMI line with a total of zero (unless you've compiled your own kernel with NMIs enabled). So, if NMIs are so nifty, why do we use them for x86_64, and not plain old i386? Good question. I'm not sure why the two architectures are treated differently. Perhaps because x86_64 is a bit more young, and the Linux kernel folks are more concerned with being able to debug hangs? Or perhaps there are architecture-specific differences in other parts of the kernel that make the watchdog less appealing for i386. I don't know.

Too much of a good thing?

So, let's get back to that NMI line in your /proc/interrupts file. If you tap your fingers for a few seconds between inspections of this file, you'll notice the NMI total increasing. However, the rate at which it increases will be dependent on your underlying hardware. If you're running linux-x86_64 on AMD hardware, you'll notice those NMIs ticking up at about 1Hz. This is convenient for the intended purpose; once a second is plenty frequent to check for something as (hopefully) rare as a hard system lock-up.

Now, try the same experiment with an Intel EM64T machine. You'll notice that the NMI interrupts are coming in much, much faster. If you do the math, you'll find they're coming in at 1000Hz, exactly the same rate as the timer interrupts. What gives? And why does Linux want 1000 times more of them on EM64T hardware than on AMD64 hardware?

The answers are buried in nmi.c:nmi_watchdog_default; for AMD64, the kernel uses on-chip performance counters as a source of NMIs, while for all other CPUs (namely, EM64T parts), it uses the timer interrupt. After an initial calibration phase, Linux throttles back the AMD NMIs to a rate of 1Hz. However, on Intel hardware, however, some unusual jiggery-pokey takes place in the legacy PIC and local APICs, so that the very same timer interrupt signal trickles into the kernel via two different routes: once as a normal interrupt, via the IOAPIC and the local APIC's intr pin, and again via the LINT0 line into each local APIC as an NMI. Since the signal generating the NMI is the timer signal, there's little Linux can do but run the NMI interrupt at the same frequency as the timer interrupt.

This arrangement presents a couple of problems. From the point of view of NMI consumers like the aforementioned oprofile, this partially subverts the purpose of NMIs in the first place; by heavily correlating the NMI handler with the running of a particular chunk of kernel code (namely, the plain-jane timer interrupt handler), the distribution of kernel samples can be skewed. This could badly impact the effectiveness of profiling applications (the profile samples would tend to hit near the same place).

There are also performance consequences to this use of the hardware. 64-bit Linux on Intel hardware performs worse than it has to. How much worse? Let's assume a typical P4 needs 1000 cycles at minimum to take an NMI, and execute an IRET instruction to return from it. Then, of course, the software presumably has some work to do, taking at least another 2000 cycles. (Yes, I'm pulling these figures from thin air, but I consider them lower bounds, given that the data and code for the NMI handler are most likely cool in the cache.) So, we've used up 3000 cycles 1000 times every second; on your 3GHz modern processor, that's about 0.1% of the processor's performance dedicated to checking for deadlocks. That figure might not sound damning, but when you consider the blood that kernel folks sweat trying to wring fractional percentages out of a single path, 0.1% shaved right off the top, independent of Amdahl's Law for just the price of a recompile is an absolute dream.

Where do I come in? Well, this NMI overhead is even more pronounced when running atop VMware's virtual machine monitor. The probe effect of NMIs is magnified inside a virtual machine, since we typically must emulate the vectoring of the NMI through the virtual IDT in software. But, what's worse, in SMP VMs, the hardware path Linux is using to deliver NMIs introduces bottlenecks. The PIC and LINT0 line, which Linux uses to deliver NMIs on Intel hardware are (and are constrained by the architecture to be) system-wide global entities, shared by all virtual CPUs in the VM; to manipulate them 1000 times a second induces lots of synchronization-related overheads. (And no, armchair lock granularity second-guessers, there's not just a big "LINT0 lock" we're taking over and over again; it's a cute little lock-free algorithm, but at the end of the day, you can only get so cute before you do more harm than good.)

Update: Linux 2.6.12 fixed the NMI-storm-on-EM64T misbehavior chronicled here. Unfortunately, few distributions have picked up such a shiny, new kernel, so the weirdness documented above still affects the majority of users.