Bugfixing the in-kernel megaraid_sas driver, from crash to patch

Today we bring you a technical writeup for a bug that one of our sysadmins, Michael Chapman, found a little while ago. This was causing KVM hosts to mysteriously keel over and die, obviously causing an outage for all VM guests running on the system. The bug was eventually traced to the megaraid_sas driver and the patch has made it to the kernel as of version 3.3.

As you can imagine, not losing a big stack of customer VMs at a time, possibly at any hour of the day, is a pretty exciting prospect. This will be a very tech-heavy post but if you’ve ever gone digging into kernelspace (as a coder, or someone on the ops side of the fence) we hope it’ll pique your interest. We’ll talk about the diagnostic process and introduce some of the new tools that made this possible.

Here we see Michael analysing a rare and dangerous kernel bug

Difficulties faced

The exact circumstances of what we saw weren’t terribly interesting, it suffices to say that you’d lose a whole machine and not have much useful logging to work with. Our VM infrastructure is mostly RHEL 5 and 6, running on Dell R510 hardware.

A handful of KVM hosts were affected while others were rock-solid. Differential diagnosis was frustrated by the fact that you can’t play around with live machines, that the fault couldn’t be triggered at-will, and that we don’t have a huge install base of hosts in which to find patterns. In addition, pretty much every dimension for analysis that we could think of had working and non-working cases.

As an example, we looked at the chassis, drives, RAID cards and kernel version, amongst other things. The problem might manifest on servers A+B+C, all with the same kernel version, but not on server D which has an older kernel. “Aha!” you might think, but then you’d find that server E is also crashing and has the same kernel as server D.

Just for the record, we’re running Redhat Enterprise Linux 5 and 6 systems here. The tools we’re using are generic, but the usage details may be specific to RHEL.

Having a peek at the crashed system using kdump

Figuring out even the general circumstances of the crashing was not easy. All our Dell servers have out-of-band management cards (DRACs) that let you view the console, but these aren’t always helpful. Often the screen will be blank (default console-blanking in Linux), but even if you plan ahead and disable that, there’s only so much you can do with a crashed system from the console. You need something more.

Enter kdump. kdump is similar to other utilities like diskdump and netdump that let you grab a core from a system and push it to disk or across the network. They work, but they have some prerequisites relating to your storage and networking drivers that can be problematic. kdump dodges these issues by taking another approach. (If you’d prefer the canonical lowdown on kdump, head over to the kernel’s git repo and check out their docs.)

One feature that’s been around for a little while now is kexec. We’ll gloss over the details, but kexec is an interesting hack that lets you reboot into a new kernel by jumping straight into execution of the new binary. It’s a bit of a cheat and there’s many ways you could come adrift in the process (who knows what state your devices are in?), but it’s the perfect answer for simpler tasks. kdump leverages kexec to execute some code that will grab a copy of the system state and leave it somewhere safe.

kdump starts by setting up what’s called a “crash kernel”. This is a special, really skinny kernel and initrd with just enough smarts to dump a core for you, then reboot. You boot the system as normal, passing the crashkernel parameter in the kernel line from GRUB. This sets aside a chunk of memory (about 256MiB) that nothing else can touch, and then loads the crash kernel into that space.

When the system crashes (panics), instead of turning into a blazing fireball it uses kexec to jump into the crash kernel. The crash kernel runs makedumpfile to capture a pristine copy of the panicked system and dump it to disk – we set aside an LVM volume for this purpose. Once it’s done, the system reboots as normal, and all your VMs come back online.

The first kernel has already guaranteed that the crash kernel won’t be touched by blocking-out that memory, so it’s good to go at the drop of a hat. In turn, because we used kexec we can jump into the crash kernel knowing that all the crashed state is perfectly intact. The crash kernel and associated initrd has enough smarts to load the drivers necessary to mount a filesystem on LVM and write the dump, but it can also push the dump to a remote networked system if you’d prefer.

Our systems are big. Like, “128GiB of RAM” big. It takes a long time to write that much data even to fast RAID arrays, so we pass arguments to makedumpfile (check the manpage) to tell it to not bother with zero pages, cache pages, etc. Remember that any time spent running kdump is time that the VMs are down.

Inspecting the crash site

Brilliant, now you’ve actually got something to look at. This sure beats rebooting the system and hoping it doesn’t happen again. But now where to?

This is where you fire up crash (an aptly, but inconveniently named tool). crash takes the debugging power of gdb and adds smarts to make it kernel-aware and more suited to getting this sort of work done. It has its limitations and quirks (seriously, guys, readline isn’t that hard), but it does the job.

That’s well and good, but we don’t know what we’re looking for yet. Using a debug-build kernel we’d managed to extract the eventual cause of the panic: corruption in some kernel data structures. The kernel’s slab allocator manages a series of caches, all of which are visible if you inspect /proc/slabinfo on your system. We were seeing corruption specifically in the buffer_head cache.

Buffer heads are 112 byte structures (the “objsize” heading in slabinfo) carrying metadata pertaining to I/O blocks. The slab allocator packs 36 buffer heads into each slab (“objperslab”), and all the slabs of buffer heads together form the “buffer head cache”. A cache is really just a doubly-linked list.

Using crash we inspected the dumps of crashed systems. What we saw indicated that the panic was occurring due to corruption in the linkage pointers between slabs. The corruption was apparently contained to the buffer_head cache, but in the slab structures themselves, not the buffer heads contained within the slabs.

Cross-referencing the kernel source, we found that this sort of corruption is being sanity-checked for, but that only occurs when the linked list is actually traversed. The corruption could occur at any time but wouldn’t bring down the system until something runs through the buffer_head cache (for extra fun, if the corruption were very precise, it could delay the panic until something traverses the cache (a doubly-linked list) backwards).

This is a good starting point. Using crash we could see the exact nature of the corruption and where in memory it was occurring. Assuming that perhaps a pointer was being misused, we searched the memory space for anything else pointing at the corrupted area but found nothing. Remembering that there could be an a arbitrary amount of time between corruption and panic, this isn’t too surprising.

Caught in the act

Without any fingerprints from the culprit we had to resort to using a live system. Luckily for us, we have SystemTap. SystemTap is a relatively new tool with some similarities to Solaris’ well-regarded Dtrace, letting you hook into a running system for analysis and diagnosis. SystemTap scripts resemble C code, and are compiled into dynamically-loaded kernel modules, sidestepping the need to prepare a system ahead of time for debugging (like recompiling your kernel with debugging flags enabled).

Now knowing what we were looking for we began setting up SystemTap hooks on chunks of kernelspace. This included hooks on the slab allocator dealing with the buffer head cache, checksumming functions to attempt to catch corruption when it happened, and a function to effectively perform a fsck on the buffer head cache itself.

Make no mistake, this is slow work. It’d taken about a full week worth of workhours to get to this point, and the cause of the corruption still wasn’t clear. Through a lot of tracing and source-diving we were starting to point the finger at the RAID driver. This isn’t an uncommon line of reasoning – crashes and other “bad events” often line up with “lots of stuff happening” on the system, but the correlation just wasn’t so clear-cut this time.

This did give us an interesting angle to look at. Going back to the differential diagnosis, we realised there was more of a pattern to the symptoms than we’d realised before. They were, however, very specific.

Summarising the findings:

The crash is due to corruption in the kernel slab allocator

The corruption is only evident when there is a large number of buffer_head objects in memory, such as during high I/O (The corruption may be occurring regardless, however)

The corruption appears to be triggered by MegaCli, the megaraid_sas driver, or the PERC device itself, when MegaCli invokes an STP (SATA Tunnelling Protocol) command on a SATA device in the chassis

A SATA device? But we only use SAS drives in these servers. Well, that’s not entirely true…

Newer servers are using CacheCade, a new performance feature that we wrote about several months ago. The SSDs used for cachecade are SATA devices – this was a “lightbulb moment”. 🙂

So now we’ve identified MegaCli as the culprit, but why is this happening? It turns out that monitoring is to blame. We poll every system with an LSI Megaraid card once an hour to check that the RAID arrays are healthy, which uses the MegaCli tool to query the card for its status. When that happens, MegaCli performs an STP command – that’s our trigger for corruption!

Falling down the Megacli rabbit-hole

We immediately killed the Megaraid monitoring on all servers. We might miss an array failure, but we can limp through those; another panic-crash would be unacceptable now. We needed to figure out what was going on, and why.

MegaCli isn’t distributed as a source package, you just get this binary blob. We can’t trace exactly what’s happening when the STP command is sent but we can spot the ioctl calls with strace, and pull them apart using SystemTap.

By the end of the day we’d managed to figure out a few more things:

The problem occurs on a variety of RHEL kernels: 2.6.32-131.12.1.el6, 2.6.32-131.4.1.el6, 2.6.18-274.7.1.el5

The problem occurs with a variety of megaraid_sas drivers: 5.34-rc1, 5.38-rh1, 5.40

Only MegaCli 8.01.06 has been shown to invoke this command. MegaCli 8.02.16 does not appear to do so

This last point is the key – finally we’d established the exact conditions under which corruption could be triggered, but it needed testing to be be certain.

A note on the diagnosis: Even if the cachecade difference had occurred to us earlier, it would’ve been masked by the way system updates are handled. Due to their sensitive nature, KVM hosts receive a more conservative update schedule. Hosts that happened to have received the updated MegaCli package would be protected from the issue, but for no obvious reason – userspace components aren’t generally expected to cause issues like this.

Before we continue we’ll have a quick word from our sponsor, the SATA Tunnelling Protocol (STP). STP is used to support SATA devices attached to a SAS fabric. It’s basically an encapsulation layer that makes a SATA device look and behave like a SAS device; it’s necessary because SATA devices don’t support all the features that SAS takes for granted.

Reproducing the corruption

A fix is no good unless you can be confident that it works. Science FTW! To reproduce the problem we devised a method to give a high probability of corruption and a crash occurring, based on the knowledge we had.

Buffer heads seem to be the victim, so let’s have lots of them. We did this by performing a simple dd transfer from /dev/zero to a spare LVM volume, creating millions and millions of buffer_head slabs in the cache.

Then to round it all off we forced a traversal of the buffer_head cache – boom! Just as planned.

Fast-forward a few hours, we verified that the newer version of MegaCli doesn’t cause the corruption, then proceeded to build v8.02.16 packages for all our systems so we could safely get the monitoring back online.

Source-diving

The solution is so close, you can taste it! Poring over the driver code, we did manage to nail down precisely what was happening. What follows is mostly a copy of our internal notes, they get down and dirty with the driver and explain exactly what was happening.

The commands sent through from MegaCli contain, amongst other things, a “frame” and an IO vector (a scatter/gather list). The “frame” can have one of a number of different formats (they all have the same header, though); one of the formats is for STP commands.

The IO vector tells the megaraid_sas driver where in the userspace address space the data should be sent to or received from the device itself. The driver allocates corresponding DMA memory chunks for each entry in the IO vector, and determines what kernel addresses correspond to those chunks. It also handles the copy_from_user/copy_to_user stuff to get the data from userspace into these kernelspace chunks and back again. All good so far.

The DMA addresses for these chunks are all 32-bit. The “command” sent to the device itself consists of the “frame” taken from userspace, along with a *new* IO vector with the DMA addresses.

One complication, however, is that the IO vector in the command can have one of three formats: 32-bit addresses, 64-bit addresses, and “IEEE” (as far as I can tell, that’s IEEE 1212.1), which is a variant of 64-bit addresses.

The driver knows how to interpret the command’s IO vector through some flags in the frame. These flags are sent through from userspace without any interpretation or adjustment by the driver.

Here is where things break down: for the STP commands only, MegaCli appears to turn on the IEEE flag. The driver, however, always fills out the command’s IO vector with 32-bit addresses, ie. an array of:

NB: “Skinny” appears to be a codename for a particular megaraid model. I don’t know what the difference between “__attribute__ ((packed))” and “__packed” is.

For the first entry in the STP IO vector, length == 20 == 0x14. The device therefore sees a 64-bit DMA address of 0x1400000000 + phys_addr instead, and clobbers the wrong memory.

To give an example: say the kernel allocated the DMA address 0x91dfc000. The corresponding kernel virtual address is 0xffff880091dfc000 (the kernel simply maps physical memory one-to-one from 0xffff880000000000). The device ends up writing to DMA address 0x1491dfc000, which has the kernel virtual address 0xffff881491dfc000. The first 20 bytes of whatever was at that page have just been erroneously overwritten.

The really lazy sysadmin’s version:

MegaCli prepares some memory to receive results from the card

MegaCli says to the card “Please do something and then write the results back to THIS memory address that I just setup and zeroed-out for you

MegaCli specifies the address in one format, but sets a flag indicating that it’s in another

The card does as its told, reads the flag to interpret the address format, then writes the results to the wrong location in memory. There’s no protection against this because we’re in kernelspace

MegaCli doesn’t notice/care because it was probably expecting zeroes in the result anyway

Assuming the memory address was in use, some poor sucker just got zeroes splattered over their slab headers

Making the patch

So, what’s the long-term solution? For now we’re using a MegaCli binary that appears not to invoke any STP commands, but it’d be even better if those STP commands weren’t corrupting memory.

A simple patch to the megaraid_sas driver can ensure that the correct flags are sent to the device:

MegaCli is still the ultimate cause, passing along data structures with the wrong type-flags set, but the driver shouldn’t be passing opaque structures along blindly either. At least one of these issues is actually easily fixable.

Other consequences of the bug

We suspect, but can’t confirm, that this caused filesystem corruption in one of the VM guests. It’s particularly insidious because drawing a correlation between issues on the host and guest is very weak.

It’s worth pointing out that while this was only observed corrupting the buffer_head cache, the bug really has the potential to cause a multitude of other problems. Specifically, various offsets provided from userspace are used by the driver without any checks. If these offsets are maliciously chosen, the driver can be induced to write to arbitrary kernel memory. Not an easy attack if you want something more precise than a DoS, but you have to work for your supper.

Conclusion

That was a really nasty one and a half weeks that we had to deal with KVM hosts crashing at random. Michael’s diagnosis and solution was a monumental piece of work, comprehensively tearing everything apart to definitively identify the root cause. It’s safe to say that this wouldn’t have been solveable in any reasonable way without tools like kdump, crash and SystemTap.

We hope you’ve enjoyed the write-up, there’s a lot to digest. If you’ve got any questions or something doesn’t make sense, feel free to leave a comment or drop us a mail and we’ll do our best to elucidate. Likewise, any general feedback is also appreciated.

2 Comments

We use stock 5.40 megaraid_sas drvier + megacli 8.02.16, but we also found one misterious hang without any evidence in logs + one partition filesystem corrution in one of our 12 same boxes. All servers has uptime ~100 days, without any problems in logs.

This problem became while there were no megacli command in run. Because it seems that you are some sort of specialists in megaraid_sas + RH, I would like to ask you if you have found any other problem which might cause such server hang?

Not really specialists with Megaraid as such, it’s more about following your nose to find possible leads and investigate them until there’s nowhere left to go. Our focus for the article was mostly on tools that are available to do this sort of work. In your case you’d use everything you’ve got to try and get some logging (things like sysstat, dmesg, etc.) to find hints as to where the problem may lie, then dig in that direction.

It’s very much a scientific approach: use evidence or your instincts to form a hypothesis as to where the problem lies, develop ways to test that, then go ahead and test. Based on the results, you refine the hypothesis and re-test, or try something else.

Unfortunately we haven’t had any experiences like the one you’ve described. Random hangs can be difficult to diagnose especially if you’re suspecting it’s something close to hardware, in which case you may find it useful to change various components and see if differential diagnosis can rule out possible causes. Good luck!