Tracking down a kernel bug with git bisect

After a recent upgrade of my Fedora 20 system to kernel 3.15.mumble, I
started running into a problem (BZ 1121345) with my Docker
containers. Operations such as su or runuser would fail with the
singularly unhelpful System error message:

Downgrading the kernel to 3.14 immediately resolved the problem,
suggesting that this was at least partly a kernel issue. This seemed
like a great opportunity to play with the git bisect command,
which uses a binary search to find which commit introduced a
particular problem.

Unfortunately, between the version I knew to work correctly (3.14) and
the version I knew to have a problem (3.15) there were close to 15,000
commits, which seemed like a large space to search by hand.

Fortunately, git bisect can be easily automated via git bisect run
subcommand, which after checking out a commit will run a script to
determine if the current commit is “good” or “bad”. So all I have to
do is write a script…that’s not so bad!

It actually ended up being somewhat tricky.

Testing kernels is hard

In order to test for this problem, I would need to use arbitrary
kernels generated during the git bisect operation to boot a system
functional enough to run docker, and then run docker and somehow
communicate the result of that test back to the build environment.

I started with the Fedora 20 cloud image, which is nice and
small but still the same platform as my laptop on which I was
experiencing the problem. I would need to correct a few things before
moving forward:

The Fedora cloud images (a) do not support password authentication and
(b) expect a datasource to be available to cloud-init (without
which you get errors on the console and potentially a delay waiting
for the login: prompt), so prior to using the image in this test I
made some changes by mounting it locally:

For this test I would be using the qemu-system-x86_64 command
directly, rather than working through libvirt (qemu has options
for convenient debugging with gdb, and is also able to access the
filesystem as the calling uid whereas libvirt is typically running
as another user).

I would need to perform an initial docker pull in the image, which
meant I was going to need a functioning network, so first I had to set
up a network environment for qemu.

Network configuration

I created a bridge interface named qemu0 to be used by qemu. I added
to /etc/sysconfig/network-scripts/ifcfg-qemu0 the following:

The what now? /dev/ssd/root is the root device for my host system,
but wasn’t anywhere in the kernel command line I used when booting
qemu. It turns out that this was embedded in the initrd image in
/etc/cmdline.d/90lvm.conf. After removing that file from the
image…

I bet you thought we were done!

Modern systems are heavily modular. Without access to a module tree
matching the kernel, I would be unable to successfully boot the
system, let alone use Docker. Looking at which modules were loaded
when I ran docker with the above image, I set up a custom kernel
configuration that would permit me to boot and run docker without
requiring any loadable modules. This would allow me to use the same
image for each kernel without needing to re-populate it with modules
each time I built a kernel.

Testing docker

The last step in this process is putting together something that tests
docker and exposes the result of that test to the build environment.
I added the following script to the image as /root/docker-test:

That mounts the build directory as a 9p filesystem on /mnt. This
allows us to write out test results to, e.g.,
/mnt/test_result/stdout and have that appear in the test_result
directory inside the kernel source.

This script is run at the end of the boot process via an entry in
/etc/rc.d/rc.local:

#!/bin/sh
sh /root/docker-test

Running the boot-kernel script without additional configuration will
cause the image to boot up, run the docker test, and then exit.

Running git-bisect

At this point we have just about everything we need to start running
git bisect. For the initial run, I’m going to use git tag v3.14
as the “known good” commit and v3.15 as the “known bad” commit, so
we start git bisect like this:

$ git bisect start v3.15 v3.14

Then we run git bisect run sh bisect-test, where bisect-test is
the following shell script:

As you can see (from the Merge: header), this is a merge commit, in
which an entire set of changes was joined into the master branch.
So while this commit is technically the first commit in which this
problem appears in the master branch…it is not actually the commit
that introduced the problem.

I was in luck, though, because looking at the history for the left
side of this branch (starting with f3411cb) showed a series of
patches to the audit subsystem:

The commit found by git bisect seems like a reasonable candidate;
it’s a patch against the audit subsystem and has something to do with
namespaces, which are central to Docker’s proper operation.

Debugging the problem

We can boot the kernel built from 33faba7 with the boot-kernel
script, adding the -s argument to qemu to start a gdbserver on
port 1234:

sh BOOT_ARGS=NO_DOCKER_TEST QEMU_ARGS="-s" boot-kernel

A caveat about attaching to qemu with gdb: qemu has a -S option
that will cause the virtual machine to halt at startup, such that
you can attach before it starts booting and – in theory – set
breakpoints in the early boot process. In practice this doesn’t
work well at all (possibly because the vm switches from 32- to
64-bit operation during the boot process, which makes gdb unhappy).
You’re better off attaching after the kernel has booted.

I know we’re getting the EPERM in response to sending audit
messages. Looking through the code in kernel/audit.c, the
audit_receive_msg seems like a reasonable place to start poking
about. At the beginning of audit_receive_msg, I see the following
code:

err = audit_netlink_ok(skb, msg_type);
if (err)
return err;

So let’s set a breakpoint there if audit_netlink_ok() returns an
error:

That it is, in fact, -EPERM, which suggests we’re on the right
track. Taking a closer look at audit_netlink_ok(), it’s obvious
that there are only three places where it can return -EPERM. I
tried setting some breakpoint in this function but they weren’t
working correctly, probably due to to optimizations performed when
compiling the kernel. So instead of gdb, in this step we just add a
bunch of pr_err() statements to print out debugging information on
the console:

It looks like we’ve found out where it’s failing! Of course, we’re
checking code right now that is several commits behind v3.15, so let’s
take a look the same function in the 3.15 release:

$ git checkout v3.15

Looking at audit_netlink_ok in kernel/audit.c, it looks as if that
initial check has changed:

/* Only support initial user namespace for now. */
/*
* We return ECONNREFUSED because it tricks userspace into thinking
* that audit was not configured into the kernel. Lots of users
* configure their PAM stack (because that's what the distro does)
* to reject login if unable to send messages to audit. If we return
* ECONNREFUSED the PAM stack thinks the kernel does not have audit
* configured in and will let login proceed. If we return EPERM
* userspace will reject all logins. This should be removed when we
* support non init namespaces!!
*/
if (current_user_ns() != &init_user_ns)
return -ECONNREFUSED;

So let’s insert our print statements into this version of the code and
see if we get the same behavior:

So it looks like the intial failure in audit_netlink_ok() was fixed,
but we’re stilling failing the CAP_AUDIT_WRITE check.

Summary

What’s going on here?

Prior to 33faba7, audit messages were only accepted in the main
network namespace. Inside other network namespaces, processes sending
audit messages would simply receive ECONNREFUSED. For example, this
is the result of using strace on that docker run command in a
pre-33faba7 kernel:

With 33faba7, audit messages are now accepted inside network
namespaces. This means that instead of simply getting ECONNREFUSED,
messages must pass the kernel capability check. I spoke with some of
the audit subsystem maintainers (including Richard Guy Briggs, the
author of this patch series), and the general consensus is that “if
you want to write audit messages you need CAP_AUDIT_WRITE”.

So while this patch did change the behavior of the kernel from the
perspective of container tools such as Docker, the fix needs to be in
the tool creating the namespaces.