The Kernel Column – The development of Linux Kernel 3.9

Jon Masters summarises the goings-on in the Linux kernel community as the 3.9 kernel was being prepared for release. Ongoing development brings with it security headaches, and kernel testing is improved by the Trinity ‘Fuzzer’

The Linux kernel is a very mature codebase with many millions of hours’ worth of developer time invested. There are several popular kernel test suites, including the LTP (Linux Test Project), as well as the proprietary tests run by various commercial Linux interests. Most test suites are written with the premise that they will test real- world scenarios, and so they are formed from small test cases that are run in sequence. Each test case (or unit test) will perform some sample workload and compare expected to actual results as a measure of success. What these test cases don’t typically cover well, however, are malicious or illegal sequences of system calls (operations). This is where ‘fuzzers’ like Trinity come into play.

Trinity was written by Dave Jones and has been under active development for several years. It is a ‘system call fuzzer’, meaning it will call random kernel system calls (the standard interface by which applications communicate with the kernel) according to a few simple rules. For example, those system calls expecting to receive a file descriptor will be given one (at random, pointing to almost anywhere), and those expecting to be given a length (for example, the number of bytes to read or write into a file) will be passed a range of interesting values intended to trigger ‘off by one’ bugs and the like wherein the kernel behaviour violates the intentions of the developers. Trinity is multithreaded and typically is left to run for many hours at a time. It frequently produces exciting bug reports on the kernel mailing list (often, but not always, from Dave himself) and has measurably improved the quality of the kernel code overall.

One of the more exciting things Trinity picked up o was a VFS deadlock caused by several dentries (directory entries) sharing the same directory inode under /proc/$PID/ net/stat. Every process within a single network namespace will see the exact same entry for ‘stat’, right down to the inode number (visible with ‘ls -lid’), which is a directory hardlink. This directory hardlink under the /proc file system violates long-standing UNIX (and Linux) policy that directory hardlinks are forbidden (because they can result in cyclic directory tree structures). For those situations wherein directory links are required, soft or symbolic links are normally created. This is indeed the longer-term fix that has been proposed, though in the interim this particular problem is to be worked around by preventing multiple locks being held on the same directory inode. Trinity found the problem, Dave Jones diligently reported it, Al Viro tracked down the actual problem and longer-term fix, and Linus Torvalds implemented a workaround patch in time for 3.9.

If you want to learn more about Trinity, visit its website or sign up to the new ‘trinity’ mailing list on vger.kernel.org.

Security exploits

A nasty security exploit was created in a recent kernel release, thanks to newly added support for new namespace creation by unprivileged users. Namespaces are a mechanism provided by the kernel wherein various resources such as a particular view of a file system or active network configuration can be shared among a group of processes (tasks). The namespace code is traditionally used for the implementation of the ‘chroot’ command; for example, allowing for a new program to be launched with a limited view of the file system in which its ‘/’ directory is actually a subdirectory of the real root. Traditional namespaces required special privileges to set up and use, with the flags passed to ‘clone’ (the internal system call used by the system C library when using the special ‘fork’ library function to create a new process) being used to control what was passed onto newly created subprocesses (children).

In the newly relaxed set of rules, it is possible for an unprivileged user to pass two mutually incompatible flags at new process creation: CLONE_NEWUSER and CLONE_FS. The former creates a new namespace, while the latter specifies that the newly created process should share the special in-kernel file system tracking structure with its parent – that is, effectively, sharing the same file system. This somewhat obviates the point of creating a new namespace but allows a carefully crafted attack to be performed against the kernel. The exploit relies upon being able to set up a carefully crafted chroot in which the dynamic linker (used during early setup of almost every program) is replaced with a malicious binary inheriting the permissions of any program it will run. It is then used to Trojan the dynamic linker of the real system outside of the chroot by virtue of the fact it still has access to the file-system namespace of its parent. The fix is to prevent bot CLONE_NEWUSER and CLONE_FS being specified together.