Linux 2 6 26

Summary: 2.6.26 adds support for read-only bind mounts, x86 PAT (Page Attribute Tables), PCI Express ASPM (Active State Power Management), ports of KVM to IA64, S390 and PPC, other KVM improvements including basic paravirtualization support, preliminary support of the future 802.11s wireless mesh standard, much improved webcam support thanks to a driver for UVC devices, a built-in memory tester, a kernel debugger, BDI statistics and parameters exposure in /sys/class/bdi, a new /proc/PID/mountinfo file for more accurate information about mounts, per-process securebits, device white-list for containers users, support for the OLPC, some new drivers and many small improvements

1. Important features (AKA: the cool stuff)

1.1. Read-only bind mounts

Since 2.4.0 Linux has supported bind mounts. Bind mounts are a sort of directory symlinks that allow to share the contents of a directory in two different paths. For example, "mount --bind /foo /bar" will "bind" the contents of /foo not only to /foo, but also /bar. IOW, /foo and /bar would have the same content - and any modification in one directory is visible in the other. This has been useful for things like chroots or ftp/webservers, but until now, if /foo was writable, there was no way to stop /bar from being also writable.

In Linux 2.6.26, you can make those bind mounts read-only. If we made the bind mount in the previous example read-only, the contents of /foo would show up in /bar - but an application trying to modify a file in /bar will not be able to do it (/foo could continue being writable, of course). This has a number of uses. It allows chroots to have parts of filesystems writable. It's useful for containers because users may have root inside a container, but should not be allowed to write to some filesystems. It allows security enhancement by making sure that parts of your filesystem read-only (such as when you don't trust your FTP server), when you don't want to have entire new filesystems mounted, or when you want atime selectively updated.

(The current implementation does not allow to make a bind mount directly read-only: you need to make the bind mount first - mount --bind /foo /bar - and then remount the bind as ro - mount -o remount,ro /bar)

1.5. x86 PAT support

PAT (Page Attribute Table) is a feature found in x86 processors that allows for setting the memory attribute at the page level granularity. PAT is complementary to the MTRR settings which allows for setting of memory types over physical address ranges. However, PAT is more flexible than MTRR due to its capability to set attributes at page level and also due to the fact that there are no hardware limitations on number of such attribute settings allowed. It's not a very new feature: the Linux support for this has been in the works for a long time: the current patches are evolved from ones started in 2006, and there're traces of preliminary patches in 2001. Probably because it's not a critical feature and MTRRs did the job.

1.6. Per-process securebits

Filesystem capability support makes it possible to do away with (set)uid-0 based privilege and use capabilities instead. That is, with filesystem support for capabilities but without this present feature, it is (conceptually) possible to manage a system with capabilities alone and never need to obtain privilege via (set)uid-0. Of course, conceptually isn't quite the same as currently possible since few user applications, certainly not enough to run a viable system, are currently prepared to leverage capabilities to exercise privilege. Further, many applications exist that may never get upgraded in this way, and the kernel will continue to want to support their setuid-0 base privilege needs. Where pure-capability applications evolve and replace setuid-0 binaries, it is desirable that there be a mechanisms by which they can contain their privilege. In addition to leveraging the per-process bounding and inheritable sets, this should include suppressing the privilege of the uid-0 superuser from the process' tree of children. The feature added in 2.6.26 can be leveraged to suppress the privilege associated with (set)uid-0. This suppression requires CAP_SETPCAP to initiate, and only immediately affects the 'current' process (it is inherited through fork()/exec()). This reimplementation differs significantly from the historical support for securebits which was system-wide, unwieldy and which has ultimately withered to a dead relic in the source of the modern kernel.

1.7. KGDB

For many years Linux has not included a kernel debugger. Linus Torvalds vetoed them for years, for reasons that he explained quite well in a known email: "When things crash and you fsck and you didn't even get a clue about what went wrong, you get frustrated. Tough. There are two kinds of reactions to that: you start being careful, or you start whining about a kernel debugger [...] I happen to believe that not having a kernel debugger forces people to think about their problem on a different level than with a debugger. I think that without a debugger, you don't get into that mindset where you know how it behaves, and then you fix it from there. Without a debugger, you tend to think about problems another way. You want to understand things on a different _level_."

Despite of those objections, many people wanted a debugger and KGDB is finally going in. It's a remote debugger, it needs two machines. x86 and sparc machines are supported

1.8. Device whitelist on cgroups

This feature implements a functionality wanted by some virtualization users: The ability to control the access to devices in a per-container basis. A cgroup is used to track and enforce open and mknod restrictions on device files. More details can be found in the commit link.

1.9. Memtest

Memtest is a commonly used tool for checking your memory. In 2.6.26 Linux is including his own in-kernel memory tester. The goal is not to replace memtest, in fact this tester is much simpler and less capable than memtest, but it's handy to have a built-in memory tester on every kernel. It's enabled easily with the "memtest" boot parameter.

1.10. Export BDI attributes in sysfs

Linux 2.6.24 merged per-device dirty thresholds: The limits that the kernel put to the amount of memory that a process can "dirty" changed from being global to be per-device. 2.6.26 exposes a interface in /sys/class/bdi that allow to set several parameters. There's another set of read-only parameters that are exposet in debugfs (debug/bdi/<bdi>/stats)

1.11. /proc/pid/mountinfo

The work being done these days in the VFS like per-process namespaces and such is obsoleting some things, like /proc/mounts (which is always a link to /proc/self/mounts). In its current form lacks important information and suffers some problems (see the code link). 2.6.26 introduces /proc/PID/mountinfo which addresses these deficiencies. Information about the information that can be found on these new files is explained in the commit links.

1.12. Generic semaphores

Since the introduction of mutexes, semaphores are no longer performance-critical, so the architecture-specific (and often asm-handcoded) implementation -that was needed when semaphores were really important for performance- has been reemplaced by a generic one written in C for maintainability, debuggability and extensibility. It removes 7365 LoC

fuse: support writable mmap (commit), implement perform_write. With fusexmp (a passthrough filesystem), large (1MB) writes into a backing tmpfs filesystem are sped up by almost 4 times (256MB/s vs 71MB/s). But it's disabled by default. (commit), (commit)

3. Architecture-specific changes

x86

Lazy allocation of FPU struct: Only allocate the FPU area when the application actually uses FPU, i.e., in the first lazy FPU trap. This saves memory for non-fpu using apps. For example: on a test system after boot, there are around 300 processes, with only 17 using FPU.(commit), (commit)

Hypervisor-assisted Dump: The goal of hypervisor-assisted dump is to enable the dump of a crashed system, and to do so from a fully-reset system, and to minimize the total elapsed time until the system is back in production use. As compared to kdump or other strategies, hypervisor-assisted dump offers several strong, practical advantages, see more details in the links (commit), (commit), (commit), (commit), (commit), (commit), (commit), (commit), (commit), (commit)