Linux 2 6 25

Summary: 2.6.25 includes support of a new architecture (MN10300/AM33) and the widely used OrionSoCs, a new interface for more accurate measurement of process memory usage, a 'memory resource controller' for controlling the memory usage of groups of processes, realtime group scheduling, a tool for measuring high latencies called latencytop, ACPI thermal regulation, timer event notifications through file descriptors, an alternative MAC security framework called SMACK, an ext4 update, BRK and PIE-executable address space randomization, RCU preemption support, FIFO spinlocks in x86, EFI support in x86-64, a new network protocol called CAN, initial ATI r500 DRI/DRM support, the beginning of the end for tasks stuck in D state, improved device support and many other small improvements.

1. Important features (AKA: the cool stuff)

1.1. Memory Resource Controller

The memory resource controller is a cgroups-based feature. Cgroups, aka "Control Groups", is a feature that was merged in 2.6.24, and its purpose is to be a generic framework where several "resource controllers" can plug in and manage different resources of the system such as process scheduling or memory allocation. It also offers a unified user interface, based on a virtual filesystem where administrators can assign arbitrary resource constraints to a group of chosen tasks. For example, in 2.6.24 they merged two resource controllers: Cpusets and Group Scheduling. The first allows to bind CPU and Memory nodes to the arbitrarily chosen group of tasks, aka cgroup, and the second allows to bind a CPU bandwidth policy to the cgroup.

The memory resource controller isolates the memory behavior of a group of tasks -cgroup- from the rest of the system. It can be used to:

Isolate an application or a group of applications. Memory hungry applications can be isolated and limited to a smaller amount of memory.

Create a cgroup with limited amount of memory, this can be used as a good alternative to booting with mem=XXXX.

Virtualization solutions can control the amount of memory they want to assign to a virtual machine instance.

A CD/DVD burner could control the amount of memory used by the rest of the system to ensure that burning does not fail due to lack of available memory.

The configuration interface, like all the cgroups, is done by mounting the cgroup filesystem with the "-o memory" option, creating a randomly-named directory (the cgroup), adding tasks to the cgroup by catting its PID to the 'task' file inside the cgroup directory, and writing values to the following files: 'memory.limit_in_bytes', 'memory.usage_in_bytes' (memory statistic for the cgroup), 'memory.stats' (more statistics: RSS, caches, inactive/active pages), 'memory.failcnt' (number of times that the cgroup exceeded the limit), and 'mem_control_type'. OOM conditions are also handled in a per-cgroup manner: when the tasks in the cgroup surpass the limits, OOM will be called to kill a task between all the tasks involved in that specific cgroup.

1.2. Real Time Group scheduling

Group scheduling is a feature introduced in 2.6.24. It allows to assign different process scheduling priorities other than nice levels. For example, given two users on a system, you may want to to assign 50% of CPU time to each one, regardless of how many processes is running each one (traditionally, if one user is running f.e. 10 cpu-bound processes and the other user only 1, this last user would get starved its CPU time), this is the "group tasks by user id" configuration option of Group Scheduling does. You may also want to create arbitrary groups of tasks and give them CPU time privileges, this is what the "group tasks by Control Groups" option does, basing its configuration interface in cgroups (feature introduced in 2.6.24 and described in the "Memory resource controller" section).

Those are the two working modes of Control Groups. Additionally there're several types of tasks. What 2.6.25 adds to Group Scheduling is the ability to also handle real time (aka SCHED_RT) processes. This makes much easier to handle RT tasks and give them scheduling guarantees.

There's serious interest in running RT tasks on enterprise-class hardware, so a large number of enhancements to the RT scheduling class and load-balancer have been merged to provide optimum behaviour for RT tasks.

1.3. RCU Preemption support

RCU is a very powerful locking scheme used in Linux to scale to very large number of CPUs on a single system. However, it wasn't well suited for low latency,RT-ish workloads, and some parts could cause high latency. In 2.6.25, RCU can be preempted, eliminating that source of latencies and making Linux a bit more RT-ish.

1.4. FIFO ticket spinlocks in x86

In certain workloads, spinlocks can be unfair, ie: a process spinning on a spinlock can be starved up to 1,000,000 times. Usually starvation in spinlocks is not a problem, and it was thought that it was not too important because such spinlock would become a performance problem before any starvation is noticed, but testing has showed the contrary. And it's always possible to find an obscure corner case that will generate a lot of contention on some lock, and the processor that will grab the lock does it randomly.

With the new spinlocks, the processes grab the spinlock in FIFO order, ensuring fairness (and more importantly, guaranteeing to some point the

Spinlocks configured to run on machines with more than 255 CPUs will use a 32-bit value, and 16 bits when the number of CPUs is smaller (as a bonus, the maximum theoretical limit of CPUs that spinlocks can support is raised up to 65536 processors)

1.5. Better process memory usage measurement

Measuring how much memory processes are using is more difficult than it looks, specially when processes are sharing the memory used. Features like /proc/$PID/smaps (added in 2.6.14) help, but it has not been enough. 2.6.25 adds new statistics to make this task easier. A new /proc/$PID/pagemaps file is added for each process. In this file the kernel exports (in binary format) the physical page localization for each page used by the process. Comparing this file with the files of other processes allows to know what pages they are sharing. Another file, /proc/kpagemaps, exposes another kind of statistics about the pages of the system. The author of the patch, Matt Mackall, proposes two new statistic metrics: "proportional set size" (PSS) - divide each shared page by the number of processes sharing it; and "unique set size" (USS) (counting of pages not shared). The first statistic, PSS, has also been added to each file in /proc/$PID/smaps. In this HG repository you can find some sample command line and graphic tools that exploits all those statistics.

1.6. timerfd() syscall

timerfd() is a feature that got merged in 2.6.22 but was disabled due to late complaints about the syscall interface. Its purpose is to extend the timer event notifications to something else than signals, because doing such things with signals is hard. poll()/epoll() only covers file descriptors, so the options were a BSDish kevent-like subsystem or delivering time notifications via a file descriptor, so that poll/epoll could handle them.

There were implementations for both approaches, but the cleaner and more "unixy" design of the file descriptor approach won. In 2.6.25, a revised API has been finally introduced. The API can be found in this LWN article

1.7. SMACK, Simplified Mandatory Access Control

The most used MAC solution in Linux is SELinux, a very powerful security framework. SMACK is an alternative MAC framework, not so powerful as SELinux but simpler to use and configure. Linux is all about flexibility, and in the same way it has several filesystems, this alternative security framework doesn't pretends to reemplaze SELinux, it's just an alternative for those who find it more suited to its needs.

From the LWN article: Like SELinux, Smack implements Mandatory Access Control (MAC), but it purposely leaves out the role based access control and type enforcement that are major parts of SELinux. Smack is geared towards solving smaller security problems than SELinux, requiring much less configuration and very little application support.

1.8. Latencytop

Slow servers, Skipping audio, Jerky video - everyone knows the symptoms of latency. But to know what's really going on in the system, what's causing the latency, and how to fix it... those are difficult questions without good answers right now. LatencyTOP is a Linux tool for software developers (both kernel and userspace), aimed at identifying where system latency occurs, and what kind of operation/action is causing the latency to happen. By identifying this, developers can then change the code to avoid the worst latency hiccups.

There are many types and causes of latency, and LatencyTOP focus on type that causes audio skipping and desktop stutters. Specifically, LatencyTOP focuses on the cases where the applications want to run and execute useful code, but there's some resource that's not currently available (and the kernel then blocks the process). This is done both on a system level and on a per process level, so that you can see what's happening to the system, and which process is suffering and/or causing the delays.

You can find the latencytop userspace tool, including screenshots, at latencytop.org.

1.9. BRK and PIE executable randomization

Exec-shield is a Red Hat that was started in 2003 by Red Hat to implement several security protections and is mainly used in Red Hat and Fedora. Many features have already been merged lot of time ago, but not all of them. In 2.6.25 two of them are being merged: brk() randomization and PIE executable randomization. Those two features should make the address space randomization on i386 and x86_64 complete.

1.10. Controller area network (CAN) protocol support

From the "Controller Area Network" Wikipedia article: Controller Area Network (CAN or CAN-bus) is a computer network protocol and bus standard designed to allow microcontrollers and devices to communicate with each other and without a host computer.. This implementation has been contributed by Volkswagen.

1.12. EXT4 update

EXT4 mainline snapshot gets an update with a bunch of features: Multi-block allocation, large blocksize up to PAGE_SIZE, journal checksumming, large file support, large filesystem support, inode versioning, and allow in-inode extended attributes on the root inode. These features should be the last ones that require on-disk format changes. Other features that don't affect the disk format, like delayed allocation, have still to be merged.

1.13. MN10300/AM33 architecture support

The MN10300/AM33 architecture is now supported under the "mn10300" subdirectory. 2.6.25 adds support MN10300/AM33 CPUs produced by MEI. It also adds board support for the ASB2303 with the ASB2308 daughter board, and the ASB2305. The only processor supported is the MN103E010, which is an AM33v2 core plus on-chip devices.

1.14. TASK_KILLABLE

Most Unix systems have two states when sleeping -- interruptible and uninterruptible. 2.6.25 adds a third state: killable. While interruptible sleeps can be interrupted by any signal, killable sleeps can only be interrupted by fatal signals. The practical implications of this feature is that NFS has been converted to use it, and as a result you can now kill -9 a task that is waiting for an NFS server that isn't contactable.

Further uses include allowing the OOM killer to make better decisions (it can't kill a task that's sleeping uninterruptibly) and changing more parts of the kernel to use the killable state. If you have a task stuck in uninterruptible sleep with the 2.6.25 kernel, please contact MatthewWilcox with the output from

$ ps -eo pid,stat,wchan:40,comm |grep D

Code: Commits 1-11 are prep-work. Patches 15 and 21 accomplish the major user-visible features, but depend on all the commits which have gone before them.

Realview: clocksource support for the Realview platforms (commit), clockevents support for the RealView platforms (commit), add broadcasting clockevents support for ARM11MPCore (commit), add clockevents suport for the local timers (commit), add core-tile detection (commit)

3.11. FireWire

A whole boatload of bug fixes for firewire-core, firewire-ohci, firewire-sbp2. The sum of them brings huge improvements of stability and functionality of these drivers over linux 2.6.24. See the linux1394-user changelog for a list of fixes.