Linux 2 6 32

Linux 2.6.32 has been released on December 3rd 2009.

Summary: This version adds virtualization memory de-duplication, a rewrite of the writeback code which provides noticeable performance speedups, many important Btrfs improvements and speedups, ATI R600/R700 3D and KMS support and other graphic improvements, a CFQ low latency mode, tracing improvements including a "perf timechart" tool that tries to be a better bootchart, soft limits in the memory controller, support for the S+Core architecture, support for Intel Moorestown and its new firmware interface, run time power management support, and many other improvements and new drivers.

1. Prominent features (the cool stuff)

1.1. Per-backing-device based writeback

"Writeback" in the context of the Linux kernel can be defined as the process of writing "dirty" memory from the page cache to the disk. The amount of data that needs to be written can be huge - hundreds of MB, or even GB, and the work is done by the well know "pdflush" kernel threads when the amount of dirty memory surpasses the limits set in /proc/sys/vm. The current pdflush system has disadvantages, specially in systems with multiple storage devices that need to write large chunks of data to the disk. This design has some deficiencies (described in the links above) that cause poor performance and seekiness in some situations. A new flushing system has been designed by Jens Axboe (Oracle), which focus around the idea of having a dedicated kernel thread to flushing the dirty memory of each storage device. The "pdflush" threads are gone and have been replaced with others named after "flush-MAJOR" (the threads are created when there's flushing work that needs to be done and will dissapear after a while if there's nothing to do).

The new system has much better performance in several workloads: A benchmark with two processes doing streaming writes to a 32 GB file to 5 SATA drives pushed into a LVM stripe set, XFS was 40% faster, and Btrfs 26% faster. A sample ffsb workload that does random writes to files was found to be about 8% faster on a simple SATA drive during the benchmark phase. File layout is much smoother on the vmstat stats. A SSD based writeback test on XFS performs over 20% better as well, with the throughput being very stable around 1GB/sec, where pdflush only manages 750MB/sec and fluctuates wildly while doing so. Random buffered writes to many files behave a lot better as well, as does random mmap'ed writes. A streaming vs random writer benchmark went from a few MB/s to ~120 MB/s. In short, performance improves in many important workloads.

1.2. Btrfs improvements

-ENOSPC: Btrfs has not had serious -ENOSPC ("no space") handling, the COW oriented design makes handling such situations more difficult than filesystems that just rewrite the blocks. In this release Josef Bacik (Red Hat) has added the neccesary infraestructure to fix that problem. Note: The filesystem may run out of space and still show some free space. That space comes from a data/metadata chunk that can't get filled because there's not space left to create its metadata/data counterpart chunk. This is unrelated to the -ENOSPC handling and will be fixed in the future. Code: (commit)

Proper snapshot and subvolume deletion: In the last btrfs-progs version you have options that allow to delete snapshots and subvolumes without having to use rm. This is much faster because it does the deletion via btree walking. It's also now possible to rename snapshots and subvols. Work done by Yan Zheng (Oracle). Code: (commit 1), 2)

Performance improvements: Streaming writes on very fast hardware got CPU bound at around 400MB/s, Chris Mason (Oracle) has improved the code so that now it can push over 1GB/s while using the same CPU as XFS (factoring out checksums). There are also improvements for writing large portions of extents, and other workloads. Multidevice setups are also much faster due to the per-BDI writeback changes. fsync() performance has been improved greatly as well (which fixes a severe slowdown while using yum in Fedora 11).

Modern operative systems already use memory sharing extensively, for example forked processes share initially with its parent all the memory, there are shared libraries, etc. Virtualization however can't benefit easily from memory sharing. Even when all the VMs are running the same OS with the same kernel and libraries the host kernel can't know that a lot of those pages are identical and can be shared. KSM allows to share those pages. The KSM kernel daemon, ksmd, periodically scans areas of user memory, looking for pages of identical content which can be replaced by a single write-protected page (which is automatically COW'ed if a process wants to update it). Not all the memory is scanned, the areas to look for candidates for merging are specified by userspace apps using madvise(2): madvise(addr, length, MADV_MERGEABLE).

The result is a dramatic decrease in memory usage in virtualization environments. In a virtualization server, Red Hat found that thanks to KSM, KVM can run as many as 52 Windows XP VMs with 1 GB of RAM each on a server with just 16 GB of RAM. Because KSM works transparently to userspace apps, it can be adopted very easily, and provides huge memory savings for free to current production systems. It was originally developed for use with KVM, but it can be also used with any other virtualization system - or even in non virtualization workloads, for example applications that for some reason have several processes using lots of memory that could be shared.

The KSM daemon is controlled by sysfs files in /sys/kernel/mm/ksm/, documentation can be found in Documentation/vm/ksm.txt.

1.4. Improvements in the graphic stack

The landing of GEM and KMS in past releases is driving a much needed renovation in the Linux graphic stack. This release adds several improvements to the graphic drivers that show the steady progress of this kernel subsystem:

Radeon driver

r600/r700 3D + KMS support, based in the hardware specs that AMD has published (thanks for the open source support!). 3D Performance is not great and there're still a lot of things to improve, but it works well enought to be used by composited desktops, it may cause problems in games. This driver provides suppor for the fastest graphics cards available with opensource drivers. Only Nvidia is still left with the requirement of a binary driver (hopefully soon to be fixed by the Nouveau driver) (commit)

Add dynamic clock frequency control: When the graphics are idle, the LVDS refresh rate and the GPU clock is reduced, and memory self refresh is enabled to go into a lower power state. All of these things are reenabled between frames when GPU activity is triggered (commit)

Improve behaviour under memory pressure. Now when the system is running out of memory, the driver can free used buffers (commit), (commit), (commit)

8xx works again, since the regression with GEM's introduction back in .27

VGA arbitration: For many reasons, there's currently not a proper way to arbitrate concurrent access for multiple independent processes (for example, multiple X servers in a multi-head setup) to the VGA resources of multiple cards. The VGA arbitrer Solves this problem in a generic way. Code: (commit),(commit), (commit)

KMS: Use the video= command line option to control the kms output setup at boot time. It is used to override the autodetection (commit)

1.5. CFQ low latency mode

In this release, the CFQ IO scheduler (the one used by default) gets a new feature that greatly helps to reduce the impact that a writer can have on the system interactiveness. The end result is that the desktop experience should be less impacted by background IO activity, but it can cause noticeable performance issues, so people who only care about throughput (ie, servers) can try to turn it off echoing 0 to /sys/class/block/<device name>/queue/iosched/low_latency. It's worth mentioning that the 'low_latency' setting defaults to on.

The perf tool is getting a lot of attention and patches. In the past few months the perfcounters subsystem has grown out its initial role of counting hardware events, and has become (and is becoming) a much broader generic event enumeration, reporting, logging, monitoring, analysis facility, so the tool has been renamed from "Performance Counters" to "Performance Events".

Tracepoints: This release includes support to probe static tracepoints which have been added by maintainers to many kernel subsystems (they can be probed from systemtap as well). It's now possible to analyze workloads easily. When debugfs is mounted, perf list will show a section with all the tracepoints available in the system, so you can do things like "perf record -e i915:i915_gem_request_wait_begin -c 1 -g" to record the stack trace for every Intel GPU stall during a run, or "perf stat -a -e ext4:* --repeat 10 ./somebenchmark" for an average stat of all the ext4 tracepoints during 10 runs of a benchmark. You also probe syscalls, for example you can do 'perf stat -e syscalls:sys_enter_blah', which is a strace-like sort of program, but more powerful in some ways. Read Documentation/trace/tracepoint-analysis.txt for a howto of how to do performance analysis with tracepoints.

Timechart: "perf timechart" is a perf-based tool designed to be a better bootchart. It generates a big SVG file which, unlike bootchart, is zoomable (inkscape is recommended), ie, if you want to know more details about some point of the graph, you can zoom and see the details. See this blog entry from the author of the feature to know more about this tool.

perf sched: a new perf sched tool has been added to trace/measure scheduler properties.

New tracepoints: Tracepoints in this release have been added for syscalls, module loading/unloading, skb allocation and consumption, KVM (old KVMTRACE code is removed), the page allocator, timers, hrtimers, itimers, i915 driver, some JBD2 and Ext4 statistics which were missing, and perf support in the SPARC architecture.

1.7. Soft limits in the memory controller

Control groups are a sort of virtual "containers" that are created as directories inside a special virtual filesystem (usually, with tools), and an arbitrary set of processes can be add to that control group and you can configure the control group to have a set of cpu scheduling or memory limits for the processes inside the group.

This release adds soft memory limits - the processes can surpass the soft limit as long as there is no memory contention (and they do no exceed their hard limit), but if the system needs to free memory, it will reclaim it from the groups that exceed their soft limit.

1.8. Easy local kernel configuration

Most people uses the kernel shipped by distros - and that's good. But some people like to compile their own kernels from kernel.org, or maybe they like following the Linux development and want to try it. Configuring your own kernel, however, has become a very difficult and tedious task - there're too many options, and some times userspace software will stop working if you don't enable some key option. You can use a standard distro .config file, but it takes too much time to compile all the options it enables.

To make the process of configuration easier, a new build target has been added: make localmodconfig. It runs "lsmod" to find all the modules loaded on the current running system. It will read all the Makefiles to map which CONFIG enables a module. It will read the Kconfig files to find the dependencies and selects that may be needed to support a CONFIG. Finally, it reads the .config file and removes any module "=m" that is not needed to enable the currently loaded modules. With this tool, you can strip a distro .config of all the unuseful drivers that are not needed in our machine, and it will take much less time to build the kernel. There's an additional "make localyesconfig" target, in case you don't want to use modules and/or initrds.

1.9. Virtualization improvements

This version adds a few notable improvements to the Linux virtualization subsystem, KVM:

ioeventfd is a new and faster IO mechanism for KVM. Instead of the blocking round-trip used today, which can cause a VMX/SVM synchronous "heavy-weight" exit back to userspace, ioeventfd allows host userspace to register PIO/MMIO regions to trigger an eventfd(2) signal -which is asyncrhonous- when written to by a guest, improving performance and latency (commit)

irqfd: KVM provides support for injecting virtual interrupts, but all must be injected to the guest via the KVM infrastructure. irqfd adds a new mechanism to inject a specific interrupt to a guest using a decoupled eventfd mechanism: Any legal signal on the irqfd (using eventfd semantics from either userspace or kernel) will translate into an injected interrupt in the guest at the next available interrupt window (commit)

1.10. Run-time Power Management

This feature enables functionality allowing I/O devices to be put into energy-saving (low power) states at run time (or autosuspended) after a specified period of inactivity and woken up in response to a hardware-generated wake-up event or a driver's request. Hardware support is generally required for this functionality to work and the bus type drivers of the buses the devices are on are responsible for the actual handling of the autosuspend requests and wake-up events.

The Simple Firmware Interface (SFI) is a method for platform firmware to export static tables to the operating system (OS) - something analogous to ACPI, used in the MID devices based on the 2nd generation Intel Atom processor platform, code-named Moorestown.

SFI is used instead of ACPI in those platforms because it's more simple and lightweight. It's not intended to replace ACPI. For more information, see the web site

At the same time, this release adds support for Moorestown, Intel's Low Power Intel Architecture (LPIA) based Moblin Internet Device(MID) platform. Moorestown consists of two chips: Lincroft (CPU core, graphics, and memory controller) and Langwell IOH. Unlike standard x86 PCs, Moorestown does not have many legacy devices nor standard legacy replacement devices/features. e.g. Moorestown does not contain i8259, i8254, HPET, legacy BIOS, most of the io ports.

There're also several patches that implement ACPI 4.0 support - Linux is in fact the first platform to support it.

6. MD/DM

7. Filesystems

Ext4

Work around problems in the writeback code to force out writebacks in larger chunks than just 4mb, which is just too small. This also works around limitations in the ext4 block allocator, which can't allocate more than 2048 blocks at a time (commit)

Check compiler version and EABI support when adding ARM unwind support. (commit)

x86

HWPOISON: A high level memory handler that poisons pages that got corrupted by hardware (typically by a two bit flip in a DIMM or a cache) on the Linux level. The VM marks that a page hwpoisoned and doing the appropriate action based on the type of page it is (commit 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21)

Add support for SPARC-LEON, a synthesizable VHDL model of the SPARC-v8 standard. LEON is part of the GRLIB collection of IP cores that are distributed under GPL. GRLIB can be downloaded from www.gaisler.com. You can download a sparc-linux cross-compilation toolchain at www.gaisler.com (commit), (commit), (commit), (commit), (commit)

omapfb: add support for MIPI-DCS compatible LCDs (commit), add support for rotation on the Blizzard LCD ctrl (commit), add support for the 2430SDP LCD (commit), add support for the 3430SDP LCD (commit), add support for the Amstrad Delta LCD (commit), add support for the Apollon LCD (commit), add support for the Gumstix Overo LCD (commit), add support for the OMAP2EVM LCD (commit), add support for the OMAP3 Beagle DVI output (commit), add support for the OMAP3 EVM LCD (commit), add support for the ZOOM MDK LCD (commit)