Summary: This release includes support for metadata checksums in ext4, userspace probes for performance profiling with tools like Systemtap or perf, a sandboxing mechanism that allows to filters syscalls, a new network queue management algorithm designed to fight bufferbloat, support for checkpointing and restoring TCP connections, support for TCP Early Retransmit (RFC 5827), support for Android-style opportunistic suspend, btrfs I/O failure statistics, and SCSI over Firewire and USB. Many small features and new drivers and fixes are also available.

1. Prominent features in Linux 3.5

1.1. ext4 metadata checksums

Modern filesystems such as ZFS and Btrfs have proved that ensuring the integrity of the filesystem using checksums is a valuable feature. Ext4 has added the ability to store checksums of various metadata fields. Every time a metadata field is read, the checksum of the read data is compared with the stored checksums, if they are different it means that the medata is corrupted (note that this feature doesn't cover data, only the internal metadata structures, and it doesn't have "self-healing" capabilities). The amount of code added to implement this feature is: 1659 insertions(+), 162 deletions(-).

Any ext4 filesystem can be upgraded to use checksums using the "tune2fs -O metadata_csum" command, or "mkfs -O metadata_csum" at creation time. Once this feature is enabled in a filesystem, older kernels with no checksum support will only be able to mount it in read-only mode.

As far as performance impact goes, it shouldn't be noticeable for common desktop and server workloads. A mail server ffsb simulation show nearly no change. On a test doing only file creation and deletion and extent tree modifications, a performance drop of about 20 percent was measured. However, it's a workload very heavily oriented towards metadata, in most real-world workloads metadata is usually a small fraction of total IO, so unless your workload is metadata-oriented, the cost of enabling this feature should be negligible.

1.2. Uprobes: userspace probes

Uprobes, the user-space counterpart of kprobes, enables to place performance probes in any memory address of a user application, and collect debugging and performance information non-disruptively, which can be used to find performance problems. These probes can be placed dynamically in a running process, there is no need to restart the program or modify the binaries. The probes are usually managed with a instrumentation application, such as perf probe, systemtap or LTTng.

A sample usage of uprobes with perf could be to profile libc's malloc() calls:

A probe has been created. Now, let's record the global usage of malloc across all the system during 1 second:

$ perf record -e probe_libc:malloc -agR sleep 1

Now you can watch the results with the TUI interface doing "$ perf report", or watch a plain text output without the call graph info in the stdio output with "$ perf report -g flat --stdio"

If you don't know which function you want to probe, you can get a list of probe-able funcions in libraries and executables using the -F parameter, for example: "$ perf probe -F -x /lib64/libc.so.6" or "$ perf probe -F -x /bin/zsh". You can use multiple probes as well and mix them with kprobes and regular PMU events or kernel tracepoints.

The uprobes code is one of the longest standing out-of-the-tree patches. It originates from SystemTap and has been included for years in Fedora and RHEL kernels.

1.3. Seccomp-based system call filtering

Seccomp (alias for "secure computing") is a simple sandboxing mechanism added back in 2.6.12 that allows to transition to a state where it cannot make any system calls except a very restricted set (exit, sigreturn, read and write to already open file descriptors). Seccomp has now been extended: instead of a fixed and very limited set of system calls, seccomp has evolved into a filtering mechanism that allows processes to specify an arbitrary filter of system calls (expressed as a Berkeley Packet Filter program) that should be forbidden. This can be used to implement different types of security mechanisms; for example, the Linux port of the Chromium web browser supports this feature to run plugins in a sandbox.

The systemd init daemon has added support for this feature. A Unit file can use the SystemCallFilter to specify a list with the syscalls that will be allowed to run, any other syscall will not be allowed:

1.4. Bufferbloat fighting: CoDel queue management

Codel (alias for "controlled delay") is a new queue management algorithm designed to fight the problems associated to excessive buffering across an entire network path - a problem know as "bufferbloat". According to Jim Gettys, who coined the term bufferbloat, "this work is the culmination of their at three major attempts to solve the problems with AQM algorithms over the last 14 years"

1.5. TCP connection repair

As part of an ongoing effort to implement process checkpointing/restart, Linux adds in this release support for stopping a TCP connection and restart it in another host. Container virtualization implementations will use this feature to relocate a entire network connection from one host to another transparently for the remote end. This is achieved putting the socket in a "repair" mode that allows to gather the necessary information or restore previous state into a new socket.

1.6. TCP Early Retransmit

TCP (and STCP) Early Retransmit (RFC 5827) allows to trigger fast retransmit, in certain conditions, to reduce the number of duplicate acknowledgments required to trigger a fast retransmission. This allows the transport to use fast retransmit to recover segment losses that would otherwise require a lengthy retransmission timeout. In other words, connections recover from lost packets faster, which improves latency. A large scale web server experiment on the performance impact of ER is summarized in section 6 of the paper "Proportional Rate Reduction for TCP"

Early retransmit is enabled with the tcp_early_retrans sysctl, found at /proc/sys/net/ipv4/tcp_early_retrans. It accepts three values: "0" (disables early retransmit), "1" (enables it), and "2", the default one, which enables early retransmit but delays fast recovery and fast retransmit by a fourth of the RTT (this mitigates connection falsely recovers when network has a small degree of reordering)

1.7. Android-style opportunistic suspend

The most controversial issue in the merge of Android code into Linux is the functionality called "suspend blockers" or "wakelocks". They are part of a specific approach to power management, which is based on aggressive utilization of full system suspend as much as possible. The natural state of the system is a sleep state, in which energy is only used for refreshing memory and providing power to a few devices that can wake the system up. The system only uses the full power state when it has to do some real work, and when it finishes it goes back to a suspend state.

This is a good idea, but the kernel developers didn't like Android's "suspend blockers" (a full technical analysis on the issue can be found here). Endless flames have been going on for years, and little progress was been made, which was a huge problem for the convergence of Android and Linux, because drivers of Android devices use the suspend blocker APIs, and the lack of such APIs in Linux makes impossible to merge them. But in this release, the kernel incorporates a similar functionality, called "autosleep and wake locks". It is expected/hoped that Android will be able to use it, and merging drivers from Android devices will be easier.

1.8. Btrfs: I/O failure statistics, latency improvements

Support for I/O failure statistics has been added. I/O errors, CRC errors, and generation checks of metadata blocks are tracked for each drive. The Btrfs command to retrieve and print the device stats, to be included in future btrfs-progs, should be "btrfs device stats".

This release also includes fairly large changes that make Btrfs much friendly to memory reclaim and lowers latencies quite a lot for synchronous I/O.

1.9. SCSI over FireWire and USB

This release includes a driver for using an IEEE-1394 connection as a SCSI transport. This enables to expose SCSI devices to other nodes on the Firewire bus, for example hard disk drives. It's a similar functionality to Firewire Target Disk Mode on many Apple computers.

This release also adds a usb-gadget driver that does the same with USB. The driver supports two USB protocols are supported that is BBB or BOT (Bulk Only Transport) and UAS (USB Attached SCSI). BOT is advertised on alternative interface 0 (primary) and UAS is on alternative interface 1. Both protocols can work on USB 2.0 and USB 3.0. UAS utilizes the USB 3.0 feature called streams support.

4. Memory Management

Frontswap support. Frontswap is so named because it can be thought of as the opposite of a "backing" store for a swap device. The data is stored into "transcendent memory", memory that is not directly accessible or addressable by the kernel and is of unknown and possibly time-varying size. When space in transcendent memory is available, a significant swap I/O reduction may be achieved. When none is available, all frontswap calls are reduced to a single pointer-compare-against-NULL resulting in a negligible performance hit and swap data is stored as normal on the matching swap device (commit 1, 2, 3, 4)

Add a Contiguous Memory Allocator (recommended LWN article: A deep dive into CMA). This is a memory allocator that attempts to provide big contiguous allocations of memory. It operates on memory regions where only movable pages can be allocated from. This way, kernel can use the memory for pagecache and when device driver requests (commit)

Remove swap token code and lumpy reclaim: they no longer fit in the current VM model (commit), (commit)