Linux 2 6 39

Summary: EXT4 SMP scalability improvements, increase of the initial TCP congestion window, a new architecture called Unicore-32, a feature that allows the creation of groups of network resources called IPset, Btrfs updates, a feature that allows to store crash information in firmware to recover it after a reboot, open-by-handle syscalls, perf updates, and many other small changes and new drivers.

1. Prominent features (the cool stuff)

1.1. Ext4 SMP scalability

In 2.6.37, huge Ext4 scalability improvements were merged and mentioned in the changelog. But this feature was not ready for prime time and had been disabled in source before the release - something that the changelog didn't mention. In this release it has been enabled by default. This is the text from the previous changelog:

"In this release Ext4 will use the "bio" layer directly instead of the intermediate "buffer" layer. The "bio" layer (alias for Block I/O: it's the part of the kernel that sends the requests to the IO/O scheduler) was one of the first features merged in the Linux 2.5.1 kernel. The buffer layer has a lot of performance and SMP scalability issues that will get solved with this port. A FFSB benchmark in a 48 core AMD box using a 24 SAS-disk hardware RAID array with 192 simultaneous ffsb threads speeds up by 300% (400% disabling journaling), while reducing CPU usage by a factor of 3-4"

1.3. IPset

IPset allows the creation of groups of network resources (IPv4/v6 addresses, TCP/UDP port numbers, IP-MAC address pairs, IP-port number pairs, etc), called "IP sets", then you can use those sets to define Netfilter/iptables rules. These sets are much more lookup-efficient than bare iptables rules, but may come with a greater memory footprint. Different storage algorithms (for the data structures in memory) are provided in ipset for the user to select an optimum solution. IPset has been available for some time in the xtables-addons patches and is now being included in the Linux tree.

This tool is useful to do things like: store multiple IP addresses or port numbers and match against the collection by iptables at one swoop; dynamically update iptables rules against IP addresses or ports without performance penalty; express complex IP address and ports based rulesets with one single iptables rule and benefit from the speed of IP sets.

1.4. Btrfs updates

Btrfs allows different compression and copy-on-write settings for each file/directory (in addition to the per-filesystem controls). There is also the usual round of minor speedups, and tracepoints for runtime analysis.

1.5. Pstore: storing crash information across a reboot

Pstore is a filesystem interface that allows to store and recover crash information across a reboot storing it in places like the ERST, a mechanism specified by ACPI that allows saving and retrieving hardware error information to and from a non-volatile location (like flash).

1.7. Transcendent Memory

Trascendent memory is a new type of memory with a particular set of characteristics. From LWN: "transcendental memory can be thought of as a sort of RAM disk with some interesting characteristics: nobody knows how big it is, writes to the disk may not succeed, and, potentially, data written to the disk may vanish before being read back again". This memory could be used in places like the page cache, swap, or virtualization. In this release it is used for to implement a compressed in-memory caching mechanism called zcache.

1.8. BKL: That's all, folks

In 2.6.37, it was possible to compile a Linux kernel without support for the BKL. In this release, the BKL has been removed completely from the kernel sources, including the functions lock_kernel() and unlock_kernel().

1.9. Open-by-handle syscalls

Two new syscalls have been added, name_to_handle_at() and open_by_handle_at(). These syscalls return a file handle, which is useful for user-space filesystems, backup software and other storage management tools. These handles can be used in a new flag that has been added to the open() syscall: O_PATH.

1.10. Perf updates

Add the ability to filter monitoring based on container groups (cgroups) for both perf stat and perf record. It is possible to monitor multiple cgroup in parallel. There is one cgroup per event. The cgroups to monitor are passed via a new -G option followed by a comma separated list of cgroup names (commit), (commit)

perf top: Introduce slang based TUI with live annotation, perf top --tui(commit), (commit)

stripe: implement merge method, performance improvement has been measured to be ~12-35% -- when a reasonable chunk_size is used (e.g. 64K) in conjunction with a stripe count that is a power of 2 (commit)

7. Networking

IPv4: Remove the hash based routing table implementation, make the FIB Trie implementation the default (commit)