Navigation

Summary: This version adds two new filesystem, the distributed filesystem Ceph and LogFS, a filesystem for flash devices. Other features are a driver for almost-native KVM network performance, the VMware ballon driver, the "kprobes jump" optimization for dynamic probes, new perf features (the "perf lock" tool, cross-platform analysis support), support for GPU switching, several Btrfs improvements, RCU lockdep, Generalized TTL Security Mechanism (RFC 5082) and private VLAN proxy arp (RFC 3069) support, asynchronous suspend/resume, several new drivers and many other small improvements.

1.1. Ceph filesystem

Ceph is a distributed network filesystem. It is built from the ground up to seamlessly and gracefully scale from gigabytes to petabytes and beyond. Scalability is considered in terms of workload as well as total storage. Ceph is designed to handle workloads in which tens thousands of clients or more simultaneously access the same file, or write to the same directory–usage scenarios that bring typical enterprise storage systems to their knees.

Some of the key features that make Ceph different from existing file systems:

Seamless scaling: A Ceph filesystem can be seamlessly expanded by simply adding storage nodes (OSDs), and proactively migrates data onto new devices in order to maintain a balanced distribution of data.

Strong reliability and fast recovery: All data in Ceph is replicated across multiple OSDs. If any OSD fails, data is automatically re-replicated to other devices.

Adaptive MDS: The Ceph metadata server (MDS) is designed to dynamically adapt its behavior to the current workload. As the size and popularity of the file system hierarchy changes over time, that hierarchy is dynamically redistributed among available metadata servers in order to balance load and most effectively use server resources. Similarly, if thousands of clients suddenly access a single file or directory, that metadata is dynamically replicated across multiple servers to distribute the workload.

1.2. LogFS

LogFS is a filesystem designed for storage devices based on flash memory (SDD hard disks, USB sticks, etc). It is aimed to scale efficiently to large devices. In comparison to JFFS2, it offers significantly faster mount times and potentially less RAM usage. In its current state it is still experimental.

1.3. Vhost net: fast KVM networking

vhost net is a kernel-level backend for virtio networking. The main motivation for vhost is to reduce virtualization overhead for virtio-net by moving the task of converting virtio descriptors to skbs and back from qemu userspace to the vhost net driver. For virtio-net this means removing up to 4 system calls per packet: vm exit for kick, reentry for kick, iothread wakeup for packet, interrupt injection for packet. This was shown to reduce latency by a factor of 5, and improve bandwidth to almost-native performance. Existing virtio net code is used in guests without modification.

1.4. Btrfs updates

In this version, Btrfs has the ability to change which subvolume or snapshot is mounted by default. For a while, Btrfs had a "mount -o subvol" option, which mounts into a subvolume instead of using the default root. The new ioctl allows you to set this once with "btrfs subvolume set-default" and have it used as the new default for every mount (without any mount options), until you change it again. This feature is part of snapshot assisted distro upgrades, where you can take a snapshot of your distro, update it to a beta version, and revert back the default root to the old tree if you want to go back to the old, stable version. Support for such functionality has already been added to the Yum package manager when the "yum-plugin-fs-snapshot" package is installed. This plugin takes snapshots and modifies the GRUB configuration files to show different boot options for each snapshot (note that recent versions of LVM also support changing which snapshot is the default root, so you also can use this feature in LVM/Ext4 systems)

But the ioctl also sets an incompat bit on the super block because the developers ended up doing it differently than they had planned in the disk format. People would end up with a big surprise if they mounted with 2.6.33 and got one directory tree but mounted with 2.6.32 and got another, so an incompat bit is flip when the ioctl is run. The incompat bit is only set if you run the set-default ioctl. Code: (commit), (commit), (commit)

A new userspace utility has been created, it's a command called "btrfs". This tool replaces the old utilities.

A ioctl has been added to list all the subvolumes on the filesystem (command "btrfs subvolume list"). This makes use of a new interface that runs tree searches from userland, which will be used for incremental backups in later btrfs-progs releases. There's a userspace utility to list files recently modified (command "btrfs subvolume find-new") Ioctl code: (commit)

The math for df has been changed a little to better reflect space available for data, and factors in duplication for raid and single spindle dup. Also, a a space info ioctl has been added, which shows (command "btrfs filesystem df") how much space is tied up in metadata, and shows the raid level used for metadata/data. Code: (commit), (commit)

The defrag code has added the ability to compress a single file on demand and defrag only a range of bytes in the file. Code: (commit)

When snapshots are taken, Btrfs now waits for all the delayed allocation extents to hit the disk first.

1.5. Kprobes jump optimization

Kprobes is an old feature (merged in 2004) that allows to gather information from any routine in the kernel at runtime. It is the internal system that Systemtap uses to insert probes at a random point of the kernel. The current system used to implement Kprobes (in x86) is a a breakpoint. At the instruction address, a "int 3" instruction is inserted, and when the code path hits it, the exception handler is called and kprobes recollects all the information needed.

This system works very well and it's quite efficient, but it can be improved. This version adds an experimental feature to improve it. In 2.6.34, a probe can use optionally in many (but not all, and no preempt support for now) places a simple jump, which is much faster. Usually, a kprobe hit takes 0.5 to 1.0 microseconds to process. On the other hand, a jump optimized probe hit takes less than 0.1 microseconds.

1.6. perf improvements, perf lock

Cross platform analysis support. The data recollected by perf can be analyzed in another system in a different architecture. A command has been added (command "perf archive") to archive in a .tar.bz2 file all the object files needed to do an analysis of a perf record, so an user can send it to someone who can interpret the data correctly. Code: (commit 1, 2, 3, 4, 5)

1.7. RCU lockdep

RCU is a scalable locking scheme used in many parts of the Linux tree. Its use is extending all over the tree, but its correct use needs manual checking. This version brings lockdep-style checking to rcu_dereference()

1.9. Asynchronous suspend/resume

The power management code has been modified to allow asynchronous suspend/resume, allowing drivers to do device suspend/resume in parallel, which improves the time used to suspend/resume devices quite a lot. In this version, PCI, USB and SCSI devices do asynchronous suspend/resume by default.

1.10. GPU switching

Some laptops have two GPUs, a low-power and inefficient GPU and a high-power and powerful GPU. Users should be able to switch to one or another at runtime. In this version, Linux adds support for this feature. You need to restart X, though.

1.11. Preliminary Radeon Evergreen (Radeon HD 5xxx)

1.12. VMware ballon driver

This is a standalone version of VMware Balloon driver. Ballooning is a technique that allows hypervisor dynamically limit the amount of memory available to the guest (with guest cooperation). This driver will only activate if host is VMware.