Saturday, January 14, 2012

The Butter/Better/B-tree Filesystem, Btrfs, is supposedly destined to become the default Linux filesystem. What makes it special, and what's wrong with good old tried-and-true Ext2/3/4?

Too Many Filesystems

Linux supports a gigantic number of filesystems: removable media, network, cluster, cloud, journaling, virtual machine, compressed, embedded, hardware inter-connect, pseudo-filesystems that live only in memory, Mac and Windows filesystems, and many more.
You are doubtless familiar with the general-purpose Ext2/3/4, JFS, XFS, and Reiser filesystems that we use on our desktop PCs and servers. With all of these filesystems cluttering up the landscape, what is the point of yet another one? (There is even YAFFS: Yet Another Flash File System.)
The point is meeting new needs and workloads, and building functionality into the filesystem rather than relying on a herd of external utilities. Btrfs is rather like a blend of features from ReiserFS and ZFS, Sun's advanced copy-on-write/volume manager/RAID/snapshot/etc. filesystem.
Many Linux users yearn for a native port of ZFS, but its GPL-incompatible license (the Sun CDDL) ensures that Sun's implementation (now Oracle's) can't be included in the Linux kernel.
Even so, you can't keep a good hacker down, and so there are two ports for Linux. One is ZFS on FUSE, which runs ZFS in user-space. It's included in a lot of distros so it's an easy installation. The other one is ZFS on Linux. This is a build of ZFS as a kernel module for users to install, and so you get kernel support without a GPL violation because it is not distributed with the kernel.
It's great having those options to try out ZFS, and I applaud the maintainers of these ZFS projects. Still, it looks like Btrfs is going to take the place that ZFS could have owned were it not for its incompatible license. Oracle is the primary sponsor of Btrfs, and plans to make it the default filesystem in Oracle Unbreakable Linux sometime in 2012. Btrfs isn't just an Oracle project, but has a lot of community support from the Linux kernel team and many Linux distributions. Odds are it's included in your favorite distro. (Run cat /proc/filesystems to see what filesystems your Linux supports.)

Btrfs Features

So what does this amazing super-duper filesystem do? How about a handy bullet-pointed list to answer this question?

RAID 0, 1, 10

COW

Incremental backup

Online defrag

gzip and LZO compression

Space-efficient packing of small files

Dynamic inode allocation

Checksums on data and metadata

Shrink and grow storage volumes

Extents

Snapshots

16 EiB maximum file size

Planned features include RAID 5 and 6, deduplication, and a ready-for-primetime filesystem checker, btrfsck. You can try out btrfsck now because it is included in btrfsprogs. (Which of course Debian/Ubuntu/Mint etc. changes to btrfs-tools, and Fedora calls it btrfs-progs.) But it is not ready for production systems yet.
Putting the finishing touches on btrfsck is the last big step before Oracle makes it the default filesystem in their next Unbreakable Linux release. Fedora 16 Linux was supposed to default to Btrfs, but now they're aiming for Fedora 17 in May 2012.
I'm a big fan of RAID 10, which is RAID 1+0, mirroring and striping. It is expensive of disks because only 50% of your total disk capacity goes to storage. But it is simple, robust, and fast. Half your disks can fail without losing your data. I got burned out on RAID 5 and 6 years ago; perhaps I had bad RAID mojo, but I experienced a lot of failures, and they are slow. It seemed the systems under my care were more adept at propagating parity errors than operating correctly. So for me, RAID 5 and 6 can sit on the back burner indefinitely as long as I have RAID 10.
16 EiB is exbibytes, a measurement close to the more commonly used exabyte. An exbibyte is 1,024 pebibytes. In comparison Ext4 maxes out at volumes with a maximum size of one exbibyte and file sizes up 16 tebibytes. However you say it, it is a lot.
Btrfs doesn't contain any database-specific optimizations, and is not a clustering filesystem. It is designed to handle very large storage volumes, protect data, simplify large storage management, and read and write fast.

COW, Volumes, Snapshots

A COW – copy on write – filesystem is extra-careful with writing your data. When you make a change to a file, the old data are not overwritten. Instead, the filesystem allocates new blocks for the new data, and only the changed data are given a new allocation. The downside is this creates fragmentation. So Btrfs supports online defragmentation with the

btrfs filesystem defragment

command.
COW filesystems lend themselves to easy, efficient snapshots, and Btrfs supports both snapshots and rollbacks. The easy safe way to try Btrfs is to create a new partition for testing. Gparted supports Btrfs, as you can see in figure 1.

Next, mount this partition. In this example the mountpoint is /btrfs-volume:# mount -t btrfs /dev/sda8 /btrfs-volume
Now we can create a subvolume in this partition. Subvolumes are cool. They are like independent filesystems inside the parent filesystem, with their own mountpoints and options. Create one this way:

# btrfs subvolume create btrfs-volume/test
And that's all there is to it. You'll see this as an ordinary directory in your file manager (figure 2). You don't need to worry about allocating space like you do with normal disk partitions, because subvolumes automatically snag whatever space they need from the parent volume as you add data to them. So you can go ahead and copy some files into the test subvolume. You'll need root permissions, or you can futz with the file permissions in the usual way and change them to an unprivileged user.
Now let's create a snapshot:

Snapshots are very efficient because multiple snapshots share the same original files and copy only the changes. You can list all the snapshots in the same volume; you need to name one of them and then all of them are displayed:

This also shows that Btrfs sees snapshots and subvolumes as the same things. Your snapshots can be copied elsewhere as backups, or mounted independently to different mountpoints. Want to roll back to an earlier snapshot? First set the snapshot as the default. You need the snapshot ID, and then the path:# btrfs subvolume set-default 257 btrfs-volume/
Then unmount the subvolume, and then remount:

# umount btrfs-volume
# mount -t btrfs /dev/sda8 btrfs-volume

Is that not cool? After creating subvolumes you don't need to mount the parent volume.

Prognosis

Btrfs is still rough around the edges, and the documentation and administration tools are incomplete. If you've used ZFS then Btrfs feels like a clunky copy, because administering ZFS is faster and easier. ZFS has a several-year head start on Btrfs, though. I expect Btrfs will improve rapidly as it becomes more widely used.