Search This Blog

Whither Btrfs?

Is the best technology the pathway to success? Nope. In this post, I'll take a strategic look at the future of the Btrfs file system.

Using B-trees (or modified B-trees) for space allocation has been the rage among file system designers in the past few years. Some of the more notable efforts are ZFS, Btrfs, Reiser4, and NILFS. The availability of open source operating systems, especially BSD and Linux, has enabled explorations of interesting new ways to manage storage and implement file systems. This is a good thing. But being technologically cool and successful does not foretell commercial success. For the purpose of evaluating file systems, I'll define commercially successful as having a large installed base for decades. The list of commercially successful file systems is fairly small: FAT, NTFS, HFS+, UFS, and ext2/3 are perhaps the most commercially successful general-purpose file systems today. The key to commercial success is to provide good value and have a good delivery channel.

Btrfs was announced by Oracle in June 2007 and is being integrated into the Linux kernel. It offers some of the more interesting features of other file systems built on B-tree notions: snapshots, efficient backups, copy-on-write, multiple file systems in a single logical volume (called subvolumes), dynamic inode allocation, multiple device support, internal mirroring, etc. These are all cool features and represent a viable technology direction. But technological feats often run into barriers to adoption which prevent them from becoming commercially successful.

The most important barrier to adoption is the delivery channel. Clearly, Microsoft dominates the industry as it carefully controls the deliver channel of software onto approximately 90% of the computer systems volume. Microsoft owns (is the proprietor of) NTFS and FAT which dominate the market. The next major vendor by volume is Apple, who owns HFS+ that is used as the default file system for OSX. The largest Linux channels, Red Hat and SuSE, use ext2/3 and seem to be planning to use ext4 in the future. Changing the default file system for a popular OS is a very expensive, time-consuming, and disruptive event, which is why OS vendors will spend a lot of time and money to fix and incrementally improve the default file system when possible. The life cycle for a default file system is measured in decades. The development of a new file system takes time, too -- on the order of 5-6 years seems to be typical as measured by having enough stable new features that the value of migration is greater than the inertia of the legacy. The barrier here is time to maturity and time to become the default in the channel. Introduced in 2007, we can expect to see Btrfs be mature in the 2012 timeframe. But what about the prospects of becoming the default in the channel?

Oracle has been trying to reduce their costs by eliminating the OS vendor for many years. Until recently, their efforts were to completely eliminate the OS vendor (raw iron) or to take away Linux from Red Hat (so-called Oracle Enterprise Linux, aka Larry Linux). Neither has been very successful. But Oracle's acquisition of Sun Microsystems changes the industry structure in many ways. Now, Oracle will have an entire solution stack: software, hardware, and services. The solution stack represents a channel for Oracle to deliver innovations, such as a spiffy new file system. Herein lies the problem for Btrfs: Oracle will now own ZFS. This means:

Btrfs is not mature enough to become the default file system for OEL. ZFS is more than 5 years old and stable enough to become the default file system for Solaris 10 and OpenSolaris.

It makes little sense for Oracle to continue funding two, competing file system projects -- one trying to match features with the other. ZFS has approximately 45 associated patents, and patents have real value ($) in the US.

Tossing Btrfs to the open-source winds is not likely to improve its schedule or channel prospects.

There are a couple of scenarios that could still play out -- Oracle could break the GPLv2 barrier that prevents Linux from accepting ZFS in the kernel or Oracle could take a more competitive stance against Red Hat and Novell by leveraging [Open]Solaris. Either way I don't see a good business case for Oracle to continue to invest in Btrfs. What do you think?

Comments

BTRFS is interesting to Oracle because it's primary focus is on helping database applications work great. ZFS is interesting to Oracle because it's primary focus is on making commodity hardware provide stable, enterprise features. Databases are only one part of what enterprise systems need. There is a lot of overlap between ZFS and BTRFS.

I'm still betting that ZFS will win, but I'm also wondering why Apple is silent on this "feature" removal.

I'm guessing it's because they want it to be supported at MacOSXForge for a bit longer so that something really cool can fall out of having a community involved.

I would think that if both file systems are based on similar data structures, they could both be optimized in about the same ways. I.e. if I were Oracle I'd start from ZFS and make database-optimization changes to it.

My colleague and I have posted some thoughts on the Oracle / Sun merger at http://ctistrategy.com.

Post a Comment

Popular Posts

Today, we routinely hear people carrying on about IOPS-this and IOPS-that. Mostly this seems to come from marketing people: 1.5 million IOPS-this, billion IOPS-that. Right off the bat, a billion IOPS is not hard to do, the metric lends itself rather well to parallelization...

This post is the first in a series looking at the use and misuse of IOPS for storage system performance analysis or specification.

Let's do some simple math. We all want low latency -- the holy grail of performance. In the bad old days, many computer systems were bandwidth constrained in the I/O data path, so it was very easy to measure the effect of bandwidth constraints on latency. For example, fast/wide parallel SCSI and UltraSCSI was the rage when the dot-com bubble was bubbling, capped out at 20 MB/sec. Suppose we had to move 100 MB of data, then the latency is easily calculated:

If you wander through the OpenSolarisZFS-discuss archives or look at the ZFS Best Practices Guide, then you can encounter references and debates about whether the zfs send and zfs receive commands are suitable for backups. As I've described before, zfs send and zfs receive can be part of a comprehensive backup strategy for high-transaction environments. But people get nervous when we discuss placing a zfs send stream on persistent storage. The reasoning is that if the stream gets corrupted, then it is useless. There is an RFE open to improve the robustness of zfs receive, but that is little consolation for someone who has lost data. The fundamental design of ZFS is exposed in zfs send -- the send stream contains an object, not files. This is great for replicating objects, and since ZFS file systems and volumes are objects, it is quite handy. This is why zfs send and zfs receive do not replace the functionality of an enterprise backup system that works on files. So, I expect the te…

ZFS now offers triple-parity raidz3. Conceptually, raidz3 is an N+3 parity protection scheme. Today, there are few, if any, other implementations of triple parity protection, so when we say "raidz is similar to RAID-5" and "raidz2 is similar to RAID-6" there is no similar allusion for raidz3. I prefer to say "raidz3 is like raidz2 with one additional level of parity protection. But how much better is raidz3 than raidz2? To help answer that question, I used the simple Mean Time to Data Loss (MTTDL) model to calculate the data retention capabilities of the possible configurations of 12 disks under ZFS. To be fair, the same model applies to other RAID implementations, but I'll use the ZFS terminology here.

In this MTTDL model, the configuration includes N total disks. If the data protection scheme is raidz3, then the minimum N = 1 data disk + 3 parity disks = 4. You can add more data disks to increase the overall available space, so if N=6 then you have 3 data…