We look at the amazing features in ZFS and btrfs—and why you need them.

Share this story

Most people don't care much about their filesystems. But at the end of the day, the filesystem is probably the single most important part of an operating system. A kernel bug might mean the loss of whatever you're working on right now, but a filesystem bug could wipe out everything you've ever done... and it could do so in ways most people never imagine.

Sound too theoretical to make you care about filesystems? Let's talk about "bitrot," the silent corruption of data on disk or tape. One at a time, year by year, a random bit here or there gets flipped. If you have a malfunctioning drive or controller—or a loose/faulty cable—a lot of bits might get flipped. Bitrot is a real thing, and it affects you more than you probably realize. The JPEG that ended in blocky weirdness halfway down? Bitrot. The MP3 that startled you with a violent CHIRP!, and you wondered if it had always done that? No, it probably hadn't—blame bitrot. The video with a bright green block in one corner followed by several seconds of weird rainbowy blocky stuff before it cleared up again? Bitrot.

The worst thing is that backups won't save you from bitrot. The next backup will cheerfully back up the corrupted data, replacing your last good backup with the bad one. Before long, you'll have rotated through all of your backups (if you even have multiple backups), and the uncorrupted original is now gone for good.

Contrary to popular belief, conventional RAID won't help with bitrot, either. "But my raid5 array has parity and can reconstruct the missing data!" you might say. That only works if a drive completely and cleanly fails. If the drive instead starts spewing corrupted data, the array may or may not notice the corruption (most arrays don't check parity by default on every read). Even if it does notice... all the array knows is that something in the stripe is bad; it has no way of knowing which drive returned bad data—and therefore which one to rebuild from parity (or whether the parity block itself was corrupt).

What might save your data, however, is a "next-gen" filesystem.

Let's look at a graphic demonstration. Here's a picture of my son Finn that I like to call "Genesis of a Supervillain." I like this picture a lot, and I'd hate to lose it, which is why I store it on a next-gen filesystem with redundancy. But what if I didn't do that?

As a test, I set up a virtual machine with six drives. One has the operating system on it, two are configured as a simple btrfs-raid1 mirror, and the remaining three are set up as a conventional raid5. I saved Finn's picture on both the btrfs-raid1 mirror and the conventional raid5 array, and then I took the whole system offline and flipped a single bit—yes, just a single bit from 0 to 1—in the JPG file saved on each array. Here's the result:

Original image.

Corrupted image: RAID5.

Corrupted image: btrfs-raid1.

The raid5 array didn't notice or didn't care about the flipped bit in Finn's picture any more than a standard single disk would. The next-gen btrfs-raid1 system, however, immediately caught and corrected the problem. The results are pretty obvious. If you care about your data, you want a next-gen filesystem. Here, we'll examine two: the older ZFS and the more recent btrfs.

What is a “next-generation” filesystem, anyway?

"Next-generation" is a phrase that gets handed out like sales flyers in a mall parking lot. But in this case, it actually means something. I define a "generation" of filesystems as a group that uses a particular "killer feature"—or closely related set of them—that earlier filesystems don't but that later filesystems all do. Let's take a quick trip down memory lane and examine past and current generations:

Generation 0: No system at all. There was just an arbitrary stream of data. Think punchcards, data on audiocassette, Atari 2600 ROM carts.

Generation 1: Early random access. Here, there are multiple named files on one device with no folders or other metadata. Think Apple ][ DOS (but not ProDOS!) as one example.

Generation 3: Metadata—ownership, permissions, etc. As the user count on machines grew higher, the ability to restrict and control access became necessary. This includes AT&T UNIX, Netware, early NTFS, etc.

Generation 4: Journaling! This is the killer feature defining all current, modern filesystems—ext4, modern NTFS, UFS2, you name it. Journaling keeps the filesystem from becoming inconsistent in the event of a crash, making it much less likely that you'll lose data, or even an entire disk, when the power goes off or the kernel crashes.

That's quite a laundry list, and one or two individual features from it have shown up in some "current-gen" systems (Windows has Volume Shadow Copy to correspond with snapshots, for example). But there's a strong case to be made for the entire list defining the next generation.

Justify your generation

The quickest objection you could make to defining "generations" like this would be to point at NTFS' Volume Snapshot Service (VSS) or at the Linux Logical Volume Manager (LVM), each of which can take snapshots of filesystems mounted beneath them. However, these snapshots can't be replicated incrementally, meaning that backing up 1TB of data requires groveling over 1TB of data every time you do it. (FreeBSD's UFS2 also offered limited snapshot capability.) Worse yet, you generally can't replicate them as snapshots—with references intact—which means that your remote storage requirements increase exponentially, and the difficulty of managing backups does as well. With ZFS or btrfs replicated snapshots, you can have a single, immediately browsable, fully functional filesystem with 1,000+ versions of the filesystem available simultaneously. Using VSS with Windows Backup, you must use VHD files as a target. Among other limitations, VHD files are only supported up to 2TiB in size, making them useless for even a single backup of a large disk or array. They must also be mounted with special tools not available on all versions of Windows, which goes even further to limit them as tools for specialists only.

Finally, Microsoft's VSS typically depends on "writer" components that interface with applications (such as MS SQL Server) which can themselves hang up, making it difficult to successfully create a VSS snapshot in some cases. To be fair, when working properly, VSS writers offer something that simple snapshots don't—application-level consistency. But VSS writer bugs are a real problem, and I've encountered lots of Windows Servers which were quietly failing to create Shadow Copies. (VSS does not automatically create a writer-less Shadow Copy if the system times out; it just logs the failure and gives up.) I have yet to encounter a ZFS or btrfs filesystem or array that won't immediately create a valid snapshot.

At the end of the day, both LVM and VSS offer useful features that a lot of sysadmins do use, but they don't jump right out and demand your attention the way filenames, folders, metadata, or journaling did when they came onto the market. Still, this is only one feature out of the entire laundry list. You could make a case that snapshots made the fifth generation, and the other features in ZFS and btrfs make the sixth. But by the time you finish this article, you'll see there's no way to argue that btrfs and ZFS definitely constitute a new generation that is easily distinguishable from everything before them.

Data integrity has been forgotten in the race of ever increasing drive sizes. I am excited for the future where we can combat this head on as everyday people start adding terabytes of data to their lives.

Given the potential instability of btrfs, and disregarding the use case of "lots of huge VM images that get copied and slightly modified on a regular basis," would you recommend ZFS for new deployments over btrfs? Do we have any idea when btrfs will reach ZFS's level of maturity?

Still not sure what a snapshot is or why you'd want it? Well, imagine you're about to do something potentially dangerous to your system like apply an automatic update to a big cranky application you rely on a lot and don't trust, or manually delete a bunch of stuff in system directories to try to uninstall a program that you can't remove normally. (We're talking big things that could go wrong and may be difficult or impossible to undo.) Before doing them... take a snapshot. Then, if $BigScaryProcedure goes wrong, roll back to the snapshot you just took. Poof, everything is peachy again.

In practice, I take a snapshot every hour on the hour on my own machines and delete the old snapshots as necessary to recover disk space. That gives me the best possible chance to recover from something unexpected going horribly, horribly wrong.

Might be worth mentioning things like OS X's Time Machine (which does exactly this, albeit to a separate volume rather than additional space on the original volume).

A file system with extra protection from unexpected shutdown corruption seems like a real winner to me. I've nearly lost my desktop's hard drive a few times to brownouts in my area forcing my machine into multiple shutdown-restarts before I could plunge behind my desk to unplug my power cord.

Data integrity has been forgotten in the race of ever increasing drive sizes. I am excited for the future where we can combat this head on as everyday people start adding terabytes of data to their lives.

I think the fact that drives are so large and cheap now are what has made people forget about file systems. Let's face it, it is not the most sexy part of an operating system and when you can buy a 2 terabyte drive for less then a hundred bucks, backups are cheap and easy.

In fact the only time I really even notice file systems is due to the fact that Windows STILL does not read ext4 out of the box.

Having used and abused ZFS (using zfsonlinux), I would never trust my data to anything else ever again.Wow.

By the way, saying things like "ZFS compression needs to be enabled on the entire filesystem" isn't really accurate the way one might think, because ZFS changes the way one thinks of the filesystem versus the storage device. You should think of ZFS file systems as cheap and infinitely mutable. You want /home/user/Documents to be compressed? Fine! just do zfs create compression=on /home/user/Documents.

Given the potential instability of btrfs, and disregarding the use case of "lots of huge VM images that get copied and slightly modified on a regular basis," would you recommend ZFS for new deployments over btrfs?

Yes, right now ZFS is the way to go to get next-gen features. It's available in Solaris or FreeBSD in their base distributions, and is available in Linux by way of third-party loadable kernel modules - in the case of Ubuntu, you can add it directly from a PPA:

The caveat under Linux is that you can't currently (at least, not easily enough to make it worth contemplating) boot directly to ZFS, so you end up needing an ext4 (or whatever) root filesystem, and then you can attach your zpools later.

There are also a few papercuts involved with running ZFS on Linux, but nothing that's likely to cause you serious trouble - generally just "tunables that actually need tuning".

My highly unscientific guess(tm) is that btrfs will be production-ready in about a year or two. It's pretty close now - I'm using it on a very small scale in "test production" on systems I actually care about the condition of - but as of right now, since ZFS does exist, I really wouldn't recommend btrfs to anybody who isn't very well aware and willing to regularly contemplate its bleeding-edge status.

Hey, great to see Jim on the frontpage, not only as the always helpful and friendly guy in the linux subforums.

I've just two little pokes at the article:1) It sounds as if ZFS would, in contrast to BTRFS, not support different compression algorithms. AFAIK it does, at least under FreeBSD, where one is, iirc, LZJB and the other one is gzip, with a compression level of your choice.2) When outlining BTRFS' indeed more flexible options for pool mutability, it's done with a mirror. Adding and removing mirrors, though, is exactly what ZFS can do, too, though. It is absolutely correctly stated that ZFS' immutability applies to RAIDZ pools, but I still feel this part is explained kind of unlucky.

I would have loved to also see a section about how ZFS (and BTRFS) can make use of log devices and cache devices to get you much better performance while staying safe and essentially create quite awesome hybrid drives, but that's probably mostly my own addiction to next gen file systems, with ZFS in particular.

Nope. There's btrfs-gui, which is just what it sounds like - a graphical interface to managing btrfs filesystems - but it's both still in very early stages and not really designed to be a NAS-in-a-bottle the way FreeNAS is.

If you want a fire-and-forget homebrew NAS, FreeNAS is the no-brainer answer. None of the additional features btrfs offers are very relevant in a small home NAS environment, and so far there hasn't been much demand to try to recreate everything that FreeNAS does on any other platform.

This article reads like a review of HFS+ features. (Without checksumming). Snapshots, COW, et cetera. Sadly, HFS+ is never mentioned.

IMHO checksumming is the single biggest feature of these "next gen" file systems.

Other features are neat and really useful. But being sure that the data you put on a drive is the same as the data you read from a drive should really be considered a "basic feature" of any filesystem.

It's worth adding to this article that checksumming only get you so far. You also need to scrub your data. (Where you read through and correct any errors.) It's no use to only do that once you read the data back as by then it might be too late and any redundant data you had may already be corrupt.

This is one reason why I'm kind of skeptical of my DroboFS. It claims to do scrubbing, but I'm also pretty sure that I've had data go bad on it.

(Related to this, it's also kind of mind boggling that when you do a "back up" and copy your files to a new NAS you're not entirely sure that you actually have a correct copy unless you do a proper checksum on the results.)

One nitpick, 7 billion people times 2 terabytes is 14 zetabytes, which is considerably larger than 16 exabytes. A mere 8 million people with 2 terabyte drives would outstrip the capacity of btrfs if all their storage were pooled together. NYC for instance.

A bit confused about the "online balancing", how does the example differ from a [zpool]?

The examples got snipped a bit in final editing; it's easy to get confused. Forget the mirrors for a moment, say you've got a single-disk working btrfs-on-root Linux install on /dev/sda1, and three more disks with identical partition layouts at /dev/sdb /dev/sdc and /dev/sdd.

Code:

you@box:~$ sudo btrfs dev add /dev/sdb1 /dev/sdc1 /dev/sdd1 /

Now you've added the three other disks to your btrfs filesystem. So far, this isn't much different than adding three more vdevs to a zpool.

THAT's where it gets different - you just live rebalanced your formerly single-disk root btrfs install to a raid10 array, also bootable, also root. (You need to grub-install the other three drives if you want to be able to boot directly from any of them, as well, but that's as simple as it sounds, and yes it does work.)

One nitpick, 7 billion people times 2 terabytes is 14 zetabytes, which is considerably larger than 16 exabytes. A mere 8 million people with 2 terabyte drives would outstrip the capacity of btrfs if all their storage were pooled together. NYC for instance.

Might be worth mentioning things like OS X's Time Machine (which does exactly this, albeit to a separate volume rather than additional space on the original volume).

No. No, no, no, no, a thousand times no.

Time Machine is a file-level tool, not block level. It copies entire files. Change one cluster in a 5GB file and Time Machine will back up the entire 5GB file; change another cluster and TM will back it up again. It can potentially waste tremendous amounts of space unless you're careful about what you're excluding from it (virtual machines, for example, should be excluded from your TM backup unless you just love chewing through your TM space).

The block-level snapshots provided by smart file systems like ZFS and btrfs are completely different animals. They capture only the changed clusters and store only those. They have other advantages, as Jim states (atomicity is one, and they can be replicated, too), but the biggest is that if you change one cluster in that 5GB file, only that one changed cluster gets backed up.

Don't get me wrong: I use Time Machine and it's great, but it's NOT the same thing.

Quote:

This article reads like a review of HFS+ features. (Without checksumming). Snapshots, COW, et cetera. Sadly, HFS+ is never mentioned.

I like seeing articles like this, but I am a little dissapointed at the lack of Storage Spaces + ReFS in the comparison. It'd be nice to see how it really stacks up against the more-mature ZFS and the 'fully' open source btrfs.

I built a machine to be my ZFS file server recently, but since I'm trying to get a couple hyper-v certs, it's hosting a Server 2012 R2 Hyper-V hypervisor with storage spaces right now, and I'm actually rather pleased with the performance and features. (There ARE a couple features that look really annoying.. I have an SSD cache defined, but I think to replace the ssd cache with a larger drive in the future looks... nigh-impossible without wiping the volumes in the process.)

You want /home/user/Documents to be compressed? Fine! just do zfs create compression=on /home/user/Documents.

This is correct... assuming you've created /home/user/Documents as a separate ZFS filesystem rather than as a folder. In practice, that's fine for a folder or two, but on a more complex system it can end up descending into rather a lot of complexity (now you have to deal with snapshotting tons of zfs filesystems instead of one, etc. There are answers to managing that complexity, but it's still complexity that's easier to avoid in the first place if it's possible to set compression on or off per folder or file rather than having to do it per filesystem (or subvolume, in btrfs' jargon).

Different admins may have different opinions, but as a guy who has used (and loved!) ZFS on a hundred-ish systems over the last five years... being able to do so much per-folder and per-file in btrfs is nice.

There are also a few papercuts involved with running ZFS on Linux, but nothing that's likely to cause you serious trouble - generally just "tunables that actually need tuning".

Any references on those tunables? As noted in my previous post, I'm prepping my ZFS file server in the near future, and it will be linux-hosted (due to my familiarity with linux over *bsd/solaris) and other things I want to host on the same box.

Great article, but I felt like the author was much of a btrfs "cheerleader". A good balanced article talks about the good and the bad, the deficiencies and the advantages of all the products, and I think this article (good as it was) missed that journalistic mark.

Having used and abused ZFS (using zfsonlinux), I would never trust my data to anything else ever again.Wow.

By the way, saying things like "ZFS compression needs to be enabled on the entire filesystem" isn't really accurate the way one might think, because ZFS changes the way one thinks of the filesystem versus the storage device. You should think of ZFS file systems as cheap and infinitely mutable. You want /home/user/Documents to be compressed? Fine! just do zfs create compression=on /home/user/Documents.

Furthermore, if you use LZ4 compression, it will automatically abort the compression and write the block uncompressed if it detects that the compression ratio is too low. The compression overhead is so low that it makes sense to just set it on and forget it, even if the volume will hold nothing but multimedia, except in the most extreme performance-crucial situations. Here is a reference for that claim.

who thought it was cute to put that little 'i' in the acronyms for KB, MB, GB, TB, PB..?

The little i actually means something. A KB is 1024 bytes. A KiB is 1000 bytes. Similarly, a MB is 1024KB, whereas a MiB is 1000 KiB.

Blame hard drive manufacturers for this - they started using powers-of-ten approximations when rating their drives' storage capacities a long, long time ago - presumably because they made for bigger numbers, and consumers will buy the thing with the bigger number on it.

The bigger the units get, the bigger the disparity. A TiB is only nine-tenths of a "real terabyte", and an EiB is only 85% of a "real exabyte". I hate that we have to talk about "tebibytes" and "exbibytes" too, but... it really makes a difference.

Correct me if I'm wrong, it even may be a bit off topic but, the traditionnal RAID 5 array is setup on _four_ drives, innit ?Your point is still valid, but I thought it still would be worth-mentionning.

Nope, raid 5 is a 3 or more drives. X-1 storage space. so, 4 drives gives you a higher storage % than 3 drives, but it's still raid 5.

Correct me if I'm wrong, it even may be a bit off topic but, the traditionnal RAID 5 array is setup on _four_ drives, innit ?Your point is still valid, but I thought it still would be worth-mentionning.

RAID5 requires a minimum of 3 disks, but can go as big as you feel comfortable going. Nothing wrong with a 4, 5, 6, 7, 8, or even larger number of disks (except that recoverability starts to get scary).

Edit for more - R5 is striping with parity, so if you have 4 disks, your data is striped across all 4 of them, and one disk's chunk of that stripe holds parity data. It's a different disk for each stripe, so every disk holds mixed real data and parity data. You can lose any one disk and rebuild the disk's contents from all the other disks, but you can't lose more than one. That's why R5 arrays are often referred to as "N+1"—like, if you have a 4-disk R5 array, you've got a "3+1" array. You'll get 3 disks' worth of usable space out of the 4-disk array, but you're protected from one disk failure.

There are also a few papercuts involved with running ZFS on Linux, but nothing that's likely to cause you serious trouble - generally just "tunables that actually need tuning".

Any references on those tunables? As noted in my previous post, I'm prepping my ZFS file server in the near future, and it will be linux-hosted (due to my familiarity with linux over *bsd/solaris) and other things I want to host on the same box.

Hands-down the biggest issue is that ZFS on Linux is too slow to relinquish RAM from the ARC back to the system, so you will end up with OOM issues, VMs refusing to start once they've been stopped, etc unless you manage it manually by setting zfs_arc_max.

Extra annoyingly, yes it really DOES have to be specified in BYTES, not in KB or MB. Yuck. It's also AFAICT not a tunable that can be set after the zfs ko has already been loaded, meaning that you'll have to completely unload your zpool(s) and the kernel module in order to change it (usually easiest done just by rebooting the system entirely, in practice).

A pretty good general purpose rule of thumb is to limit the ARC to half the system RAM. If that ends up not working for you, you can further tune it up or down as desired. As you can see, I ended up changing my mind about how much I wanted to let it have on that particular server a couple of times.

Great article, but I felt like the author was much of a btrfs "cheerleader". A good balanced article talks about the good and the bad, the deficiencies and the advantages of all the products, and I think this article (good as it was) missed that journalistic mark.

I'll take the hit - but in my defense, this was originally supposed to be an article about btrfs that only mentioned zfs in passing, and at the last minute got changed to be "an article about btrfs and zfs". Mea culpa.

who thought it was cute to put that little 'i' in the acronyms for KB, MB, GB, TB, PB..?

The little i actually means something. A KB is 1024 bytes. A KiB is 1000 bytes. Similarly, a MB is 1024KB, whereas a MiB is 1000 KiB.

Blame hard drive manufacturers for this - they started using powers-of-ten approximations when rating their drives' storage capacities a long, long time ago - presumably because they made for bigger numbers, and consumers will buy the thing with the bigger number on it.

The bigger the units get, the bigger the disparity. A TiB is only nine-tenths of a "real terabyte", and an EiB is only 85% of a "real exabyte". I hate that we have to talk about "tebibytes" and "exbibytes" too, but... it really makes a difference.