We look at the amazing features in ZFS and btrfs—and why you need them.

Share this story

Most people don't care much about their filesystems. But at the end of the day, the filesystem is probably the single most important part of an operating system. A kernel bug might mean the loss of whatever you're working on right now, but a filesystem bug could wipe out everything you've ever done... and it could do so in ways most people never imagine.

Sound too theoretical to make you care about filesystems? Let's talk about "bitrot," the silent corruption of data on disk or tape. One at a time, year by year, a random bit here or there gets flipped. If you have a malfunctioning drive or controller—or a loose/faulty cable—a lot of bits might get flipped. Bitrot is a real thing, and it affects you more than you probably realize. The JPEG that ended in blocky weirdness halfway down? Bitrot. The MP3 that startled you with a violent CHIRP!, and you wondered if it had always done that? No, it probably hadn't—blame bitrot. The video with a bright green block in one corner followed by several seconds of weird rainbowy blocky stuff before it cleared up again? Bitrot.

The worst thing is that backups won't save you from bitrot. The next backup will cheerfully back up the corrupted data, replacing your last good backup with the bad one. Before long, you'll have rotated through all of your backups (if you even have multiple backups), and the uncorrupted original is now gone for good.

Contrary to popular belief, conventional RAID won't help with bitrot, either. "But my raid5 array has parity and can reconstruct the missing data!" you might say. That only works if a drive completely and cleanly fails. If the drive instead starts spewing corrupted data, the array may or may not notice the corruption (most arrays don't check parity by default on every read). Even if it does notice... all the array knows is that something in the stripe is bad; it has no way of knowing which drive returned bad data—and therefore which one to rebuild from parity (or whether the parity block itself was corrupt).

What might save your data, however, is a "next-gen" filesystem.

Let's look at a graphic demonstration. Here's a picture of my son Finn that I like to call "Genesis of a Supervillain." I like this picture a lot, and I'd hate to lose it, which is why I store it on a next-gen filesystem with redundancy. But what if I didn't do that?

As a test, I set up a virtual machine with six drives. One has the operating system on it, two are configured as a simple btrfs-raid1 mirror, and the remaining three are set up as a conventional raid5. I saved Finn's picture on both the btrfs-raid1 mirror and the conventional raid5 array, and then I took the whole system offline and flipped a single bit—yes, just a single bit from 0 to 1—in the JPG file saved on each array. Here's the result:

Original image.

Corrupted image: RAID5.

Corrupted image: btrfs-raid1.

The raid5 array didn't notice or didn't care about the flipped bit in Finn's picture any more than a standard single disk would. The next-gen btrfs-raid1 system, however, immediately caught and corrected the problem. The results are pretty obvious. If you care about your data, you want a next-gen filesystem. Here, we'll examine two: the older ZFS and the more recent btrfs.

What is a “next-generation” filesystem, anyway?

"Next-generation" is a phrase that gets handed out like sales flyers in a mall parking lot. But in this case, it actually means something. I define a "generation" of filesystems as a group that uses a particular "killer feature"—or closely related set of them—that earlier filesystems don't but that later filesystems all do. Let's take a quick trip down memory lane and examine past and current generations:

Generation 0: No system at all. There was just an arbitrary stream of data. Think punchcards, data on audiocassette, Atari 2600 ROM carts.

Generation 1: Early random access. Here, there are multiple named files on one device with no folders or other metadata. Think Apple ][ DOS (but not ProDOS!) as one example.

Generation 3: Metadata—ownership, permissions, etc. As the user count on machines grew higher, the ability to restrict and control access became necessary. This includes AT&T UNIX, Netware, early NTFS, etc.

Generation 4: Journaling! This is the killer feature defining all current, modern filesystems—ext4, modern NTFS, UFS2, you name it. Journaling keeps the filesystem from becoming inconsistent in the event of a crash, making it much less likely that you'll lose data, or even an entire disk, when the power goes off or the kernel crashes.

That's quite a laundry list, and one or two individual features from it have shown up in some "current-gen" systems (Windows has Volume Shadow Copy to correspond with snapshots, for example). But there's a strong case to be made for the entire list defining the next generation.

Justify your generation

The quickest objection you could make to defining "generations" like this would be to point at NTFS' Volume Snapshot Service (VSS) or at the Linux Logical Volume Manager (LVM), each of which can take snapshots of filesystems mounted beneath them. However, these snapshots can't be replicated incrementally, meaning that backing up 1TB of data requires groveling over 1TB of data every time you do it. (FreeBSD's UFS2 also offered limited snapshot capability.) Worse yet, you generally can't replicate them as snapshots—with references intact—which means that your remote storage requirements increase exponentially, and the difficulty of managing backups does as well. With ZFS or btrfs replicated snapshots, you can have a single, immediately browsable, fully functional filesystem with 1,000+ versions of the filesystem available simultaneously. Using VSS with Windows Backup, you must use VHD files as a target. Among other limitations, VHD files are only supported up to 2TiB in size, making them useless for even a single backup of a large disk or array. They must also be mounted with special tools not available on all versions of Windows, which goes even further to limit them as tools for specialists only.

Finally, Microsoft's VSS typically depends on "writer" components that interface with applications (such as MS SQL Server) which can themselves hang up, making it difficult to successfully create a VSS snapshot in some cases. To be fair, when working properly, VSS writers offer something that simple snapshots don't—application-level consistency. But VSS writer bugs are a real problem, and I've encountered lots of Windows Servers which were quietly failing to create Shadow Copies. (VSS does not automatically create a writer-less Shadow Copy if the system times out; it just logs the failure and gives up.) I have yet to encounter a ZFS or btrfs filesystem or array that won't immediately create a valid snapshot.

At the end of the day, both LVM and VSS offer useful features that a lot of sysadmins do use, but they don't jump right out and demand your attention the way filenames, folders, metadata, or journaling did when they came onto the market. Still, this is only one feature out of the entire laundry list. You could make a case that snapshots made the fifth generation, and the other features in ZFS and btrfs make the sixth. But by the time you finish this article, you'll see there's no way to argue that btrfs and ZFS definitely constitute a new generation that is easily distinguishable from everything before them.

634 Reader Comments

Can the checksum itself become corrupted? Then it would attempt to correct uncorrupted data.

The probability of the bits of the checksum becoming corrupted are much, much smaller than the probability of the bits of the data it represents becoming corrupt (the checksum has orders of magnitude less bits than the data it represents).

you never hear anyone complaining about corrupt/inconsistent zfs filesystems, whereas with Btrfs, people are screaming for an fsck.

People screamed for an fsck in ZFS for years, and still are (see http://openindiana.org/pipermail/openin ... 08822.html for a recent example). Not because it needs it, but because that's what people are used to, and because when people DO end up losing data, they think "I NEED AN FSCK!", not realizing that the kinds of errors an fsck tool could fix just don't happen in the first place.

FWIW, I'd take that list with a pretty huge grain of salt. It's a few years out of date, and as far as I can tell, its author never actually used btrfs, he just went on a hunt to find things to justify not using it.

One minor point on checksumming though - doesn't that come at a pretty big performance penalty?

It really doesn't. It's a tiny flyspeck in terms of modern CPU capability. In actual practice, differences in the actual on-disk filesystem structure make a TREMENDOUSLY larger impact than checksumming does - and they're as often in favor of ZFS or btrfs as they are against.

When I benchmarked storage on Windows Server 2008 R2 guests under Linux KVM, ZFS and btrfs set up properly were the highest performers in most categories (due largely to more advanced caching algorithms) and still performed quite well even in the most demanding platter-chatter areas (sub-4K random reads and writes). I've tried disabling checksumming on ZFS before just out of curiosity, and it had little to no impact on performance in any real-world situation I'm likely to encounter.

It depends. Afaik ZFS uses fletcher4 (I might actually be off here, don't trust this) as its standard checksum. That is a very fast algorithm and really has next to no CPU overhead.

You can also tell ZFS to use other algorithms, like SHA-256, which makes little sense for checksumming. You reduce the chance of an undetected corruption from "might win the lottery ten times in a row" to "might win the lottery ten times in a row and have a lightning strike open my beer bottle on top of that". (If you use deduplication, though, SHA-256 makes sense, as the hash is used as an identifier for the data block, for which fletcher is far from good enough.)

So checksumming per se has an overhead dependent on the used algorithm. In practice, you use algorithms that are optimized to work with a low overhead.

@ JimHave you ever tested how much of a difference using SHA-256 for checksums makes? I'm simply curious and it might actually guzzle a notable chunk of CPU time.

It is getting better though - PC-BSD is a leaps-and-bounds improvement over what getting a desktop running on FreeBSD (and maintaining it afterwards!) was like, back in the day.

PC-BSD even has a GUI installer that understands RAIDZ and will let you do an install directly to a RAIDZ root. So... some caveats are definitely in order, but I don't think it's necessarily fair to dismiss it out of hand.

I'd love to have my Ubuntu desktop use ZFS, but I don't want to have to mess with it. I am not a file system expert and while I can learn I'd rather just have it work with minimal fuss. Hopefully Canonical will integrate it at some point.

@ JimHave you ever tested how much of a difference using SHA-256 for checksums makes? I'm simply curious and it might actually guzzle a notable chunk of CPU time.

No, I haven't. ZFS and btrfs are both capable of handling different checksum algorithms than the default (IIRC there were some guides floating around that excitedly recommended using a more complex fletcher algorithm), but I've never really felt the need to screw around with the default choices.

That's an interesting point about dedup, though. BTW - for what it's worth, btrfs doesn't do automatic online dedup the way ZFS can right now, but it IS capable of dedup, and with considerably less overhead; there's a userspace tool in beta called bedup that can do this on-the-fly and at a pretty granular level - for example you might decide to manually dedup two existing VMs running the same guest operating system. Neat stuff.

HFS+ does not support copy-on-write anything, and it does not support snapshots. Time Machine in 10.8 and newer is implemented in what is effectively a copy-on-write manner, according to Apple, but that is not a file system feature:

There are indications that CoreStorage is designed to support snapshots, but the feature is not there yet (unless they snuck it in in 10.9 and I missed it). My personal wish feature is that Apple implements snapshots and block-level checksumming in CoreStorage. With those, they could make "Time Machine 2" to make block-level backups and lazily verify file integrity while backing up, silently restoring a broken file from backup. If that's too hard, start checksumming the journal and the file directory (I think ext4 does this for the journal?), so at least the disk directory will be safe.

HFS+ should not be mentioned in an article about file systems, except as an example of how not to do it. The only one worse still in active use is FAT and its derivatives.

The worst thing is that backups won't save you from bitrot. The next backup will cheerfully back up the corrupted data, replacing your last good backup with the bad one. Before long, you'll have rotated through all of your backups (if you even have multiple backups), and the uncorrupted original is now gone for good.

A bit of a nitpick here but order of operations do matter. If the file was backed up before the bitrot (ie random flipped bit over time), then it could be properly restored. If this bit flipping occurs before the backup, then the backup software will backup the flipped bits.

Quote:

Contrary to popular belief, conventional RAID won't help with bitrot, either. "But my raid5 array has parity and can reconstruct the missing data!" you might say. That only works if a drive completely and cleanly fails. If the drive instead starts spewing corrupted data, the array may or may not notice the corruption (most arrays don't check parity by default on every read). Even if it does notice... all the array knows is that something in the stripe is bad; it has no way of knowing which drive returned bad data—and therefore which one to rebuild from parity (or whether the parity block itself was corrupt).

RAID5 can catch bitrot by checking the parity on read. As stated, parity check on read is disabled so this is often overlooked and this bitrot recovery scheme depends on it and only a few RAID5 implementations support it. So how is it able to recover? First the parity check fails on a read. Then comes the brute force restoration of each drive's block to determine which one was bad. Say you have drives A, B, C, and D in an array and the parity check on read fails. ABC contain data with D holding the parity in this example. BCD are used to generate A and then compared to what was read by A initially. ACD are used to generate B and so on. If all the block restorations don't detect an error in any of the individual blocks, the parity block is regenerated and compared. Obviously this recovery technique is slow but it should only be invoked on a parity check on read so typical performance should not suffer. (To clarify that enabling parity check on read does incur a performance penalty but enabling the recovery functionality would not given infrequent bitrot.) The RAID controller can then tell the drive which block is bad so that it is not used anymore.

It can detect bitrot, but it can't tell which drive has the error. Parity only tells you something in a group is different, not which one in the group is different.

It can detect which drive produced the error and correct it. Finding which drive contains the flipped bit does take a bit of effort though. I'll do a basic pseudo trace of the algorithm to help illustrate it and the use the same ABCD drive in my previous example. D contain the parity and C will contain the bitrot.

1) The ABC is read and parity checked against D. The parity check fails.2) A' is generated from BCD as if drive A had failed. Is A' the same as A? Yes, continue.3) B' is generated from ACD as if drive B had failed. Is B' the same as B? Yes, continue.3) C' is generated from ABD as if drive C had failed. Is C' the same as C? No, we have found the drive which contains the error. Set that block as bad and write C' to a different block to correct the error.

As I mentioned previously, not all RAID5 controllers support verify on read which this algorithm depends upon. Even fewer actually implement it.

[RAID] can detect bitrot, but it can't tell which drive has the error. Parity only tells you something in a group is different, not which one in the group is different.

Actually Kevin G pointed out a way that it can isolate the error - it's incredibly slow and cumbersome, but if you regenerate every data block in a stripe against the parity block and discover that only one of those blocks doesn't match, you found your bad block (and can replace it). If you regenerate each block in a stripe and NONE of them match, then either your parity block is bad (and you can regenerate and replace it), or more than one block in the stripe is bad (and you're screwed).

On the one hand, I can't believe I never thought of that. On the other hand, I've never seen a system that actually implemented that! I would say "mdraid devs, time to get busy", but TBH I have a feeling btrfs is going to largely make mdraid irrelevant in the next few years anyway.

It doesn't work with Raid5 (could work on raid6, based on the assumption that you wouldn't have two errors in the same bloc at the same time).Demonstration:Disk1: 1Disk2: 0Parity: 0It's easy to see that it does not match. But do try to find the incorrect bit among all three (you can check the files by hand, if it's a jpeg by example, but good luck with a dll or something)

I'd love to have my Ubuntu desktop use ZFS, but I don't want to have to mess with it. I am not a file system expert and while I can learn I'd rather just have it work with minimal fuss. Hopefully Canonical will integrate it at some point.

If you've lived so far without zfs, you can live for couple of more years until btrfs becomes officially stable and production ready. Two more years won't change anything.

Also, setting up a simple zfs configuration is very simple on ubuntu. You'll still need to read couple of web pages, but it's not something exceptionally difficult.

For the sake of completeness:You can use other algorithms for dedup, too, and also ask ZFS to not trust that each hash is unique for all data blocks. (Verifying when using SHA256 is kind of pointless, as the chances of different data generating the same hash are practically zero. Git also works on that assumption, btw. For other algorithms, hash collisions can be quite likely, though.)

You can tell ZFS to use fletcher4 for dedup and verify uniqueness of hashes, though. And I'm not even sure if ZFS would let you be stupid enough to disable verification with fletcher4, as hash collisions are to be expected quite often with that. In some cases fletcher4 plus verify is supposed to be faster, though. I haven't done any further research, but I suppose that's the case for weak CPUd boxen with a rather small amount of differing data to be deduped.

On the windows side, Server 2012 and Windows 8.1 have implemented a new file system called Resilient File System (ReFS). This FS has many of the same goals as ZFS, the most important being avoiding bit rot. It is meant to be used in conjunction with storage spaces to provide all the benefits and without bothering to get into the architectural details of it, I think MS did it that way because it's easier to add features to storage spaces in future OS releases than it is to make more modifications to the base file system (on the users).

ZFS is more mature and feature rich, but ReFS v1 + SS is the only option windows users have and it helps with the most important goal of avoiding bit rot along with a few other things and so it definitely classifies as a Next Gen file system.

edit: And I'm surprised this was ignored. There are certainly going to be WAY more people on windows than systems that support ZFS. Would be helpful to a lot of people to amend original article with info about ReFS.

I got my original reconstruction of Kevin G's description of recovery kinda wonky. Let's step through the process using something like your example:

Three disk RAID5, single-bit blocksize:

1 0 0 : [1 + 0 + 0] = 1

Note that the parity block is actually written on the first disk, after the first data block. A three drive RAID5 has three data blocks and one parity block per stripe, not two data and one parity.

Now, flip the bit on the second data block:

1 1 0 : [1 + 1 + 0] != 1

Parity doesn't match. Reconstruct each block from parity and the other blocks:

[1 - 1 - 0] [1 - 1 - 0] [1 - 1 - 1]0 0 1

Compare the parity-reconstructed data with the original data:

0 0 11 0 0

Reconstructed blocks 1 and 3 depended on the data as written in block 2 - and the reconstructed versions are different from the versions as written. Reconstructed data from block 2 did NOT depend on data as written in block 2 - and it isn't different from the version as written. This tells us that the data as written in block 2 is wrong, and should be replaced with the reconstructed version from parity, leaving us with:

1 0 0 : [1 + 0 + 0] = 1

Which is the original data, and which does match the checksum.

What if we'd flipped the parity block instead?

1 0 0 : 1becomes1 0 0 : 0and now when we reconstruct, we get0 1 1 ALL blocks changed, which means that our parity block is wrong (or that literally every single block is wrong), so lets flip the parity back1 0 0 : 1and now we match the checksum again.

It hurts my brain, but it does work, as long as only one block was actually corrupted. (If you corrupt more than one block, your results become impossible to decipher beyond "this stripe is screwed, reload from backup lol.")

Also worth noting in the conventional RAID reconstruction example: relatively simple multiple corruptions can pass under the radar: if you turn an 8E into an 8F in one data block and then turn the byte in the same position from a 50 into a 4F in another data block in the same stripe, you'll pass parity with flying colors. The same trick would not pass muster with fletcher or SHA checksumming; yes, hash collisions are still possible, but they're MUCH less likely.

Where do SSD's fit into this? Does the fact that there isn't real platters make any difference? I know that when they first came out NTFS did not play so nice with them.

Not to a big degree - some problems go away with SSD (lots of fragmentation due to COW on heavily re-written files doesn't matter much with no platters), and some other things don't need to be DONE with SSD (no point in trying to write heavily-used files "at the front of the drive", defragging is not only unnecessary but a bad idea), but those things are pretty much true across the board.

For what it's worth, btrfs is aware of SSDs if the operating system is (and Linux generally is, automatically) and it will automatically not try to do spinning-rust-oriented "optimizations" to an SSD. ZFS isn't automatically aware of SSDs, but doesn't really need to be either.

I also noticed that BTRFS has work for a fsck. The reason for this is because the FS can be in an inconsistent state. The rebalancing and RAID conversion is an example of how this can happen. ZFS has entirely designed to not need fsck because the FS should never be in an inconsistent state. Not the say this can't happen because of bugs.

ZFS is taking a while to get stuff like rebalancing because ZFS guarantees everything to be atomic. It is much harder to do this with that guarantee. Think it's not important? Do you want to be the sys-admin that has to tell their boss that the 10PB SAN is down and will take 2 weeks of fsck to fix?

The other issue is that BTRFS essentially has race conditions, which are the about the only reason something isn't atomic, this means you can test a conversion in dev and get different results in live because of load.

As someone who runs ZEVO for ZFS on the Mac for a home NAS server, I love the idea of next gen file systems, but considering that ZEVO is now a dead product, I'm stuck. I don't want to regress to a lesser file system after running ZFS for a while, but I'm stuck on OSX 10.8 Server until I can find a way to either migrate from ZEVO to MacZFS or buy a NAS from QNAP or Netgear or Synology.

Excellent article, Jim. I knew of zfs and btrfs before now, but didn't have any real idea of what differentiated them from what I guess I thought of as "traditional" file systems (ext4, NTFS, HFS). I'm just a regular user, so I'm going to hold off on playing with btrfs for the moment, but I'm looking forward to using it in a year or two when it's more polished.

The worst thing is that backups won't save you from bitrot. The next backup will cheerfully back up the corrupted data, replacing your last good backup with the bad one. Before long, you'll have rotated through all of your backups (if you even have multiple backups), and the uncorrupted original is now gone for good.

A bit of a nitpick here but order of operations do matter. If the file was backed up before the bitrot (ie random flipped bit over time), then it could be properly restored. If this bit flipping occurs before the backup, then the backup software will backup the flipped bits.

Quote:

Contrary to popular belief, conventional RAID won't help with bitrot, either. "But my raid5 array has parity and can reconstruct the missing data!" you might say. That only works if a drive completely and cleanly fails. If the drive instead starts spewing corrupted data, the array may or may not notice the corruption (most arrays don't check parity by default on every read). Even if it does notice... all the array knows is that something in the stripe is bad; it has no way of knowing which drive returned bad data—and therefore which one to rebuild from parity (or whether the parity block itself was corrupt).

RAID5 can catch bitrot by checking the parity on read. As stated, parity check on read is disabled so this is often overlooked and this bitrot recovery scheme depends on it and only a few RAID5 implementations support it. So how is it able to recover? First the parity check fails on a read. Then comes the brute force restoration of each drive's block to determine which one was bad. Say you have drives A, B, C, and D in an array and the parity check on read fails. ABC contain data with D holding the parity in this example. BCD are used to generate A and then compared to what was read by A initially. ACD are used to generate B and so on. If all the block restorations don't detect an error in any of the individual blocks, the parity block is regenerated and compared. Obviously this recovery technique is slow but it should only be invoked on a parity check on read so typical performance should not suffer. (To clarify that enabling parity check on read does incur a performance penalty but enabling the recovery functionality would not given infrequent bitrot.) The RAID controller can then tell the drive which block is bad so that it is not used anymore.

It can detect bitrot, but it can't tell which drive has the error. Parity only tells you something in a group is different, not which one in the group is different.

It can detect which drive produced the error and correct it. Finding which drive contains the flipped bit does take a bit of effort though. I'll do a basic pseudo trace of the algorithm to help illustrate it and the use the same ABCD drive in my previous example. D contain the parity and C will contain the bitrot.

1) The ABC is read and parity checked against D. The parity check fails.2) A' is generated from BCD as if drive A had failed. Is A' the same as A? Yes, continue.3) B' is generated from ACD as if drive B had failed. Is B' the same as B? Yes, continue.3) C' is generated from ABD as if drive C had failed. Is C' the same as C? No, we have found the drive which contains the error. Set that block as bad and write C' to a different block to correct the error.

As I mentioned previously, not all RAID5 controllers support verify on read which this algorithm depends upon. Even fewer actually implement it.

You have the incorrect assumption that there is a A', B' and C'. This is not true. What you have is ABC and (ABC)'. If you stored each partiality, you'd essentially have mirroring.

The math for RAID5 is simple. D = A ^ B ^ C

If A is corrupted, you will only know that D is no longer equal to A ^ B ^ C. But is it C or is it B or is it A? No idea. One of the bits are flipped, but you can't tell which.

Example 1 ^ 0 ^ 1 = 00 ^ 1 ^ 1 = 01 ^ 1 ^ 0 = 0

It's a many to one relationship. In all cases, the parity is the same.

A three drive RAID5 has three data blocks and one parity block per stripe, not two data and one parity.

With n drives, you have n blocks per stripe (one on each drive). You need one for parity. You get n-1 for data.If you wrote 4 blocks per stripe on 3 drives, that means either (basic maths):- 2 data block on the same disk, which you can't recover on failure (single parity block)- 1 block of data and and 1 of parity on the same drive, which you can't recover either.Have a look here if you don't trust me: http://en.wikipedia.org/wiki/RAID_5#RAID_5

edit: I've now read your post further, and actually you assume that you can compare the original file:

Reconstructed blocks 1 and 3 depended on the data as written in block 2 - and the reconstructed versions are different from the versions as written. Reconstructed data from block 2 did NOT depend on data as written in block 2 - and it isn't different from the version as written. This tells us that the data as written in block 2 is wrong, and should be replaced with the reconstructed version from parity, leaving us with:

You don't need RAID if you have the original data. just compare each bit to the original data, and you can detect and correct errors.

On the windows side, Server 2012 and Windows 8.1 have implemented a new file system called Resilient File System (ReFS). This FS has many of the same goals as ZFS, the most important being avoiding bit rot. It is meant to be used in conjunction with storage spaces to provide all the benefits and without bothering to get into the architectural details of it, I think MS did it that way because it's easier to add features to storage spaces in future OS releases than it is to make more modifications to the base file system (on the users).

ZFS is more mature and feature rich, but ReFS v1 + SS is the only option windows users have and it helps with the most important goal of avoiding bit rot along with a few other things and so it definitely classifies as a Next Gen file system.

edit: And I'm surprised this was ignored. There are certainly going to be WAY more people on windows than systems that support ZFS. Would be helpful to a lot of people to amend original article with info about ReFS.

One of the more common complaints about next-generation filesystems is that if you hammer them extremely hard with a never-ending stream of really punishing random I/O, they will eventually fall down harder than simpler conventional filesystems.

Err ... If a "next gen" file system cannot handle random I/O like the current gen .... than that file system is NOT next gen ... it's a step backwards.

It's like saying ... hey! The 2015 cars will have 3 wheels with the 4th wheel missing... but that doesn't matter! It has an ipod dock on the dashboard .... But this is a next gen car because it has a new feature.

who thought it was cute to put that little 'i' in the acronyms for KB, MB, GB, TB, PB..?

The little i actually means something. A KB is 1024 bytes. A KiB is 1000 bytes. Similarly, a MB is 1024KB, whereas a MiB is 1000 KiB.

Blame hard drive manufacturers for this - they started using powers-of-ten approximations when rating their drives' storage capacities a long, long time ago - presumably because they made for bigger numbers, and consumers will buy the thing with the bigger number on it.

The bigger the units get, the bigger the disparity. A TiB is only nine-tenths of a "real terabyte", and an EiB is only 85% of a "real exabyte". I hate that we have to talk about "tebibytes" and "exbibytes" too, but... it really makes a difference.

Other way around, dude.

Now, I hate the KiB-style notation, and think we should just use KB - the fact that is is bytes tips you off to the fact that it should be powers of 2. The HD manufacturers should be sued or regulated till they either use powers of 2, or they should have to use special units. But that is just what I would like.

What actually happens is that KiB-style means powers of 2, while KB means powers of 10 (1 KB = 1000 B). They did this because "kilo" literally means 1000, and means it throughout the SI system. And because they are morons who can't pick up on "subtle" clues, like the word "byte".

One of the more common complaints about next-generation filesystems is that if you hammer them extremely hard with a never-ending stream of really punishing random I/O, they will eventually fall down harder than simpler conventional filesystems.

Err ... If a "next gen" file system cannot handle random I/O like the current gen .... than that file system is NOT next gen ... it's a step backwards.

It's like saying ... hey! The 2015 cars will have 3 wheels with the 4th wheel missing... but that doesn't matter! It has an ipod dock on the dashboard .... But this is a next gen car because it has a new feature.

By that argument cache coherence protocols have also gotten worse over the last few years.

If it doesn't matter whether the result is correct it's very easy to make it arbitrarily fast.

who thought it was cute to put that little 'i' in the acronyms for KB, MB, GB, TB, PB..?

The little i actually means something. A KB is 1024 bytes. A KiB is 1000 bytes. Similarly, a MB is 1024KB, whereas a MiB is 1000 KiB.

Blame hard drive manufacturers for this - they started using powers-of-ten approximations when rating their drives' storage capacities a long, long time ago - presumably because they made for bigger numbers, and consumers will buy the thing with the bigger number on it.

The bigger the units get, the bigger the disparity. A TiB is only nine-tenths of a "real terabyte", and an EiB is only 85% of a "real exabyte". I hate that we have to talk about "tebibytes" and "exbibytes" too, but... it really makes a difference.

Other way around, dude.

Now, I hate the KiB-style notation, and think we should just use KB - the fact that is is bytes tips you off to the fact that it should be powers of 2. The HD manufacturers should be sued or regulated till they either use powers of 2, or they should have to use special units. But that is just what I would like.

What actually happens is that KiB-style means powers of 2, while KB means powers of 10 (1 KB = 1000 B). They did this because "kilo" literally means 1000, and means it throughout the SI system. And because they are morons who can't pick up on "subtle" clues, like the word "byte".

I did the math on this some years ago and you lose roughly 2% more space for every factor up on your hard drive (from megabytes to gigabytes to terabytes), so this is only going to get worse until manufacturers stop using base ten to do their storage capacity calculations and use the actual 1 kb = 1024 bytes factor.

As someone who runs ZEVO for ZFS on the Mac for a home NAS server, I love the idea of next gen file systems, but considering that ZEVO is now a dead product, I'm stuck. I don't want to regress to a lesser file system after running ZFS for a while, but I'm stuck on OSX 10.8 Server until I can find a way to either migrate from ZEVO to MacZFS or buy a NAS from QNAP or Netgear or Synology.

you never hear anyone complaining about corrupt/inconsistent zfs filesystems, whereas with Btrfs, people are screaming for an fsck.

People screamed for an fsck in ZFS for years, and still are (see http://openindiana.org/pipermail/openin ... 08822.html for a recent example). Not because it needs it, but because that's what people are used to, and because when people DO end up losing data, they think "I NEED AN FSCK!", not realizing that the kinds of errors an fsck tool could fix just don't happen in the first place.

FWIW, I'd take that list with a pretty huge grain of salt. It's a few years out of date, and as far as I can tell, its author never actually used btrfs, he just went on a hunt to find things to justify not using it.

I've just seen lots of mailing list woes where someone had to powercycle their Btrfs server and ended up with an unmountable FS. My point was not supposed to be about fsck, its that when it comes to actually being consistent on-disk, ZFS is way ahead. The fsck screaming was simply in response to the unusable filesystems, whether or not it would have actually helped. To be fair, Btrfs is still a major work in progress.

Yes it does; it uses block checksumming at the disk sector level. FC/SAS drives are formatted using 520 byte sectors, and every 4096-byte WAFL block has a 64-byte checksum. SATA drives do not support 520-byte sectors, so they're using eight 512-byte sectors, and write a 64-byte checksum to the ninth, leaving 448 bytes empty, which results in SATA drives using approximately 11% of their rated capacity.

Yes it does; it uses block checksumming at the disk sector level. FC/SAS drives are formatted using 520 byte sectors, and every 4096-byte WAFL block has a 64-byte checksum. SATA drives do not support 520-byte sectors, so they're using eight 512-byte sectors, and write a 64-byte checksum to the ninth, leaving 448 bytes empty, which results in SATA drives using approximately 11% of their rated capacity.

Right, but WAFL also has certain advantages over btrfs or for that matter any other bolt-on FS—namely, it's run on storage arrays and is fully block-aware, and can exercise total block-level control over what goes where.

The little i actually means something. A KB is 1024 bytes. A KiB is 1000 bytes. Similarly, a MB is 1024KB, whereas a MiB is 1000 KiB.

You've got that backwards. KiB means 1024 bytes, MiB means 1024KiB.

KB and MB are short for kilobyte and megabyte. Kilo and mega are SI prefixes that mean 1000 and 1000000. However historically these terms have been used both with their proper SI meaning and as multiples of 1024.

Support for using the SI prefixes exclusively in their correct sense has been increasing over the past decade. I think that's a good thing.