Wednesday, October 10, 2007

I don't like the point-by-point quote and response format — it's way too much like an old-school Usenet flamewar. So I will simply try to hit the high points of their arguments.

Where we agree

ZFS is not ready to deploy to the entire Mac OS X user base today. There's still some work to be done.

ZFS isn't necessary for most of today's Macintosh computers. If you have been using your Mac with no storage-related problems, then you can keep on using it that way. Perform regular backups and you'll be just fine.

It would be an absolutely terrible idea to take people's perfectly working HFS+ installations on existing computers and forcibly convert them to ZFS, chuckling evilly all the while. Not quite sure where that strawman came from.

ZFS fatzaps are expensive for small files. If it were true that 20% of the files in a Mac OS X installation required a fatzap (pdf link to ZFS-on-disk specification), that would indeed be unnecessarily wasteful.

A typical Mac OS X 10.4.x installation has on the order of about 600,000 files.

I think that's about it. But of course there are a number of places where we disagree too.

ZFS would be awfully nice for a small segment of the Mac OS X user base if it were ready today.

If you spend any amount of time managing storage — if drives have gone bad on you, if you have ever run out of space on a desktop system and needed to add a drive (or two), if you have a RAID array — then you are the sort of user that could see some immediate benefit.

But of course as we already agreed, it's not ready today. You haven't been "cheated" and I'm sure you don't feel that way. But feel free to look forward to it: I sure am.

ZFS — or something with all the features of ZFS — will be more than nice, it will be necessary for tomorrow's Macintosh computers.

Both storage sizes and consumer consumption of storage grow exponentially. I tried to make this point last time, but MWJ seems to have misunderstood and accused me of misquoting. Let's try again.

In 1997, 20GB of storage meant a server RAID array. Ten years later, in 2007, 20GB of storage is considered "not enough" by most people. Across my entire household I have drives larger than that in my computer, in my TiVo, in my PlayStation 3, and even in my iPod. Now let's extrapolate that into the future.

In 2007, 20TB of storage means a server RAID array. Ten years from now, in 2017, 20TB of storage will similarly be considered "not enough". MWJ scoffed at ZFS because it's really pretty good at the problems of large storage. But you know what? A solution to managing that much data will need to be in place in Mac OS X well before 20TB drives become the norm. Better hope someone's working on it today.

Meanwhile — and this is what scares the pants off me — the reliability numbers for hard drives have improved much more slowly than capacity.

Here's a fairly typical Seagate drive with a capacity of ~150GB = ~1.2 x 1012 bits. The recoverable error rate is listed as 10 bits per 1012 bits. Let's put those numbers together. That means that if you read the entire surface of the disk, you'll typically get twelve bits back that are wrong and which a retry could have fixed. (Updated Oct 11 2007: In the comments, Anton corrected me: I should've used the unrecoverable error rate here, not the recoverable error rate. The net result is that in ideal operating conditions bit errors occur over 100x less frequently than I originally suggested. However, it's still not zero. The net result is still a looming problem when you scale it across (installed base) x (storage consumption) x (time). See the comment thread.)

Yes, really. Did you catch the implications of that? Silent single-bit errors are happening today. They happen much more often at high-end capacities and utilizations, and we often get lucky because some types of data (video, audio, etc) are resistant to that kind of single-bit error. But today's high end is tomorrow's medium end, and the day after tomorrow's low end. This problem is only going to get worse.

Worse, bit errors are cumulative. If you read and get a bit error, you might wind up writing it back out to disk too. Oops! Now that bit error just went from transient to permanent.

Apple using ZFS rather than writing their own is a smart choice.

As I hope I made abundantly clear in the last post, extending HFS+ to the future that we can see looming is just not an option — its structure is simply too far removed from these problems. It's really just not worth it. It's pretty awesome that the original HFS design scaled as far as it did: how many people can come up with a 20-year filesystem? But you have to know when to throw in the towel.

So if you accept that the things I described above are real, looming problems, then Apple really does need a filesystem with at least several of the more important attributes of ZFS.

The choices at this point are essentially twofold: (1) start completely from scratch, or (2) use ZFS. There's really no point in starting over. ZFS has a usable license and has been under development for at least five years by now. By the time you started over and burned five years on catching up it would be too late.

And I really do want to reiterate that the shared community of engineers from Apple, Sun, and FreeBSD working on ZFS is a real and measurable benefit. I've heard as much from friends in CoreOS. I can't understand the hostility to this very clear and obvious fact. It's as if Apple suddenly doubled or tripled the number of filesystem engineers it has available, snagging some really brilliant guys at the top of their profession in the process, and then multiplied its testing force by a factor of 10.

(To respond to a query voiced by MWJ, HFS+ never gathered that community when it was open-sourced because the design was already quite old at that point. It frankly didn't have anything new and exciting to offer, and it was saddled with performance problems and historical compromises of various kinds, so very few people were interested in it.)

ZFS fatzaps are unlikely to be a significant problem.

This gets a bit technical. Please skip this section if you don't care about this level of detail.

MWJ really pounded on this one. That was a bit weird to me, since it seemed to be suggesting that Apple would not expend any engineering effort on solving any obvious glaring problems with ZFS before releasing it. That's not the Apple I know.

But okay, let's suppose that we're stuck with ZFS and Mac OS X both frozen as they stand today. Let's try to make an a priori prediction of the actual cost of ZFS fatzaps on a typical Mac OS X system.

Classic HFS attributes (FinderInfo, ExtendedFinderInfo, etc) are largely unnecessary and unused today because the Finder uses .DS_Store files instead. In the few cases where these attributes are set and used by legacy code, they should fit easily in a small number of microzaps.

Extended attributes may create fatzaps. Today it seems like extended attributes are typically used on large files: disk images, digital photos, etc. This may provoke squawking from the peanut gallery, but once a file is above a certain size — roughly a couple of megabytes — using an extra 128KiB is negligible. If you have a 4MiB file and you add 128KiB to track its attributes, big deal: you've added 3%. It's not nothing, but it's hardly a significant problem.

Another likely source of fatzaps in ZFS on Mac OS X is the resource fork. But with Classic gone, new Macs ship with virtually no resource forks on disk. There are none in the BSD subsystem. There are a handful in /System and /Library, mostly fonts. The biggest culprits are large old applications like Quicken and Microsoft Office. A quick measurement on my heavily-used one-year-old laptop shows that I have exactly 1877 resource forks out of 722210 files — that's 0.2%, not 20%.

(Fun fact: The space that would be consumed by fatzap headers for these resource files comes out to just 235 MiB, or roughly six and a half Keyboard Software Updates. Again: not nothing, but hardly a crisis to scream about.)

Want to measure it yourself? Amit Singh's excellent hfsdebug utility will show you a quick summary. Just run "sudo hfsdebug -s" and look at the numbers for "files" and "non-zero resource forks". Or try "sudo hfsdebug -b attributes -l any | less" to examine the files which have extended attributes on your disk.

ZFS snapshots don't have to be wasteful

The cheesesteak analogy was cute. But rather than imagining that snapshots just eat and eat and eat storage until you choke in a greasy pile of death, it would help if we all actually understand how hard drive storage is actually used in practice, and how ZFS can work with that.

There are three major classes of stored data.

Static data is data that you want to keep and almost never modify. This is your archive. Photographs, music, digital video, applications, email, etc. Archives are additive: unless you really run out of room, you rarely delete the old — you only add new stuff. You want the contents safe and immediately accessible, but they are essentially unchanging.

Snapshotting static data is close enough to free that you won't notice: the only cost is the basic cost of the snapshot. No extraneous data copies are ever created, because you never modify or delete this stuff anyway.

Dynamic data is data that you want to keep, but are modifying with some frequency. This is whatever you are working on at the moment. It might be writing a novel, working in Photoshop, or writing code: in all cases you keep saving new versions over the old.

Snapshotting dynamic data is more expensive, because if you do it too much without recycling your old snapshots then you can build up a large backlog.

Transient data is data that should not be persistent at all. These are your temporary files: local caches, scratch files, compiler object files, downloaded zip files or disk images, etc. These may be created, modified, or deleted at any moment.

Snapshotting transient data is generally a bad idea — by definition you don't care that much about it and you'd prefer it to be deleted immediately.

Got all that? Okay. Now I need to make a couple of points.

First, I assert that virtually all of the data on personal computer hard drives is static most of the time. Think about that. The operating system is static the whole time you are using it, until you install a system update. (And even then, usually just a few hundred megabytes change out of several gigabytes.) Your /Applications folder is static. Your music is static. And so on. Usually a few percent of your data is dynamic, and a few more percent is transient. But in most cases well over 95% is static. (Exceptions are easy to come up with: Sometimes you generate a large amount of transient data while building a disk image in iDVD or importing DV footage. That can shift the ratio below 95%. But once that task is complete you're back to the original ratio.)

Second, the biggest distinction that matters when snapshotting is separating persistent data from transient data. Taking snapshots of transient data is what will waste disk space in a hurry. Taking snapshots of dynamic data as a local backup is often valuable enough that it's okay to burn the small amount of disk space that it takes, because remember: that's the actual data that you're actively working on. And as we already mentioned, snapshots of static data are free.

Now here's where it gets interesting.

With ZFS, snapshots work on the filesystem level. Because it no longer uses the "big floppy" model of storage, new filesystems are very cheap to create. (They are almost as lightweight as directories, and often used to replace them.) So let's create one or more special filesystems just for transient data and exclude them from our regular snapshot process. In fact on Mac OS X that's easy: we have well-defined directories for transient data: ~/Library/Caches, /tmp, and so on. Link those all off to one or more transient filesystems and they will never wind up in a snapshot of the important stuff. I wouldn't expect users to do this for themselves, of course — but it could certainly be set up that way automatically by Apple.

Once the transient data is out of the picture, our snapshots will consist of 95% or more static data — which is not copied in any way — and a tiny percentage of dynamic data. And remember, the dynamic data is not even copied unless and until it changes. The net effect is very similar to doing an incremental backup of exactly and only the files you are working on. This is essentially a perfect local backup: no duplication except where it's actually needed.

Will you want to allow snapshots to live forever? Of course not. One reasonable model for taking backup snapshots might be to remember 12 hourly snapshots, 7 daily snapshots, and 4 weekly snapshots. If you are getting tight on storage the system could take new snapshots less frequently and expire them more aggressively. Remember: when nothing is changing the snapshots don't take up any space.

Wrap-up: Listen to the smart guys

Some very smart people at Sun started the ball rolling by putting an awful lot of thought into the future of storage, and they came up with ZFS.

After they announced it and started talking about it, other brilliant people at Apple (and FreeBSD, and NetBSD) paid attention to what they were doing. And they listened, and thought about it, and looked at the code, and wound up coming around to the side of ZFS as well.

If you think I'm smart, just know that I'm in awe of some of the guys who've been involved with this project.

If you think I'm stupid, why, I look forward to hearing from you in the comments.

41
comments:

Sladuuch
said...

One point MWJ brought up that I haven't seen addressed anywhere is how portable and removable hard drives will be handled. Pooling disks together to create logical volumes is awesome until you need to remove one of those drives--or, more likely, you have to disconnect a laptop from the drive tethered to your desk to take it somewhere. What then? Does the whole filesystem fail and die? Does the data located on that drive become inaccessible? Or is there some way to divide the physical drives into different logical drives, essentially replicating the "big floppy" model?

Basically, (and correct me if I'm wrong) it seems silly to wax poetic about the fact that disks can be used as "building blocks" if they have to be able to be removed. Take enough blocks away from a tower and it will collapse. Since disks have to be removable, what does this say about ZFS? Will we have to lose one of ZFS's coolest user features to accommodate removable disks?

A system can have a basically-unlimited number of pools. Thus, in a typical scenario, you'd use one pool for all the internal disk(s), and one for each external disk. When attaching a previously-unseen disk, the system might solicit a user as to whether to create a new pool ("big floppy"), or integrate it into an existing pool. Would presumably default to new-pool for externals, and perhaps to existing-pool for internal disks (in a Mac Pro, for example).

The MWJ rebuttal claims "without RAID-Z, ZFS can only tell you that the data is bad." This is not true.

ZFS has significant self-healing capabilities even when used on a single disk. Specifically, the filesystem's uberblock and all metadata blocks are replicated. ZFS also allows file data to be replicated via ditto blocks. While it is possible that every copy of an block could be corrupted, this is extremely unlikely.

But even if MWJ's claims were true, ZFS is still a huge leap over HFS+ because ZFS can detect silent data corruption. Most filesystems, even journaling ones, do a very bad job of this. The Advanced Systems Lab at U. Wisconsin at Madision published several papers on this topic. See specifically, "Model-based Failure Analysis of Journaling File Systems" and "IRON File Systems". Those papers looked Linux filesystems (ext3, resierfs and JFS) and found they offer almost no protection against silent corruption.

The same group did a study of the virtual memory systems on Linux and FreeBSD (see "Dependability Analysis of Virtual Memory Systems"). They found the same lack of detection for silent corruption. They were able to trick the VM into returning pages from one process to another when they were paged in from disk. Putting your swap on a ZVOL is one way to prevent this problem.

I was really struck that MacJournal's article sounded a lot like what everyone was saying when Apple was switching to NeXT's OS: a lot of scare-mongering about how it's the wrong fit, isn't needed, and how what we're used to now will be good enough in the future. ZFS is amazing, and if Apple can put a decent GUI on it we'd be fools to not want it. I can't wait for the linux community to get their act together and replace ext3/4 with it for my servers.

A few specific comments:1) replacing drives in pools: zpool replace and sliver/unsliver should do most of what they want.

2) snapshot size issues are identical with Apple's TimeMachine, but there they are on a per-file basis, not a per-block basis. Either way you need a good GUI to decide which snapshots to delete, and having good file system support can only help.

3) Performance concerns are irrelevant. There are 6 orders of magnitude in difference between speeds for CPUs and disks (ns to ms), so the more time your CPU can spend figuring out how to be clever with the disk the better.

The other thing that is rarely mentioned is that snapshots via copy-on-write are MUCH more efficient than the file-based snapshots of Time Machine for large files. One place of growing importance these show up is virtual machines - all the more common with the Intel switch.

Every time you boot your Windows or Linux VM, the whole virtual disk (Vista's minimum is over 10GB) will be changed. With Time Machine, my understanding is all of that data will be backed up, even though 99% of the drive file is static. Under ZFS, only those few changed blocks will be backed up. Much more efficient and lightweight.

"Will you want to allow snapshots to live forever? Of course not. One reasonable model for taking backup snapshots might be to remember 12 hourly snapshots, 7 daily snapshots, and 4 weekly snapshots. If you are getting tight on storage the system could take new snapshots less frequently and expire them more aggressively. Remember — when nothing is changing the snapshots don't take up any space."

Amen, anonymous. I appreciate everything Mr. Thaler is doing to educate people, but I also think he's feeding the troll. Just the fact that he responded made MWJ think they were worth listening to, instead of just uninformed, sneering jackasses.

It's unfortunate people are copying Daring Fireball's style, as in the original MWJ post. Gruber can be a sneering jackass, but at least he's usually informed.

MWJ is relevant to the Macintosh community, including developers, so we can just leave that question at the door. If you're asking the question, then you aren't inside the beltway, and that's fine. But you can't exclude him because you've never heard of him. Matt's the real deal. (I am biased.)

Question MWJ raised not answered in this nice response: How do you remove drives from ZFS?

The problem with this debate is that ZFS is over- respected ("it will solve every problem") and HFS+J is way under-respected ("it's crap").

Also HFS+ is not 20 years old, it will only be 10 next year. Conflating HFS+ and HFS is not productive. They have similar names but are very different. HFS is 32 character Mac Roman names, while HFS+ is 256 character Unicode.

The ZFS marketing machine has done a wonderful job of confusing error rates with *undetected* error rates. But there's a very big difference!

I won't go into a tutorial on error-correcting codes; if readers aren't familiar with them, there's plenty of good (if overly simplified) information available on the web. Remember these two facts, though:

1. All storage is inherently error-prone; hence nearly all storage in computing systems is protected by some sort of error-correction code. (Memory is the main exception on desktop systems; hopefully we'll see mainstream ECC within the next few years.)

2. Error-correction codes are very good at detecting errors, and very good at not mis-correcting errors.

The recoverable error rate is specified as 10^-11. This says that if you read the whole disk, on average the disk will correct 12 bit errors during the process. There is no corruption here, and no data loss. There's most likely only one block affected, since bit errors on magnetic media tend to run together (adjacent magnetic domains). Unless you're carefully monitoring statistics, you will never know that there were any errors ... because the disk fixed them for you!

The non-recoverable error rate is specified as 10^-16. This says that if you read the whole disk 8,333 times, you're likely to encounter one bad block. Not a block with incorrect data, but a block for which the disk had detected an error and reported it to the computer. If you're using a standard file system, with no mirroring, you've lost data at this point. A simple mirror would protect you, of course. (How big is 10^16 bits? If you read constantly at 50 MB/second, then after 9.5 months you would expect to encounter your first bad block.)

What the checksums in ZFS (and WAFL, incidentally, and some other file systems and database systems) protect you from is the next category of errors, those which are mis-corrected. How frequent are those? Less than 10^-16. Probably much less, on the order of 10^-18 or 10^-20, depending on the details of which error-correcting code the drive uses. At 50 MB/second, you'll see one of those after somewhere between 79 and 7900 years.

What does this say? That the ZFS checksums really are protection against one of two cases. Either you have so much data that you're likely to encounter one of these very rare events (and there are a few petabyte data installations in the world today), or you have an unreliable transport mechanism. Fibrechannel, modern SCSI, SAS and SATA use fairly reliable checksumming mechanisms to protect data in transit. Parallel ATA and older SCSI systems do not (they use parity instead). It's an option with iSCSI.

====

So in summary, drives are not out there randomly returning bad data. When bad data does wind up on a disk on a modern laptop or desktop system, it's most likely because it was corrupted in memory -- where it stayed the longest without a strong error-correction code, and where ZFS won't help at all.

Don't get me wrong. Testing for data integrity is a good thing, especially for very large storage subsystems. That's why many of these systems have added it over the years! In fact, RAID 6 has two major advantages: the ability to survive a double-disk failure, and the ability to detect and correct silent data corruption. And tape and optical archiving systems have used multiple file copies, together with checksums, for years.

But most of us will never see the benefit of this. Think of it as an insurance policy against your house being struck by a meteor. It's potentially a very bad thing (though in most cases a small roof repair will be all you need), but the chances of it happening to you are very, very low.

Anton's comments are very interesting and apropos. The checksumming is harder to justify, and would be a real issue if it consumed unjustifiable amounts of processor. We can see about that with a real implementation in OS X later.

In the meantime, the userspace advantages for ZFS (snapshots, pool management, fully reliable writes and never overwriting live data) are much more interesting. The fact is that we live in a world of changing disk requirements, power outages and unforeseen circumstances. These are VERY important features to laptop users and up.

One thing that's be puzzling me is where snapshots are stored. I mean, the basic functionality from a top down view is the same as Time Machine. How they work underneath is different but they accomplish the same task. The thing is, Time Machine writes to another volume. Does ZFS already support the writing of snapshots to another volume specified by the user or is it something that could in theory be added without much fuss?

Oh, OK. So I guess MWJ is like another Wil Shipley, reclining on a throne of undeserved celebrity status thereby allowing his doting litter of starving, blind fanboys a chance to poke their heads within the ever-expanding confines of his belt(way) for a chance to suckle at his life-giving he-teat. What an exciting blogosphere of absolute relevance and butt-ugly HTML. I only wish there were some way I could give them money for all this exciting content.

Anton: You're right, I was wrong. I was using the RER (recoverable error rate) instead of the UER (unrecoverable error rate). My apologies — I don't normally get quite down into that level of detail. I'll update the post above to correct that.

However, despite the orders of magnitude between them, the UER is plenty bad enough at the scale of deployed installations that we're talking about. Some quick math:

Let's suppose that the installed base of reasonably modern Macs is about 20 million, or 2x10^7. That's about three years' worth of sales.

The drives used in your Mac have unrecoverable error rates of anywhere from 10^-14 to 10^-16. Overheated enclosures can make that worse, as some earlier commenters mentioned. But let's stick with maybe 10^-15.

Daily hard disk reads are harder to quantify. There's a huge variance, but let's take a wild educated guess and ballpark the daily reading on a single average Mac at maybe 1.25GB per day = 10^10 bits. I actually think that's AT LEAST an order of magnitude too low — it's only about a minute or two of disk thrash if it were all run together — but I'm trying to be persuasive. (Not conservative: the conservative position would be to overestimate both I/O transfer amounts and UERs.)

Now multiply. The amount of data transferred on those Macs comes out to roughly 2 x 10^17 bits per day. Multiply again by 10^-15 and we find roughly 200 Macs getting a UER bit error each day. Over the course of a year, over sixty thousand such errors will occur to just the Macs that are less than three years old.

This rate is pretty darn low. But it's not zero. (And if you have an iMac which is more prone to heat problems, or if you live in somewhere hot or humid, or if you do way more I/O than the average bear ... well, sucks to be you.) Worse, this problem more than doubles every year — storage consumption doubles each year; the number of Macs keeps increasing; some up-and-coming types of media like flash have higher error rates; and so on. A solution really is needed for the future.

Pilky: ZFS snapshots are stored in-place. Let me see if I can come up with an analogy.

Imagine one of your documents as a physical document, on a piece of paper. There are two ways to modify that document: "rewrite" — going in with white-out and retyping over it — or "copy-on-write" — overlaying a sheet of transparency and making your changes on that. (Like the kind used on projectors in class. Do they still do that?) Taking a snapshot means simply placing a new, clean sheet of transparency over the existing stack.

If you rewrite, whatever you erase is gone. If you want history, you'll need to continually copy chunks (or the whole document) somewhere else.

If you use the copy-on-write method, your original data is still there: it hasn't really gone away. Something like Time Machine could simply peel back the layers of transparencies. Off-site backups are still necessary, mind you, but snapshots would be able to shoulder the burden of many of the common uses.

This buys you a bunch of interesting capabilities: you can take the top one off and throw it away, reverting completely to a previous state. You can peek underneath to see what the document looked like at any previous snapshot. You don't want an infinite stack of transparencies, so you can merge your document with the bottommost transparency, committing its changes permanently and recycling the space used. You can make a copy of that transparency and store it somewhere else. You can even branch your document by starting a new, parallel layer of transparencies for it. (Okay, that last one stretches the physical analogy a bit.)

Now, if you grok how that could be done with a document, let me blow your mind.

Your entire filesystem is a document. Documents are data, and filesystems are data. Anything you can do with that document, you can do with a filesystem. That's basically how ZFS treats it. Your only constraint is that the total space used by (original data) + (data on transparencies) needs to stay less than your total pool capacity.

It's pretty freakin' cool. Copy-on-write in the filesystem is really an enabling technology: there is a lot of great stuff that it makes possible. Most of us don't miss it because we've never had it. But once you have it and you start using things built on those capabilities, you're never going want to go back.

Re: recoverable vs. unrecoverable vs. mis-recovered errors. One thing you must remember regarding the checksums in ZFS is that they buy you "memory-to-memory" protection. Data errors do not only come from the actual reading of data from the disk, they are actually more likely to come due to transient errors in the controllers and EMI in the cables. I know of more than one case where people started using ZFS and discovered that they had flaky controllers that were delivering (previously) silent data corruption.

So if we think you're smart, are we not allowed to post? I totally agree that HFS is not up to the task 10 years from now. It's starting to look creaky now.

Meanwhile, I would hope you count ~/Downloads which comes with Leopard as transient. We need a symbol for the tree cloud (/Library, /Sys/Library, ~/Library) as I presume the "and so on" meant. %/Caches ?

I think Apple needs snapshots to be able to support install rollback. Currently, if you have to rollback and they don't have a downgrade installer, you have to reinstall with preserve. Not nearly as bad as Windoze without System Restore, but it could be a lot better.

So, I've read this blogwar but I don't think anyone has covered - at a high level - how ZFS works (or at least how you would envision utilizing it). The concept of a filesystem pool is quite staggering (and not all unlike a SAN). I could envision having a pool for the OS, temp/transients, a special pool for VM, and Users. All would auto expand, of course, with optional manual management. I would hope the ZFS could prioritize storage location. I would want my VM pool on the outer edges of the disk followed by the current hot clustering or whatever brilliant idea comes along. I could also see API support for applications to offer never expiring undo and maybe other special features. I'd love to hear about options ZFS could support having copy on write at the FS level.

There's no reason we couldn't set up an HFS disk for video editing, too.

MWJ has been around for a long time and bill themselves as geekspeak translators. They should be quite knowledgeable so it makes for an interesting debate. Everone wins as we are learning about what likely will replace HFS+.

HFS+ is more than just filenames. It's still a follow-on generation of HFS. Speaking of filenames, I can't imagine that Apple could take away 128 chars of filename space without pissing off someone.

Keep in mind that currently, Apple is only shipping ZFS (hopefully this month) with read-only ZFS support. I hope that changes with a dot-rev.

Even though disk sizes march larger and larger, our voracious appetite for space leads me to think more and more users will want to have multiple drives. RAID isn't pretty (1+0 seems the best but still wasteful; 5 or 6 could be good but needs hardware (Apple) support). So, it behooves us to have another option soon too.

Although I agree with the vast majority of what you say (and as far as I am concerned ZFS can't come soon enough) I think it's slightly strange to speak of Apple and Sun being forward-looking whilst simultaneously making the case for fatzap not being a problem on account of there being so few extended attributes in use on Mac OS X today. I would like to see richer metadata in future. Who is to say how many xattrs (at 128KiB each) I will want to attach to my photos in future? 128KiB might become vanishingly small in my storage pool, but why should design decisions in the filesystem hold up the hyperminiaturisation of my portable devices? As for .DS_Store files, they should be retired ASAP (half the stuff stored in them should be per-user anyway).

Keep up the good work though -- I was pretty incensed by that MJW article too!

Pecos Bill: How ZFS might work in practice seems like a whole new blog post, which I'll make some time for in the future. Re 256->128 filenames, it doesn't seem like such a big deal to me; even "long" filenames typically tend to peak somewhere around 64 chars because there's a limit to what can displayed nicely in the UI. (Have you seen a real 128-char filename? It's awfully large.) Big old apps actually tend to use an OLDER standard like 8.3 or 31 chars.

Honestly, Apple can't change anything without pissing off someone. The question is whether it's worth it. :-) It's not as if HFS+ is ever going away; there will just be a new kid in town.

Hamish: There are two things to consider: ZFS can change, and Mac OS X can change. The ZFS on-disk format has been static for a while, but it may still be revved (e.g. for case-insensitivity). If it is, there's a lot of room for a mid-range between 8-byte microzaps and 128KiB fatzaps. But even if that doesn't happen, the main point is that user data — where the bulk of xattrs will probably live — is a tiny, tiny fraction of the 600,000+ files that exist on your disk.

Also, fatzaps are a 128KiB header, not 128KiB per attribute. Attributes are stored as chains of zap_leaf_array structs inside one or more normal, variable-size ZFS blocks.

One place where zfs would be REALLY handy, even for consumers, would be in the AirPort disk sharing combined with Leopard's Time Machine.

If you're automatically backing up to an AirPort-shared volume, it would be awfully handy to be able to expand the pool by simply adding another USB drive to a hub, followed by some UI clicking in the AirPort admin app.

"Link those all off to one or more transient filesystems and they will never wind up in a snapshot of the important stuff."

Tbh I don't like the thought of limiting selected folders to "old filesystem" behaviour. Those folders might particularly benefit from flexible volume sizes (for extensive swapping, huge downloads, ...). Not to mention the surprising system behaviour once your swap volumes runs out of space ("but I added another drive!")

Jon: Nope, you can easily use any filesystem anywhere. ZFS is just one more filesystem that OSX supports. Format a drive as HFS+ and it would work just fine.

There is some benefit to having external drives formatted as standalone ZFS pools, though. External drives for laptops are (a) usually cheaper drives and (b) get a lot more G-shock than internals, so they tend to go bad more quickly.

martind: The transient filesystems could (and should) be ZFS, not HFS+. Then they get to live in the same storage pool. I just meant "transient" in the sense that they would be deliberately kept out of the normal snapshotting process.

Drew, great posts. Some more info. The relevant drive specs are SATA drives at 10^14. Those are what SOHO users buy. The costly enterprise drives have less capacity and better specs, so URE's are much less likely.

10^14 is one URE every 12.5 TB. This is why RAID 5 no longer works reliably with large SATA drives. A drive fails and then you have a very good chance of an URE during rebuild, at which point the RAID controller barfs and you go to whatever backup you have.

Also, Anton's disk-centric view of the problem misses the point: there are LOTS of ways data gets corrupted to and from a platter.

I use ZFS in RAID-Z2 mode on my home server. 6x750GB drives, and an even higher level of redundancy than RAID1 as I can lose any two drives without losing data. It also clocks in at 160MBps sustained I/O throughput on Bonnie++

I also have it running home directories on my company's network and users come all the time to praise the ability to go back to a previous day's snapshot to recover data lost due to a silly mistake.

The snapshot functionality alone is something you otherwise have to pay $50K+ for a NetApp filer, not to mention RAID-Z2 or the 128-bit ECC.

The MWJ authors comments' are in the same ridiculousness league as the people who claimed 640KB was all the memory anyone could need, or that a non-protected kernel like OS 9 was acceptable. I find it very hard to take anything they say seriously. Sure, migration from HFS+ won't be trivial because of the work ISVs wil have to do to qualify their apps, and since bootable ZFS isn't ready for prime time yet (the OpenSolaris guys have it working in GRUB right now, so as soon as Apple desupports PowerPC, it won't be an issue any longer). I can't wait until HFS+ gets a well-deserved stake in the heart.

Well, I won't call you 'stupid', but your reading comprehension needs work. Anton's correction was not that you should be using the unrecoverable error rate rather than the recoverable error rate when discussing the 'silent errors' that ZFS protects against: it was that the pertinent rate was the undetected (or in his terms 'miscorrected') error rate, which is orders of magnitude lower than the unrecoverable error rate.

It's really difficult to make a strong case for ZFS in a single-disk system (including systems that use ancillary removable disks as a storage supplement rather than as an integral part of the main storage). Not that ZFS is a particularly bad choice there - it just doesn't offer any compelling attractions, whereas in a many-disk array its ease-of-management and intelligent self-repair capabilities really shine. Snapshots are certainly nice, but incremental backup has been available from many vendors for decades even when not thus assisted by the file system - and for really good moment-to-moment coverage I'd prefer continuous data protection.

ZFS is indeed 'cool', though one's admiration for its coolness should be leavened by the knowledge that NetApp's WAFL has been just about equally 'cool' for well over a decade (of course it's not available in client systems, nor open sourced). Rather than go on in more detail here, I'll just point to Robin's ZDNet blog entry (http://blogs.zdnet.com/storage/?p=202#comments) - my comments start about 2/3 of the way down.

Actually, I read correctly but disagreed with a part of Anton's statement. Sorry if I didn't clarify that; I was trying to keep the word count down in an already verbose reply.

I went to the trouble of checking with some of my old contacts at a hard disk manufacturer via IM. Their consensus was that many classes of "unrecoverable" errors can actually be recovered from in the host by re-reading the data — if you are persistent and if you can correctly identify good data (vs miscorrected data) when you get it. So despite the name I think the UER really is the hard disk error rate that matters.

An unrecoverable disk read error is not returned lightly: before returning it, the disk itself tries to re-read the data, using multiple strategies, sometimes for periods lasting many seconds (unless it's told that it's part of a RAID where another good copy of the data probably exists, in which case it should return the error promptly so that the other copy can be used without significant delay, after which the RAID should rewrite the original bad copy with good data, possibly redirected by the disk to an alternate location if necessary).

Furthermore, you don't need ZFS to detect unrecoverable disk errors: you just need background 'scrubbing' of the disk (some disks even do this themselves, but the OS or FS can do it if they don't). The scrub detects the unrecoverable error before the data is actually needed (and it can even perform its own additional retries if it wants to), allowing the data to be repaired before some disaster may befall the remaining good copy (or applicable parity stripe).

So the error rate that applies in determining the value of ZFS's particular brand of parent-checksumming aids is indeed the undetected - or possibly miscorrected - error rate, not the unrecoverable error rate.

Re: Bill Todd - You're ignoring errors that happen once the data has left the drive. CERN's data reliability study found that the storage path from memory -> PCI -> controller -> disk is much less reliable than the disk itself. ZFS end-to-end checksumming detects errors that occur on this path.

Further, ZFS' checksumming also detects errors caused by dropped writes and misdirected reads and writes. Background scrubbing by a drive's controller won't catch these. Neither will a fsck in some cases.

You clearly aren't referring to the CERN "Data Integrity" paper that I've got (Google using those terms to find it as the first hit), so a reference would be nice.

One section of this CERN paper identified 80% of the errors encountered in one test (which exercised the entire path from RAM to disk and back again - the same path protected by ZFS's in-parent checksums) as coming from a WD disk firmware bug, 10% more as errors in 'sector or paged-size regions' (origin not specified, but one might suspect from host RAM given the reference to 'page', in which case ZFS's checks would not necessarily help there), and the final 10% as single-bit errors some of which were also correlated with ECC errors in system RAM (and hence would not necessarily be helped by ZFS's checks). This test found a total of 500 errors after a total of about 2.5 PB was written to disk and then read back and checked, for an error rate of about 1 in every 5 TB (80% of which definitely occurred at the disk due to the WD firmware bug, though as noted the provenance of the remaining one-error-per-25-TB was less clear): even including the WD firmware errors, this incidence is comparable to the disk manufacturer's expected unrecoverable error rate, hence the rest of the path clearly wasn't introducing additional errors of any greater significance.

A second test validating the consistency of RAID-5 data yielded an error rate of less than 4 in 10^15 bits: though the data may well only have been exposed to the RAID-controller-to-disk path rather than to the entire path between RAM and disk and back again, this figure is lower than the disk manufacturer's expected unrecoverable error rate, so no other error sources of greater significance seem to have been present in that (short) path.

CERN's final test of full-path data integrity found 22 errors in 8.7 TB of data, or about one in every 400 GB: this is indeed significantly higher than disk BERs would predict, but if 80% of those errors were due to the WD firmware bug (as the paper explicitly assumed) and some of the remaining 20% were due to RAM problems then the residue is once again pretty close to the expected rate due to unrecoverable disk errors and hence no additional path-related contributions of significance seem to have been present.

I'm well aware that ZFS's checks also catch 'wild' and 'lost' writes, but their incidence is even lower than what CERN is discussing: not the sort of thing that any sensible desktop user would spend even a fraction of a second worrying about, especially given the far more important potential exposures in that environment.

From their slides: In 41 PB of data generated across 3500 machines, they found roughly 1,000 errors on 170 machines. Nearly two thirds of those were large chunks of corrupted data (usually in 64 KB chunks). This sounds like a bad RAID controller.

But more to the point, was that these 1,000 errors were only found because they were looking for them. The drive firmware, raid controller, ECC memory, and filesystem code didn't catch these errors. The authors of the paper specifically recommend end-to-end checksumming to help detect these errors. Will ZFS catch all types of errors they encountered? No. But it will catch many of them.

Similar studies by U. Michigan's Advanced Systems Lab on filesystem and swap space reliability have reached the same conclusions. I referenced those papers in an earlier comment (google for "Iron File Systems"). The authors recommend that filesystems store redundant metadata, checksum metadata and file data, and repair latent errors automatically.

Thanks for the CERN pointer: it looks like a slightly later version of the paper that I already had.

1. The 'Type I' errors reported correlated strongly with RAM errors, which strongly suggests that unless they happened to occur in the very brief interval between the point that ZFS generates its just-before-disk-write checksum and the point where the data is actually sent to disk (or possibly in the similarly brief interval when the data is re-read and checked, though given the nature of memory errors later corruption might be a less likely scenario) ZFS's integrity checks would not catch them.

2. The origin of the 'Type II' errors was not established but their character (and the fact that they appeared related to out-of-memory situations) led the author to speculate that they were related to slab-allocation corruption: if that was indeed the case, then the above observation that ZFS would likely not have been of any help applies to them as well.

3. The 'Type III' errors appear to be those which the earlier paper stated were due to a WD on-disk firmware bug that was exacerbated by the demands (and/or less-than-strident error-reporting) of the 3Ware RAID controller (the earlier paper indicated that they had identified this bug and were updating the firmware on their 3,000 WD disks as a result). While ZFS's checks would indeed catch such errors, if most of them came from that interaction then they would not occur (or at least not in nearly as great numbers) in a normal desktop environment (the environment under discussion here).

4. The very-low-incidence 'Type IV' errors seem to have been observed after the earlier paper was published, and the slides state that they may not warrant a separate category.

Note that none of the above errors (with the possible exception of the very-low-incidence Type IV errors and, if the author's speculation about their origin turns out to be incorrect, Type II errors) appear to have occurred in the path between RAM and disk, contrary to your original suggestion: that's why I questioned it. The closest one can come to blaming that path is to blame the 3Ware RAID controller for not complaining more stridently about the disk time-outs that it saw, but the problem *was* detectable via non-ZFS mechanisms and the underlying problem was at the disk, not in the path.

It's interesting that all the errors observed apparently occurred on only about 5% of their tested machine population (with an average of about 6 errors per machine, while the other 95% of the machines had no errors), which cannot possibly have been happenstance (why they didn't manage to correlate this with something mystifies me).

Thanks also for reiterating your earlier UWisc citations: I've had considerable respect for their CompSci department ever since the days of DeWitt, Carey, et al. there. It turns out that I had downloaded Vijayan's thesis over a month ago but had not yet gotten around to reading it.

Unfortunately, a quick skim of possibly relevant sections of that thesis plus the other citation there seems to indicate that they performed no actual evaluation of the *incidence* of 'silent' (or any other) errors but only evaluated what effect certain kinds of errors *if present* would have on various current file systems. And the primary observation seems to have been that several current file systems have design bugs related to *detectable* errors, rather than specifically due to the far-lower-incidence 'silent errors' the handling of which is ZFS's claim to fame in this area. Furthermore, the mention of path ('Transport') errors that I noticed in the thesis (page 12) again discussed *detected* errors, not 'silent' ones.

So I'll suggest that you (and you're far from the first) have inadvertently conflated the problems of 'silent' corruption with the far more common incidence of *detectable* errors and the also somewhat startling problem of mis-handling these detected errors that even fairly respected existing file systems appear to have. Such file system bugs are not 'path' bugs, but the fact that ZFS-style in-parent checksums would catch them is relevant - so the question becomes whether the actual incidence of such error-induced improper behavior is frequent enough to worry about a lot more than the incidence of 'silent errors' is, and I've seen no evidence to that effect.

In sum, while the information about possible design flaws in some existing file systems is certainly alarming, I still see no reason to disagree with Drew's assertion in the original post above that "ZFS isn't necessary for most of today's Macintosh computers. If you have been using your Mac with no storage-related problems, then you can keep on using it that way. Perform regular backups and you'll be just fine."

My main disupte with Drew involved his misunderstanding about which disk error rate was applicable to ZFS's almost-unique strengths, and my dispute with you related to your assertion that "the storage path from memory -> PCI -> controller -> disk is much less reliable than the disk itself" (which as I've explained really doesn't seem to be supported by the references that you cited). Far from being a "ZFS hater", I applaud it (as I've said elsewhere) as a breath of fresh air and feel that it represents a notable step forward in reliability in those environments sufficiently controlled (unlike virtually all desktop environments) that other sources of errors do not make ZFS's additional integrity disappear in the noise.

Nice posts Drew! Wrt. snapshots, I think the best way to experiment around with ZFS and the actual day-to-day utility of snapshots is to *dive in and have a go*, rather than theorize about their effects.

I've done that since 2005, and find them incredibly useful for my day to day desktop usage.

It is interesting to do some reading on ZFS/flash solutions. That might be the single best place for ZFS to outshine other filesystems in the future. Flash tends to get corrupt blocks, and ZFS can find and replace those blocks with parity/copies.