Delayed allocation and the zero-length file problem

A recent Ubuntu bug has gotten slashdotted, and has started raising a lot of questions about the safety of using ext4. I’ve actually been meaning to blog about this for a week or so, but between a bout of the stomach flu and a huge todo list at work, I simply haven’t had the time.

The essential “problem” is that ext4 implements something called delayed allocation. Delayed allocation isn’t new to Linux; xfs has had delayed allocation for years. Pretty much all modern file systems have delayed allocation, according to the Wikipedia Allocate-on-flush article, this includes HFS+, Reiser4, and ZFS; btrfs has this property as well. Delayed allocation is a major win for performance, both because it allows writes to be streamed more efficiently to disk, and because it can reduce file fragmentation so that later on they can be read more efficiently from disk.

This sounds like a good thing, right? It is, except for badly written applications that don’t use fsync() or fdatasync(). Application writers had gotten lazy, because ext3 by default has a commit interval of 5 seconds, and and uses a journalling mode called data=ordered. What does this mean? The journalling mode data=ordered means that before the commit takes place, any data blocks are
associated with inodes that are about to be committed in that transaction will be forced out to disk. This is primarily done for security reasons; if this is not done (which would be the case if the disk is mounted with the mount option data=writeback), then any newly allocated blocks might still contain previous data belonging to some other file or user, and after a crash, accessing that file might result in a user seeing unitialized data that had previously belonged to another user (say, their e-mail or their p0rn), which would be a Bad Thing from a security perspective.

However, this had the side effect of essentially guaranteeing that anything that had been written was guaranteed to be on disk after 5 seconds. (This is somewhat modified if you are running on batteries
and have enabled laptop mode, but we’ll ignore that for the purposes of this discussion.) Since ext3 became the dominant filesystem for Linux, application writers and users have started depending on this, and so they become shocked and angry when their system locks up and they lose data — even though POSIX never really made any such guarantee. This become especially noticeable on Ubuntu, which uses many proprietary, binary-only drivers, which caused some Ubuntu systems to become highly unreliable, especially for Alpha releases of Ubuntu Jaunty, with the net result that some Ubuntu users have become used to their machines regularly crashing. (I use bleeding edge kernels, and I don’t see the kind of unreliability that apparently at least some Ubuntu users are seeing, which came as quite a surprise to me.)

So what are the solutions to this? One thing is that the applications could simply be rewritten to properly use fsync() and fdatasync(). This is what is required by POSIX, if you want to be sure that data has gotten written to stable storage. Some folks have resisted this suggestions, on two grounds; first, that it’s too hard to fix all of the applications out there, and second, that fsync() is too slow. This perception that fsync() is too slow was most recently caused by a problem with Firefox 3.0. As Mike Shaver put it:

On some rather common Linux configurations, especially using the ext3 filesystem in the “data=ordered” mode, calling fsync doesn’t just flush out the data for the file it’s called on, but rather on all the buffered data for that filesystem.

Fundamentally, the problem is caused by “data=ordered” mode. This problem can be avoided by mounting the filesystem using “data=writeback” or by using a filesystem that supports delayed allocation — such as ext4. This is because if you have a small sqllite database which you are fsync(), and in another process you are writing a large 2 megabyte file, the 2 megabyte file won’t be be allocated right away, and so the fsync operation will not force the dirty blocks of that 2 megabyte file to disk; since the blocks haven’t been allocated yet, there is no security issue to worry about with the previous contents of newly allocated blocks if the system were to crash at that point.

Another solution is a set of patches to ext4 that has been queued for 2.6.30 merge window. These three patches (with git id’s bf1b69c0, f32b730a, and 8411e347) will cause a file to have any delayed allocation blocks to be allocated immediately when a file is replaced. This gets done for files which were truncated using ftruncate() or opened via O_TRUNC when the file is closed, and when a file is renamed on top of an existing file. This solves the most annoying set of problems where an existing file gets rewritten, and thanks to the delayed allocation semantics, that existing file gets replaced with a zero-length file. However, it will not solve the problem for newly created files, of course, which would have delayed allocation semantics.

Yet another solution would be to mount ext4 volumes with the nodelalloc mount option. This will cause a significant performance hit, but apparently some Ubuntu users are happy using proprietary Nvidia drivers, even if it means that when they are done playing World of Goo, quitting the game causes the system to hang and they must hard-reset the system. For those users, it may be that nodelalloc is the right solution for now — personally, I would consider that kind of system instability to be completely unacceptable, but I guess gamers have very different priorities than I do.

A final solution which might not be that hard to implement would be a new mount option, data=alloc-on-commit. This would work much like data=ordered, with the additional constraint that all blocks that had delayed allocation would be allocated and forced out to disk before a commit takes place. This would probably give slightly better performance compared to mounting with nodelalloc, but it shares many of the disadvantages of nodelalloc, including making fsync() to be potentially very slow because it would force all dirty blocks to be forced out to disk once again.

What’s the best path forward? For now, what I would recommend to Ubuntu gamers whose systems crash all the time and who want to use ext4, to use the nodelalloc mount option. I haven’t quantified what the performance penalty will be for this mode of operation, but the performance will be better than ext3, and at least this way they won’t have to worry about files getting lost as a result of delayed allocation. Long term, application writers who are worried about files getting lost on an unclena shutdown really should use fsync. Modern filesystems are all going to be using delayed allocation, because of its inherent performance benefits, and whether you think the future belongs to ZFS, or btrfs, or XFS, or ext4 — all of these filesystems used delayed allocation.

What do you think? Do you think all of these filesystems have gotten things wrong, and delayed allocation is evil? Should I try to implement a data=alloc-on-commit mount option for ext4? Should we try to fix applications to properly use fsync() and fdatasync()?

If that’s what you want, you can mount the filesystem using -o sync. Performance will be terribly, terribly slow. Worse yet, it still won’t necessarily help you because application writers take shortcuts. For example, if the application writer truncates the file down to zero, and then rewrites the file with the new contents, because the application writer wants to preserve the file ACL without actually saving and restoring the ACL entries, and the system crashes right after the file is truncated, there’s really not much you can do.

So I agree with you that performance is the #1 priority — assuming that the application writer is doing things correctly. If the application writer does something stupid, such as truncating the file down to zero, there are real limits to what you can to protect against application writer stupidity. We are in fact adding some hueristics to try to protect the data in the face of application writer malpractice, but at that point, I think we do need to trade off performance versus trying to protect against application writer stupidity. After all, as the old saying goes, the problem with trying to create a fool-proof solution is that fools are so ingenious….

Perhaps you’re confused about who owns the computer. If it’s in fact yours, you’re free to have it work any way you want it to, including the way that you described.

Of course, simply getting data to disk is nowhere nearly sufficient to guarantee its utility in all cases: you often have to do things like group updates appropriately for atomicity, for example – though I’ve only written a couple of file systems that did this because facilities that far beyond the industry’s least common denominator tend not to get used all that much.

The reality of the situation is that many people (though perhaps not you) get annoyed at ‘tolerating’ the kinds of performance constraints that you describe when there’s absolutely no benefit to be gained from them (because just as some situations call for far more care in handling data than you suggested, others require far less: it depends on the details of how the application uses its storage). Since the file system cannot by definition satisfy the entire range of such needs with a single approach good designs offer a variety of approaches from which each application can choose what best suits its needs.

Another reality, unfortunately, is that many people can see only what *they* want from a file system rather than the full range of needs that it should satisfy. Perhaps if they actually wrote one and had the opportunity to experience feedback from a few thousand vocal users they’d develop more appreciation for this concept.

One note from a usability angle, and (shudder) from a Linux user who appreciates something MicroSoft got right:

Linux got many things right: security, stability well before XP, and quite a lot else. Linux excelled at being technically correct, at least compared to MicroSoft Windows.

Meanwhile, MicroSoft got more or less one thing right in Windows: usability. Where Linux was secure, open, and so on, Microsoft knew the value of being something that would work for people, and that people without computer science backgrounds could figure out. Apple understood the importance of user-centeredness too, even if they didn’t make the best business decisions. The 90% market share achieved by MicroSoft is because however many things they got wrong, however badly they bungled stability, security, and so on and so forth, they sold people a way that they could figure out how to use their computers. Only recently has Linux caught up with this way of putting users at the center.

The basic argument for ext4 is that it is more correct compared to a precise reading of specifications. If that causes large-scale practical instability for users who failed to exercise the due diligence of only using programs whose source may contain open; write; close; rename; without including an fsync, then this is not a problem with the file system. It’s a more correct read on the spec, so it’s an improvement to the filesystem, and if there are consequences, that’s Not Our Problem.

I wince at saying this, but I’d like to see developers think a little more like MicroSoft here.

Maybe this wasn’t made clear enough, but the first thing that I did, before writing this blog article, was to create hueristics for ext4 that worked around broken application behavior. That is, ext4 tries to determine if applications are trying to update files in dangerous way (i.e., update-via-truncate and update-via-rename without using fsync), and it will force an implied file system flush to avoid data loss most of the time. Unfortunately, if an application truncates a file, and then system crashes before the application gets around to writing the new data, there’s not much that can be done at the file system level.

But I did first work around application programmer stupidity, and then called on application programmers to be, well, less stupid. That was because I knew application programmers outnumbered file system developers by several orders of magnitude. So with all due respect, I was thinking from a user-centric point of view; the first thing I did was to try to avoid as much data loss as possible without application programmer assistance.

One advantage Microsoft has, that Linux kernel programmers don’t have, is the Windows logo compatibility program. If there is some really stupid thing that Microsoft wants to prohibit, they can add a requirement to the Windows application logo compatibility program, and software companies won’t be able to put the Windows logo on their software packages unless they conform to all of the requirements of the Windows logo program. We don’t have that big stick to beat over the heads of application programmers, so all we can try to do is pursude application programmers to do better, via blog posts such as this one.

Jonathan, I’m afraid that you, like so many others here, Just Don’t Get It.

This is not a discussion about correctness and specs: it’s very much a discussion about usability – the ability to use a file system to satisfy a wide variety of needs for a wide range of applications written to support a wide range of users.

If users got to use the file system directly rather than predominantly through intermediary applications then the file system *might* be able to provide default behavior that the majority would find appropriate. Instead, a wide variety of applications using the file system in a wide variety of ways are what the users see, and the file system cannot, by definition, serve all these applications (and their user) well with a single approach even if one assumes that all users want the same things: only applications can do that, because only they understand the ways they’re using the file system and what implications this has for the user experience.

Perhaps if you understood both the Linux and Windows file systems better you’d be less inclined to hold up the latter as some sort of paragon of usability even in the face of the kind of application incompetence being discussed here. Like nearly all modern file systems Windows defers most on-disk updates in the absence of application instructions to the contrary. For example, when a user clicks on ‘Save’ and the application issues a standard file system Write request nothing goes to disk for some period of time: only if the application recognizes that ‘Save’ means that the user wants the data to move to disk Right Now (just in case that ominous thunder outside presages a power outage) and explicitly flushes the data to disk immediately after issuing its normal Writes does the user get the behavior desired.

There’s another layer involved as well, since desktop systems are typically configured to enable the disk’s own internal write-back cache (as usual, for performance reasons: users do get annoyed at slow computers far more frequently than they get annoyed because they’ve lost some data, after all). So when the file system gets instructions from an application to force data to disk it in turn must tell the disk to force it to the platters (which competent file systems of course do).

Windows systems typically ship with the disk’s write-back cache enabled, because that’s what users seem to want. And Windows file systems don’t subvert that facility without explicit instructions from the application (or to protect their own internal consistency). So if applications fail to tell the Windows file system what to do, it will in most cases just as happily leave their data subject to loss should an interruption occur as ext4 will when its applications do the same.

Ted has very thoughtfully back-stopped broken applications in this one specific area, perhaps because it has relatively little performance down-side, perhaps because he feels some responsibility for having set false (though completely undocumented) expectations in ext3, perhaps because it was relatively easy to do. Don’t make the mistake of thinking that such back-stopping for application incompetence should (let alone could) be applied across the board.

“This gets done for files which were truncated using ftruncate() or opened via O_TRUNC when the file is closed, and when a file is renamed on top of an existing file.”

Seriously, Ted. You complain about people relying on nice side effects of ext3 and now you’re adding yet another one to ext4. I understand the motivation to make replacing rename()s safe, because the method used suggests that the user wanted consistency. But users who open files with O_TRUNC don’t care about consistency, they want to delete data. Please revert that part of the patch, or make that not a default.

Unfortunately, there very many application programmers that attempt to update an existing file’s contents by opening it with O_TRUNC. I have argued that those application programs are broken, but the problem is that the application programmers are “aggressively ignorant”, and they outnumber those of us who are file system programmers. When we try to tell them that no, the right way to update a file is to create a new file, write the contents to the new file, fsync the new file, and then use rename() to rename the new file on top of the old file, they question our competence; they even question our paternity.

One could argue that they don’t care about consistency, but the problem is they do it anyway, and then when you combine it with users who use these broken applications, and then use Ubuntu systems with broken proprietary video drivers which crash the entire system whenever you breathe on them wrong (or when you exit certain 3D graphical games) — unfortunately the users don’t blame the application programmers, they blame the file system developers.

So unfortunately, because application programmers aren’t willing to even acknowledge that their programming styles are broken, we have to include that as a workaround. You can strace programs such as certain editors, and find that they really do update existing files (including files that might be considered precious, such as someone’s Ph.D. thesis or C source code) by opening the file with O_TRUNC. And if you crash between the time the file is truncated and the time that the data blocks are safely written to disk, you’ll lose data. And unfortunately, we have ample evidence that (a) users don’t blame the application programmers, and (b) there are a significant number of application programmers which refuse to fix their programs.

Actually, I do think that “write to new file, then rename to old file” is the ONLY sane way of doing things. Truncate-rewrite is asking for trouble even if the system itself is solid as a rock.

After all, things can go plenty awry even without the system crashing. Suppose that the app hits a permission problem, or runs out of disk space, or somehow manages to segfault at an inopportune moment? (some bugs have made that happen in some of my favorite programs). With the write-rename method, those troubles pounce the application BEFORE the original data is destroyed or replaced.

So, in my opinion, workarounds designed to accomodate inherently broken apps (i.e., truncate-rewriters) should get less priority than those to accomodate proper apps (write fsync rename fsync-on-dir). A way of “encouraging” application writers to stop being fsync shy.

Though I do agree that fsync should NEVER punish an app more than needed. Waiting for your own file to hit the disk is quite expected, but getting ambushed with a massive cascading writeout of everyone else’s files is a violation of the “principle of least surprise”, so to speak.

I don’t usually want to side with people who write code that is incorrect, but are you really blaming application developers and proprietary drivers???

I actually don’t blame the filesystem per se, but the steps required to read a file, change something, and write it back are absolutely ridiculous. Maybe it should be fixed in glibc or something, but I’m not surprised at all that people screw it up. Let’s see:

1.) read file
2.) make changes in mem
3.) create new file
4.) modify acls/permissions on new file so they match old file
5.) write new file
6.) fsync new file (oh wait it fsyncs everything in reality… huge pause)
7.) rename new file -> old file
8.) fsync containing directory
9.) ok, now show that the file has been saved

but instead of adding that to some library people just blame application developers.

Also, I’m not fan of proprietary drivers, but nvidia has consistently had the best drivers (proprietary or open source) available for linux of any manufacturer. Blaming nvidia is definitely a red herring. you’re essentially saying “computers should never crash and if we didn’t have proprietary drivers they wouldn’t!!!”

Of course we blame application developers: they’re the ones incorrectly using the tools they have to work with, rather than seeking some other line of work because they don’t understand how to use those tools correctly (or are simply too lazy to, given that the rules in this area have been very clear for decades in Unix environments).

Would easier-to-use tools be nice? Perhaps – though extra layers add extra overheads and interface detail (even though perhaps making specific tasks easier). Would easier tools be nice for just one of many file systems in use on Linux? That’s a lot less clear – but (as you suggest) a library approach, with appropriate specialization for each such file system, might handle that.

Libraries are, of course, application rather than system code, so anyone can write one (and system developers tend to leave that up to others: they’ve got enough problems of their own to handle). If the library is sufficiently successful it might even become a standard, with the advantage of being portable across many different environments.

In this particular case you’re asking for a specialized kind of transaction, something eminently achievable at application (or library) level. Transactions have not traditionally fallen within the scope of file system responsibilities on *any* common platform, which may help explain why there’s a bit of resistance to being told that any individual file system is at fault for not providing them. The atomicity of the rename operation itself is transactional in nature, but only within a single action – and that is what allows the traditional sequence used to update a file atomically to be as concise as it is.

I suspect that the reason that no one has developed the kind of library function which you seem to be advocating is that this particular situation covers only one small part of what applications require to be robust in the face of unexpected interruptions – and hence addresses so small a part of the post-interruption clean-up they must perform that it’s not worth special-casing. I suspect the main reason for this tempest in a teapot is that application developers found a convenient (though undocumented and unintentional) short-cut with ext3 which they unwisely assumed would exist in perpetuity, and that their unsuspecting users are now looking for a single scapegoat because that’s less intellectually challenging than understanding why the file system really has good reason to work the way it traditionally has.

That last would certainly be consistent with the general collapse of analytical competence in the U.S. during this decade, and I see little evidence that the technical community has in some way remained immune to that (much as it would be comforting to believe otherwise).

First, thanks a bunch Ted for explaining the zero-length-file issue, the delayed-write nature of EXT-4, and the proper application of fsync(). You helped me easily solve an otherwise baffling bug.

Can anyone (Ted?) tell me the effect of EXT-4’s Delayed Write feature on msync(…,MS_SYNC) calls? I use memory-mapped files, and since my switch to EXT-4, my files are being corrupted across power failures. Does msync(…MS_SYNC) REALLY commit data to the DISK, or does it just commit it to the FILE SYSTEM, where it might be further hung up by Delayed Write?

Bill R., msync is supposed to have a similar effect as an fsync call. At a minimum this should schedule the pertinent data for immediate writeout to disk.

However, I don’t think most linux filesystems (by default at any rate) actually flush the device write cache when fsync is called, for performance reasons. They should wait for the scheduled write to actually make it to the device, however.

In the event of a power loss, anything in the device write cache that hasn’t yet been written will be lost, so the easiest thing to do is disable it. This will have performance implications of course. I imagine this will fix your problem.

@Nathaniel, Ted: Is there any new information aboout disabling fsync in laptop_mode?
I am pretty sure it’s not in there yet, as in 2.6.35 my netbook still keeps spinning up every time I close a browser tab. If I use libeatmydata it doesn’t happen. But of course libeatmydata is pretty inflexible and not a very good solution to the problem.
I’d really love to see this for my netbook, it would certainly give some extra battery time.