Monday, October 25, 2010

WAL Reliability

I recently learned, somewhat to my chagrin, that operating systems are pathological liars, and in particular that they habitually lie about whether data has actually been written to disk. If you use any database product, you should care about this, because it can result in unfixable, and in some cases undetected, corruption of your database. First, a question. On which of the following operating systems do fsync() and related calls behave properly out of the box?

A. Linux
B. Windows
C. MacOS

As far as I can tell, the correct answer is "none of these". In the case of Linux, the fine manual sums up the problem pretty well:

If the underlying hard disk has write caching enabled, then the data may not
really be on permanent storage when fsync() / fdatasync() return.

Well, that sucks, doesn't it? Getting the data into the drive's write cache doesn't really help very much, unless perhaps it's a battery-backed cache. It seems that you'll need to fiddle with the hdparm utility, or perhaps some other tool if you have something other than an IDE hard drive, to turn off write caching. Your choice of filesystem and filesystem mount options can also have an impact on behavior, and apparently most Linux file systems don't handle this very well, though Greg Smith has some good things to say about the newest versions of ext4, at least in this limited regard.

MacOS X has a similar problem. On the MacBook Pro provided to me by my employer, EnterpriseDB, I get the following pgbench results with an out-of-the-box configuration.

Hmm. Even if we rather optimistically assume that a single disk flush writes the commit records for all four threads to disk in a single go, that's still 323 writes per second. Each commit should require a full revolution of the hard drive, so apparently this puppy has a 20K RPM drive in it? Doesn't sound likely. MacOS X is kind enough to provide an system call that bypasses write caching and really writes data to disk. If I set wal_sync_method=fsync_writethrough in postgresql.conf, performance drops from 1292 tps to 27 tps. Ouch. I guess that's why people don't run critical production databases on laptop-class machines.

I'm not terribly familiar with Windows, except as an (occasional) user, so I can't comment on the situation there in detail. However, the latest PostgreSQL documentation on this topic reads:

On Windows if wal_sync_method is open_datasync (the default), write caching is disabled by unchecking My Computer\Open\{select disk drive}\Properties\Hardware\Properties\Policies\Enable write caching on the disk. Also on Windows, fsync and fsync_writethrough never do write caching.

What that means is that if you're running the default configuration, you'd better make sure to have unchecked that box; otherwise, you'd better change the default fsync method. Assuming this documentation reflects the actual behavior of Windows, it sounds like they're a bit ahead of the curve, actually making fsync() force a flush of the drive's write cache out of the box. On the other hand, apparently they only do this for fsync(), and not for similar methods such as opening the file with O_DSYNC, which is supposed to be similar in effect to an fsync() after every write, but more efficient.

I think the reason why there's so much operating system (and database!) misbehavior in these cases is that everyone is afraid that doing the right thing will lead to massive loss of performance on benchmarks. You can understand why. In the above example with MacOS X, performance dropped by 50x when doing it the right way. If the next release of MacOS X were to enable this behavior by default, or if the next release of PostgreSQL were to use wal_sync_method=fsync_writethrough on MacOS X by default, someone would doubtless complain that there'd been a huge performance regression "for no reason". The same is presumably true for Linux or Windows. This is an unfortunate state of affairs, and it's not clear whether it will ever get fixed, but unless and until it does, watch out! Don't get fooled into believing that your database is safe without very careful verification.

If you need to improve performance (and don't have the budget for a machine with a battery-backed write cache), consider setting up the system for reliable operation and then configuring synchronous_commit=off. This should recapture much of the performance you've lost through making your write ahead log reliable, without risking database corruption. You may lose a few transactions in the event of a power failure, but for many applications that's an acceptable risk, and certainly better than losing the whole database.

We've been working on updating the documentation on these topics lately, so that it will be more clear what you need to do on your system to be certain that data is being properly flushed to disk. If you haven't read the first section of Chapter 29, on Reliability and the Write-Ahead Log, it's worth your time to do so. You may want to read the version in the developer documentation, which has been recently updated with some additional details that aren't present in the released versions.

Even with a UPS, power failures can and do happen. There's always the risk of someone kicking the cord out, or the UPS failing. Of course, if you're willing to take the risk of data corruption, then, as you say, write caching is a huge win. A better bet still is a battery-backed write cache, which will allow you several hours to get the power restored in the event of an outage.

I'm kinda surprised that this is news to a Postgresql hacker! (I'll assume you've just never had to work at that layer - lucky you.) The lies don't stop at hdparm, though - your hardware may also simply be incapable of guaranteeing write-through to non-volatile storage. Most consumer hard disks fall into that class.

Anyway, very glad to hear this discussion is making it into the (excellent) Postgresql docs! Seems obvious now, but I never realized it was missing. Just did some Googling and found a mention on the Dark Side.

You just have to look at the nonsense Phoronix has been publishing for months about databases…(for instance:http://www.phoronix.com/scan.php?page=article&item=linux_2634_fs&num=4Hey, BTRFS sucks, it's the only one not doing 800TPS on our SATA drive :) )

-- . A better bet still is a battery-backed write cache, which will allow you several hours to get the power restored in the event of an outage.

To me, anyway, this implies that a BBU write-cache (which I also assume is part and parcel of some disc controller hardware) can sit around with its bits being refreshed while the rest of the machine (including that controller and any other machine/disc in the room) is unpowered. And that when the machine is re-powered, PG (which presumably didn't perform a graceful exit when the plug was pulled) picks up where it left off?? How does it manage that?

The be fair here, it's not that operating systems are pathological liars. It's the hard drives that are. Operating systems are getting increasingly good at smacking the data onto the disk regardless of their lies, but there's still progress to be made. Yang suggests that most consumer hard drives are incapable of guaranteeing write-through, but that's really not true--you just have to send them exactly the right cache flushing commands to do so. The only drives I'm aware of that ignore those, too, are many of the popular SSD models out there, including all of Intel's SSD products.

For Windows, the situation is just like the Mac. Don't fool with the drive write caches. Use fsync_writethrough as your wal_sync_method. Problem solved in any common hardware for PostgreSQL, while keeping the advantage of the write cache for non-database use intact.

I've started a PostgreSQL wiki page titled Reliable Writes that gathers information about this topic, in hopes of providing a secondary resource to the main PostgreSQL documentation that isn't afraid to single out bad vendors and equipment. The sample chapter from my book available for download there answers most of the questions raised by Robert Young here. It works via a combination of the cached blocks being saved on the controller being replayed once power is restored, followed by replaying the write-ahead log in PostgreSQL to correct any lingering damage at startup time.

@Greg Smith - very interesting. I think there's too much conflicting information on consumer device reliability - I've read that many of them simply can't disable the write cache, and are thus prone to data loss (no references handy though, conveniently). And my labmates say the enterprise versions of the Intel SSDs *are* reliable (with write caches disabled; when enabled, are enterprise devices typically durable?). There seems to be a lack of concrete, coherent information, and I'm glad to see the wiki section start to address that concern.

Aside: It seems messy to have both an fsync and an fsync_writethrough. Why is one a "soft" fsync, and the other one "I really mean it"?