You see, when you yank the power cord out of the wall, not all parts of the computer stop functioning at the same time. As the voltage starts dropping on the +5 and +12 volt rails, certain parts of the system may last longer than other parts. For example, the DMA controller, hard drive controller, and hard drive unit may continue functioning for several hundred of milliseconds, long after the DIMMs, which are very voltage sensitive, have gone crazy, and are returning total random garbage. If this happens while the filesystem is writing critical sections of the filesystem metadata, well, you get to visit the fun Web pages at http://You.Lose.Hard/ .

I was actually told about this by an XFS engineer, who discovered this about the hardware. Their solution was to add a power-fail interrupt and bigger capacitors in the power supplies in SGI hardware; and, in Irix, when the power-fail interrupt triggers, the first thing the OS does is to run around frantically aborting I/O transfers to the disk. Unfortunately, PC-class hardware doesn't have power-fail interrupts. Remember, PC-class hardware is cr*p.

Why doesn't ext3 get hit by this? Well, because ext3 does physical-block journaling. This means that we write the entire physical block to the journal, and only when the updates to the journal are committed do we write the data to the final location on disk. So, if you yank out the power cord, and inode tables get trashed, they will get restored when the journal gets replayed.

XFS and ReiserFS do what is called logical journaling. This means that if you modify the mod time of an inode, instead of writing the entire inode table block to the journal, they just write a note in the journal stating that "inode X now has a mod time of Y". This takes much less space, so you can pack many more journal updates into a single disk block. For filesystem benchmarks that try to saturate the filesystem's write bandwidth, and/or that also have very high levels of metadata updates, XFS and ReiserFS will tend to do much better than does ext3. Fortunately, many real-world workloads don't have this characteristic, which is why ext3 tends to perform just fine in practice in many applications, despite what would appear to be much worse benchmarks numbers, at least for some benchmarks.

The problem with logical journaling is that if you do have an unexpected power drop, and the storage system scribbles garbage on the inode table, there's no way to recover. In some cases, the only thing that is left to be done is mkfs and restore from backups. (Backups? You do keep backups, right?)

The practical upshot of this is that if you use XFS or ReiserFS, and you are on crappy PC-class hardware, you ***MUST*** have a UPS, and use a serial/USB cable, so that system can do a graceful shutdown when the UPS's batteries are exhausted.

This issue is completely different from the XFS issue of zeroing all open files on an unclean shutdown, of course. Having a UPS won't save you from that particular XFS misfeature. The reason why it is done is to avoid a potential security problem, where a file could be left with someone else's data. Ext3 solves this problem by delaying the journal commit until the data blocks are written, as opposed to trashing all open files. Again, it's a solution that can impact performance, but at least in my opinion, for a filesystem, performace is Job #2. Making sure you don't lose data is Job #1.