This blog is about the Google Summer of Code project "ZFS filesystem for FUSE/Linux"

Friday, June 30, 2006

Why ZFS is needed even in desktops and laptops

Personally, I can't use my computer unless I know I have reliable software and hardware. I think I have achieved that goal mostly, since my computer pretty much never crashes or loses data (apart from the occasional application bugs).

Now, even though I use a reliable journaling filesystem (XFS) in my Linux system, I like to do a filesystem consistency check once in a while (usually not less than once every 3 months), which can only happen in those rare times when I need (or want) to reboot. Today was one of those days.

And here are the results: xfs_repair.txt. I ended up with 90 files and empty dirs in lost+found. Why did this happen? It could be a hardware problem - either the hard disk, the SATA cable, the SATA controller or even the power supply; or a software bug - either in the SATA driver, the XFS code or somewhere else in the Linux kernel.

I actually suspect this is a hardware problem. This particular machine, back when I was using a different SATA controller and a different hard disk, had the very annoying problem of not flushing the disk write cache on reboots. This caused *a lot* of the problems you see above in the xfs_repair log. I even tried MS Windows, which would chkdsk on every other reboot. So the problem was definitely hardware related. Even though I never fixed the problem, fortunately I never lost an important file!

Now, after the hard disk died, I bought a new one and changed SATA controller (my motherboard has 2), just to be on the safe side. But, well, as you can see above, something's still definitely not working correctly.

This is one of the reasons I need ZFS. I don't want to lose or end up with misteriously corrupted files. I want to see how often data is corrupted. I want to see if corruption only happens after a reboot (which means it's a disk write cache flush problem), or if it happens while the system is running (I can't fsck XFS filesystems while they're being used). Of course, I want to do this in order to diagnose the problem and fix it.

And even if the hardware only corrupts data once in a blue moon, I need my filesystem to properly report a checksum error and retry (or return an error), instead of returning corrupted data. Basically, I want a reliable computer..