Silent Data Corruption Is Real

This means there was a data error on the drive. But it’s worse than a typical data error — this is an error that was not detected by the hardware. Unlike most filesystems, ZFS and btrfs write a checksum with every block of data (both data and metadata) written to the drive, and the checksum is verified at read time. Most filesystems don’t do this, because theoretically the hardware should detect all errors. But in practice, it doesn’t always, which can lead to silent data corruption. That’s why I use ZFS wherever I possibly can.

As I looked into this issue, I saw that ZFS repaired about 400KB of data. I thought, “well, that was unlucky” and just ignored it.

Then a week later, it happened again. Pretty soon, I noticed it happened every Sunday, and always to the same drive in my pool. It so happens that the highest I/O load on the machine happens on Sundays, because I have a cron job that runs zpool scrub on Sundays. This operation forces ZFS to read and verify the checksums on every block of data on the drive, and is a nice way to guard against unreadable sectors in rarely-used data.

I finally swapped out the drive, but to my frustration, the new drive now exhibited the same issue. The SATA protocol does include a CRC32 checksum, so it seemed (to me, at least) that the problem was unlikely to be a cable or chassis issue. I suspected motherboard.

It so happened I had a 9211-8i SAS card. I had purchased it off eBay awhile back when I built the server, but could never get it to see the drives. I wound up not filling it up with as many drives as planned, so the on-board SATA did the trick. Until now.

As I poked at the 9211-8i, noticing that even its configuration utility didn’t see any devices, I finally started wondering if the SAS/SATA breakout cables were a problem. And sure enough – I realized I had a “reverse” cable and needed a “forward” one. $14 later, I had the correct cable and things are working properly now.

One other note: RAM errors can sometimes cause issues like this, but this system uses ECC DRAM and the errors would be unlikely to always manifest themselves on a particular drive.

So over the course of this, had I not been using ZFS, I would have had several megabytes of reads with undetected errors. Thanks to using ZFS, I know my data integrity is still good.

shitty disk, even though a bitter taste stays if the problem always occured during scrub.
that’s why on every halfway serious disk array * you were able to throttle the “scrub/sniff/patrol/verify” rate.
yes the disk was smelly, but a scrub should be a scrub and not a stress test at *the same* time.

(you know they also had per block crc/ecc since ages, plus long enough of T10 error detection, right?)

ZFS does implement throttling. There are two APIs to influence the load of scrub and resilver. The first is a sleep interval in “ticks” (multiples of 1ms on a FreeBSD system unless you changed kern.hz). The more relevant one happens on the per VDEV I/O scheduling. ZFS issues I/O requests in batches and implements its own scheduling. I/O requests are typed ({ async, sync } x { read, write} + { scrub }). For each type of I/O there is a reservation and upper bound per scheduler invocation. This can be changed to optimise ZFS to a workload and the default allows limits scrubbing to two I/O requests at a time per VDEV.

I once had a SATA drive with some sort of hardware problem related to EDAC. I used it on and off (at work) for a few months and kept getting strange crashes on the code I was building. My co-worker kept snickering and mumbling that my code must be bad, but it worked fine in the debugger. Eventually I suspected the disk and began comparing large binary files that should have been identical. I saw random errors every few hundred kilobytes that were different each time I read the file. That one bad drive cost me a lot of time. NEVER ASSUME ANYTHING.

ZFS requires redundancy to recover corrupted data. By default at least two copies of metadata are maintained on top of VDEVs (mirroring, RAID-Z). The storage allocator tries to spread copies over different VDEVs.