How data scrubbing protects against data corruption

Small steps to big protection for your storage

Data scrubbing, as the name suggests, is a process of inspecting volumes and modifying the detected inconsistencies. As times goes by, some data may fall victim to slow degradation that gradually deteriorates data integrity. Worse still, they occur silently without any warning. Take photos as an example. It could be a real disaster if it happens to one of your precious photos capturing the indelible memories. The two images below are the original photo and the corrupt one that suffers from bit rot. Read on to see how data scrubbing prevents your digital assets from data corruption.

Before we go into detail about data scrubbing, let us introduce RAID arrays to you first. RAID stands for redundant array of independent disks. Simply put, it combines multiple drives into a single storage pool, offering fault tolerance and data redundancy. Here we’re going to briefly introduce RAID 5.

RAID 5: It requires at least three drives and utilizes parity striping at the block level. When writing a block of sequential data into the array, for example, RAID 5 will write it into A1, A2, A3, B1, B2, B3 in sequence. Likewise, it reads data in the same order. What about Pa, Pb, and Pc? They are parity blocks distributed across the drives. When writing A1, A2, and A3, RAID 5 will use the following XOR to calculate Pa and write it to the corresponding block.

Pa = A1 (XOR) A2 (XOR) A3 (Function 1)

If one of the drives fails, RAID 5 will repair the missing data by using Pa and contents of the remaining two drives. Suppose the drive containing A2 breaks, then we can perform the following XOR calculation to reconstruct it:

A2 = A1 (XOR) A3 (XOR) Pa (Function 2)

The recovered contents are what we call a redundant copy. This is how RAID 5 achieves redundancy, protecting your data against drive failure.

RAID Scrubbing

Now that we have a basic understanding of the characteristics of RAID 5, then we can go on to talk about data consistency. First of all, we know that the parity information in each drive should satisfy Function 1 shown above. If it holds true, then we can safely say that the data in the array is consistent. Upon the failure of a single drive, we can use Function 2 to calculate the redundant copy and recover the contents accordingly. If it proves wrong, then there’s a problem of data inconsistency because the reconstructed data will be incorrect.

Failure to recover your data is something serious, so it’s vital to retain data consistency. RAID scrubbing scans all the contents in an array, making sure all the parity stripes satisfy Function 1. If it fails to fulfill the XOR function, it will be recalculated again and again until all the values are consistent.

Unfortunately, the answer is no. We cannot make sure that the data written to the drives will always be accurate. Some data corruption goes unnoticed. It can occur during the write-to-drive process without being reported. This kind of errors are caused by various reasons: hardware errors, electromagnetic interference, and many more.

The problem is that RAID scrubbing can only ensure data consistency. That is, it cannot tell which data block is incorrect. If a block is corrupted, every other block will be “consistently corrupt” as well. Sole reliance on RAID scrubbing may pose a potential risk. Say today we want to reconstruct Pa using A1, A2, and A3 (as shown in Function 1 above), but if any one of A1, A2, or A3 is corrupt, then performing the function will go awry, only to yield the wrong content and make things even worse.

This is where Btrfs data scrubbing comes in.

As you may have noticed, not only can you see whether your RAID type supports RAID scrubbing, but you can also know if file system scrubbing is supported from the volume information under the Data Scrubbing tab in Storage Manager1.

Btrfs data scrubbing

File system data scrubbing employs the checksum mechanism to check the volumes in the Btrfs file system. If any data that is inconsistent with the checksum is detected, the system will try to use the redundant copy to repair the data. Once you enable data checksum when creating a shared folder, the Btrfs file system will calculate a checksum (data checksum) for every written file, and further protect that data checksum with another checksum (metadata checksum).

Every time data scrubbing is conducted, the file system will recalculate the checksum and compare it with the previously stored data checksum. Meanwhile, the data checksum will cross-check its corresponding metadata checksum to make sure the data checksum itself is intact. In other words, if the recalculated checksum does not accord with the data checksum, a cross-check with its metadata checksum will be followed to see whether it is the file or data checksum that goes wrong. Once data corruption is detected, the system will try to repair the corrupt data by retrieving the redundant copy (RAID 5).

One thing to note, though, is that Btrfs data checksum may take a toll on system performance. It’s not suggested to enable data checksum if it’s a shared folder storing databases, virtual machines, or surveillance video recordings. Rest easy if you only store documents or photos in shared folders or if you use these folders for file access or sharing, as it has a very modest influence on performance.

Keeping data integrity risk at bay

Can’t decide which data scrubbing you should employ? No worries. You can have it both ways. Synology’s Data Scrubbing2 integrates Btrfs data scrubbing and RAID scrubbing to ensure data integrity. When running data scrubbing on a Btrfs volume, file system data scrubbing will be performed first to make sure the data is accurate. RAID scrubbing will be implemented next to achieve data consistency. They work together to mitigate the risk of silent data corruption and help you maintain a healthy storage system.