ZFS Failure Modes

As a combined file system and volume manager, ZFS can exhibit many different
failure modes. This chapter begins by outlining the various failure modes,
then discusses how to identify them on a running system. This chapter concludes
by discussing how to repair the problems. ZFS can encounter three basic types
of errors:

Note that a single pool can experience all three errors, so a complete
repair procedure involves finding and correcting one error, proceeding to
the next error, and so on.

Missing Devices in a ZFS Storage Pool

If a device is completely removed from the system, ZFS detects that
the device cannot be opened and places it in the UNAVAIL state. Depending
on the data replication level of the pool, this might or might not result
in the entire pool becoming unavailable. If one disk in a mirrored or RAID-Z
device is removed, the pool continues to be accessible. If all components
of a mirror are removed, if more than one device in a RAID-Z (raidz1)
device is removed, or if a single-disk, top-level device is removed, the pool
becomes FAULTED. No data is accessible until the device
is reattached.

Damaged Devices in a ZFS Storage Pool

The term “damaged” covers a wide variety of possible errors.
Examples include the following errors:

Transient I/O errors due to a bad disk or controller

On-disk data corruption due to cosmic rays

Driver bugs resulting in data being transferred to or from
the wrong location

Another
user overwriting portions of the physical device by accident

In some cases, these errors are transient, such as a random I/O error
while the controller is having problems. In other cases, the damage is permanent,
such as on-disk corruption. Even still, whether the damage is permanent does
not necessarily indicate that the error is likely to occur again. For example,
if an administrator accidentally overwrites part of a disk, no type of hardware
failure has occurred, and the device need not be replaced. Identifying exactly
what went wrong with a device is not an easy task and is covered in more detail
in a later section.

Corrupted ZFS Data

Data corruption occurs when one or more device errors (indicating missing
or damaged devices) affects a top-level virtual device. For example, one half
of a mirror can experience thousands of device errors without ever causing
data corruption. If an error is encountered on the other side of the mirror
in the exact same location, corrupted data will be the result.

Data corruption is always permanent and requires special consideration
during repair. Even if the underlying devices are repaired or replaced, the
original data is lost forever. Most often this scenario requires restoring
data from backups. Data errors are recorded as they are encountered, and can
be controlled through routine disk scrubbing as explained in the following
section. When a corrupted block is removed, the next scrubbing pass recognizes
that the corruption is no longer present and removes any trace of the error
from the system.