Oracle Blog

Reflections on OS integration

ZFS and FMA

In this post I'll describe the interactions between ZFS and FMA (Fault Management Architecture). I'll cover the support that's present today, as well as what we're working on and where we're headed.

ZFS Today (phase zero)

The FMA support in ZFS today is what we like to call "phase zero". It's basically the minimal amount of integration needed in order to leverage (arguably) the most useful feature in FMA: knowledge articles. One of the key FMA concepts is to present faults in a readable, consistent manner across all subsystems. Error messages are human readable, contain precise descriptions of what's wrong and how to fix it, and point the user to a website that can be updated more frequently with the latest information.

In this example, one of our disks has experienced multiple checksum errors (because I dd(1)ed over most of the disk), but it was automatically corrected thanks to the mirrored configuration. The error message described exactly what has happened (we tried to self-heal the data from the other side of the mirror) and the impact (none - applications are unaffected). It also directs the user to the appropriate repair procedure, which is either to clear the errors (if they are not indicative of hardware fault) or replace the device.

It's worth noting here the ambiguity of the fault. We don't actually know if the errors are due to bad hardware, transient phenomena, or administrator error. More on this later.

ZFS tomorrow (phase one)

If we look at the implementation of 'zpool status', we see that there is no actual interaction with the FMA subsystem apart from the link to the knowledge article. There are no generated events or faults, and no diagnosis engine or agent to subscribe to the events. The implementation is entirely static, and contained within libzfs_status.c.

This is obviously not an ideal solution. It doesn't give the administrator any notification of errors as they occur, nor does it allow them to leverage other pieces of the FMA framework (such as upcoming SNMP trap support). This is going to be addressed in the near term by the first "real" phase of FMA integration. The goal of this phase, in addition to a number of other fault capabilities under the hood, is the following:

Event Generation - I/O errors and vdev transitions will result in true FMA ereports. These will be generated by the SPA and fed through the FMA framework for further analysis.

Simple Diagnosis Engine - An extremely dumb diagnosis engine will be provided to consume these ereports. It will not perform any sort of predictive analysis, but will be able to keep track of whether these errors have been seen before, and pass them off to the appropriate agent.

Syslog agent - The results from the DE will be fed to an agent that simply forwards the faults to syslog for the administrator's benefit. This will give the same messages as seen in 'zpool status' (with slightly less information) synchronous with a fault event. Future work to generate SNMP traps will allow the administrator to be email, paged, or implement a poor man's hot spare system.

ZFS in the future (phase next)

Where we go from here is rather an open road. The careful observer will notice that ZFS never makes any determination that a device is faulted due to errors. If the device fails to reopen after an I/O error, we will mark it as faulted, but this only catches cases where a device has gotten completely out of whack. Even if your device is experiencing a hundred uncorrectable I/O errors per second, ZFS will continue along its merry way, notifying the administrator but otherwise doing nothing. This is not because we don't want to take the device offline; it's just that getting the behavior right is hard.

What we'd like to see is some kind of predictive analysis of the error rates, in an attempt to determine if a device is truly damaged, or whether it was just a random event. The diagnosis engines provided by FMA are designed for exactly this, though the hard part is determining the algorithms for making this distinction. ZFS is both a consumer of FMA faults (in order to take proactive action) as well as a producer of ereports (detecting checksum errors). To be done right, we need to harden all of our I/O drivers to generate proper FMA ereports, implement a generic I/O retire mechanism, and link it in with all the additional data from ZFS. We also want to gather SMART data from the drives to notice correctible errors fixed by the firmware, as well doing experiments to determine the failure rates and pathologies of common storage drives.

As you can tell, this is not easy. I/O fault diagnosis is much more complicated than CPU and memory diagnosis. There are simply more components, as well as more changes for administrative error. But between FMA and ZFS, we have laid the groundwork for a truly fault-tolerant system capable of predictive self healing and automated recovery. Once we get past the initial phase above, we'll start to think about this in more detail, and make our ideas public as well.

This sounds very nice.
However, I think it would also be nice if 'zpool status' explicitly stated which devices were faulted.
Yes, you can see it in the CKSUM column, but an explicit mention of the faulted device would be more in the sense of human readable error messages.

I concur with Florian, although you guys do provide an explanation that a non-zero device chksum is bad, it would be better to display the device name immediately.

Put it another way, how is the administrator going to navigate to the long URL? Do you expect them to type it in the browser? Is it possible to just grab teh HTML from the URL and display it in a man format?

'zpool online' to clear the errors seems really counter-intuitive when the device status says 'ONLINE' already. Is there a clash of terminology there, and would 'zpool clear' or similar not be better (or perhaps the status of the device should read 'FAULTED')?

You mention the poor man's hot spare system; in what "phase" might we see the rich man's version?
I'd have a much greater sense of security if I could set up hot spares for RAID-Z pools, or had some other way to make sure that a RAID-Z pool could survive the loss of more than one disk. Depending on humans to replace failed drives in a timely manner creates too many opportunities for human error.
Thanks for writing this blog - it's fascinating.