The Nature of Failure Modes

The Pathology of Failure

At 4:00:36 a.m. on March 28, 1979 the pumps feeding the secondary cooling loop of reactor number 2 at the Three Mile Island nuclear plant in western Pennsylvania shut down. An alarm sounds in the control room, which is ignored becuase the backup pumps have automatically started.

The backup pumps are on the loop of a 'D'-shaped section in the pipework. At the corners of this 'D' are the bypass valves, which are normally open, but were shut a day earlier so that technicians could perform routine maintenance on the Number 7 Polisher. Even though they completed the maintenance, they forgot to reopen the valves, meaning that the backup pumps are pumping vacuum instead of cooling water.

As pressure in the primary cooling system rose--from being unable to shift heat into the secondary loop--a Pressure Relief Valve (PORV) on top of the reactor vessel opens automatically and begins to vent steam and water into a tank in the floor of the containment building.

Nine seconds have elapsed since the pumps failed and now control rods made of boron and silver are automatically lowered into the reactor core to slow down the reaction. In the control room the indicator light for the PORV turns off, but the PORV is still open: its failure mode is "fail open", like how a fire-escape door is always installed on the outer rim of the door-jam.

Water as well as steam begin to vent from the PORV, a condition known as a Loss Of Coolant Accident. At two minutes into the accident, Emergency Injection Water (EIW) is automatically activated to replace the lost coolant.

The human operators see the EIW has turned on, but believe that the PRV is shut and that pressure is decreasing, so they switch off the EIW.

At the eight minute mark, an operator notices that the bypass valves of the secondary cooling loop are closed, and so he opens them. Gagues in the control room falsely report that the water level is high, when in fact it has been dropping.

At an hour and 20 minutes into the accident, the pumps on the primary cooling loop begin to shake from steam being forced through them. An operator mistakes this to mean the pumps are malfunctioning, so he shuts off half of them. These are the last two that were still operation, so now there is no circulation of cooling water in the core at all, and the water level drops to expose the top of the core.

Superheated steam reacts with the zirconium alloy in the control rods, producing hydrogen gas that escapes through the PORV.

At two hours and 45 minutes, a radiation alarm sounds and a site emergency is declared and all non-essential staff are evacuated. Half of the core is now exposed, but the operators don't know it, and think that the temperature readings are erroneous.

Seven and a half hours into the accident, the operators decide to pump water into the primary loop and open a backup PORV valve to lower pressure.

Nine hours, and the hydrogen in the containment vessel explodes. This is heard as a dull thump, and the operators think it was a ventilator damper.

Fifteen hours in, and the primary loop pumps are turned back on. Half the core has melted, but now that water is circulating the core temperature is finally brought back under control.

But even if the operators had done nothing at all, Three Mile Island had an inherently high-quality failure mode: it was a negative void coefficient reactor. This meant that as steam increased (voids), the nuclear reaction decreased (negative coefficient).

One frosty Monday morning you yawn and step out of your front door in your pajamas to pick up the newspaper, and as you bend down a gust of icy wind blows the front door shut--locking you out. You are now in a failure mode.

At work, you've created a service that half the company uses, and on the weekend you go camping the service crashes and nobody knows how to restart it. Half of the company is now in a failure mode.

A failure mode is a degradation of quality, but it is not yet a catastrophe. In your personal life and in your work, you should always think about what kind of quality you'll be limping along with if some component or assumption were to fail. If you find that quality is unpalatable, then it's time to go back to the drawing board and try again.

Fault recovery and insurance policies

Back at home--cold, shivering, and with the neighbors watching--you grope around under the doormat and find the spare key you've hidden there. You've now recovered from the fault and exited the failure mode.

And at the office, your service is being monitored by some kind of watchdog process that detects the crash and restarts the service.

The spare key and the watchdog process are insurance policies, but they should not be confused with the quality of your failure mode. You enter failure mode when a component fails, but insurance policies are just another kind of component. Maybe the spare key wasn't there this time, or maybe the watchdog itself crashed. Either way, you're still in a condition of degraded quality.

Having an insurance policy doesn't absolve you from engineering as much quality as possible into each level of failure mode. All that you've done is create more possible failure modes.

Engineering quality into each failure mode

Having learned your lesson, you now either get dressed and put on a coat before going outside to fetch the paper, or you switch to an online subscription, or you defer reading the newspaper until after you're dressed and on the train for work. You've now either eliminated or changed the quality of the failure mode to something more acceptable.

At work it's not likely to be so simple, but consider some of the general techniques below as a starting point.

The component increases the accuracy of dataAddress correction services, for example, can identify addresses with missing apartment numbers, incorrect zip codes, etc. They can help you cut back on reshipping costs and penalty fees charged by UPS for false or inaccurate data. If this service fails you'll want your software to transparently time-out and pass the address through as-is, and the business will simply cope with higher shipping costs until the service is restored

The component updates a databaseIf the database is down then you can't take any new orders. Consider submitting database updates through a queue that can hold messages on the client machine when the server is unavailable. This changes the quality of the failure from a complete inability to take orders to the inability to fulfill the orders until the service is restored. Your customers can probably tolerate a delay in fulfillment, but they'll go elsewhere if they can't even submit an order at all

The component queries a databaseThere are two strategies that may be used individually or combined:

Use a REST-style web services and place a caching proxy such as Squid in front of it, giving the proxy a lot of memory to cache data in. If the database fails, the proxy will be able to continue serving popular requests

Break up your data into logical groupings with their own separate services, such as product description, inventory, customer recommendations, reviews, photography, etc. Design the client application to display as much as it can get and transparently omit what doesn't respond in time. This is the strategy used by Amazon.com to build product pages, so that if the "customers also bought" service isn't responding then the page is built anyway without that section

The component validates a transactionAn example is checking credit card numbers for validity and sufficient funds, which will depend on the success of both your code and the gateway service you're paying for. Some systems are set up to refuse a transaction if it's unable to validate them, even if the transaction would be valid otherwise. You need to design your system so that validation takes place after a transaction has been received and recorded, and that the transaction is considered "on hold" until validation is performed

Two or more changes must happen togetherTransferring money between two accounts is the canonical example of why Transactions are important; if a fault happens after you've decreased the balance of the sender, but haven't yet increased the balance of the receiver, then the money disappears. Databases support Transactions, but now so do filesystems, queues, and even memory. If your platform supports Distributed Transactions, then you can even tie everything together: pulling a message off a queue, updating multiple tables on multiple databases, and writing to an audit log file can now all happen in the same transaction, so that if the final step of writing to the audit log fails, then everything is rewound all the way back to the queue for the chance to try again. If the operation still continues to fail, you'll still have a system with consistent quality

Changes should be retroactiveThis morning you get a notification that a Big Box Store wants you to ship with "signature required" service on large orders, and to do that they're changing part of the EDI spec to contain a new boolean value. The problem is that notification was sent a month ago and nobody remembered to forward it to the IT department until the change had already gone into effect. If you were deleting their EDI files after importing them into the database then you'd be in hot water. It may still be awkward even if you had backups to restore from, because now you've got a thousand workflows in various stages that have to be changed. To solve this, you'll want a system that records changes to a workflow's state, but doesn't actually change any data. The workflow has to begin with the EDI document and use it as the bible all the way to the end. You pull the minimum details into your database for tracking and indexing, and build an abstraction layer to query the original document for everything that isn't performance sensitive (such as getting the spec for the shipping label). Now when the document's format changes, you only have to change the abstraction layer

Every system has failure modes, all the way from the trivialities of your personal life to the global economy, and the truth is that we are always operating in at least one failure mode all of the time. My car's suspension needs work, we just lost an employee who walked out with a lot of unwritten knowledge, access to credit has dried up and the economy is shrinking. And yet my car still runs, the company is still in business, and we can still buy and sell things. We just cope with the degradation of quality and work to improve it somehow.