Monday, 23 May 2011

Few developers consider, when trying to build robust platforms, all the possible modes of failure. Indeed, it is difficult to consider them all, let alone plan for them, or design tests which exercise particular symptoms.

In this post, I discuss some of the types of failure we can see in real systems.

Complete server failure

Most developers DO consider this. In a "Complete server failure", what generally happens is:

* The server stops processing new requests, completely.* The server's OS no longer responds to any network request at all (e.g. "ping")* Processing does not continue within the server* The contents of memory are immediately and irretrievably lost.

Typically, the server recovers, and when it does so, it is rebooted and restored to full health. All writes which were acknowledged before its failure have been persisted.

This is very easy to simulate (just hit the "power off" button in your VM hypervisor) and fairly easy to plan for; most robust systems consider this kind of scenario.

Network failure

There are many different kinds of network failure, but consider the simplest, most severe network failure:

* One or more machines in the infrastructure lose network connectivity* None of them can talk to anything at all, including each other* Local processing on these servers continues as normal* No machines need to be rebooted to fix the fault, when it is repaired everything is back to normal.

This is a symptom of, perhaps a switch failure, where a "complete" failure occurs.

I won't discuss network failures at all, but there are many different kinds. My experience suggests that the most common is partial or complete loss of internet connectivity from one location (datacentre).

IO subsystem failures

* One or more discs / volumes suddenly become unavailable* The OS does not reboot; processes do not stop

These are the kinds of failures which developers typically don't consider and are a lot more difficult to simulate. What might happen is, the power fails for a disc enclosure unit, but not its host server, in this case the OS and its boot discs remain available, but data discs are not. In these cases, failover might not be triggered or might behave incorrectly.

Heavy load or unexpected poor performance

* A single server unexpectedly starts performing very badly* In the extreme, this means without sufficient capacity to do useful work* But it's not failed; no subsystem is individually totally unavailable* Sometimes the effect is severe enough to prevent operations engineers logging in to diagnose / fix the fault

These kinds of faults usually cause a larger problem, because failover systems aren't triggered, or cannot take over in a timely fashion. Common causes can be

"Zombie" systems or, back from the dead

* A system fails in a catastrophic way and can't be remotely recovered* Operations engineers assume that it's going to be completely dead until physically replaced (They are some distance away and don't raise a "remote hands" request, or are unable to recover it by doing so)* Another system is provisioned in its place, and takes over its IP address, role etc* Then one day... the "Zombie" system unexpectedly comes back from the dead to haunt its successsor ... Brraaainss....

Of course this could be months later, after many software updates (possibly security updates). The "zombie" system is running an old build and will not carry out correct processing if it is given work to do.

Conclusion

These are just a few of the annoying types of failures which happen to real systems in production. Expect the unexpected (as if that's not a contradiction!).