failure probability and clusters

When running a high-availability cluster of two nodes it will generally be configured such that if one node fails then the other runs. Some common operation (such as accessing a shared storage device or pinging a router) will be used by the surviving node to determine that the other node is dead and that it’s not merely a networking problem. Therefore if you lose one node then the system keeps operating until you lose another.

When you run a three-node cluster the general configuration is that a majority of nodes is required. So if the cluster is partitioned then one node on it’s own will shut down all services while two nodes that can talk to each other will continue operating as normal. This means that to lose the cluster you need to lose all inter-node communication or have two nodes fail.

If the probability of a node surviving for the time interval required to repair a node that’s already died is N (where N is a number between 0 and 1 – 1 means 100% chance of success and 0 means it is certain to fail) then for a two node cluster the probability of the second node surviving long enough for a dead node to be fixed is N. For a three node cluster the probability that both the surviving two nodes will survive is N^2. This is significantly less, therefore a three node cluster is more likely to experience a critical second failure than a two node cluster.

For a four node cluster you need three active nodes to have quorum. Therefore the probability that a second node won’t fail is N^3 – even worse again!

For a five node cluster you can lose two nodes without losing the cluster. If you have already lost a node the probability that you won’t lose another two is N^4+(1-N)*N^3*4. As long as N is greater than 0.8 the probability of keeping three nodes out of four is greater than the probability of a single node not failing.

To see the probabilities of four and five node clusters experiencing a catastrophic failure after one node has died run the following shell script for different values of N (0.9 and 0.99 are reasonable values to try). You might hope that the probability of a second node remaining online while the first node is being repaired is significantly higher than 0.9, however when you consider that the first node’s failure might have been partially caused by the ambient temperature, power supply problems, vibration, or other factors that affect multiple nodes I don’t think it’s impossible for the probability to be as low as 0.9.

echo $N^4+\(1-$N\)*$N^3*4|bc -l ; echo $N^3 | bc -l

So it seems that if reliability is your aim in having a cluster then your options are two nodes (if you can be certain of avoiding split-brain) or five nodes. Six nodes is not a good option as the probability of losing three nodes out of six is greater than the probability of losing three nodes out of five. Seven and nine node clusters would also be reasonable options.

But it’s not surprising that a google search for “five node” cluster high-availability gives about 1/10 the number of results as a search for “four node” cluster high-availability. Most people in the computer industry like powers of two more than they like maths.