This blog will be covering my attendance at the QCon 2011 conference in San Francisco, California from November 16 to 18. QCon is a software development conference on many useful topics which I will try my best to summarize through my posts. Please, feel free to post your comments and get some conversation started. Enjoy!

Thursday, November 17, 2011

Reliability Engineering Matters, Except When It Doesn't

A presentation on Reliability Engineering (RE) and how it can be used for software systems. Quite a confusing presentation doing a full circle from

RE is good for many domains including software to

RE is very complex for software systems to

But it is still a tool that you can use, not just an absolute one...

Presentation content:

This book written by the presenter was introduced by the host as an excellent book on the subject: "Release It!" by Michael T. Nygard

Reliability Engineering (RE) is a lot about maths so the speaker tried to present concrete examples. What if the hotel bar lighting rack falls on happy drinkers (ex: Pascal)? We could analyse the supporting chain and everything using static & mechanics principles. This is fine, but this abstracts other possibilities affecting the reliability such as earthquakes, beam wearing out, drunk people hanging on it occasionally, etc. or we can analyse this from another perspective: The rack held properly yesterday, today is a lot like yesterday, so the rack will be OK today again... but the point is that it will for sure eventually break and this is what RE is about.

RE Maths

The presenter went through many mathematical models with hazards equations, fault density, etc. Those came from many disciplines from which RE software inspires itself, and in which the probability of failures augments with time; but software do not really wear out... so we cannot simply apply blindly those models.

By taking a single server example and then a multiple servers example, he presented the Reliability Graphs (http://en.wikipedia.org/wiki/Reliability_block_diagram) where there is a start node and an end node and everything in between are successful reliability paths. A single path is subject to a global system failure if any subsystem fails.

3 Types of failures

Independent failure: Failure of one unit does not make another unit more likely to fail (excellent! but is that true for software systems? not likely)

Correlated failure: Failure of one unit makes another unit more likely to fail.

Common mode failure: Something else, external to the measured system, makes 2 redundant units likely to fail (ex: redundant LEDs in an Apollo capsule that were both subject to overheating)

Important Note for RE when applied to software systems:

Lots of software reliability analysis make the error of assuming a perfect independence between duplicated resources. The 2 last types of failures are way more common.

Blabla you should probably skip unless you know Andrey Markov (the mathematician)...

The speaker then went through more real-life examples, mainly to lead us to the various pitfalls of formal analysis with regards to software reliability. He showed that because of Load Balacing algorithms for example, redundant systems were not independent and that a failure on one was certainly augmenting the probability of failure on other ones (receiving the load of the failed ones in this example). Also, if you run a system with 9 servers, it is certainly because you need most of them to be alive for the system to work (otherwise, you overbuilt your system...); he introduced the concept of what is the minimum number of systems required to be alive out of the total number of systems deployed. Interestingly, one of the pitfall in increasing reliability seems to be that the fail over mechanisms tend to bring with them their own failure paths... In the end, when he used quite a small system for an example (maybe 5 machines) and put in the "invisible" systems from the logical system block diagram (ex: switches, routers, hard drives, etc.) in the probability of failure equation, the result was no less than spectacular (I can't write this equation down!). But it certainly proved again that true independence is not existing most often in software systems. Also, unlike the famous constructions analogies, a failed software system can come back up. To handle that aspect, he introduced the work of this mathematician, Andrey Markov and its Markov model (essentially state diagrams with probabilities of change states on each arcs). Markov systems are awfully complex for the simplest systems...

Limitations of RE with regards to software systems:

Intractable math (only few systems have closed forms distributions like gaussian - more often they have exponential (ex: a single server), log normal (ex: multiplicative failures) or weibull (ex: hardware) distributions). There is also the "repair" distributions that could be modeled with a Poisson distribution. But in the end, which distributions do we apply to software systems? None: Software fails based on load, not on time (as opposed to typical engineering disciplines)

Curse of Dimensionnality (you should have seen the Markov model for the simple system)

Mean Time Between Failure (MTBF) is Bullshit! Note: See Google analysis a few years back on hard drive failures. Would be good to get this data for other devices (servers)

other big killers:

human error (50-65% of all outages!)

interiority

distributed failure modes

Lack of independence between nodes and layers

So in the end, should we abandon Reliability Engineering for software system?

NO! Even if RE cannot tell you when your system is OK, it can tell you whenit is not. Modeling reduces uncertainty. Use RE like other model and apply it when your system is at risk.

Reference back to the ambivalent presentation title "Reliability Engineering Matters ... Except When It Doesn't".