Can defective software be safe or secure?

Let’s distinguish between systems where the hazards of failure are material (critical) and those that are not. If it bugs don’t matter, they don’t matter. As the question (posed in a LinkedIn forum) asks about safety and security, we’re talking about critical systems.

There is a long standing debate in reliability engineering about the relationship between latent defects and field reliability. Reliability is based on observed failures, which are determined by usage — if defective code is never executed, it cannot fail. Therefore some argue that “debug testing” which aims to reveal defects can be replaced with testing that aims to accurately mimic anticipated field usage. Then, once sufficient reliability is observed in test, you can release with confidence that failures will be tolerably infrequent.

From a completely probabilistic perspective, this makes sense. As testing strategy, it maximizes field reliability and minimizes testing cost. But from a safety perspective, it is reckless. Only a fool willingly plays Russian roulette with a six in one chance of losing. At what odds would you willingly play? 1 in 10,000?

We never really know what damage a defect will do until it is actually triggered. One of the weirder characteristics of software is its sensitivity to seemingly trivial errors – one wrong bit in a billion has been enough to crash and burn. Contrast that to mechanical or structural systems, where comparable defects of impurity or tolerance variation are nearly always irrelevant. And, although many defects are “cosmetic,” they are often indicative of systemic development sloppiness. As Tom DeMarco observed, when you see a roach scurry across your table at a restaurant, you don’t say, ‘whew – there goes THE roach.’

Security failures often result from exploits of bugs and blunders that are not triggered under normal usage, so the consequences of “trivial” or “cosmetic” defects can become severe with sufficiently malicious provocation.

And, as complexity grows, so does the likelihood of “emergent behavior” — the production of completely unanticipated (and usually undesirable) behavior from nominally correct implementations under stress. The “Flash Crash” outages in equity markets are an example. My multi-dimensional testing approach targets exactly these failure modes. It is based on Dider Sornette’s “Dragon Kings” — a notion of emergent behavior as a kind of positive feedback that results in catastrophic failures.

So, although it is true that some software defects are trivial and inconsequential, a mindset that tolerates and excuses them in critical systems is dangerous.