28 January 2011

The Normalization of Deviance in Software Development

Twenty-five years ago today, on January 28, 1986, the Space Shuttle Challenger was destroyed, taking the lives of all 7 of the astronauts aboard. Like millions of others, I remember where I was and what I was doing when I heard the news. I had always been a keen follower of the space program - I remember Apollo 11 landing on the moon, even though I was a couple of months short of 4 years old at the time - so I was very interested in the investigation into the cause behind the disaster.

Fast forward to 2004, and I was a presenter at the DPI Canada's Professional Development Week in Ottawa. After I had given my session on Transitioning to Agile, I attended one of the keynote talks given by Mike Mullane. Mike is a former Shuttle astronaut, and knew the Challenger crew (and some of the Columbia crew who died in 2003). He gave a talk called Countdown to Teamwork, which was funny and inspiring. In that talk, I was introduced to a term that has stuck with me for 7 years and I believe is one that the software world needs to learn and to which it should pay heed.

In the ensuing years I've poked around at the term, discussing it with others in the software and Agile community, and often speaking about it at clients. After some quick research I found that the term had been coined by sociologist Diane Vaughan, while writing her book The Challenger Launch Decision. Ms. Vaughan had spent many years investigating the culture of NASA and attempting to find the root cause or causes of what led to the loss of Challenger.

She wrote of how the culture at NASA had become so focused on hitting launch dates that once unacceptable situations or conditions had become acceptable risks, mainly because nothing bad had happened yet. When first built, the O-rings on the solid rocket boosters were to have no erosion at all by the hot gases from inside the combustion chamber. However, on each flight there was some erosion occurring. The engineers made some changes and the erosion, while it still occurred, was stable.

In other words, a once unacceptable condition - erosion of the O-rings - was now deemed acceptable. The deviance had been normalized. Management even applied spin to the process... an O-ring that had been eroded by 1/3 of it's diameter was deemed to have a "safety factor" of 3!

Twenty-five years ago, in 1986, it was unseasonably cold in the Cape Canaveral area of Florida with temperatures dropping below freezing during the night of January 27th and into the morning of January 28th. the Challenger sat on the launch pad, receiving a "cold soak". Remember Dr. Feynman's experiment? Well, in the cold temperatures the O-rings didn't flex like they were supposed to, and what had been partial erosion of the O-rings became a complete breach, leading to the destruction of the Challenger 73 seconds after liftoff and the loss of the astronauts on board.

So, what does all of this have to do with software development? How does Normalization of Deviance apply?

Ask yourself this question: when did anything more than 0 defects in software become acceptable, and then expected?

We have normalized the deviance of improperly built software to the point that people are actually nervous if no defects are found. There's a saying among the test community: No program has 0 defects - it only has ones we haven't found yet!

I know what you're thinking... "But, Dave, you're being naive. We only need to ship something good enough to market in order to be successful. Look at Windows 95!! It simply isn't cost effective to write perfect or near perfect software."

Or, possibly... "But, Dave, you're being naive. We aren't building web sites here - we build (insert their product here). It's immensely complex and we'd go out of business if we tried to ship without defects."

I may have bought those arguments back when I was first starting to get into the software development profession, which was about the time Challenger was destroyed. My experience since, and certainly over the past 10 years that I've been using XP and Agile methods, says that it isn't only achievable and cost effective but it may become necessary for society's sake.

From a cost perspective, look at your team or organization. How much time did you spend fixing defects in the last 12 months. Did you miss any deadlines or have to remove promised features from a release at the last minute because the massive testing effort was still finding defects days before release? How many field issues do you get from customers? Do you need a support team as big as your development team?

All in the name of short term cost savings.

The issues with the O-rings in the Space Shuttle's solid rocket boosters were known in 1977, 9 years before Challenger was lost. The segmented design of the boosters was a problem from the start. So why was this design used? NASA had proposed a single segment design in the first place, but Congress balked at the cost. The problematic multi-segmented design was the lower cost option.

Endeavor, the shuttle that replaced Challenger, cost about 2 billion US dollars to build. The Shuttle program was halted for 2.5 years, and many design changes were made to the shuttle fleet to improve safety. The halt and these changes also cost hundreds of millions of dollars.

If Congress has approved the higher funding in the first place, it would have cost a fraction of that. Seven astronauts likely would still be alive today as well.

So, think about all this when someone tells you that you can't possibly use Test-Driven Development because writing all those tests takes too long. Think about it when someone questions why you're wasting company time by Pair Programming. Think about it when it's suggested that automating Acceptance Tests will be expensive because it takes so long and you won't be able to ship as many features. Think about it when someone tells you not to waste time Refactoring because the code is good enough already. Think about it when someone doesn't want you to spend time automating a build because it's complex and will eat a few days of your time. Think about it when someone says that defect free software is a fallacy.

We know how to write code that is near defect free, and we know that it makes us go faster and thus costs less in the medium and certainly the long term. My own experience is that in time periods of anything greater than a week or two it's faster and thus less costly to just "do it right" than to cut corners and hope for the best.

So much of our society today relies on software that we can no longer afford to think that we can't afford to write defect free software. We have normalized the deviance of that single first defect because nothing really, really bad has happened yet. All it took was a cold day in January 25 years ago to prove what happens when we become complacent with those sorts of risks.

4 comments:

In the late 90s and the run up to Y2K, I was testing 911 location software. This was the kind of system where if you release a defect to prod, people could die. The Y2K testing effort for that system was one of the best professional projects I have ever been a part of.

I slept soundly on Dec. 31 1999. I KNEW the system I had tested would work the following day.

Thank you for this post, I will enjoy referring people to it! I saw Mike Mullane do that talk last year. He noted that basically the same mistakes were made with Columbia, because the corporate memory did not last 17 years, they didn't keep revisiting what they had learned from Challenger, and they fell into the same trap again. I also like what he says about our sacred duty to be a fully engaged team member. None of us should ever become a "passenger" on our s/w team.

I think that 'zero defect' is a noble goal for the software that you are coding, however it would be naive to think that you can achieve zero defects on an end-to-end customer facing application.

One simple reason: The programming language, Operating system, kernels, server, hardware, network connection etc. are factors that are largely out of a software developer's control. Edge cases appear occasionally, and they impact your product's performance. Your customer doesn't care if it's your code, or the OS that is borked, they just want it fixed.

Yes, of course, if the whole world lived in perfect unity and all systems had zero defects, then we wouldn't have that problem. Sadly, that is a long way off.

The cost of attaining 0 defects may be asymptotic the closer you get to 0. However, we already see the costs of not figuring out how close to 0 we can get.

Eleven years ago I likely would have provided the same comment you have. Nine years ago, and today, I would have said that I now know how to drive the number of defects to levels that are a few orders of magnitude lower than industry norms. Extreme Programming, Lean Software Development and newer techniques such as Behaviour-Driven Development and a raft of new tools have allowed me to work in that mode, allowed to me coach other people to work in that mode, and allowed me to understand that we can and should seek to attain 0 defects.

So, let me ask you this: If the OS is borked, wouldn't it have been nice if the people who wrote it busted their collective asses to reach 0 defects?