brain food for hackers

Richard Feynman, the Challenger Disaster, and Software Engineering

Feb 20th, 2008

On January 28th, 1986, Space Shuttle Challenger was launched at 11:38am on the 6-day STS-51-L mission. During the first 3 seconds of liftoff the o-rings (o-shaped loops used to connect two cylinders) in the shuttle’s right-hand solid rocket booster (SRB) failed. As a result hot gases with temperatures above 5,000 °F leaked out of the booster, vaporized the o-rings, and damaged the SRB’s joints. The shuttle started its ascent, but seventy two seconds later the compromised SRB pulled away from the Challenger, leading to sudden lateral acceleration. Pilot Michael J. Smith uttered "Uh oh" just before the shuttle broke up. Torn apart by excessive force, it disintegrated rapidly. Within seconds the severed but nearly intact crew cabin began to free fall and seven astronauts plunged to their deaths. I was a child then and remember watching in horror as Brazilian TV showed the footage.

At the time I didn’t know that SRB engineers had previously warned about problems in the o-rings, but had been dismissed by NASA management. I also didn’t know who Richard Feynman or Ronald Reagan were. It turns out that President Reagan created the Rogers Commission to investigate the disaster. Physicist Feynman was invited as a member, but his independent intellect and direct methods were at odds with the commission’s formal approach. Chairman Rogers, a politician, remarked that Feynman was "becoming a real pain." In the end the commission produced a report, but Feynman’s rebellious opinions were kept out of it. When he threatened to take his name out of the report altogether, they agreed to include his thoughts as Appendix F – Personal Observations on Reliability of Shuttle.

It is a good thing it was included, because the 10-page document is a work of brilliance. It has deep insights into the nature of engineering and into how reliable systems are built. And you see, I didn’t put ‘software’ in the title just to trick you. Feynman’s conclusions are general and very much relevant for software development. After all, as Steve McConnell tirelesslypoints out, there is much in common between software and other engineering disciplines. But don’t take my word for it. Take Feynman’s:

The Space Shuttle Main Engine was handled in a different manner, top down, we might say. The engine was designed and put together all at once with relatively little detailed preliminary study of the material and components. Then when troubles are found in the bearings, turbine blades, coolant pipes, etc., it is more expensive and difficult to discover the causes and make changes.

So software is not the only discipline where the longer a defect stays in the process, the more expensive it is to fix. It’s also not the only discipline where a "top down" design, made in ignorance of detailed bottom-up knowledge, leads to problems. There is however a difference here between design and requirements. The requirements for the engine were clear and well defined. You know, go to space and back, preferably without blowing up. Feynman is arguing not so much against Joel’s functional specs, but rather against top down design such as that advocated by the UML as blueprint crowd. On goes Feynman:

The Space Shuttle Main Engine is a very remarkable machine. It has a greater ratio of thrust to weight than any previous engine. It is built at the edge of, or outside of, previous engineering experience. Therefore, as expected, many different kinds of flaws and difficulties have turned up. Because, unfortunately, it was built in the top-down manner, they are difficult to find and fix. The design aim of a lifetime of 55 missions equivalent firings (27,000 seconds of operation, either in a mission of 500 seconds, or on a test stand) has not been obtained. The engine now requires very frequent maintenance and replacement of important parts, such as turbopumps, bearings, sheet metal housings, etc.

Unfortunate top down manner, difficult to find and fix, failure to meet design requirements, frequent maintenance. Sound familiar? Is software engineering really a world apart, removed from its sister disciplines? Feynman elaborates on the difficulty in achieving correctness due to the ‘top down’ approach:

Many of these solved problems are the early difficulties of a new design. Naturally, one can never be sure that all the bugs are out, and, for some, the fix may not have addressed the true cause.

Whether it’s the Linux kernel or shuttle engines, there are fundamental cross-discipline issues in design. One of them is the folly of a top-down approach, which ignores the reality that detailed knowledge about the bottom parts is a necessity, not something that can be abstracted away. He then talks about the avionics system, which was done by a different group at NASA:

The software is checked very carefully in a bottom-up fashion. First, each new line of code is checked, then sections of code or modules with special functions are verified. The scope is increased step by step until the new changes are incorporated into a complete system and checked. This complete output is considered the final product, newly released. But completely independently there is an independent verification group, that takes an adversary attitude to the software development group, and tests and verifies the software as if it were a customer of the delivered product.

Yes, go ahead and pinch yourself: this is unit testing described in 1986 by the Feynman we know and love. Not only unit testing, but ‘step by step increase’ in scope and ‘adversarial testing attitude’. It’s common to hear we suck at software because it’s a "young discipline", as if the knowledge to do right has not yet been attained. Bollocks! We suck because we constantlyignore well-established, well-known, empirically proven practices. In this regard management is also to blame, especially when it comes to dysfunctional schedules, wrong incentives, poor hiring, and demoralizing policies. Management/engineering tensions and the effects of bad management are keenly discussed by Feynman in his report. Here is one short example:

To summarize then, the computer software checking system and attitude is of the highest quality. There appears to be no process of gradually fooling oneself while degrading standards so characteristic of the Solid Rocket Booster or Space Shuttle Main Engine safety systems. To be sure, there have been recent suggestions by management to curtail such elaborate and expensive tests as being unnecessary at this late date in Shuttle history.

This is one of many passages. I picked it because it touches on other points, such as the ‘attitude of highest quality’ and the ‘process of gradually fooling oneself’. I encourage you to read the whole report, unblemished by yours truly. With respect to software, I take out four main points:

Engineering can only be as good as its relationship with management

Big design up front is foolish

Software has much in common with other engineering disciplines

Reliable systems are built by rigorously tested, incremental bottom-up engineering with an ‘attitude of highest quality’

There are other interesting themes in there, and Feynman’s insight can’t be captured in a few bullet points, much less by me. What do you get out of it?