Single point of failure

September 1st, 1983. Korean Airlines flight 007 from New York City to Seoul disappeared a couple of hours after take-off. Only later was it discovered that the plane deviated from its original route; instead of flying through air corridor R-20, it entered Soviet airspace and was shot down by a Soviet interceptor. All 269 people on board were killed.

During an investigation conducted by the National Transportation Safety Board (NTSB) , it was made clear the plane was cruising way northern than it should have been. Instead of flying above international waters, the plane somehow entered Soviet airspace, enforcing them to gun it down, thinking it was a plane in a spying mission. How did the plane deviate that much from its assigned route? NTSB came up with two possible options, both pointed at human error.

The first option was typing the aerial waypoints incorrectly. These are latitude / longitude pairs the co-pilot enters and the captain validates, and they form the flight's route. Mistyping one digit may take the plane way off its planned route, possibly making it enter hostile territories. NTSB also mentioned another possibility - not turning the coordinates-based auto-pilot (INS) on, and instead flying with the Magnetic Heading auto-pilotmode. The Magnetic Heading option is always on during take-off, so it would require the pilot to remember to change the auto-pilot mode. If he failed to do so, the INS system would not use the coordinates they typed to guide the plane, since it would be off.

The captain on board of KAL flight 007 had years of flying experience. 10+ years in KAL, and many years before that in the air-force. Therefore, NTSB deemed the second option "less likely". They thought it is much more likely for typing a number incorrectly, and not caring to verify it, than it was for a very experienced pilot to flip a switch right after take-off. It is a switch you flip on every flight, after all.

Years later, after the Soviet Union fell apart and the investigation was able to conclude using the original black-box from the plane, the real reason for the deviation of the flight was discovered. It turns out the captain forgot to switch the INS system on, so the plane was cruising using the Magnetic Heading. Had he remembered to switch the INS system on in any point during the flight, he would have caught the error and redirect the plain to its assigned route, probably avoiding death.

In the software world we have a lot of slogans, methodologies and names for patterns. Single point of failure is not just a slogan. In this case, the system had many single points of failures, and it was only a matter of time before before it would have mortal consequences. I'm pretty sure this is not the only time the pilot forgot to switch to INS mode; it is the only time (that I know of) it caused death. Of an entire 747.

The Single Point of Failure in this case is not a system crash, or a bottleneck. It is about assuming the operator will always remember to do the right thing at the right time. And that is wrong, even if your user has 10+ years of flawless experience. I'm consciously avoiding the discussion on the poor UX of the auto-pilot system, and this is why I left some details relating to it out. Yes, you can get away from this using some UX tricks, like checklists or blinking signs or whatever, but then in the best scenario you are just making it less likely to happen, which is not good enough.

If it is the common practice to always first have magnetic heading mode turned on, and then switch to something else (not necessarily INS), then having it as a dedicated mode is a wrong assumption. But here I'm talking UX again, so we'll stop here.

When designing any software, not to mention complex systems, don't ever allow for a single point of failure, and don't ever assume it is only about preventing bottlenecks or crashes. In some systems you might save lives, but in most systems you'll just save yourself a lot of support calls.

You can read the full story, with all the details, in the Wikipedia page. National Geographic had a chapter on it in the excellent "Air Crash Investigation" series, which you can watch here. The image above is from that show.