Post by PhantomWolf on Jul 20, 2010 8:45:04 GMT -4

and yet they seemed to get away with some startlingly courageous decisions (Apollo 12 comes to mind in particular). But with the Shuttles the policy seemed to be "Launch unless proven unsafe".

What about Apollo 12? Do you mean the decision not to abort after the lightning strikes? I attribute that to the MOCR rules "If you don't know what to do, do nothing" and "Never call an abort without two independent indications"

I'd say that the decision to launch was courageous as the Manned Space Flight Center Launch Mission Rule 1-404 stated that "the vehicle will not be launched when its flight path will carry it through a cumulonimbus (thunderstorm) cloud formation." This was waived to allow the launch despite the storm, something that resulted in the lightning strike.

It must be fun to lead a life completely unburdened by reality. -- JayUtah

"On two occasions, I have been asked, 'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?' I am not able to rightly apprehend the kind of confusion of ideas that could provoke such a question." -- Charles Babbage (1791-1871)

Post by JayUtah on Jul 20, 2010 15:03:30 GMT -4

Constraint waivers are too often misused. A waiver intends to explain why a certain constraint doesn't hold in a given case. A waiver that passes a 0.51 percent per day leaky tire had better be accompanied by the engineer's computations showing that the tire still satisfies some larger safety requirement.

In Challenger's case, the SRBs flew under what was effectively a standing waiver. This is because acknowledging a design flaw in a Criticality 1R assembly means grounding the fleet for two years or more while it's fixed. Waivers that simply sidestep safety constraints in order to improve production capacity will eventually bite you.

What happens is a phenomenon call the normalization of risk. It is the bad end of the probability game. It means that if you allow an unsafe condition to occur, and no consequence follows immediately, you wrongly believe that the system remains safe. The shuttle did not explode the first time the SRB field joints failed. Hence there arose the notion that those elastomeric seals were not as critical as originally believed. How wrong they were.

We introduce design margins to accommodate unforeseen conditions. The system operates safely in those cases because although it wanders briefly outside the operational envelope, it does not exceed the physical envelope. The normalization of risk often employs a design margin to increase production capacity. When that occurs, the system can no longer accommodate momentarily excessive circumstances. Yes, the SRB field joints can accommodate erosion under normal flight conditions. But under cold-weather conditions and excessive wind shear (i.e., excessive bending moments in the casing), the safety that would have been provided by the design margin simply isn't there. And the system fails.

Normalization of risk is a chronic human-nature condition that plagues all engineering.

Although I'm not a forensic engineer, from time to time I do read accident investigation reports.

That's the work product of a forensic investigation. It's a great way to get an overview of the field and to get a glimpse into the methods.

They [accidents] invariably seem to happen after a long series of events, any one of which could have prevented the accident had it not happened.

Those are multiple-mode failures. Several things have to conspire -- often in an improbable combination -- to fail the system. Weather, combined with operator inattention, combined with some key failure, for example. Any one of them alone would be inconsequential. These occur because they are insanely difficult to design against. In any complex system, the number of single-mode failures is already daunting. 2- and 3-way combinations simply number too great to be imagined.

The Apollo 13 accident is a classic example. Had any of a number of steps toward failure not been taken, the accident would have been averted. But each step seems reasonably innocent. Dropping the tank is unfortunate, but recoverable. Running the heater on GSE was a standard procedure. Failure to test the thermostat under high electrical load was not itself seen as disastrous. We accept these momentary risks because we believe the system as a whole to be resilient. If Step G doesn't catch the failure, Step R will.

This brings up a condition we call systemic loafing, or social loafing. It's social loafing when human activity dominates the system, and systemic loafing when automation predominates. We've long known that if you have a process that employs one quality-control inspector, adding another inspector in sequence actually reduces overall system quality. This is because when there is only one inspector, he knows that the buck stops with him. But with two or more inspectors, one will always believe that any mistake he misses will be caught by the other one. So he will tend to be less diligent.

The opposite of a multiple-mode failure is a common-mode failure. That is when one or more components suffer because of the failure of some third component that is commonly connected to both. We like to design systems to reduce the criticality. We like to reduce coupling. These design factors affect how the system behaves in the face of component failure.

If you have a fluid reservoir that also functions as a heat exchanger, failure there will produce both thermal and quantity-related effects. The system will run hotter. It may also suffer from over- or under-capacity errors.

You see the same thing in shipping.

Actually that's a very insightful arena because the shipping accident rate has remained largely unimproved for 50 years. Despite huge strides in automation, satellite navigation, and shipbuilding, we still experience a relatively constant rate of accidents.

This is because advances intended toward safety, such as autopilots, are being used to achieve greater production capacity. For example, if a GPS system improves your knowledge of the ship's position, ship operators realize they can run harbor channels at a faster speed because they no longer need the positioning margin. It means you can run a ship with fewer crew, leading to more fatigue-related accidents.

What this tells engineers is that humans have an inherent (and largely fixed) notion of acceptable risk. When customers say, "We want the system to be safer," what they often end up saying is, "We want to get more out of our system for the same level of safety."

I've also learned that there are definite limits to human reliability.

Indeed, and as systems become more complex and harder to understand, the operators of these systems run up against hard and fast limits in human comprehension.

We saw this in the Three Mile Island nuclear power plant accident, and in the Apollo 13 accident. If you look at how operators responded to the problem, you find that they were simply unable to grasp the scope and nature of the failure at the time. It took them a long time to realize that they were experiencing something beyond a simple failure.

Operators tend to adopt a de minimis hypothesis early in the accident sequence and to filter incoming information based on the hypothesis. For nearly an hour Apollo 13 controllers believed they were looking at a simple failure that was being aggressively misreported in the telemetry. For about that long, Three Mile Island operators failed to consider that their safety systems themselves were malfunctioning.

Humans just aren't well suited to situations where nothing happens for a very long time, and then suddenly and without warning you have to make a crucial decision.

Quite true. I had brunch on Sunday with someone who trains engineers (the train-driving kind) for Union Pacific. Fatigue and attention deficit are the biggest problems he faces.

I think there's an unwarranted belief out there that the human is ultimately more reliable than the machine, and that's just not always so.

That's very true. The joke goes that a modern airliner should be flown by one man and one dog. The dog is there to bite the man if he tries to touch the controls, and the man is there to feed the dog. Modern flight control systems are exceedingly adept, and are much more capable than a human of flying an airplane safely and efficiently under normal circumstances, and even under many abnormal ones.

As you note, sometimes simple automation is best. In loss-of-control flight accidents, very often you see (i.e., by examining the DFDR) the pilot trying to regain control of an airplane that has gone into spins, dives, or other uncontrolled maneuvers. And very often you can see that the pilot's command inputs are largely ineffective because he lacks an appropriately detailed spatial awareness of his situation. In those same instances the autopilot is shown to have been better at recovering the airplane. This is because autopilots are dumb: they're just simple control systems that map inputs to output. The roll-channel controller says, "Hm, my roll attitude is way off and my roll rate is excessive; let me apply my ailerons at the properly aggressive position." And this works because the controller is paying attention only to two inputs, and has only one output to manipulate. Similarly pig-headed thoughts are going on in the minds of the pitch and yaw controllers. The combined effect is a deliberate application of the right combination of control inputs to correct the overall attitude errors. It isn't a human pilot flailing at the controls. The autopilot isn't awash in adrenalin and pushed beyond its capacity by a survival instinct.

But there is a similar backlash misconception among some engineers that automation is inherently more reliable. In fact engineering a system for reliability and self-regulation more often than not requires engineered safety devices (ESDs). And these sometimes constitute engineering examples that can themselves go wrong. For example at Three Mile Island a pressure-operated relief valve opened as scheduled to relieve pressure in the coolant loop, but then failed to close again. Often we rely on ESDs to sit dormant for many years untested and untried, and then to function perfectly in the one instance where they are required.

At best ESDs are additions to the system complexity. They will help if they are working properly. They will hurt if they themselves are not well built and maintained.

There is a fine art to designing safety and warning systems. Warning systems that go off for no good reason annoy operators and normalize them to the danger they represent. Often operators will disable a "faulty" warning system, or simply ignore it unless it signals a condition that is visibly harmful.

My theater employs a large mobile stage system designed by Scala in Canada, the same company that automates Cirque du Soleil. It is phenomenally powerful, and rather complex. One of its ESDs is a dead-man's switch that is meant to be operated by a spotter down near the machinery. The spotter inspects the stage in motion and releases the switch (stopping the mechanism) if something goes wrong. The dead-man's switch spends most of its time wedged between a conduit and the wall, held closed and unattended.

Another is a set of astragals that guard pinch and shear hazards. "Astragal" is the technical term for those rubber bumpers on the edges of elevator doors that signal a blockage. The astragals are extremely sensitive, and the normal operation of the stage sometimes bumps the astragals and trips the system. As with most ESDs, the safety interlock cuts power to the actuators and the control relays. Resetting the astragal, rebooting the controller, and advancing to the appropriate part of the program takes, at best 45 seconds -- an eternity in live theater. Hence there was considerable production pressure to avoid tripping the stage.

It took a near-fatal accident in which a stage hand was dragged gruesomely into a formerly-guarded pinch hazard to shock the operators into restoring the human safety factors and to re-engineer the control and safety system so that it would (a) not be so sensitive to normal operational modes, and (b) could be reset fast enough to maintain normal show operations.

"Oh, that alarm always goes off -- it's never right," is one of the big hassles in safety engineering.

Humans are best when it comes to making complex reasoned judgments with plenty of time to do so...

Well, reasonably good. There is no machine that compensates for human judgment, but human judgment is as likely to fail the system as it is to save it. Humans have the ability to think creatively, and that's why operators are still required.

Post by nomuse on Jul 20, 2010 18:27:42 GMT -4

Multi-mode failures. Thanks. I've been wanting a term for that. I used to try to explain to my old shop foreman why a problem with a piece of scenery shouldn't be considered in isolation. You have to consider not just the failure, but what happens if this OTHER thing fails at the same time.

One wonderful example of the above was a limit switch on a motorized platform at the Huntington Stage that required the trip relay be pulled DOWN by a connection through the switch. Power to the relay box was in an extension cord running along the back wall until it plugged into an outlet under the pinrail. You can see what is coming. The operator failed to stop the platform, and a rail operator had managed to kick the plug out of the wall. The platform continued on it's merry way until it hit the rear wall of the theater. By the by, can you guess where the relay box was located? That machine suicide made it even easier to convince them to pull out the whole thing and rewire it correctly the next time.

Post by PhantomWolf on Jul 20, 2010 23:37:40 GMT -4

Great Post Jay. As I reading I was reminded of the British Midlands crash near Kegworth in Jan, 1989. This caused by a blade failure compounded by lack of operator experience with a new design, and distrust of the warning systems. Worse was that the feedback they relied on told them that they were doing the right thing, when in fact they were actually doing the exact wrong one.

It must be fun to lead a life completely unburdened by reality. -- JayUtah

"On two occasions, I have been asked, 'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?' I am not able to rightly apprehend the kind of confusion of ideas that could provoke such a question." -- Charles Babbage (1791-1871)

Post by ka9q on Jul 21, 2010 5:03:56 GMT -4

I'd say that the decision to launch was courageous as the Manned Space Flight Center Launch Mission Rule 1-404 stated that "the vehicle will not be launched when its flight path will carry it through a cumulonimbus (thunderstorm) cloud formation."

Courageous? There's a difference between courage and stupidity.

I didn't know that the Apollo 12 launch actually violated any rules. I thought the rule was simply that there be no active lightning, and there wasn't. The rule writers didn't know that the Saturn V could actually induce lightning simply by its presence. The S-IC's long, ionized plume was an electrical conductor, and it extended a path to ground up into the charged cloud and triggered the two strikes.

The Apollo 12 strikes led to the installation of an elaborate network of electric field sensors at Cape Canaveral, and they are now part of launch commit criteria.

Post by ka9q on Jul 21, 2010 5:36:53 GMT -4

Operators tend to adopt a de minimis hypothesis early in the accident sequence and to filter incoming information based on the hypothesis. For nearly an hour Apollo 13 controllers believed they were looking at a simple failure that was being aggressively misreported in the telemetry.

If anyone is at all interested in understanding how the crew and Mission Control initially troubleshot the Apollo 13 emergency, I strongly recommend getting Sy Liebergot's book "Apollo EECOM - Journey of a lifetime". The CD-ROM in the back contains several hours of the flight director and EECOM loops as Sy and his relief EECOM work things out with the backroom team.

I never thought I could be so riveted by "techie talk" as I was by these recordings, especially for an event 40 years in the past that I was already quite familiar with. I kept thinking "C'mon, can't you tell? The O2 valves for cells 1 and 3 are closed, even though it doesn't show it. Try cycling them!" but they never do. Not that it would have made any difference, of course.

Nor was it fair to judge them with crystal clear hindsight. They had absolutely no reason to think that the explosion had shocked those two valves closed. They had no indication of their positions in telemetry. Not even the crew did, because the H2 valves were still open and the indicators were wired to show "open" unless both H2 and O2 valves were closed. In fact, they didn't even have a reason to think there'd been an explosion. At least not right away.

Post by ka9q on Jul 21, 2010 22:54:02 GMT -4

The explosion wasn't QUITE as dramatic as shown in the film Apollo 13, though they did feel it.

Yeah. Neither were the attitude excursions during the manually controlled burn quite as violent as depicted.

One thing they apparently depicted right, or maybe even less violently than the real thing, was S-IC/S-II staging. Looking at the Saturn V flight reports for these missions you can easily see that the acceleration compressed the stack, and S-IC shutdown excited a longitudinal mode - essentially the stack oscillated like an accordion for a second or two, with the accelerometers showing both negative and positive longitudinal g-force peaks. The movie "Apollo 13" simply showed them being thrown forward into their straps rather than being thrown back and forth a few times.

I did get the opportunity to ask two Saturn V veterans - Bill Anders and Alan Bean - about this and they both thought the Apollo 13 depiction was more or less accurate.