An Anthropologist's views on bald apes in groups chasing pieces of paper in pursuit of happiness.

Friday, November 06, 2009

18 rules of complex system failure

This list below was on ZDnet and is a cut and paste job from this Brief Paper "How Complex Systems Fail" by Richard I. Cook. I suggest reading Normal Accidents if this field of extreme risk interests you.

In finance complex failure at the largest level is called systemic risk and it is what we are currently facing.

How many items below relate to a recent failure or proposed solution? (real estate, CDS and AIG, US DEBT bubble, stimulus...) the comments in parenthesis are my opinions. My own solution is a lo-fi finance approach.

1. Complex systems are intrinsically hazardous systems. The frequency of hazard exposure can sometimes be changed but the processes involved in the system are themselves intrinsically and irreducibly hazardous. It is the presence of these hazards that drives the creation of defenses against hazard that characterize these systems.

(Finance is nothing more than risk allocation)

2. Complex systems are heavily and successfully defended against failure. The high consequences of failure lead over time to the construction of multiple layers of defense against failure. The effect of these measures is to provide a series of shields that normally divert operations away from accidents.

(robust checks and balances need to be in place at all levels. Innovation is frequently a form of subverting these for the sake of efficiencies (capital, tax or otherwise). Risk eventually shows up.

3. Catastrophe requires multiple failures - single point failures are not enough. Overt catastrophic failure occurs when small, apparently innocuous failures join to create opportunity for a systemic accident. Each of these small failures is necessary to cause catastrophe but only the combination is sufficient to permit failure.

(The media reflecting audience desire seeks to identify point failures. There is rarely just a bad guy or broken part. )

4. Complex systems contain changing mixtures of failures latent within them. The complexity of these systems makes it impossible for them to run without multiple flaws being present. Because these are individually insufficient to cause failure they are regarded as minor factors during operations.

(people and organizations fail all the time. All things fail eventually. Systems need to allow for system failure. To paraphrase someone else: Capitalism without failure is like religion without hell. It doesn't really function as social tool.)

5. Complex systems run in degraded mode. A corollary to the preceding point is that complex systems run as broken systems. The system continues to function because it contains so many redundancies and because people can make it function, despite the presence

(Optimization comes at the expense of safety: 100:1 leverage or some other "innovation" may seem optimal, but may be unstable.)

6. Catastrophe is always just around the corner. The potential for catastrophic outcome is a hallmark of complex systems. It is impossible to eliminate the potential for such catastrophic failure; the potential for such failure is always present by the system’s own nature. of many flaws.

(Only the paranoid survive, find a culture or organization that is fat and happy and you will find un-acknowledged risks.)

7. Post-accident attribution accident to a ‘root cause’ is fundamentally wrong. Because overt failure requires multiple faults, there is no isolated ‘cause’ of an accident. There are multiple contributors to accidents. Each of these is necessary insufficient in itself to create an accident. Only jointly are these causes sufficient to create an accident.

(See earlier comment in regards to the media, this also applies to congressional committees, which are typically witch hunts.)

8. Hindsight biases post-accident assessments of human performance. Knowledge of the outcome makes it seem that events leading to the outcome should have appeared more salient to practitioners at the time than was actually the case. Hindsight bias remains the primary obstacle to accident investigation, especially when expert human performance is involved.

(This is the forehead slap effect, that accompanies after a bubble event.)

9. Human operators have dual roles: as producers & as defenders against failure. The system practitioners operate the system in order to produce its desired product and also work to forestall accidents. This dynamic quality of system operation, the balancing of demands for production against the possibility of incipient failure is unavoidable.

(Unfortunately positive feedback in the form of earnings, bonuses and industry recognition, amplifies this bias. A fair assessment of skill rarely gets in the way of ego amplified by culture. Cultures be they national, organizational or group that assume instant monetary reward directly equates to insight or skill almost always fail due to this bias.)

10. All practitioner actions are gambles. After accidents, the overt failure often appears to have been inevitable and the practitioner’s actions as blunders or deliberate willful disregard of certain impending failure. But all practitioner actions are actually gambles, that is, acts that take place in the face of uncertain outcomes. That practitioner actions are gambles appears clear after accidents; in general, post hoc analysis regards these gambles as poor ones. But the converse: that successful outcomes are also the result of gambles; is not widely appreciated.

( Few do post-mortems on successful outcomes that deviate from the norm. These are better than post mortems on failures as they paid for themselves.)

11. Actions at the sharp end resolve all ambiguity. Organizations are ambiguous, often intentionally, about the relationship between production targets, efficient use of resources, economy and costs of operations, and acceptable risks of low and high consequence accidents. All ambiguity is resolved by actions of practitioners at the sharp end of the system. After an accident, practitioner actions may be regarded as ‘errors’ or ‘violations’ but these evaluations are heavily biased by hindsight and ignore the other driving forces, especially production pressure.

(See CDO's, black boxes and any obfuscation. Wall street is excellent at packaging and selling things, fairly mediocre at purchasing them. See the mutual fund industry performance among others in these regards.)

12. Human practitioners are the adaptable element of complex systems. Practitioners and first line management actively adapt the system to maximize production and minimize accidents. These adaptations often occur on a moment by moment basis.

(wrong incentives, confusing short term motivations versus long term risks are part of the system. We need to design, organizations and regulations with this in mind.)

13. Human expertise in complex systems is constantly changing. Complex systems require substantial human expertise in their operation and management. Critical issues related to expertise arise from (1) the need to use scarce expertise as a resource for the most difficult or demanding production needs and (2) the need to develop expertise for future use.

(Expertise is a false notion in complex systems due to their changing nature. For this reason anything seeking to "optimize a system can make it unstable.)

14. Change introduces new forms of failure. The low rate of overt accidents in reliable systems may encourage changes, especially the use of new technology, to decrease the number of low consequence but high frequency failures. These changes maybe actually create opportunities for new, low frequency but high consequence failures. Because these new, high consequence accidents occur at a low rate, multiple system changes may occur before an accident, making it hard to see the contribution of technology to the failure.

(see above and regulatory reform etc. The law of unintended consequences runs deep in complex systems. The ratings agencies offering stamps of approval to structured products is a classic case of this.)

15. Views of ‘cause’ limit the effectiveness of defenses against future events. Post-accident remedies for “human error” are usually predicated on obstructing activities that can “cause” accidents. These end-of-the-chain measures do little to reduce the likelihood of further accidents.

(Establishing boundary conditions etc. that assume point failure, human error (greed,stupidity and crowd blindness) need to be built in.)

16. Safety is a characteristic of systems and not of their components. Safety is an emergent property of systems; it does not reside in a person, device or department of an organization or system. Safety cannot be purchased or manufactured; it is not a feature that is separate from the other components of the system. The state of safety in any system is always dynamic; continuous systemic change insures that hazard and its management are constantly changing.

(see above large stable systems are the results of small stable systems. Consider the role of audit integrity and other sub functions of the financial system.)

17. People continuously create safety. Failure free operations are the result of activities of people who work to keep the system within the boundaries of tolerable performance. These activities are, for the most part, part of normal operations and superficially straightforward. But because system operations are never trouble free, human practitioner adaptations to changing conditions actually create safety from moment to moment.

(System participants need to look out for changes, be they over or under performance of a system normal behaviour. Keeping an eye out for "innovation" in finance is highly recommended.)

18. Failure free operations require experience with failure. Recognizing hazard and successfully manipulating system operations to remain inside the tolerable performance boundaries requires intimate contact with failure. More robust system performance is likely to arise in systems where operators can discern the “edge of the envelope”. It also depends on providing calibration about how their actions move system performance towards or away from the edge of the envelope.

(More work needs to be done discussing multiple points of failure in a system, including the hubris or collective myopia that lead to the failure. My own belief is that a culture or group which reflects hubris, is obsessed with over optimizing or believes that today's profit means they are "right", are the things to watch for. For quants out there: beware of geeks baring gifts.)

Comments

18 rules of complex system failure

This list below was on ZDnet and is a cut and paste job from this Brief Paper "How Complex Systems Fail" by Richard I. Cook. I suggest reading Normal Accidents if this field of extreme risk interests you.

In finance complex failure at the largest level is called systemic risk and it is what we are currently facing.

How many items below relate to a recent failure or proposed solution? (real estate, CDS and AIG, US DEBT bubble, stimulus...) the comments in parenthesis are my opinions. My own solution is a lo-fi finance approach.

1. Complex systems are intrinsically hazardous systems. The frequency of hazard exposure can sometimes be changed but the processes involved in the system are themselves intrinsically and irreducibly hazardous. It is the presence of these hazards that drives the creation of defenses against hazard that characterize these systems.

(Finance is nothing more than risk allocation)

2. Complex systems are heavily and successfully defended against failure. The high consequences of failure lead over time to the construction of multiple layers of defense against failure. The effect of these measures is to provide a series of shields that normally divert operations away from accidents.

(robust checks and balances need to be in place at all levels. Innovation is frequently a form of subverting these for the sake of efficiencies (capital, tax or otherwise). Risk eventually shows up.

3. Catastrophe requires multiple failures - single point failures are not enough. Overt catastrophic failure occurs when small, apparently innocuous failures join to create opportunity for a systemic accident. Each of these small failures is necessary to cause catastrophe but only the combination is sufficient to permit failure.

(The media reflecting audience desire seeks to identify point failures. There is rarely just a bad guy or broken part. )

4. Complex systems contain changing mixtures of failures latent within them. The complexity of these systems makes it impossible for them to run without multiple flaws being present. Because these are individually insufficient to cause failure they are regarded as minor factors during operations.

(people and organizations fail all the time. All things fail eventually. Systems need to allow for system failure. To paraphrase someone else: Capitalism without failure is like religion without hell. It doesn't really function as social tool.)

5. Complex systems run in degraded mode. A corollary to the preceding point is that complex systems run as broken systems. The system continues to function because it contains so many redundancies and because people can make it function, despite the presence

(Optimization comes at the expense of safety: 100:1 leverage or some other "innovation" may seem optimal, but may be unstable.)

6. Catastrophe is always just around the corner. The potential for catastrophic outcome is a hallmark of complex systems. It is impossible to eliminate the potential for such catastrophic failure; the potential for such failure is always present by the system’s own nature. of many flaws.

(Only the paranoid survive, find a culture or organization that is fat and happy and you will find un-acknowledged risks.)

7. Post-accident attribution accident to a ‘root cause’ is fundamentally wrong. Because overt failure requires multiple faults, there is no isolated ‘cause’ of an accident. There are multiple contributors to accidents. Each of these is necessary insufficient in itself to create an accident. Only jointly are these causes sufficient to create an accident.

(See earlier comment in regards to the media, this also applies to congressional committees, which are typically witch hunts.)

8. Hindsight biases post-accident assessments of human performance. Knowledge of the outcome makes it seem that events leading to the outcome should have appeared more salient to practitioners at the time than was actually the case. Hindsight bias remains the primary obstacle to accident investigation, especially when expert human performance is involved.

(This is the forehead slap effect, that accompanies after a bubble event.)

9. Human operators have dual roles: as producers & as defenders against failure. The system practitioners operate the system in order to produce its desired product and also work to forestall accidents. This dynamic quality of system operation, the balancing of demands for production against the possibility of incipient failure is unavoidable.

(Unfortunately positive feedback in the form of earnings, bonuses and industry recognition, amplifies this bias. A fair assessment of skill rarely gets in the way of ego amplified by culture. Cultures be they national, organizational or group that assume instant monetary reward directly equates to insight or skill almost always fail due to this bias.)

10. All practitioner actions are gambles. After accidents, the overt failure often appears to have been inevitable and the practitioner’s actions as blunders or deliberate willful disregard of certain impending failure. But all practitioner actions are actually gambles, that is, acts that take place in the face of uncertain outcomes. That practitioner actions are gambles appears clear after accidents; in general, post hoc analysis regards these gambles as poor ones. But the converse: that successful outcomes are also the result of gambles; is not widely appreciated.

( Few do post-mortems on successful outcomes that deviate from the norm. These are better than post mortems on failures as they paid for themselves.)

11. Actions at the sharp end resolve all ambiguity. Organizations are ambiguous, often intentionally, about the relationship between production targets, efficient use of resources, economy and costs of operations, and acceptable risks of low and high consequence accidents. All ambiguity is resolved by actions of practitioners at the sharp end of the system. After an accident, practitioner actions may be regarded as ‘errors’ or ‘violations’ but these evaluations are heavily biased by hindsight and ignore the other driving forces, especially production pressure.

(See CDO's, black boxes and any obfuscation. Wall street is excellent at packaging and selling things, fairly mediocre at purchasing them. See the mutual fund industry performance among others in these regards.)

12. Human practitioners are the adaptable element of complex systems. Practitioners and first line management actively adapt the system to maximize production and minimize accidents. These adaptations often occur on a moment by moment basis.

(wrong incentives, confusing short term motivations versus long term risks are part of the system. We need to design, organizations and regulations with this in mind.)

13. Human expertise in complex systems is constantly changing. Complex systems require substantial human expertise in their operation and management. Critical issues related to expertise arise from (1) the need to use scarce expertise as a resource for the most difficult or demanding production needs and (2) the need to develop expertise for future use.

(Expertise is a false notion in complex systems due to their changing nature. For this reason anything seeking to "optimize a system can make it unstable.)

14. Change introduces new forms of failure. The low rate of overt accidents in reliable systems may encourage changes, especially the use of new technology, to decrease the number of low consequence but high frequency failures. These changes maybe actually create opportunities for new, low frequency but high consequence failures. Because these new, high consequence accidents occur at a low rate, multiple system changes may occur before an accident, making it hard to see the contribution of technology to the failure.

(see above and regulatory reform etc. The law of unintended consequences runs deep in complex systems. The ratings agencies offering stamps of approval to structured products is a classic case of this.)

15. Views of ‘cause’ limit the effectiveness of defenses against future events. Post-accident remedies for “human error” are usually predicated on obstructing activities that can “cause” accidents. These end-of-the-chain measures do little to reduce the likelihood of further accidents.

(Establishing boundary conditions etc. that assume point failure, human error (greed,stupidity and crowd blindness) need to be built in.)

16. Safety is a characteristic of systems and not of their components. Safety is an emergent property of systems; it does not reside in a person, device or department of an organization or system. Safety cannot be purchased or manufactured; it is not a feature that is separate from the other components of the system. The state of safety in any system is always dynamic; continuous systemic change insures that hazard and its management are constantly changing.

(see above large stable systems are the results of small stable systems. Consider the role of audit integrity and other sub functions of the financial system.)

17. People continuously create safety. Failure free operations are the result of activities of people who work to keep the system within the boundaries of tolerable performance. These activities are, for the most part, part of normal operations and superficially straightforward. But because system operations are never trouble free, human practitioner adaptations to changing conditions actually create safety from moment to moment.

(System participants need to look out for changes, be they over or under performance of a system normal behaviour. Keeping an eye out for "innovation" in finance is highly recommended.)

18. Failure free operations require experience with failure. Recognizing hazard and successfully manipulating system operations to remain inside the tolerable performance boundaries requires intimate contact with failure. More robust system performance is likely to arise in systems where operators can discern the “edge of the envelope”. It also depends on providing calibration about how their actions move system performance towards or away from the edge of the envelope.

(More work needs to be done discussing multiple points of failure in a system, including the hubris or collective myopia that lead to the failure. My own belief is that a culture or group which reflects hubris, is obsessed with over optimizing or believes that today's profit means they are "right", are the things to watch for. For quants out there: beware of geeks baring gifts.)