Do we have to accept “Normal Accidents”?

My esteemed colleague Andrew Bishop sent our team Malcolm Gladwell’s 1996 New Yorker article “Blowup” this afternoon. If you’re interested in complex systems, technology and risks it makes for an interesting read. Gladwell draws on Charles Perrow‘s work on “Normal Accidents” to discuss the risk management challenges posed by modern, complex systems such as spacecraft and nuclear power stations, and concludes that we need to simply need to come to terms with the fact that “the potential for high-tech catastrophe is embedded in the fabric of day-to-day life”.

I disagree with this conclusion, although not because I think that there is a magic bullet in risk management (I don’t). Instead, I believe we need to think about why the fabric of day-to-day life rewards the complexity that in turn creates the risk, and try and solve for this. In some cases, that will mean changing the fabric by moving away from certain practices and technologies.

Coincidentally, over the weekend I spoke to a friend of mine, Denise Caruso from Carnegie Mellon, about Perrow and his ideas. She mentioned that she interviewed Perrow (who wrote the book Normal Accidents in 1984) a few years ago. She said that Perrow felt many people had misinterpreted his concept of as implying (as Gladwell does) that we need simply to accept the inevitability of catastrophic risks in technological systems as “normal”. Perrow emphasized that in the end we choose to work with complex organizations and complex physical systems that can have catastrophic impacts without appreciating the true downsides – in Warren Buffett’s words we “pick up nickels in front of a bulldozer”. He argues that we should be willing to discard complex technologies and forms of organization that offer short-term gains yet threaten catastrophic failures in the medium- and long-term.

Perrow’s arguments around the dangers of complex systems are supported by Joseph Tainter’s thesis that complexity offers declining (in his examples, social) returns over time – Tainter’s point is that the marginal value of complexity becomes negative at some point and in fact precipitates social collapse. Part of the reason why this occurs can be linked to Perrow’s characterization of multiple, unanticipated failures in tightly coupled systems – keeping such failures at bay requires significant amounts of energy (which may not always be available). Both Tainter and Perrow argue that, at some point, an additional unit of complexity (for want of a better measure) is a bad thing, since it increases the threat to the entire system.

So why can’t we simply find or design rational solutions to solve for the challenges of increasing complexity? Can’t we remove or reduce complexity, or invent better risk management systems that can cope? Well, the truly interesting bit of Tainter’s argument is a form of “ratchet” principle that means increasing complexity is far easier than decreasing complexity (in particular within social systems). One possible explanation for this phenomenon is the existence of adaptive agents with entrenched interests within the system. Hence, the response to new perception of risks is most often to ADD risk management systems (or additional bureaucratic measures, redundancies, protocols, fail-safe mechanisms etc) to existing organizational structures, rather than trying to remove or reduce structures, which increases complexity overall. So it’s tough to just “de-complexify”. Clay Shirky riffed on this in the context of business models in a great blog post last year.

At the same time, as Perrow points out, risk management in the face of complexity requires directly contradicting organizational forms – the prospect of catastrophic failures produced by complex systems requires decentralized, adaptive operators that can respond quickly to signals in proximate subsystems, but also highly centralized, routine-driven operators that appreciate the extent of interconnections across subsystems such that local intervention doesn’t create further problem. You can’t have both centralized and decentralized operators, hence a fundamental organizational dilemma exists that makes managing the risks posed by complex systems extremely challenging indeed. Lesson: it’s tough to find organizational solutions to complexity. Worse, seeking to do so will tend to increase its complexity and hence risk.

Which brings me to why I disagree with Gladwell’s conclusion. While I appreciate his addition of risk homeostasis as a contributing factor towards catastrophic risk (and could add it to the reasons why returns to increasing complexity can turn negative), I don’t believe the world is faced with a stark choice between accepting the risks of beneficial technologies on one hand and giving up the comforts of modern life on the other, as he suggests. This is a false dilemma that serves only to force us to accept heightened risk because to do otherwise would be viewed as a kill-joy.

In fact, we don’t have to sacrifice many of the technologies that make us safer: as Perrow argues, the jet engine is safer, cleaner and more reliable than the prop engine, and doesn’t increase systemic risk by being used as a substitute. But we should be wary of those technologies and organizational forms whose potential for dangerous interaction with our societies and environment are weighed against small, uncertain or remote benefits (particularly when those benefits are measured in terms of material gain but not true prosperity). We have a choice about whether to sanction projects such as geo-engineering, genetic engineering and even large-scale nuclear technologies. Despite what some people argue, we don’t have to live with large, global banks which far exceed the capacity of national regulators to supervise.

While it is politically difficult to achieve practically, as Perrow points out, we could choose instead to adopt local solutions such as solar power, and have smaller organizations that don’t co-opt political and social power as they increase in size and complexity. We could identify those technologies and organizational forms that offer false efficiencies – efficiency over a short timescale (i.e. without accounting for the cost of potential risks), or efficiency to an end that actually represents very low value (e.g. increased efficiency in distributing unwanted or unneeded goods), and choose alternatives that give us what we need without the attached cost of catastrophic failure. And where we do see the possibility of certain systemic risks thanks to complex, interconnected systems (such as the Internet) but weigh the benefits of enhanced communication and transparency above threats to privacy or intellectual property, we should actively resist the urge to layer it with regulations and safeguards that ultimately only increase the total risk by increasing its complexity. What you don’t want public on the internet, you don’t put on the internet.

And if all that fails, as Clay Shirky pointed out in that same post, the really smart people will look for examples where complexity is already threatening collapse, and find opportunities to take advantage of simpler and fundamentally more efficient ways of achieving even better results. Rather than shutting our eyes to the impending risks of the complex organizations and technologies that define “the fabric of day-to-day life”, a portion of our efforts should go towards looking at how new approaches, technologies and thinking can leapfrog older, complex business and organizational models in order to create more sustainable, simple ways of living. That’s certainly what we should have been doing in the immediate aftermath of the financial crisis. And it’s where the focus should be in Egypt and Tunisia right now.

Thoughts?

Addendum: Last June, Perrow contributed an interesting piece to the Energy Collective blog here about why we should abandon Deepwater drilling based on his principle of normal accidents. If you enjoyed this post, you might find it interesting. In fact, you might find it much more interesting than this post, so go read it!

Share this:

Like this:

Related

Responses

Nick, thanks for your thoughts on the matter. Actually I was reminded of Nassim Taleb’s “Ten principles for a Black Swan-proof world” which include these two particularly relevant notes:

1. What is fragile should break early while it is still small. Nothing should ever become too big to fail. Evolution in economic life helps those with the maximum amount of hidden risks – and hence the most fragile – become the biggest.

[…]

5. Counter-balance complexity with simplicity. Complexity from globalisation and highly networked economic life needs to be countered by simplicity in financial products. The complex economy is already a form of leverage: the leverage of efficiency. Such systems survive thanks to slack and redundancy; adding debt produces wild and dangerous gyrations and leaves no room for error. Capitalism cannot avoid fads and bubbles: equity bubbles (as in 2000) have proved to be mild; debt bubbles are vicious.

Nick glad to have back in the blogosphere. there are many choices in design and management of complex systems which you rightly address. normal accidents as discussed do get “accepted” as accidents instead of design failures. Ideally systems such as three mile island would be studied for MTBF involving human UI factors and situations. The other trade offs are robustness vs. efficiency. ie. tightly coupled systems tend be efficient, but brittle. complexity usually delivers expanded feature sets which are desired the oppposite is elegence and simplicity.

Human drive for doing more with less pushes these variables toward unseen normal failure or what is sometimes refferred to after the fact as “hidden” paths in the system. usually the hidden paths lead to negative or extreme negative outcomes.

I also agree that too big too fail is not an excuse. A bank that becomes large enough to threaten the system has by definition extended the boundaries of its “operation” to include others and therefore should managed or shrunk in such a way that its actions or failures do not impinge on others within the systems.

readers of this may be interested in joining the black swan group on linked in. the discussion is about these types of design risks, human factors and systemic issues. membership is for professionals in risk, educators in risk or those managing +$100m in fiscal or physical assets.

Nic, what role does contingency play in the increase in complexity? I note nick grogerty’s comments regarding efficiency and brittleness but wonder if efficiency or ‘tightly coupled’ systems and contingency are mutually exclusive? Similarly, If the systems are grown diagrammatically rather then vertically are the costs associated prohibitive? Regarding grogerty’s comments ‘too big to fail’, is it possible to separate one system, in this case the banks with another system, the economy. At what point is there a diaphragm to separate these systems which otherwise are mutually reliant?

Hey Jeremy, good to hear from you! Hope you’re doing well. Have you been up to Berlin to visit Rick and Shell? I have yet to get my butt (and travel plans) into gear…

Re the trade-offs associated with contingency, I think it depends what you mean by contingency, since it can be used in so many different ways. In the way I think of it, complex systems are inherently contingent in that a wide range of future states depend on variables in ways that are difficult to predict in advance. In the sense that contingency allows you to plan for such future states and create safeguards, the challenge is the sheer number of possible states, and the fact that some contingencies might solve certain risks but create others.

Here, a key contributing factor was actually a risk management mechanism that had been implemented on the Airbus 320 was a safeguard that made it impossible to engage reverse thrust (air braking) without weight on both landing struts. This had been installed after the crash of a Lauda Air flight that had broken up mid-air when one of the engines had accidentally gone into reverse thrust during flight.

Unfortunately this fail-safe mechanism contributed to the death of two people on flight 2904 when, thanks to a mix-up as to the wind conditions, the plane landed faster than normal on one wheel (banked for a cross-wind) and, thanks to water on the runway, it took 9 seconds for the second wheel to touch down; during this time the plane was unable to brake, resulting in a crash into an embankment at the end of the runway and a fire starting that penetrated the cabin. Hearing that story from a Lufthansa official just before boarding a flight back to Geneva did not make me happier about flying!

This same principle leads straight into your point about separating systems. In one sense, the challenge with the banking system is that it is intended to reduce the transaction costs of allocating costs in the economy. Even the operations of hedge funds and speculation on commodity markets are designed to have this effect. So to try and create a barrier between banks and the real economy seems counter-productive. Unfortunately in recent years the complexity of the products and operations (many of which were, ironically, created and marketed as ways to *reduce* ris) meant that they were employed and used in ways that drastically increased risks for certain parts of the system. It is unfortunate indeed that the least-sophisticated parts of the system (the banks doing the lending, the home-owners relying on property markets and businesses needing lines of credit etc), ended up absorbing this risk.

So, as Nick G suggests, the trick is not so much to sacrifice efficiency by creating barriers, but instead to redefine efficiency over longer time-frames (and from the perspective of less-sophisticated parts of the system) by looking for elegance and simplicity rather than overlaying successive sets of contingency measures that deepen complexity. The cost of doing this might be far higher upfront (redesigning systems from the ground-up, as you can probably confirm from your work, is more expensive in the short-term) but if the design is right, will has the potential to be far more effective in the long-term.

Do you have any examples from design and architecture that might be able to counter or confirm this view? I spend too much time getting all conceptual, so some more practical examples would be very interesting.

contingency isn’t exactly mutually exclusive. tightly coupled systems mean there is less response time (measured in time or event domain) for contingency to be enabled or act.

on complexity. each extra bit of complexity adds extra paths of systems behavior. These paths may be hidden “unintended consequences”. from an engineering perspective. extra complexity increases the degrees of freedom in the system be increasing the potential environments in which it acts. for example a “human” kill swtich installed in a system suddenly has exposure to human error as well as safer system maintenance. the added safety contingency has increased the complexity of the system, but it is crucial from a design perspective that the increased complexity is a net system benefit, i.e. increasing systems reliability and management oversight. too many “safety valves or monitors” can overwhelm which is what happened at 3-mile island. The important thing to acknowledge in any system is there is no absolute. every design changes bring good and bad. if a person can’t highlight the areas of “extra risk” however remote added with a design change, get a new systems designer or engineer.

[…] be interrupted by such rippling. And even more than the need for a subject to experience the risk, as I’ve written about before, we very often contribute to the creation of risk inadvertently, even while actively trying to […]