Devops, complexity and anti-fragility in IT: Risk and anti-fragility

I am taking a few posts to explore the rise of new software development and operations models, and why these models are critical to the enterprise. Today, I want to explore the risk economics of software development and the concept of “anti-fragility.”

Enterprise IT organizations have spent decades trying to create systematic approaches to control and (hopefully) eliminate disruption in computing operations. The standard approach to date has been to strictly control change. Now, concepts like continuous integration and deployment, modularized application systems, and “fail fast” agile processes encourage continuous change.

Advertisement

Embracing anti-fragility

So why would anyone want to promote an approach that encourages constant change, when failure in the form of outages or breaches or large-scale processing errors exact such a heavy toll on businesses? The short answer is because some application domains require it, but that’s also a bit glib. Instead, let me bring in the concept of “anti-fragility,” as coined by Nassim Nicholas Taleb in his book “Anti-fragile: Things That Gain From Disorder.”

I explained the gist last week:

“Anti-fragility is the opposite of fragility: as Taleb notes, where a fragile package would be stamped with ‘do not mishandle,’ an anti-fragile package would be stamped ‘please mishandle.’ Anti-fragile things get better with each (non-fatal) failure.”

Anti-fragile systems benefit from variability and can take advantage of differences from the “normal” to ultimately gain value. Anti-fragile systems behave in such a way that failures due to change exact a small cost, but successful change drives exponentially higher value, so the system gains overall. Taleb argues this is only achieved by keeping the scope of each activity small enough that the downside risk is manageable (and results in strengthening the system), and that any gains can be maintained ongoing.

Taleb shows why the traditional approach of operations – making change hard, since change is risky — is flawed: ‘the problem with artificially suppressed volatility is not just that the system tends to become extremely fragile; it is that, at the same time, it exhibits no visible risks. . . . These artificially constrained systems become prone to Black Swans. Such environments eventually experience massive blowups. . . . catching everyone off guard and undoing years of stability or, in almost all cases, ending up far worse than they were in their initial volatile state’ . . .

This is a great explanation of how many attempts to manage risk actually result in risk management theatre — giving the appearance of effective risk management while actually making the system (and the organization) extremely fragile to unexpected events. It also explains why continuous delivery works. The most important heuristic we describe in the book is ‘if it hurts, do it more often, and bring the pain forward.’ The effect of following this principle is to exert a constant stress on your delivery and deployment process to reduce its fragility so that releasing becomes a boring, low-risk activity.

Today’s IT models don’t demonstrate that behavior, at least at the project level. As Humble noted, most IT projects are highly fragile — a few relatively small errors during development or operations can send the entire project crashing down at an inopportune time. IT projects (and individual project releases, for that matter) tend to:

have giant scopes of hundreds or thousands of requirements.

be managed through a series of organizational siloes with weak feedback loops between the silos.

introduce new operations vulnerabilities with each release, due to dependence upon manual process steps, and highly context-specific, fragile “scripting.”

Change, therefore, is artificially suppressed, or at least intensely controlled. This just makes projects more fragile in the long term, especially from the perspective of meeting constantly changing business needs.

Approaching anti-fragility through devops

It doesn’t have to be this way.

One solution to that problem is highlighted today in the form of devops or “noops”-driven software organizations like Netflix (s nflx) and Etsy. The software approach these organizations take is one of releasing small changes as often as possible, with heavy reliance on automation, and — this is very important — measuring the resulting effect on dynamics important to the business stakeholders.

Oh, and they can quickly reverse or replace stuff that doesn’t work out as expected. Which happens fairly often. Which leaves them no worse off then they were before they tried the change. See the anti-fragility yet?

However, in order to get to this state of low-risk, constant experimentation, these organizations have had to employ skills, tools, processes and practices that are significantly different than the change-management techniques of the past. The most obvious qualities of their devops systems are:

Automation enforces certain practices, such as running various tests with every build or build environment (e.g., running regression tests before moving from dev to staging).

Culture enforces practices, such as Etsy’s practice of allowing developers to own their mistakes without fear of reprisal, which encourages tribal knowledge of how to avoid such mistakes in the future. Culture also dicates that dev, ops, security, business and other stakeholders all work together over the entirety of the application lifecycle.

Prudent measurement of all elements of the processes, tools and applications provides the key feedback necessary to continually strive for improvement.

Devops and anti-fragility are by no means synonymous, however. Devops can be implemented in such a way that it doesn’t exhibit the trait of being anti-fragile — like when high developer turnover results from always putting the “best people” on the “next great thing,” and knowledge or culture is lost as a result.

Anti-fragility can also be acheived without devops, though I’m not aware of another consistent methodology for doing so. Nonetheless, anti-fragility is a trait to be strived for, not a methodology itself, so there are options for achieving that trait.

No silver bullet

And, lest you think we’ve hit upon yet another “drop everything and change the way you do things” approach to enterprise IT, I would caution against applying anti-fragility religion where the investment wouldn’t pay off.

Given the difference between devops and most “construction-method” approaches to IT that we see today, for example, I would argue that enterprises should adopt devops and address anti-fragility first by using it for those IT projects that would benefit from continuous change. Ones like marketing applications, business process automation and so on. Less critical would be systems like core ERP databases and infrastructure that don’t often need to undergo change.

You can’t crowbar a change-averse technology into a change-driven methodology. However, over time you might be able to adopt a few of the benefits to lessen risk when change is necessary in those systems.

Here’s a heuristic I’m experimenting with: the more that differentiation and adaptation are important to the solution at hand, the more that anti-fragility should be strived for. Undifferentiated activities (such as running data center facilities or core SAP packages) should strive for resiliency, but perhaps adapt automation, etc., over time as part of a more traditional approach to software project control.

This heuristic is backed up, somewhat, by the work of my friend Simon Wardley, who has one of the most-comprehensive theories of the evolution of enterprise activities from inovation through commoditization. Activities at different stages of that spectrum benefit from different practices, and IT is no exception to that rule.

In my next post, I’ll go into more detail about this spectrum of practices, define the “stability-resiliency” tradeoff and explain how enterprise IT can navigate it. In the meantime, this is an opportunity for you to express your thoughts on the subject of new IT models and old, either here in the comments or via Twitter, where my handle is @jamesurquhart.

For more discussion on devops and next-generation systems management, check out this panel discussion (that includes James Urquhart) from Structure 2012:

DevOps is not a technology or a team. It’s a culture change and unfortunately only a few are getting it right because people keep getting in the way. I once heard “anything with 2 heads is a monster” that’s why IT and Development struggle to deliver even the simplest code. The are a 2 headed monster. DevOps merges the 2 heads into one and tryings to build a culture around teams working together instead of against each other. Here’s a post I wrote on the Shades of DevOps.

Dev ops is nothing more but good old shell scripting on steroids. Looks like developers don’t have enough work , so they have started to muck around with System Administration. The result is actually a slower rate of change because previously what could be a simple shell script change has now to be floated to the DevOps team , who will put it on their agile voodoo dashboard, checkout from svn, make changes , check it back in, run through dev, qa and ultimately to production. On top of that they are always back logged so what used to take a day in the olden days with the Devops team it is now a two weeks process. Basically this is all part of the overall process to dehumanize our jobs.

Not just the arc of innovation is relevant to adoption of the “everyone makes mistakes” approach. When times are interesting and sub-organisations are on the downbeat it is hard for an employee to say we should try this and see if it works; middle management can be looking over their shoulders rather than holding their heads up.