Ops, Entrepreneurism, Tools, and Thoughts.

Failure Hurts

As the Internet, technology and consumers collide we often hear, ‘What are you going to do to ensure this never happens again.’ For awhile, this really bothered me and it was simple to get me from my zenlike calm to ultra rage with that sentence. While I am sure we all endevaour for failure to not occur, it does. Thermodynamics tell us that closed systems alway seek a state of maximum entropy and so it goes in Web Operations. This is why our entire existence is formed around MTTR. One day, in my state of annoyance with my inability to project certainty, I came across this tweet from Mark Burgess:

I was missing the point. All along, I thought there was nothing meaningful in failure (and no one needs nothing) but as it turns out, there is a great deal of meaning if you can get past the hurt. Here’s the pattern I employ:

Assumptions

There is some type of metrics collection in use so you can measure response to inputs

There is a willingness to employ a continuous improvement loop

The ‘Lawrence of Arabia’ Pattern

blameless post mortem

You need to understand what happened and why it hurts. It’s not enough to say the internet is hard and move on. As much as possible, quantify your pain. Can you reduce an outage to dollars per minute? Can you measure the impact on your brand reputation? To get meaing from failure you simply must know what impact the failure has. Here are some examples of how to enact the blameless postmrotem:

one human correction

Once you have a bead on the pain the next step is to enact your continuous improvement loop. With my team we always set the intention that we will make (at least) one human behaviour based improvement in the wake of an outage event. We don’t mind that it hurt. We do want to make sure that the pain we feel next time is new and different. Some exercises we enact to identify that human based improvement:

What business assumptions have changed?

What technology assumptions have changed?

What could we do to make it easier to do the right thing?

example

Once upon a time we suffered pain from hysterical stop the world GC events. The human improvement was to find a way to post to a dashboard when these hysterical GCs were occuring so that, as the operator, we could quickly see that it was a GC loop and act accordingly. A little graphite and a small daemon go a long way to minimizing MTTR.

one technology correction

Everything changes and changes fast. What assumptions have you made about your technology collection in your complex system? Are the queues queuing well? Is the JVM tuned for the way people use your system? Can you do anything to catch this and make the feedback loop shorter? I encourage the introspection so that one improvement is made on the technology side.

example

Let’s stay on the GC example for fun. After torpedoing the heap the obvious tuning paramter is to merely add more. Indeed, that was our first approach to buy some time. Our introspection kicked in and we realized that our user base had grown and their habits had evolved. We then decided to look at a different JVM that would match more closely to how people used our software.

conclusion

If you were to take one thing away from this pattern I hope that it is the inspiration to find meaning in failure. Like Lawrence of Arabia, there is beauty in the desert of failure and it you can withstand the pain it’s a wonderful way to improve what you do.