Steve Wilson has over 15 years in individual and management roles that focus on application of technology to business strategy. He currently holds the role as Sr. Solutions Engineer at Turbonomic. His experience in working with companies from all industries, in both the development and operations areas, give him keen insight into cross-organizational performance challenges. With these credentials, Steve brings an innovative perspective and insight into IT and business collaboration across the entire lifecycle of an application.

The DevOps Challenge of MTTR: The Chains that Bind IT, Pt. 2

The Chains that Bind IT

In part one of this series, we focused on the need to move from mean time to repair (MTTR) to reduce technical debt. To do so, a team must really focus on creating a system that helps them increase the mean time between failures (MTBF). This is the challenge that must be addressed in order to really start to have a successful impact on the business and really start doing DevOps at speed.
Here we’ll look at two other ways of viewing the problem to begin this shift to MTBF: Proactive vs. Preventive and Imitate vs. Iterate.

MTTR says be proactive, MTBF says be preventive

Conventional practice states that you have to reduce the amount of time between the problem occurring and when it is resolved. I cannot tell you how many times I have spoken to operations teams who say they want to be proactive. When asked what that means they say that want to know when problems are occurring and be able to fix them before the end user does. This is not being proactive this is being hyper-reactive.

The problem with trying to reduce the time from failure to fixed is that it is impossible to predict how long that will take or if it can even be done at all. Today most systems are becoming more one-way adaptive. All the components of an application adapt to change in the environment. When you try to back out the changes that caused the failure the system never really returns to the previous state. This makes managing the environment even harder when a change made cannot be undone. In this context the idea of fail forward is the only way to do things.

This inability to truly back out changes makes mean time between failures even more important. Having a way to absorb these changes and reroute around problems is a better way to address these imperfections that are injected into the system. This is why understanding application demand and mapping that demand to the environment is critical. It allows for more change to happen and abstract small failures from affecting the business riding above it.

Companies like Netflix are doing this right now. Netflix uses a well documented application called ChaosMonkey which actively goes through the environment and caused issues to occur so that the resiliency of the application components and fail-over can be tested. This limits the impact of failure by creating resilient components and services and allows for a more flexible delivery channel. It also allows for the operations and dev teams to work together to understand how applications and the infrastructure respond to imperfection in the system. By doing these events Netflix is creating a platform that is preventative in nature, allowing development teams to rapidly test and deploy new features and functions while ensuring that the end-users are never truly impacted.

MTTR says imitate, MTBF says iterate

A lot of focus within companies trying to implement a DevOps practice has been around finding best practices and then integrating those practices into a the corporate culture. Look at how other companies are addressing technical debt and then copy and paste that here. The problem is that DevOps is not an imitation of best practices. It is an iteration of common sense philosophies guided by the needs and culture of the business. It should be the business that drives IT to want to deliver a platform that is failure resistant, secure and flexible.

An MTTR based approach wants to inject these best practices into policy and procedure with the mindset of shorting the root cause cycle. This tends to slow change volume down so that it can managed. This metered release lets change pile up to be pushed all at once.

This is a challenge because when there is a problem it continues to take a human to intervene, interpret all the data and try to figure out what exactly is going on. Due to the previously mentioned practices around more data and the hyper reaction there are many factors that have to be taken into account to try and understand the problem. Out of desperation to put the environment back in a health state many changes have to be rolled back.

During this time technical debt is being heavily accrued. The bigger the problem the more precious the resources are being redirected from top projects to investigate and remediate the problem. Any change that is backed out is more work being piled on to the next release window. Also failed changes mean more work into ensuring the problem does not happen again.

A MTBF-based view wants to speed up change. Less change more frequently makes it easier to identify problems and remediate them. It makes small failures even smaller. It creates blameless postmortems where the way that applications are delivered and controlled can constantly be evaluated to continue to eliminate waste.

When trying to extend mean time between failures that does not mean the whole system stays up. Failure at some small scale will still occur. It means that small parts can always fail but the whole application continues to work. That is what you want. Not a platform that is fortified against an onslaught. These are the types of architectures that collapse spectacularly when they fail. You want a system that will bend but not break.

In a virtual environment when you have multiple nodes and one fails the workload on the other nodes changes to respond. In those cases the demand changes and can cause issues. If you are managing the environment based on demand then automation can adjust, moving the demand to where it can be met with supply. This allows for the failure of a node to be mitigated to just that node and does not allow the problem to ripple up to the end user.

It is hard to think against the grain when most of the time MTTR is they way that the performance of an IT team is judged. If you really thinking about the problem you are trying to solve it is not shorting the time to fix a problem but really trying to eliminate the problem from impacting the end user or ever happening at all. You can never remove error for the system. That is impossible. But you can reduce the impact a failure can have and build a resilient platform that can take a hit and still deliver QoS. Next time you hear talk of needing to manage MTTR think about turning the conversation to the true problem and look at ways to increase the MTBF.