Acronyms you should know: MTTD and MTTR

If you’re a SUMO contributor, there are two acronyms you will start to hear more often from us developers: MTTD and MTTR.

They mean “mean time to detect” and “mean time to resolve,” respectively, and they refer to how long it takes to detect an issue in production, and how long it takes to resolve that issue once it’s detected.

As we move toward continuous deployment, these are two of the metrics we’ll be using to gauge the effectiveness of our tools and processes.

For major production issues, our MTTR is actually fairly good right now—if it’s something that cannot wait until the next scheduled release, it takes us 60-90 minutes from becoming aware of an issue to pushing a fix. I think we can do better with better release processes, but we’re starting off pretty good and going to get better, which is great.

Our MTTD, on the other hand, needs work. SUMO 2.8.1 upgraded Django and included a sweeping change to our CSRF protection—this necessarily affected every form on the site. We discovered three related issues that warranted immediate hotfixes, but we didn’t discover two of them for almost two days when our contributors brought them to our attention.

It’s great that our contributors pointed out these issues to us. Our community is a critical part of “detection” and I want to encourage everyone to point out issues in the forums or IRC. It’s extremely helpful!

But there are things we can do, too, to notice things faster. One thing we are working to add is business metric graphs. We have useful data in Ganglia right now, but we will be using Graphite and Etsy‘s StatsD to peer into what our users are doing. If we deploy a change and notice that no one is previewing articles, for example, we know immediately that we have an issue and can start diagnosing and fixing it.

If you follow SUMO development, you’ll hear us start using terms like MTTD, MTTR, “detection,” more, and talking about how to reduce them. We welcome your input and ideas as we start working on these challenges. And of course, keep telling us when things are broken!