Links for 2014-07-22

we believe MDD is equal parts engineering technique and cultural process. It separates the notion of monitoring from its traditional position of exclusivity as an operations thing and places it more appropriately next to its peers as an engineering process. Provided access to real-time production metrics relevant to them individually, both software engineers and operations engineers can validate hypotheses, assess problems, implement solutions, and improve future designs.

Broken down into the following principles: ‘Instrumentation-as-Code’, ‘Single Source of Truth’, ‘Developers Curate Visualizations and Alerts’, ‘Alert on What You See’, ‘Show me the Graph’, ‘Don’t Measure Everything (YAGNI)’. We do all of these at Swrve, naturally (a technique I happily stole from Amazon).

2 Comments

The most important is that it shouldn’t be the developers producing the dashboards. Usually that results in a 10×10 (I’ve seen worse) matrix of very small graphs recording things that seem important to a developer but aren’t useful in actually diagnosing a problem in an emergency. I’ve even seen dashboards where there were two sets of dense graphs separated by a set of tables of raw values. Assuming they’ve been trained in the system or helped design it, SRE (or whatever name) usually have a much better idea of the critical metrics and where to present them (above the fold is important).

Another issue that seems either brushed over or is possibly wrong is what to alert on. It’s important to be clear that alerting on symptoms is the best option (though not a strict rule). I’ve almost never seen a developer come up with useful symptom-based alerts, they always seem to prefer ones that reflect their internal state assumptions (like asserts in code), e.g. alerting if a backend is problematic rather than alerting on the problem the iffy backend causes.

It may be coming from a very devops-y POV. In Amazon, most services are
operated by their developers (ie. the original devops approach) — it’s very
much in the interest of the dev to make sure that the dashboard surfaces the
important operational metrics right at the top, so that when they themselves
are paged at 3am they don’t have to perform chart spelunking.

This also drives the addition of good, alertable metrics to the code, too,
since they tend to get pissed off when they’ve been paged at 3am 3 nights
running for some internal P99 latency hitting 100ms, when everything’s fine at
the service level ;)

Good point though. Maybe that only works when the devs are opsing.

(also: 10×10. jaysus. At least with Amazon or Graphite it quickly becomes
obvious that the dashboard is not viable with that many graphs when it OOMs
your browser tab!)