SPONSOR MESSAGE

Are your incident management skills sharp, or are you continuously fighting fires? Take the free, online incident management assessment from VictorOps and compare your practices against leading DevOps methodologies: http://try.victorops.com/ima/sreweekly

Articles

This article is about the risks of automation. While automation can reduce risk by making errors less likely, it also disengages human operators from what’s actually happening, meaning that they’re less likely to catch and correct problems.

The author spent seven months sifting through, categorizing, and documenting over 1700 production incidents. The result was impressive: a massive improvement in the SRE team’s incident response process and documentation. It’s got me wondering if we can do something similar at $JOB.

A danger of a microservice architecture is that one failing service can affect those that depend on it, even indirectly. The Netflix API handles over 10000 requests per second, and it was carefully designed to avoid the case where a slow dependency breaks unrelated requests.

Nuclear Family is an interactive play in which the audience is presented with critical decisions as the characters move inexorably toward a nuclear plant disaster. The goal is to demonstrate local rationality, the principal that people make the best decision they can with the information they have at hand — even if in retrospect that decision led to an adverse outcome.

Last year, PagerDuty moved toward giving developers operational responsibilty for the systems they create. The really cool thing about their transition is that they have hard stats on reduction of incidents, decrease in MTTR, and increase in changes deployed to production.