Articles

The big story this week is the release of the inaugural issue of Increment, a newsletter by Stripe, edited by Susan Fowler. They bill it as “A digital magazine about how teams build and operate software systems at scale” and the first issue, dedicated to on-call, certainly delivers. Below, I’ll include my short take on each article in the issue.

Increment interviewed over thirty companies to build a picture of the common practices in incident response. I’m actually pretty surprised to hear that “it turns out that they all follow similar (if not completely identical) incident response processes”, but apparently the commonalities don’t stop at just process:

Slack and PagerDuty appear to be two points of failure across the entire tech industry

After laying a solid groundwork of suggestions for avoiding burn-out in on-call, this next Increment article raises a really important point: on-call affects people differently based on privilege. Example: single parents are going to have a much harder time of it.

[…] if you set up an on-call rotation with a schedule or intensity that assumes the participants have no real responsibilities outside of the office, you are limiting the people who will be able to participate on your team.

Outages

A DDoS took out their DNS service, taking out customer domains and also sites they they host for customers. While this is a news article and not a formal post-analysis, it does include some pretty interesting technical detail from an interview with their CTO. I’m not sure that he did himself any favors by quoting the definition of their SLA:

“People look at 99.9 per cent and think that’s seconds of downtime, but you work it out and it’s 45 minutes.”