This is the first issue sent to over 2000 email subscribers (not to mention the 500+ Twitter followers and an unknown number of RSS subscribers!). Wow! Thank you all so much for reading and for all the great feedback you’ve sent over the past year and a half. You make this fun.

Articles

The holy grail of high availability is a multi-datacenter (or cloud) active/active architecture. This article goes into why, including examples of common pitfalls of traditional disaster recovery solutions.

Neat idea: here’s a Stack Overflow question asking for critique of a proposed outline for a post-incident analysis. It’s a great start already, and the answers include some pretty top-notch suggestions.

Last week, I linked to an article about debugging an overloaded ELB node. This week we have the sequel, a deep dive into the intricate details behind the problem, complete with a trip into the glibc source code.

Zayna Shahzad, a PagerDuty software engineer, did customer support for a day, and she learned a ton. As SREs, we have the customer experience directly in our sights, so this kind of thing sounds like a really great idea.

Charity Majors does not want to be an SRE. Find out why by watching this 5-minute video interview between her and Rob Hirschfeld. I don’t often link to videos, because who has time to watch stuff? But this one is pretty intriguing.

Do what better? Prevent and end illegal and unethical actions like discrimination, harassment, and retaliation. This article is by Susan Fowler, featured here a bunch, and while it’s not directly related to SRE, it’s so important that I urge you to read it.

Outages

Monitorama (and a swathe of Portland) suffered a power outage last week. The organizers created a status site post (linked) and quickly organized a disaster recovery site: an entirely separate conference venue. Seriously amazing work, and oddly appropriate given the conference subject matter.

If you didn’t make it to Monitorama, here’s a summary from LinkedIn SRE Michael Kehoe.