“What Is SRE? An Introduction to Site Reliability Engineering” (registration required but free), is an ebook by Kurt Andersen & Craig Sebenik, published by O’Reilly. The following is a summary (abridged copy and paste) of the parts I found most useful, with a few of my own notes. The original is about 9,000 words; this is about 2,000.

I highly recommend reading the original in its entirety, if you have time, and I’m a big fan of the Accelerate book too. As with all the other summaries I create, this just as as way to help me digest and understand an excellent article.

I caught a talk by Tori Wieldt at the New Relic booth at AWS re:Invent on “SRE principles”. Even though it was a short talk in the expo hall, rather than a formal scheduled one, it had a ton of good SRE material.

Some notes on the “Reactive DDD – When Concurrent Waxes Fluent” talk by Vaughn Vernon (author of Implementing Domain-Driven Design) at QCon 2018. (Currently I think you need to be logged in as a ticket holder to see the talk – I will post a link if it becomes public)

“Distributed Systems Observability” is a book from Cindy Sridharan (find her on twitter, and medium), available as a free download here (registration required). At a little over 30 pages and 8,000 words, it is not a difficult read, and I definitely recommend it.

“Chaos Engineering” is a book from O’Reilly (free download), written by folks from the “The Chaos team” at Netflix. It is a GREAT read for anyone interested in resilience engineering. This post is essentially a cut and paste of the most salient parts (the original is about 16,000 words; this is about 3,000), with some paraphrasing and merging/rewriting of sections for brevity.