“Distributed Systems Observability” is a book from Cindy Sridharan (find her on twitter, and medium), available as a free download here (registration required). At a little over 30 pages and 8,000 words, it is not a difficult read, and I definitely recommend it.

“Testing in production” used to be a joke. The implication was that by claiming to test in production, you didn’t really test anywhere, and instead just winged it: deploying to production and hoping that it all worked. Times have changed however, and testing in production is becoming accepted as a best practice.

“Chaos Engineering” is a book from O’Reilly (free download), written by folks from the “The Chaos team” at Netflix. It is a GREAT read for anyone interested in resilience engineering. This post is essentially a cut and paste of the most salient parts (the original is about 16,000 words; this is about 3,000), with some paraphrasing and merging/rewriting of sections for brevity.

Pre-release tests are essential, but the ability to debug, monitor and observe your application suite post-release is what allows you to detect, and quickly fix, the production problems that will inevitably rise.