Articles

Managing the burden of on-call is critical to any organization’s incident response. Tired incident responders make mistakes, miss pages, and don’t perform as effectively. In SRE, we can’t afford to ignore this stuff. Thanks to VictorOps for doing the legwork on this!

Not strictly directly related to reliability (unless you’re providing ELK as a service, of course), but I’ve found ELK to be very valuable in detecting and investigating incidents. Scaling ELK well can be an art, and in this article, Etsy describes how they set theirs up.