WebOps Hack #1: Simple Availability Report for Busy Teams

I created this spreadsheet for tracking availability and “days since last outage”. Along with the availability and uptime calculations, it asks the following questions:

What broke?

Why?

What fixed It?

What did we learn?

How can we prevent recurrence?

Who owns follow-up?

I’ve found this to be the “simplest thing that could possibly work” for identifying problems and tracking issues before a formal incident tracking system is in place, or with vendors or other teams who you want to keep honest. Please let me know if it’s helpful for you and how it might be improved. (Feel free to improve upon it yourself too — it’s Creative Commons Attribution Share Alike.)

Link to the Google doc is here. You need to “Copy to a new spreadsheet” to be able to use it.

Get the O’Reilly Web Ops and Performance Newsletter

There’s not a lot to discuss about it, I guess, but: I really appreciated this post. It’s a good definition-by-example of the concept of “simplest thing that could possibly work”. And, without dwelling on them, it illustrates by example many common Radar themes. Nice piece!

I’m a sole proprietor business guy, myself. This particular spread doesn’t do anything for my particular needs but I like the idea of tools as content and hope that by saying so I encourage more.