Reliance on this service was highlighted by the vast number of services which suffered downtime or degraded service as a result. The root cause turned out to be human error followed by cascading system failures.

With a growing dependence on the cloud for computing and with no signs of demand for cloud resources abating, we really need to treat those resources like the on-premises data center that we relied on for so many years.

As the article, “3 Steps to Ensure Cloud Stability in 2017” points out “it’s critical to ensure the stability of your cloud ecosystem” and that starts with monitoring. The article offers the following advice: “Ensure that you have access to reports which can give you actionable, predictive analytics around your cloud so that you can stay ahead of any issues. This goes a long way in helping your cloud be stable.”

Of course, I couldn’t agree more! Server Density even built an app to send notifications when cloud providers have outages.

But note that the outage only affected the US East region. Other regions were unaffected, yet the fact that many services suffered outages indicates they are relying on a single region for deployments. AWS runs many zones within regions, which are equivalent to individual data centers but are still within a logical group and a small geographical area. Cross region deployment is typically reserved for mitigating against geographic events e.g. storms, but should also be used to mitigate software and system failures. Good systems practice means code changes get rolled out gradually and indeed, AWS states that regions are entirely isolated and operated independently.

S3 itself has a feature which automates cross region replication. Of course, this doubles your bill because you have data in two regions, but it does allow you to switch over in the event an entire region is lost. Whether that cost is worth it depends on the type of service you’re running. Expecting an hour a year of downtime is the starting point for the cost benefit calculation, but this particular outage took the service offline for more than that.

Outages will always throw up something interesting, such as the AWS Status Dashboard itself being hosted on S3. The key is knowing when something is going wrong, having a plan and closing it up with a post mortem.