Lessons Learned from Public Cloud Outages

A recent, well-publicized public cloud outage has many businesses reviewing their Hybrid IT strategies. That’s a good thing. If anything can be learned from this outage, it’s that even a 99.999 up-time guarantee isn’t fool proof. And, it isn’t a 100% guarantee. Outages happen.

However, one concern we have is that business leaders will draw the wrong conclusions from this event. Conclusions such as public clouds are unreliable, and mission critical apps should not be in the public cloud. To avoid that, we’ve gathered a few lessons to be learned (or perhaps re-learned) from public cloud outages.

The Right Lessons

#1 Know which apps and datasets are mission critical. A lot of companies make the mistake of treating everything as mission critical, leading them to overinvest in infrastructure services. At the end of the day, some systems are more vital than others. For example, if you’re an online retailer, website availability is crucial to your business, but you might be able to do without your marketing automation systems for a few hours. You will want to ensure your most mission-critical applications and datasets are on the most reliable platform, whether that’s on-premises or in the cloud.

#2 Determine RTOs and RPOs by application and dataset. Your Recovery Point Objective (RPO) is your assessment of how much data you can afford to lose without taking a significant hit to your business. Recovery Time Objectives (RTO) measure how long you can afford to be down. As we stated in lesson #1, RTOs and RPOs will be different for mission critical applications than for less vital systems. Even mission-critical functions may not all have the same RTO and RPO requirements. Setting these metrics will help you further refine your strategy and choose the best deployment options.

#3 Don’t put all your most important eggs in one basket. You don’t need to invest in redundant systems for all your applications and data sets, but having a safety net for your most critical ones is a wise move.

#4 Don’t confuse uptime SLAs with durability metrics. In public cloud discussions, so much focus is put on uptime that many people forget about durability. Uptime is server reliability; durability refers to data loss. For some applications, e.g., financial systems, durability is just as important as reliability. When choosing a cloud vendor, look at both metrics.

#5 Follow the vendor’s architecture best practices. Public clouds are often looked at as DIY implementations, but public cloud vendors have a vested interest in minimizing damage to their customers even if their servers are only down a couple of hours a year. Published architecture best practices can and have helped customers avoid downtime and loss of data in many cases. If you don’t have the staff with the skills or experience to implement per the recommendations, I highly recommend bringing in a managed service provider who can provide the necessary expertise.

Honesty is the Best Policy

The biggest lesson of all may be that you need to be honest with yourself. I’ve mentioned a couple of times that you should deploy your most mission critical apps wherever you can get the highest reliability (and durability). It can be hard for many IT leaders to admit, but that may not be your on-premises systems.

Uptime is easy to calculate, but it’s important to carry that percentage out at least a couple of decimal places and to measure server uptime over an extended length of time. For example, if your servers were down for two days in February, but not at all in March, using March’s 100% uptime doesn’t give you a very honest picture of your datacenter’s reliability.

The reason you need to carry the number out to a couple of decimal places is because those tenths of a percent can make a big difference. For example, a 99% uptime may sound good, but it means a company’s servers could be down as much as 3.65 day or 87 hours and 36 minutes a year. However, an uptime of 99.99% means a company’s servers are only down 52 minutes a year. That’s a pretty big difference.

Need More?

TierPoint helps clients get more from their cloud investments, with stronger protection and faster recovery. We can help you:

Reduce downtime. TierPoint has fully redundant hosting and storage with common storage protocols to help reduce downtime.

Minimize risk. Our DRaaS solution is protocol agnostic and allows replication and fail over into different regions natively, avoiding the risks of a single-storage situation.

Provide a safety net. We provide local cloud hosting with access to the major hyperscale cloud providers, including AWS and Microsoft Azure, providing your data and apps a safety net.

Bob Hicks, with more than 30 years of experience in the industry, oversees TierPoint’s growing collection of enterprise-class data centers. He is responsible for local operations and data centers in Arkansas, North Carolina, Pennsylvania, and Tennessee. He joined TierPoint as Senior Vice President and General Manager, Pennsylvania, through the company’s acquisition of Xand in 2014. Bob is committed to helping clients solve unique and evolving technology challenges.

Related Posts

Layering cloud security technologies can provide defense in depth but like any cloud security strategy, it requires vigilance – from you and your managed service provider. How much and what kind of security do you need? It can get complicated, ...

As cyber-attacks become more sophisticated, and numerous, CIOs and CISOs need to up their ability to quickly respond to an attack. Data from Alert Logic’s Threat Hunting Report found that 56% of companies experienced an increase in ...

Many organizations are embarking on or have recently deployed multi-cloud IT environments to support specific business needs – such as edge computing or to tap into hyper-scalable resources for disaster recovery, storage, bursting workloads and ...