Designing Services For Downtime

We have an array of wonderful tools and platforms available to help us automatically monitor, scale, heal and failover our servers. Designing our products for Five Nines of availability (99.999% or just over 5 minutes of downtime per year) is absolutely achievable in 2017. However, this may not be necessary for your business. Here’s why:

High availability is expensive. Maintaining multiple servers across multiple data centers incurs time and money, and it’s difficult to justify this “just-in-case” business expense.

Redundancy requires complexity. If your team doesn’t have the necessary expertise to manage this complexity, continuing to add Kubernetes clusters and redundancy could even make things worse.

If you’re relying on trusted third party providers, it often doesn’t make sense to build in redundancy. For example, when Amazon S3 suffered a 5 hour breakdown in early 2017, very few companies had an automatic failover built into their architecture. This is because the price of maintaining it vastly outweighed the likelihood of S3 going down. While you could back up your third party DNS, CDN and cloud server… is it really worth it?

For most of us, downtime is inevitable. While you can’t control everything, you can manage the experience customers have when your services do start to go down.

In this article, we look at how teams can design their downtime to be less disruptive to customers, and reduce the cost of this downtime to the business.

Prioritize Your Effort

Depending on your business type, different parts of your service will be more critical to maintain than others. What are the most important actions customers need to perform? These are the ones that must be protected first as a priority.

Prioritize redundancies by deciding how critical each service is to the running of your product. Is your ecommerce site primarily a catalogue? If so, your slave SQL database won’t need to be promoted to master when things go wrong, as users will still be able to browse your products. Is search critical to your site? If not, you can take the money set aside for your redundant ElasticSearch instance and invest it elsewhere.

With unlimited time and resources, we could have a bulletproof service that never goes down. But that’s not the case for most of us. (If this is you, and you’re hiring, please give me a call…). Instead, we need to prioritize our time to make sure the most critical services stay up.

Degrading Gracefully

If one tiny error brings your website crashing to its knees and flashing 500 status code errors, you’re going to upset a lot of customers. Your service needs to handle brief moments of instability or unavailability gracefully. For example, Google offers an HTML only version of Gmail when loading is slow. This means that customers can still get the information they require, albeit without the full experience.

Offline First is a popular trend at the moment, especially for mobile. It involves caching data and assets on the client so that the service still works when the server is unavailable. Its primary focus is for mobile – 4G is widely available, but deteriorates when customers are on the move, such as on a train. But even if your app is designed for desktop browsing where the connection is generally stable, Offline First will allow your app to continue functioning if your server goes down. Obviously a customer would have to visit your site or download your app beforehand, so this technique will only help for returning customers, and not new ones.

For the web, the service worker is a great way to start implementing offline-first paradigms. It’s a programmable network proxy that lives in the browser which allows you to intercept network requests as they happen. The service worker also has access to the browser cache, meaning it can store the results of network requests for future use. If a customer makes an HTTP request, and the server returns a 500 error, the service worker is able to intercept the error and instead return the last version of the request that was successful.

Single page applications – web pages that load a single HTML page and update dynamically as the user interacts with the app – can use other techniques for gracefully degrading. Since network requests are primarily made using Ajax, you can control the user experience after receiving a 500 message. For example, if your search server is down, an inline message can be displayed with more helpful information, rather than just serving up a generic error page. If you’re storing data in the web browser’s local storage, you could even offload search to the client to provide a degraded version to the end user. Local storage can also be utilized to store user provided data, allowing a form to be resubmitted automatically when your server is back online. Clever use of all of these technologies, along with good UX to set customer expectations, means you can still provide a stable service when your servers are burning!

Understand How Your Services Depend On Each Other

It’s important to keep track of your architecture, how different services depend on each other and what happens to the end user if one of these services is unavailable. For example, a common SaaS stack would include a key-value store like Redis. How does your application react if Redis runs out of storage, is overloaded or goes down? If you’re only caching data in Redis, it should be possible for your service to continue running when it becomes unavailable. If you’re storing session data in Redis, what happens if a customer tries to log in? Can you fall back to a different mechanism, like file storage? Does the affected customer see a sensible error message or just a 500 page? Keep these eventualities and possibilities in mind when adding services to your stack, and you’ll be able to keep your customers satisfied when the worst happens.

Keeping Customers Informed

While you’re working on bringing service back up to normal, your customers are waiting on you. They don’t know what’s happening behind the curtain, and they don’t know if the downtime will last a few seconds or many hours longer.

Keep in mind:

Don’t make promises you can’t keep. Exact time estimates are the worst sins. If you suggest the service will be back up in 10 minutes, and it actually takes 15, you’ll have a lot of frustrated customers knocking on your door!

Don’t throw your vendors under the bus. Even if it is the fault of your DNS, you chose them in the first place, and you didn’t back it up!

Don’t “apologize for the inconvenience”. It doesn’t mean anything to customers any more. Instead, offer a genuine, specific apology for the trouble they are seeing.

Status pages that function independently of your architecture, like Statuspage.ioor Sorry App, are a good investment. They’ll provide an easy way for customers to keep apprised of the situation, as long as you keep them updated. In-app notifications can also help explain why customers are seeing unusual behavior, but are more difficult to set up on the fly. Working with your marketing team to update social media feeds or send emails to key customers might be an easier way to spread the word about degraded performance.

When service resumes, it’s often helpful to provide customers with a post mortem. This can include a brief explanation of what went wrong, how it was fixed and what you’re doing to ensure it doesn’t happen in the future. Again, customers trust transparent companies, so being open post outage can win you back some of the trust you lost.

Working With Your Tech Support Team

Most of the time, your engineering team won’t be dealing directly with customers. Instead, you’ll have an army of customer support agents between you and the angry mobs trying to access the site. These people are your friends – they transfer information from your customers to you, and vice versa. Relying on your customer support team to communicate with customers means you can focus on getting everything back online.

To work effectively with your frontline teams, you must have processes set up before the storm hits. Know who your Specific Point of Contact (SPOC) is and how best to contact them. Understand what information they need to provide good customer service to customers throughout the outage. Let them know what information from customers is helpful. Do you want error codes? Troubleshooting reproductions? Your customer support team can get all of these for you.

Working with your tech support team will help ease the effect on customers, and hopefully help you identify the root cause even faster.

Planning For The Worst

Balancing the need for high availability with available resources inevitably results in some tradeoffs. But the good news is that you can always help direct how your customers experience service degradation.

With proper planning and communication, your customers will stick around, even when you go down.