It's not me it's you

How to handle failures

Disaster

A large part of our codebase revolves around ensuring we are sending around 3 to 10 thousand messages to Popular Online Platform(tm). We've been updating the infrastructure around this process over the last few weeks, attempting to optimize the process as well as introduce new features.

At 8PM on a weekday, after everyone had left for the day, alarms went off for our most critical system.

We'll come back to this in a moment though.

Culture

Our engineering team strives toward concepts like unit tests and CI/CD for peace of mind. Some projects have better unit testing than others, but we tend to verify, and fix, bugs via unit testing. Most of our projects have all CI/CD, but our business requirements change constantly, so we have some mitigating factors on how/when we deploy and can't do so with all critical systems.

We do have one rule: Be honest about your failures with your team and your clients.

In our case, our clients are all in house, either the business team or the chain up to the CEO. We strive to be open and honest about failures and roadblocks. We push to consider alternate paths when hitting the roadblocks. We perform root cause analysis on our failures so we can understand and communicate to the clients what happened, why the failure happened, how we mitigate, and when we will have a preventative measure in place for the future.

Why do this? This approach seems terrible to some at first glance, but has some interesting side effects. First and foremost, you gain trust with your clients. Second, it's the "right thing to do" morally. Third, to be honest with your client, you have to be honest with yourself.

Reaction

Let's take the Microsoft Azure Status History Blog. Forget that they have a pretty chart showing the current status for a moment, they have a blog that goes into frank root cause explanation. This blog is pushed to twitter as well as an RSS feed that our team subscribes to in Slack. This explanation often times has to do with hardware or equipment that is not in Micrsoft's control. When it is a Microsoft problem, they are open and honest about it.

8/20/2018: The incident was caused by a bug in the Storage Resource Provider (SRP). The bug was triggered in the create account path when we were evaluating the scale unit to create the account in. The bug only manifests under specific configuration settings which was only present in West Europe. Due to the bug, customers would have experienced account creation failures. The bug was not found during testing, and the service had been deployed to other production regions without issue before encountering the incident in West Europe.

As a user (or client) of Azure, I have confidence that when a system breaks, I know why, I know if there is anything I can do about it, and I know I'm going to get regular updates as a fix is put in place.

Back to the Disaster

On the other hand, when Popular Online Platform(tm) went down, the issue was not even acknowledged for over 12 hours. The platform's status board remained green, indicating no problems. ** Tickets that were opened with the platform were immediately closed by the engineering team with the comment that they had "experienced an unexpected failure."

Fortunately, I was able to reach out to a sister company who also works with Popular Online Platform(tm) to verify that it wasn't just us. My explanations, as they were available to my clients went something like this some information {redacted} to protect the innocent/guilty/secret sauce:

Summary:
At approximately 8PM on {Date}, Popular Online Platform(tm) stopped processing our messages as confirmed by checking their API endpoint that tracks submissions and verified with {Sister Company}. Subsequent communication from Popular Online Platform(tm) suggested that the failure on their end was related to {reasons here}

Impact:
No {type} messages were processed by Popular Online Platform(tm) between 8PM {date} and the following day around 11AM. During this time, the ability to manually process messages on the platform's website portal were also disabled. During this time, the service status for the platform remained green, indicating no problems. Between 11AM and 1PM the day following the initial incident, messages were received at an extremely low rate, about 1000/minute. Therefore all information displayed on Popular Online Platform(tm) contained stale information for approximately 15 hours. As the platform came back online, it took several hours to process all messages being processed by the platform. Anecdotally, the fewer messages being processed, the faster the system became.

Mitigation:
The engineering team disabled the mechanism that sends the batched messages to the platform as they were showing 60 messages in process at the time (usually there are at most 1 message in process). The team also changed the mechanism by which messages are sent to the platform to prevent the release of messages from the origin system based on the state of the platform.

Next Steps:

Add a process to poll the platform API and cache the state as many times as is allowed by their platform.

Implement automation to poll the state (as described in step 1). Use resulting state to enable mitigation paths in software to decrease cloud computing usage, reduce number of state messages that are "in flight" and keep the number of messages being processed by the platform to a minimum at any given time.

Invest in Twillio or some other SMS provider to send critical alerts so the team, stakeholders, and engineers understand what is happening, why, and what mitigation steps can be/should be/have been automagically taken.