Incident Report: Amsterdam Data Center DNS Failure (April 14, 2020)

Anthony Eden— 14 April 2020

What happened?

On Tuesday, April 14th, 2020 we saw a significant increase in ALIAS resolution failures in our Amsterdam (AMS) data center. The incident started at 07:00 UTC with an increase in SERVFAIL responses for certain requests in the AMS region. This correlated with an increase in ingress traffic, although the volume of traffic was not directly responsible for the incident. At the same time, customers in Europe began reporting resolution failures with their ALIAS records.

Why did it happen?

Multiple contributing factors were identified:

Updated name server software was deployed to support ECS functionality in our ALIAS resolution. This software update introduced a bug that resulted in some responses, that we previously were answering correctly, being returned as SERVFAIL responses.

An increase in traffic from a particular source where we responsed with SERVFAIL. This source appears to have continued increasing the number of requests it was sending us in response to the SERVFAIL messages.

ALIAS resolution was also impacted for a small number of records due to issues with the ECS support enabled in our resolver software.

There was a loss of IPv6 traffic into our AMS data center at the same time the incident started. It is unclear if this was a contributing factor.

How did we respond and recover?

Team members in Europe opened an incident and began investigating the issue after receiving reports from customers of ALIAS resolution failures. We ultimately identified that the issue was at least partially due to the new software version of the name server. We rolled back to the previous version in response.

We also reverted the resolver configuration changes (that were made the previous day) to remove ECS support to mitigate impact on a small subset of ALIAS records.

How might we prevent similar issues from occurring again?

We are changing our deployment procedures for name servers to perform greater analysis of error logs over a longer period of time in our canary environment.

We are updating our testing procedures for name server changes to introduce testing with a variety of resolvers to increase the chance of identifying issues that would occur with specific resolver configurations when resolving against our authoritative name servers.

We are adding new metrics and monitors to identify abnormal response patterns, so we can identify these issues before they escalate and impact customers.

We are improving our name server error logging to provide more pertenant information in our log aggregator.

Our goal is to provide you with solid authoritative ALIAS resolution that you can trust to never fail. While we failed to live up to that goal during this incident, we are working with the knowledge gained to improve our system and processes to avoid incidents like this in the future.

Thank you for your trust and your business – all of us at DNSimple appreciate it.