How the Dyn outage affected Cloudflare

Last Friday the popular DNS service Dyn suffered three waves of DDoS attacks that affected users first on the East Coast of the US, and later users worldwide. Popular websites, some of which are also Cloudflare customers, were inaccessible. Although Cloudflare was not attacked, joint Dyn/Cloudflare customers were affected.

Almost as soon as Dyn came under attack we noticed a sudden jump in DNS errors on our edge machines and alerted our SRE and support teams that Dyn was in trouble. Support was ready to help joint customers and we began looking in detail at the effect the Dyn outage was having on our systems.

An immediate concern internally was that since our DNS servers were unable to reach Dyn they would be consuming resources waiting on timeouts and retrying. The first question I asked the DNS team was: “Are we seeing increased DNS response latency?” rapidly followed by “If this gets worse are we likely to?”. Happily, the response to both those questions (after the team analyzed the situation) was no.

However, that didn’t mean we had nothing to do. Operating a large scale system like Cloudflare that deals with the continuously changing nature of the Internet means that there’s always something to learn.

Back in July 2015 Dyn had an outage that also affected some of our customers and we changed our handling of so-called infrastructure DNS records in response to prevent a similar problem, from any provider, affecting Cloudflare.

Based on what we learned last Friday we are making some changes to our internal DNS infrastructure so that it performs better when a major provider is having problems or an outage (whether caused by DDoS or not). To understand those changes it’s helpful to take a look at the role of DNS and what we saw on Friday.

A little bit about DNS

The Domain Name System (DNS) provides an address book service for the Internet. It is responsible for converting the friendly, human-readable domain names we type into our web browsers to IP addresses for websites. Let’s walk through the life of an example web request to see where DNS plays a role.

We can start by entering a web address into our browser, https://www.cloudflare.com/. The browser translates this name into an IP address so it can contact the server that’s hosting the page, it will do this using DNS. We can make these DNS queries ourselves using the dig command line tool to see what values are returned.

The DNS data model is split into two core concepts, names and records. The name here is www.cloudflare.com and the record type we have queried is A, which is used to store IPv4 addresses. There are other types of records for storing other types of data, e.g AAAA records for IPv6 addresses. We can see from the answer above that there are two IPv4 addresses for www.cloudflare.com; the browser picks one of these to use.

Records in the DNS also have an associated TTL which defines how long the data should be cached for, these records have a TTL of 10 seconds. This means the browser can store this information and skip making further DNS queries for the domain for the next 10 seconds.

For Cloudflare customers, the answer will contain our Anycast IPs instead of the origin ones (the IP addresses of the web hosting provider). The browser will then send requests to us, and we will serve content from our cache or proxy the request to the origin web server.

There are two common ways of configuring origins on Cloudflare. The first is to specify A and AAAA records, which explicitly provides us with the IP addresses of the origin. In this situation, our network knows ahead of time where it can contact the origin, so no further DNS resolution is required. For example, if www.example.com uses Cloudflare and has specified 2001:db8:5ca1:ab1e as the IP address of the origin server in the Cloudflare control panel, we can connect directly to the origin server to retrieve resources.

The other is to use a CNAME, which is a pointer to another DNS name.

When our servers receive a request with the origin configured using a CNAME, we have to perform an external DNS lookup to resolve the target of the CNAME to IP addresses. This information is cached, based on the TTL defined on the CNAME record. In this case, our ability to serve content (that is not in the cache) entirely depends on an external DNS lookup to resolve the CNAME to IPs.

For example, suppose www.example.com had set up a CNAME in the Cloudflare control panel pointing to server11.myhostingprovider.biz it would be necessary to look up the IP address of server11.myhostingprovider.biz before contacting the origin server.

In many cases the target of a CNAME is handled by a third party DNS provider. If the third party provider is unable to answer our query, we are unable to resolve the domain to an origin IP and cannot serve the request.

What Friday’s Dyn outage looked like

As Dyn says in their discussion of the DDoS attack there were three distinct waves. For Cloudflare that manifested itself in two periods during which our internal DNS query error rate spiked.

The first attack started at 1110 UTC and mostly affected DNS resolution on the US East Coast. This world map from our monitoring systems shows the Cloudflare data centers where the DNS error rate was spiking because of the Dyn outage.

The green dots on the map are Cloudflare data centers that were unaffected by the Dyn DDoS. The largest effect was on the US East Coast, although the attack had a knock-on effect in Singapore and some parts of Europe. This is most likely because the architecture of the Internet does not directly line up with geography. Locations that are physically disparate can sometimes appear ‘close’ on the Internet because of undersea cables or decisions on how to route traffic.

The chart shows the DNS error rate in each Cloudflare data center affected by the outage. It’s possible to see the attack ramp up rapidly and then remained sustained until Dyn was able to tackle it.

Later in the day the attackers returned with greater force and had a worldwide impact. This map shows the Cloudflare data centers seeing errors because Dyn was inaccessible. As you can see almost the entire planet was affected (with the exception of our China locations; we’ll return to why below).

Once again it’s possible to see the attack ramping up at 1550 UTC and continuing for some time. Dyn reports that the attack was fully mitigated at 1700 UTC.

Media and Dyn reported a third wave of attacks later on Friday, but Dyn mitigated that wave immediately and so fast that it did not have any affect on Cloudflare protected websites and applications and did not show up in our systems.

Why China was unaffected

During the most intense period of attack on Dyn our locations in China were almost completely unaffected. That’s because we handle DNS a little differently inside China.

To cope with sometimes fluctuating network conditions inside China our data centers are configured to keep DNS records for origin servers cached in our servers for longer than the rest of the world. This caching meant that even though Dyn was down and couldn’t be reached from anywhere (including China) we still had cached DNS records for sites that used Dyn on our China servers. Thus we were able to reach origin servers and continue serving content. That shows up as green dots on the map above.

Unfortunately, there’s a downside to hanging on to DNS records for a long time: if one of our customers changes their origin’s DNS records we’ll keep using the old DNS records and IP addresses. That could lead to downtime, or poor service.

The ideal system would recheck DNS records frequently so that changes are reflected quickly but in the event that the upstream DNS provider was unavailable (because of an attack or other outage) it would be able to use the DNS records it has cached.

Doing so is known as ‘serve stale while revalidating’. Our upstream DNS resolvers will cache records checking frequently for changes. If the upstream DNS is unavailable we’ll continue to serve from cache until it’s possible to refresh the DNS records.

We are testing and rolling out that change now and expect this to greatly diminish the impact of events similar to the Dyn DDoS for all of our customers who use CNAME’d DNS records that rely on a third-party DNS provider.

Conclusion

The Internet is a shared space. Because companies, people, and institutions work together we have a global, connected network that allows us to work and play from almost anywhere. Cooperation means that we work together on standards and interoperability to keep the network running and evolving.

But the Internet is very complex and, as with many things, the devil is in the details and operating Internet infrastructure is a process of constant improvement. Although the Dyn DDoS felt scary to many people unfamiliar with how the Internet operates, such attacks result in a stronger network. Just as Cloudflare is making changes to its software and configuration, so are others across the net.

We are always looking to hire smart people interested in making DNS and the Internet better for everyone. Jobs can be found here.