THE TRUTH WASN'T OUT THERE —

GoDaddy outage was caused by router snafu, not DDoS attack

Web host debunks rumors that a debilitating DDoS attack took it down.

Monday's five-hour outage that left GoDaddy unable to serve millions of websites that depend on it for Web hosting was not caused by an external attack as claimed by an anonymous hoaxster. An internal network error was at fault, company officials said Tuesday morning.

"It was not a 'hack' and it was not a denial of service attack (DDoS)," GoDaddy Interim CEO Scott Wagner wrote in an e-mail. "We have determined the service outage was due to a series of internal network events that corrupted router data tables."

Once engineers identified the problem, they were able to restore e-mail and connectivity to the company's and customers' websites, Wagner added. Customer data was never at risk of being exposed during the outage, which prevented people from accessing many or all of the websites that rely on GoDaddy.

The high-profile outage caused no shortage of frustration and disruption for the millions of individuals and businesses who rely on GoDaddy to provide e-mail and Web connectivity. Shortly after it began, an unidentified individual took to Twitter to claim the outage was the result of a distributed denial-of-service attack. The individual provided no evidence to support the claim, but it was widely reported as fact by both news and blog reports.

GoDaddy's statement marks the second time this week that unverified and anonymous hacking claims have been debunked. On Monday, a Florida-based publisher said it was the source of a leak of one million iPhone and iPad universal device IDs. The revelation that the numbers were lifted from a compromised database contradicted claims made by people who identified themselves as members of the AntiSec hacking crew. These anonymous individuals said the UDIDs were lifted from an FBI laptop they had hacked.

The incidents are a good reminder that it's a good idea to remain skeptical of unverified hacking claims.

Worth noting: the redundancy that's implied by them giving you one NS record on each of two different subnets (216.69.185.x and 208.109.255.x) is clearly no more than that - an implication. If they were truly separate networks, a router failure in one network would not have affected the other, and vast swathes of the internet would not have been made inaccessible for the majority of the day yesterday.

81 Reader Comments

In short, it shouldn't have happened unless someone really shit the bed.

Yes. But admins shitting the bed always has been and always will be independent of how many separate networks are being run. Jim's original point was claiming that GoDaddy must not actually have two separate networks and that the existence of two DNS addresses must only be to fool the customer into believing there is non-existent redundancy. The problem GoDaddy experienced does not support those assertions. The problem supports the perhaps related, but still distinct, assertion that separate networks do not provide fool proof insulation against problems. Fools can screw up two parallel networks almost as easily as they can screw up a single one.

But that does not mean that GoDaddy's two networks provide no actual redundancy.

Any redundancy they built in should have been to multiple physical locations - and again, this problem would have been avoided had that been the case.

As you said, the nature of the outage runs contrary to both their redundancy claims and their claims of the nature of the problem. Router TCAM tables do not simply become "corrupt" --- especially not across multiple routers at the same time. TCAM is "tertiary content-addressable memory". This memory is populated by "services" running on the router that need to store - and parse - information relating to logic-based requests such as access-control-lists, CEF, etc. It doesn't just become "corrupt" across multiple routers due to "internal network events". You couldn't even push a configuration that would result in a "corrupted router table", because that's not how it works.

Again, they're flat out lying. The most likely scenario is that they had no redundancy, they fucked something up majorly, and now they're trying to hide it. Either way, the "corrupt router tables" answer is a total crock.

Again, they're flat out lying. The most likely scenario is that they had no redundancy, they fucked something up majorly, and now they're trying to hide it. Either way, the "corrupt router tables" answer is a total crock.

The assertions run wild. I'm not trying to defend GoDaddy. I have no dog in this fight. They clearly screwed something up. I'm just not seeing that it is necessarily the case that what they screwed up was that they lacked a level of redundancy that they implied they have.

Physical location now that i think about it doesn't come to mind with other registrars either. I remember back in 2005 when hurricane Katrina hit New Orleans. DirectNic had to keep their datacenter up. I assume that it was their only location as who would risk their lives just to keep a server alive if they had a second working location.

In short, it shouldn't have happened unless someone really shit the bed.

Jim's original point was claiming that GoDaddy must not actually have two separate networks and that the existence of two DNS addresses must only be to fool the customer into believing there is non-existent redundancy.

Not exactly. My point was that the networks are not truly separate, not that there aren't two networks at all. We really have no way of knowing whether or not there are "two networks" in the most basic sense. The point is that, REGARDLESS, for their story to be true - or even true-ish - those networks were not actually separated in any real sense. Which yesterday's outage made pretty clear.

As you said, the nature of the outage runs contrary to both their redundancy claims and their claims of the nature of the problem. Router TCAM tables do not simply become "corrupt" --- especially not across multiple routers at the same time. TCAM is "tertiary content-addressable memory". This memory is populated by "services" running on the router that need to store - and parse - information relating to logic-based requests such as access-control-lists, CEF, etc. It doesn't just become "corrupt" across multiple routers due to "internal network events". You couldn't even push a configuration that would result in a "corrupted router table", because that's not how it works.

I also meant to point out that the claim of "corrupt router tables" is not limited to a scenario where an in-place table somehow loses integrity. All those "services" you talk about get their data from somewhere. If they are being fed bad info from another part of the network, then the routers are all operating with bad routing info. Perhaps you might not call that data technically "corrupt" (since that tends to imply data which has become garbled as opposed to just being inherrently wrong). But then we're just parsing the word "corrupt", and that's not very interesting.

(Obviously, setting up a network so that the routers are getting bad routing data is not easy to accomplish, or it would happen all the time. It probably requires a real screw-up somewhere. But that still doesn't necessarily support the assertions that are being thrown around about GoDaddy's lack of redundancy. It probably only supports the assertion that GoDaddy really screwed something up internally, which they are not denying.)

"We have determined the service outage was due to a series of internal network events that corrupted router data tables."

If they're claiming that their DNS outage was caused by corrupted router TCAM, that's a little wonky, and here's why:

1: DNS runs from servers not routers. 2: If they somehow had set up router QoS rules for DNS (irresponsible at best)...

Why are you making the leap to QOS? Routers have ... well... routes in their data tables, as well. There are routers in front of servers (DNS or no), pretty much period, and if those routers don't route, packets don't get to the servers behind them.

I do agree with you that the whole "internal network events led to corruption" is total LOL though, and in the absence of considerably more detail, I stand by my restatement - "herp derp, I accidentally the whole router".

If they had a multiple-router configuration where one router pushed routing tables to the other routers, all it takes is something to cause the "master" router to become corrupted somehow, then push a corrupted routing table to the other routers. That seems to me like the most likely scenario as to what happened IF it wasn't a hack.

If they had a multiple-router configuration where one router pushed routing tables to the other routers, all it takes is something to cause the "master" router to become corrupted somehow, then push a corrupted routing table to the other routers. That seems to me like the most likely scenario as to what happened IF it wasn't a hack.

I had considered this to be a possibility, except that such a configuration makes absolutely zero sense for where they're claiming this happened. Even with such a configuration, none of the routing protocols they'd be running would account for this type of behavior. Routing protocols - by nature - have safeguards against such problems.

In short, it shouldn't have happened unless someone really shit the bed.

Jim's original point was claiming that GoDaddy must not actually have two separate networks and that the existence of two DNS addresses must only be to fool the customer into believing there is non-existent redundancy.

Not exactly. My point was that the networks are not truly separate, not that there aren't two networks at all. We really have no way of knowing whether or not there are "two networks" in the most basic sense. The point is that, REGARDLESS, for their story to be true - or even true-ish - those networks were not actually separated in any real sense. Which yesterday's outage made pretty clear.

I'm not sure what you mean by "truly separate". Since your original point was about customer expectations, do you think customers expect that GoDaddy runs two independent networks, each of which runs different hardware and software and is configured by separate teams of admins? Because that would be the only way to create "truly separate" networks that would share no vulnerabilities. Even then you would have the problem that there would be data that would need to be shared between the two networks. And then you'd have to find some way to verify data as it moved between the two networks.

The existence of this problem simply does not support your level of certainty regarding GoDaddy's network topology.

Hey Panther Modern, where are you getting the TCAM bit from? I haven't seen any source saying anything more specific than "corrupt router tables". Have you seen something we haven't?

TCAM = Tertiary Content Addressable Memory

The is the type of memory that Cisco routers (and others) store data such as CEF (Cisco Express Forwarding), Access Control List logic tables, etc. When they say "corrupted router tables", they are specifically talking about a data array in TCAM, unless they are just using buzzwords without knowing their meaning.

Just saying "corrupted router table" is meaningless unless they said "corrupted routing table". Corrupted "routing" table doesn't make sense in this situation either. Any other "table" located in the router would not account for this issue either, since content in the TCAM does not "propagate" between routers: routers populate this memory themselves from frames/packets that enter them, not from rules in place on other routers.

Routing information needs to come from outside of the router, doesn't it? Routers are not omniscient.

Big misconception.

Routing information comes from several sources:

1: Networks that are directly connected to the router (ie: an interface on the router is configured with a specific IP address and subnet mask, indicating that all IPs within that mask reside on that interface) 2: Networks that are statically-configured (ie: your default route, ie: "ip route 0.0.0.0 0.0.0.0 <nexthop>" or "ip route 192.168.0.0 0.0.255.255 <physical interface name or nexthop IP>" 3: Networks it receives from neighboring routers (ie: the branch office in the next city)

Here's a thought experiment (completely hypothetical):

GoDaddy has two separate "networks" for its DNS infrastructure.

Network-A is physically located in Los Angeles. Network-B is physically located in San Francisco. The DNS service infrastructure is located in both facilities, to provide redundancy. The DNS servers or the "routers" located in San Francisco develop a problem.

Here's where the hypothetical situation gets muddy:

If the DNS servers located in San Francisco malfunction and attempt to propagate corrupted records to the DNS servers in Los Angeles, why didn't they just say that? Totally believable, has happened before, not out of the ordinary.

If the edge router in San Francisco malfunctions and stops propagating DNS traffic, this does not in any way affect the Los Angeles edge router. The only thing that happens is DNS replication between the sites stops, and Los Angeles begins serving all DNS traffic.

If the edge router in San Francisco tries to send a routing update to the edge router in Los Angeles, that routing update must contain the following information: a "hello, I'm alive", a "keep paying attention to me, I'm still alive" or "I have an update for you about the networks that I know about."

The last scenario is the only scenario in which routers will exchange "router table" data as described by GoDaddy. In this scenario, the San Francisco router has detected (or been programmed) with a change to its internal routing table, which it attempts to communicate to the Los Angles router. Here's the important part: The Los Angeles router does not overwrite its own routing table - it performs a series of boolean operations to determine which pieces of the routing update are relevant to it, and populates (or removes) a routing entry from its table based upon the update.

For one router to "wipe or corrupt" the routing table of a neighboring router through a malformed update runs contrary to the actual way these routing protocols work. What could happen is the physical interface in San Francisco that is attached to the switch that the DNS servers are connected to "goes down" due to a disconnected cable, and as a result, sends a "this network is no longer active on my device" message to its peers. The most that would happen is that the DNS servers inside the San Francisco facility would be unreachable.

The scenario that GoDaddy is claiming is inconsistent with the design and operation of routers. Period.

Just to be clear, carrying out a DDoS attack really isn't what most security people consider hacking. It certainly doesn't involve rooting a server or accessing customer data or other proprietary information. So I'm not sure why some commenters here seem to think there's this huge specter of shame attached to being on the receiving end of a DDoS attack.

Both networks go to the same physical facility, and, potentially, to the same physical device.

Again, GoDaddy's "corrupt router tables" answer falls straight through the floor. What probably happened is that because they lacked proper redundant infrastructure, they experienced an event that illustrated a single-point-of-failure. A huge no-no in network & systems engineering when talking about the design of enterprise-class networks and services.

So, in conclusion:

GoDaddy is definitely giving a bullshit answer by saying "corrupt router tables", as well as lying about the redundancy of their infrastructure.

The scenario that GoDaddy is claiming is inconsistent with the design and operation of routers. Period.

Unless you're getting info from somewhere other than the Ars article, I'm not really sure how you can know exactly what scenario they are claiming. A one-sentence quote from an email from the CEO (rather than a detailed after-incident report) simply does not provide you with the level of certainty you are claiming.

A one-sentence quote from an email from the CEO (rather than a detailed after-incident report) simply does not provide you with the level of certainty you are claiming.

It does combined with the other evidence. It's a buzzword-bingo answer. As a CEO, you simply kludge words together and hope they mean something technical. The plebs never know the difference and most people won't call you out on it because it sounds "plausible". It isn't.

See above: both DNS networks trace to the same physical location, and, potentially, to the same physical device. "Corrupt router tables" is obviously not the answer, except for one single scenario: someone fucked up.

Since your original point was about customer expectations, do you think customers expect...

Given that we're talking about DNS, a service with a VERY long-established set of best practices regarding redundancy, and the fact that this is one of the largest, if not THE largest, private DNS providers, here's what I would expect when I see two subnets, and each domain assigned one host record on each subnet:

The equivalent of hosting ECS instances in Amazon Availability Zones, or servers (physical or virtual) in separate Softlayer datacenters, or equivalent with another host: facilities which are independent of one another and can be relied upon not to fail from a single failure in layers 0-6. A layer 7 failure can, of course, take out a service hosted redundantly in multiple datacenters.

GoDaddy is not claiming a layer 7 failure. They're claiming a layer 3 failure. This is not difficult!

Since your original point was about customer expectations, do you think customers expect...

Given that we're talking about DNS, a service with a VERY long-established set of best practices regarding redundancy, and the fact that this is one of the largest, if not THE largest, private DNS providers, here's what I would expect when I see two subnets, and each domain assigned one host record on each subnet:

The equivalent of hosting ECS instances in Amazon Availability Zones, or servers (physical or virtual) in separate Softlayer datacenters, or equivalent with another host: facilities which are independent of one another and can be relied upon not to fail from a single failure in layers 0-6. A layer 7 failure can, of course, take out a service hosted redundantly in multiple datacenters.

GoDaddy is not claiming a layer 7 failure. They're claiming a layer 3 failure. This is not difficult!

When selecting secondary servers, attention should be given to the various likely failure modes. Servers should be placed so that it is likely that at least one server will be available to all significant parts of the Internet, for any likely failure.

Consequently, placing all servers at the local site, while easy to arrange, and easy to manage, is not a good policy. Should a single link fail, or there be a site, or perhaps even building, or room, power failure, such a configuration can lead to all servers being disconnected from the Internet.

Secondary servers must be placed at both topologically and geographically dispersed locations on the Internet, to minimise the likelihood of a single failure disabling all of them.

That is, secondary servers should be at geographically distant locations, so it is unlikely that events like power loss, etc, will disrupt all of them simultaneously. They should also be connected to the net via quite diverse paths. This means that the failure of any one link, or of routing within some segment of the network (such as a service provider) will not make all of the servers unreachable.

Thanks for running the traceroute, PM. That's pretty much what I'd expected to see - shame there's no easy way to tell if there are even separate machines inside the DC, isn't it?

I've pinged each IP individually; the ping times from each IP differ by around 40ms - 42ms, consistently, indicating that they go to separate physical devices (but not through another Layer 3 device. They're probably on different switches, on separate floors of the same building.

Worth noting: the redundancy that's implied by them giving you one NS record on each of two different subnets (216.69.185.x and 208.109.255.x) is clearly no more than that - an implication. If they were truly separate networks, a router failure in one network would not have affected the other, and vast swathes of the internet would not have been made inaccessible for the majority of the day yesterday.

What you fail to recognize tho' is that the problem was a corruption of the router tables, not a failure of a router. There is no practical way that any site could prevent an outage if their sole database for the routing information propagates bad data, and no practical way to implement distinct redundant databases that collectively only propagate valid data.

IME a simple ping just doesn't have the fine level of detail to detect topology at as small a scale as "different floor of the building" or "different switch same router". I don't typically see even one full ms of latency across a switch. And given that pings are the absolute lowest priority of any device to return, you're a lot more likely to see differences in load on the machine (or even in what OS two machines are running!) than you are to see the difference in where they're sitting in a building.

This is a ping from one subnet to another, in the same large public datacenter:

I get almost identical results pinging one box from another on my home gigabit LAN. And my router, which is literally ten feet of cable away from my PC, shows longer ping times than my wife's computer, which is upstairs on the other end of the house - which tells you that the router's busier than my wife's computer.

IME a simple ping just doesn't have the fine level of detail to detect topology at as small a scale as "different floor of the building" or "different switch same router". I don't typically see even one full ms of latency across a switch. And given that pings are the absolute lowest priority of any device to return, you're a lot more likely to see differences in load on the machine (or even in what OS two machines are running!) than you are to see the difference in where they're sitting in a building.

Reading this thread is painful, with all the assumptions made and accusations thrown around. Most of us will never know really what took place here. I'm in no way defending GD, but to see a technical community throwing up so many what-if's, its just not helpful.

First off the 2 separate networks theory, the only thing the traceroutes prove is the front ends to the DNS service are could in the same location, but based I have some doubt on that as well, you see from the traceroutes:

the last hop looks like it connected to the same device in both cases, however the 1st one shows 130ms for the hop vs 56ms on the 2nd one, this could be caused by any number of things, one being a long haul to different facilities, another being the inherent slow processing of icmp packets, either way there is doubt and no proof.

For the sake of argument lets say they are in the same facility (ns01/ns02) or at least the public ip's for each of those, its very likely someone as big as GD doesn't just assign those ip's to individual boxes, but instead uses some sort of load balancing, maybe MANY layers of load balancing, all of which would be hidden from view (traceroutes). This is to say, if you had 2 subnets for DNS and the front end was in the same facility there is every reason to believe that the backends could be elsewhere, not to mention the trace's provided dont indicate if they are tracing to an IP or a A record. Its very easy with DNS to change what IP you get for a given A record based on Anycast, Which is probably what they are doing, I would announce the same subnets from different locations and no-one would be the wiser, Have you ever tracerouted to 4.2.2.2 from different points around the world? You are usually never more than 5 or 6 hops from it, you know why? Anycast

So with 3 traces from around the world you can plainly see ns01.domaincontrol.com is in 3 different places. Based not only on hops, but there is no way in the world you could reach the same point from LA/Virginia/Europe with under 20ms. It takes 50 to get from to LA to Virginia and another 80 to get to Europe.

With that I'm going to end my rant, as all the QoS/TCAM BS and the definition of "corrupt" will have to wait for another time. I do after all have a network to run.

You guys can argue all day about what did or didn't happen, but I'll say the same thing I said yesterday on the previous article concerning this situation:

My company interfaces directly with GoDaddy. Because of legal agreements I can't give details but I can say that I know multiple employees, personally and professionally, of GoDaddy.

Their offices are a few miles from my company HQ.

I can say, without a doubt, regardless of hardware issues that they may have had as a result, this started as a DDoS attack and was the result of a combination of hacks and attacks. This is coming from the horse's mouth.

GoDaddy is a registrar. The largest registrar in the US. They aren't going to come out and say they were hacked publically if they think they can get away with avoiding the negative press.

Yeah it's anycast, go to http://lookingglass.pccwglobal.com/ and you can run traceroutes from all over the world to test. They don't appear to have a European datacenter, but they've got several in the US and one in Singapore.

You still end up going to the SAME datacenter for both your NS records no matter which zone you happen to be in, though - ns01 and ns02, or any other matched pair, will be on separate subnets routed into the exact same datacenter for any given region. So if that datacenter goes belly up, so does your DNS resolution.

Not to mention that they've still got ALL their datacenters tied together too tightly if one "internal network event" brought all of their datacenters offline at once yesterday. That really, really should not be possible. Unless of course the "internal network event" was something along the lines of "lol, let's push updates to the firmware of all our routers at once without testing first".

A one-sentence quote from an email from the CEO (rather than a detailed after-incident report) simply does not provide you with the level of certainty you are claiming.

It does combined with the other evidence. It's a buzzword-bingo answer. As a CEO, you simply kludge words together and hope they mean something technical. The plebs never know the difference and most people won't call you out on it because it sounds "plausible". It isn't.

See above: both DNS networks trace to the same physical location, and, potentially, to the same physical device. "Corrupt router tables" is obviously not the answer, except for one single scenario: someone fucked up.

Nice deflection. Of course someone fucked up. You just don't know enough to know exactly how they fucked up. Yet that doesn't stop you from acting like the smartest guy in the room making assertions off of essentially zero information.

Thanks for running the traceroute, PM. That's pretty much what I'd expected to see - shame there's no easy way to tell if there are even separate machines inside the DC, isn't it?

I've pinged each IP individually; the ping times from each IP differ by around 40ms - 42ms, consistently, indicating that they go to separate physical devices (but not through another Layer 3 device. They're probably on different switches, on separate floors of the same building.

In short... no, they definitely don't have the redundancy in place that they claim to have.

Is that info pre- or post-incident. For all you know they have something temporary in place while they make permanent fixes to their network.

Again, I'm not denying that GoDaddy fucked up. I'm just denying that you really have enough information to make the assertions you're making.

Y'know what, Chuck? He does. PM wrote in the present tense: "they don't have the redundancy ... they claim."

As far as I could tell from here, that's true. As you say, whether they had redundancy before the Giant Screwup is unknown, at least to me. However, a dozen or so hours ago when these messages were posted, they didn't.

If one runs the same tests now, one gets timeouts. Can't really tell very much from that.

I've been a GoDaddy customer since it was a small business run by an ex-Marine. I'm not sure I can ever completely untangle my relationship with GoDaddy, but I've become convinced that friends don't let friends use GoDaddy. This very evening I'm in the process of helping a friend migrate away. And I've got a list of other friends.