Posted
by
CmdrTaco
on Wednesday May 12, 2010 @10:00AM
from the wo-ist-jones.de dept.

An anonymous reader writes "Due to an error on behalf of DENIC, the German DNS registrar for second-level .de domains, millions of .de domains fell over the edge (auf Deutsch) of the Internet today. The cause of this GAU (GröYter anzunehmender Unfall = maximum credible accident) is still unknown, as DENIC officials haven't answered any questions from journalists at the time of writing."

I wonder if this had anything to do with my own DNS outage yesterday. There seemed to be a rolling DoS attack which hit a couple of my nameservers. It hit a slightly out of date version of bind, which made it barf. Of course I have the servers monitoring themselves, so they kept bringing it back up, just to be knocked down again a few minutes later. The solution? Upgrade to current.

Did anyone else see this, or was it two isolated (and unrelated) cases?

I experienced an array of network cuts recently as well! i thought i was the only one but this confirms my suspicions that something in the network is buggy. I use clearwire as does my mom and just as she was calling me claiming that the internet didn't work mine was acting buggy too. Whats stranger is that despite that all the diagnostics detected errors in the network connection, i could strangely remote connect to moms computer from mine! after this i did do another netstat to see what was going on but i

Why complain? It's nothing else than the typical bad work of the so called "editors" of slashdot. They also did not notice that a charset conversion error occured. The german phrase is "Größter anzunehmender Unfall", not "GröYter anzunehmender Unfall". But why should we expect that paid editors do actually work?

Nah; it was just as stated: a charset conversion screwup. The pages I get say <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> in the header section. Do yours say UTF-8 or some other Unicode charset?

It would be nice to be able to quote things in non-Western languages here, especially now that China+Korea+Japan are the majority of the Internet, and where most of our hardware is now produced. But I guess it'll still be a while before those of us dealing with non-Western langu

Actually, we reserve the term "Super-GAU" for that. "GAU" translates to "most severe expected accident"; it's still something you design your facility to handle. Consequently, a Super-GAU is an accident that exceeds what you planned for. An important point is that nuclear plants are required not to emit any radioactive material even in case of a GAU. Therefore, any accident during which the plant does release radioactive material is a Super-GAU.

Three Mile Island is a good example: Back than it was a Super-GAU as nobody designed reactors to handle gas buildup. With modern reactors it's a regular a GAU since modern designs are required to consider that failure mode and mitigate it.

In short: A GAU is "well, I guess after we're done decontamining and repairing the plant we'll need to do quite a bit of lobbying to get it back online". A Super-GAU is "we just contaminated how much land?".

True. The engineering term for a GAU is "Auslegungsstörfall" - you gotta love German composita. It roughly translates to "design basis accident" - the biggest accident covered by you fail-safes. The "GAU" acronym is mostly misused these days.

A worst-Case Scenario is more likely a "Super-GAU", i.e. when you take political and commercial "interpretations" away and only look at what could actually happen. And, yes, "Super-GAU" is a proper term, not my invention.

Technically, no. SNAFU refers to a stituation that is Normal (but always All Fouled Up). GAU refers to the worst possible situtation that could be anticipated, ie: more fouled up than usual, and about as fouled up as your imagination can take you.

Stop calling it root server (Shame on you, Golem.de!). The root servers serve the root zone, which contains the top level domains. The affected servers in this incident were the de-TLD servers which serve the second level domain records.

Lufthansa (in German): "Ground, what is our start clearance time?"
Ground (in English): "If you want an answer you must speak in English."
Lufthansa (in English): "I am a German, flying a German airplane, in Germany. Why must I speak English?"
Unknown voice from another plane (in a beautiful British accent): "Because you lost the bloody war."

Shiza sounds like one of those districts of Tokyo that nerds dream of going to because it consists of obscure shops that sell flavours of Pocky not available in the United States, like vodka-scented curried tayberry, or the infamous "Pocky flavoured Pocky" whose very meta-ness has driven some Westerners mad....

The problem did not affect all domains and it did not affect all nameservers for the german TLD. The nameservers which are reached through "c.de.net" (== c.nic.de) and "s.de.net" (== s.nic.de) more or less worked fine during the outage. Only for a short period of time they did not answer. The other nameservers for.de however lost the knowledge of most domains under the TLD and only returned NS-records for the domain names starting with a digit or with the letter a to e. So for example br-online.de worked fine, while web.de did not. The really bad part is, that the affected nameservers did not refrain to answer but instead answered with NXDOMAIN. So they told that they do not have a record for the query, which in turn effects to "This domain does not exist". Unfortunately such negative answers are cached for a time determined by the authorative nameserver. DENIC's nameserver tell clients to cache this result for 7200 seconds, therefore the outage continued to make problems for up to two hours after the problem was fixed, unless the DNS caches were cleared.

One more thing to notice: Some sites claim that four of the six nameservers for.de were affected because six hostnames are listed as nameservers for.de and as i told, two of them did work. However both a.nic.de and z.nic.de resolv to anycast IPs which will be routed to a number of different servers around the world depending on your own location. So it are more than six servers in total.

According to the DENIC registrar's mailing list, this was just an administrative fuckup. DENIC apparently runs Bind, (on at least the 4 affected logical servers) and they reloaded Bind with an empty zone file. Since the six logical servers are all authoritative, the empty-zone-file servers replied with NXDOMAIN (as they should have).

The parent is correct, non-existent domain responses should only be cached for 2 hours.

Since.de is the largest ccTLD (by count of registrations), this is a pretty big deal. On April 3 2010, there were 13.5 million [domainnews.com] registered.de domains. I wonder how long it took Bind to start with that many zones!

The DE-NIC finallly spoke out [denic.de]. If you don't speak german, the statement doesn't contain anything that wasn't already well known: Yes, there was an problem starting at about 13:00 and it was fixed around 15:45.

The zone information was only partially available from some servers. That could be the result of the size increase caused by the additional (large) DNSSEC records. Perhaps some automated zone update process ran out of space or time. This is only speculation though.

I heard this claim coming from the DFN as well, but i really suspect that it's bullshit. Why? As far as i understood (i admit lacking proper knowledge of DNSSEC) the introduction of DNSSEC might only affect clients which are actually capable of doing DNSSEC and which will request the nameserver to do DNSSEC, as DNSSEC is done by additional records in the DNS. Old clients will just request records as they did ever and will get normal answers like they got ever.

We wouldn't need to speculate if the DE-NIC would give out more details. Concerning myself, the DFN NOC holds more credibility than the DE-NIC.

There are hundreds of ways to get a DNSSEC deployment wrong. The error is not disturbing by itself. The time needed for a rollback on any change they made is IMHO. As well as the lack of concept about what to do in case something like this happens. Don't get me started on the information policy...

Wow, someone decided to mod me down as overrated. Talk about mod abuse.

However, it is told that the DNSSEC testbed worked fine during the outage, so this is a strong indicator that DNSSEC was not the culprit. I also got a credible statement from a DENIC technician that DNSSEC was not the reason and the DFN NOC is - as i said - making a ridiculous claim without any background knowledge. DENIC still has not provided an explanation but it appears that for some reasons the zones were only transfered in part, wh

The reason is AFAIK not DNSSEC itself, but the process of the introduction. Why should someone delete zone files if not due to changes made to the zones? I would guess, any nameserver gets for each update only a diff and not a full dump. In this case the diff contained empty zones (my guess).

The Association of German Spammers has already claimed billions of dollars in losses due to the outage.
The association chairman claims: "We need a rescue package to recover our losses".
Billions of SPAM emails to valid accounts were flagged as "does not exist".
Using the same calculations as the music industry the rescue package is estimated to
exceed 50 billion euros to recover todays losses.

the KISS principle is perhaps one of the greatest principles in engineering, and frequently keeps people's minds grounded in the deliverables, and prevents them from spinning out of control into overly complex solutions, which are in fact the source of most software bugs, not the solution to them

if this is the principle you are referring to, i don't know where the source of your animosity to it lies, nor why it has anything to do with this particular subject matter

Once upon a time, the DE-NIC was very respected in the german internet community. But several things happened lately, that let the trust erode. There were internal power struggles [heise.de], the rising influence of domain traders [denic.de] inside the DE-NIC and the surprising distribution of the two-letter-domain-rush [www.egm.at] (25% of all domains ending in the hands of a single person). Perhaps this outage will be a wakeup call. If we only count the time spent on customers calling the hotline, the damage for my company is several thousand dollars.