Posted
by
timothy
on Sunday September 26, 2010 @11:44AM
from the three-letters-bad dept.

Julie188 writes "That was probably the only time 'DNS' will ever be a trending term on Twitter. The cause was Facebook's 2.5 hour outage on Thursday, which incorrectly told users trying to access the site that a DNS error was to blame. In truth, experts who've read Facebook's explanation say the site went down because Facebook gave itself a distributed denial-of-service attack when a system admin misconfigured a database. So why was DNS blamed? The 27-year-old communications protocol has been known to cause other, somewhat similar outages."

I can understand why that may cause people to think the problem is with DNS. The error message looks like it came from an http proxy. That would suggest that either the user had a proxy configured or facebook were using a reverse proxy. If it was the later, the DNS "problem" would be inside their network.

Easy. They absolutely do use reverse proxies - every large site does, because you just can't scale a web site to Facebook's size without them.

In the post-mortem, they mention the need to effectively "turn off" the entire site, and the easiest way to do that is to remove its DNS. In this case, however, it was most likely more effective to remove the DNS entries for the back-end hosts that the proxies forward queries to, rather than the entries for www.facebook.com. This is most likely what generated the DNS

Bingo! The DNS issue was internal to Facebook's load-balancing cluster. Anyone who's hosted a busy web site should be familiar with this kind of setup. Internal DNS is often uses for such purposes, as it can transparently provide round-robin functionality. Every time you resolve the hostname, you get a different IP (caches notwithstanding), so while the back-end servers need to be conscious of the load-balancing in a generic fashion, the actual distribution of work is trivial, and adding more back-end n

I think they changed their internal DNS config, screwed it up, and when their front facing webservers tried to lookup their database servers and failed, they tried the backup/rollover db servers, failed... these cascading errors caused their internal DNS servers to melt down.

After they'd been down for a while, because it spun down slowly over about half an hour, somebody in charge asked "WHY ARE WE DOWN" and was told "DNS error" and then changed the front facing webservers to spit up HTML that said "DNS ERR

Indeed. I don't do Facebook, but if I had got such a message, my first response would be to look at my own/etc/hosts file. From time to time I manage to bite myself on the ass with my block-list, but I can live with that...

What percentage of slashdotters actually noticed the facebook outage when it happened? As opposed to merely participating in the post-hoc commentary after they read about it. It should have been posted to slashdot's idle category.

It is the most used website in the world (more userhours/month spent of Facebook than any other site), the fastest growing internet community (when measured in new users/month), etc... And as such it is an engineering masterpiece (in software engineering and probably in several other areas, too). When it goes down for several hours, it is a newsworthy event.

For us who work for advertising agencies, FB downtime is also a financially notable event.

Yet, you failed to notice that/. is a site for nerds.Many nerds do not thrive to cultivate their social skills.Checking their friends status on social network might not be on top of their agendas.So: event was notable, but not very important to many slashdotters.

It wasn't your browser having a DNS error, it was the user facing servers at Facebook reporting DNS problems talking to whoever they talk to. Maybe when they decided the way to fix the problem was to take down the site, they just removed the back end server cluster from their internal DNS.

The way to stop the feedback cycle was quite painful - we had to stop all traffic to this database cluster, which meant turning off the site.

I'm, uh, taking a wild guess that simply shutting off port 80 is not going to allow for a controllable ramp up... they could redirect to another site, Orkut or myspace would have been mildly humorous. I am mildly surprised they don't have a simple emergency box with a simple static "undergoing repair" page, but, whatever...

So, other than zapping the A records and waiting, what are they supposed to do? Bonus points if they were doing DNS based load balancing and simply unplugged their (dns based) load balancer.

I have no dog in the fight, having deleted my facebook account months ago. It is kind of funny that a page of technobabble is described as "technical details" as if folks like us/me would find it to be a complete description rather than pretty vague. Then again we're dealing with farmville addicts and you can't reason with addicts.

I'm, uh, taking a wild guess that simply shutting off port 80 is not going to allow for a controllable ramp up...

Both approaches allow for a controllable ramp up given the right software on their servers. And I think with the typical off the shelf software neither of them allow for a controllable ramp up.

But did they even need a controllable ramp up of user requests? It sounded like the overloaded system was overloaded by internal requests, that were unrelated to the number of requests they got from end

But did they even need a controllable ramp up of user requests? It sounded like the overloaded system was overloaded by internal requests, that were unrelated to the number of requests they got from end users.

When you hear hoofs, think horses not zebras.

Seeing my servers spike to 100% CPU or 100% I/O and stay there, I'd look outside first before looking inside... So my first goal would be to act for a controllable ramp up of user requests. If the systems are so overloaded I can't troubleshoot at 100% of users, maybe I COULD log in and troubleshoot at 50 or 90 % load.

Also, I've worked at places that won't upgrade until outages due to high utilization are some large multiple of the cost of upgrading, this would

This whole situation does explain why my mother appeared to be sick on the couch at my parent's place on Thursday afternoon when I paid them a visit. With all the shaking and huddling under the covers and looking pale-faced I presumed she had come down with the flu or something.

Then again we're dealing with farmville addicts and you can't reason with addicts.

They aren't addicts, that's patently unfair. They can stop any time they want. What is most admirable about them is that they are simply so time-savvy that they coincide those times at which they wish to stop with the periods during

No. DNS has a few security issues, but they're mostly minor. The fact that DNS works for millions of people every day without issue at least 99% of the time proves that DNS is a successful design, even if it could use some security updating.

1. TFA is about DNS.2. There is no "TCPv6". There is TCP over IPV4 and TCP over IPv6. They are, however, the same TCP.3. TCP/IP is also used as a broad term for for the entire network stack. For example, DNS is an application-level protocol implemented on top of TCP and UDP over IP. But the entire thing is, loosely speaking, TCP/IP technologies.

Some people think technology should be replaced just because it is old. But really, it should be replaced if it doesn't suit our needs and there is a different technology that does suit it.

It is better to replace a 1 year old technology that does not suit our needs than to replace a 50 year old one that does. Usually when replacing, you want to replace with something newer. But in some cases it may turn out to be better to replace a new and misdesigned technology with an older and proven one.

That said, there are improvements to both IP and DNS which should be rolled out because they fix real problems. The rollouts are not happening as fast as they ought to, mainly because it is problematic to roll out a change to the entire Internet, especially when not everybody involved is cooperating.

I'll take the quality of design of IP or DNS over what passes on for "The Web" these days. The browser as a concept is bending towards it's breaking point as it tries to cope with the fact it's treated as a clown car.

I guess it's historical legacy that we started with HTML and crap like that for browser interaction and everything sort of grew from there, but we're doing the whole "web as an applications platform" wrong.

And is definitely showing it's age. There's been a big cry for years from those working at the really high end of networking that we need to replace (really just extend) TCP because it doesn't work well with high bandwidth-delay-product links. This is because the max window size and ramp-up algorithm (slow start) don't allow you to saturate the pipe quickly enough or even at all. There are several proposed extensions floating around to fix the problem but none of them have widespread adoption.

How much easier would fighting spam be if SMTP had a strong authentication system for sent messages?

There is one, called OpenPGP. There is another one, called S/MIME. Implementation of these in real-world MUAs awaits a decision on best practices for how strong the authentication needs to be. Stronger authentication has two downsides. First, the cost of obtaining a digital ID goes up with strength; even with the OpenPGP web of trust, travel to a key signing party hundreds of km away is not free. Second, requiring strong digital ID makes it difficult for someone living under a government that suppresses spe

How much easier would fighting spam be if SMTP had a strong authentication system for sent messages?

There is one, called OpenPGP. There is another one, called S/MIME. Implementation of these in real-world MUAs awaits a decision on best practices for how strong the authentication needs to be. Stronger authentication has two downsides. First, the cost of obtaining a digital ID goes up with strength; even with the OpenPGP web of trust, travel to a key signing party hundreds of km away is not free.

If everyone had as a personal policy "only read OpenPGP-signed mail, and distrust mail signed with a key I haven't personally downloaded from a key server"

Then it would it would still fall under the "Requires immediate total cooperation from everybody at once" line of the well-known copypasta [craphound.com], and possibly "Mailing lists and other legitimate email uses would be affected" and "Many email users cannot afford to lose business or alienate potential employers" depending on how it is implemented.

I suppose if I were an angst-ridden bitter friendless teenager I may have found it amusing too. Luckily, I'm an adult. (How sad that this comment is currently marked insightful.)

And - really? Genuine panic? I think that says more about the specific subset of Facebook users within your anecdote set than anything else. Or do you also extrapolate out from the frequent racist troll comments on Slashdot?

LOL Yeah it was hilarious [publicradio.org] when people [dailymail.co.uk] were complaining about being unable to get on Facebook. So funny that people need services to keep in contact with others, it's like why don't you just talk to them in person? I mean like, HELLO, am I the only one getting this? Geeze. If it's so important to you then you should be more redundant with your services, like, everyone knows that!

I don't know that finger pointing is necessarily healthy - that tends to suggest CYA and childish blame games. But on a technical IT focused web site, one might suppose that a lessons learned exercise on the root cause of the failure of a massive website would be of interest and hopefully even an educational experience.

But on a technical IT focused web site, one might suppose that a lessons learned exercise on the root cause of the failure of a massive website would be of interest and hopefully even an educational experience.

I can't remember ever seeing an article about a major outage from some big website where they delivered enough information that one could learn from it.So what did you learn from this article? Don't fuck up when you are an admin or maybe to create a better error page.Yes very informative and of interest to nerds, indeed.

With all the adblocking software, no scripting, and other misc Firefox plugins I have no ads from facebook, let alone any other page. Firefox is set to delete cookies and BS on exit and I keep my machine clean with bleachbit.

Facebook allows me to connect with lots of people I never would see, like buddies in the Army based in Japan for instance or friends in New York etc, etc.

I hide all the annoying spam adds for peoples stupid farms, I have convinced many of my Facebook friends and family to stop playing t

I had never heard of them (I'm an old Slackware hand, and more recently Arch), but Mint's webpage is so incredibly slow to load, it's impossible to see what that agenda is. It doesn't inspire much confidence in them.:-|

You won't find it on the home page. It was a post by a developer on the dev blog. He later removed it and apparently moved it to his personal blog.

Palestine Written by Clem on Sunday, May 3rd, 2009 @ 12:34 am | Main Topics

This is not the place to talk about this but I am deeply touched by what is happening over there. I feel disgust and guilt with us passively witnessing it and our money and weapons supporting it. I don't want to use my name or this project to push my own ideas about this but I spend a lot of time working and giving away, sharing and receiving to and from a lot of people.

I'm only going to ask for one thing here. If you do not agree I kindly ask you not to use Linux Mint and not to donate money to it.

I hope for these people to be able to live decently in the future and for me not to have anything to do with the misery they're in at the moment.

I promise not to talk about this anymore. I don't want any money or help coming from Israel or people who support the action of their current government.

The confusion might have come from the fact that when I looked, there seemed to also be some DNS problem.

Basically, when asking directly, the servers that are authoritative for the zone were giving me a CNAME for the 'ANY' query, but not the associated A records, which it should, since the CNAME was pointing to a host name within the same authority. At this point, any sensible resolver stops asking !

This only lasted for a little while though - so it might have been a glitch or possibly a deliberate action related to how they were trying to fix the underlying issue itself - possibly averting traffic until they actually solved the actual problem.

To make matters worse, every time a client got an error attempting to query one of the databases it interpreted it as an invalid value, and deleted the corresponding cache key. This meant that even after the original problem had been fixed, the stream of queries continued. As long as the databases failed to service some of the requests, they were causing even more requests to themselves. We had entered a feedback loop that didn't allow the databases to recover.

Even when the database has a valid value, if failures to get a value from the database can creating a growing cascade of errors, then this design is still poised for a future failure for simple things like a partial outage of databases or network access to them. Ideally, once the data was valid, the number of clients not getting a valid value should gradually decrease as more and more get valid values and don't have to requery. But if the scale was such that none could get anything when all were trying (a

The "error page" is clearly a Facebook server reporting a DNS failure within Facebook's own network.
Facebook requests are processed by user-facing servers which make RPC calls (not HTTP) into Facebook's internal network.
Machines in multiple locations may be involved in generating a single Facebook page. If their in-house DNS system for organizing their internal network failed, they might produce messages like that.

It didn't fail, they turned it off. This was the easiest way to "shut off the entire site" as their post-mortem describes. The DNS errors users saw were being generated by the front-end HTTP proxies, not by client browsers, which caused most of this confusion. Once the database issue cleared, they reactivated the DNS entries for the back-end servers one cluster at a time and the site came back.

You seem informed. Maybe you can explain why it is that clients would not be picking up the corrected info and reducing their "attack" on the database servers (more so than everything being turned off and back on).

This is explained in the post-mortem. Basically, the problem was that clients were reacting to corrupt data being served up by the origin DB cluster the same way that they reacted to bad data coming from the memcached cluster - by deleting the offending entry in memcached and re-sending the query to the origin DB. So a client queried the origin, got bad data, and then deleted the key from memcached - resulting in every other client (tens of thousands of them, most likely) then querying the cluster for the s