Posted
by
CmdrTaco
on Thursday December 30, 2010 @10:17AM
from the don't-die-during-christmas dept.

aabelro writes "On December 22th, 1600 GMT, the Skype services started to become unavailable, in the beginning for a small part of the users, then for more and more, until the network was down for about 24 hours. A week later, Lars Rabbe, CIO at Skype, explained what happened in a post-mortem analysis of the outage."

"Doubles" refers to the last two digits in your post number (22 in this case).

Every post on 4chan is numbered, with each forum having its own individual counter. So while something small like/int/ (International) might have tens of thousands of posts, something more popular like/v/ (Video Games) or/b/ (Random, the sewage drain of the Internet) have millions.

There are often posts such as "doubles/triples/quads names my dog", or games wherein events are determined by post numbers like a roll of the dice. D

Here's why: "Your organization's Internet use policy restricts access to this web page."Reason:"Internet Telephony is filtered." - So I'm glad slashdot linked to the blog so I'd be able to read what was going on. My workplace is so backwards they still use old-fashioned telephone lines rather than internet phones. Oh and hot water radiators with that classic "thunk thunk thunk" sound when they turn on. Feels like I'm living in the 1930s.;-)

If you are a node-based company worth several billion, charge for services, and don't even run enough of your own supernodes and monitor them in such a way that they cannot handle an outage effectively, you need serious help.

If you are a node-based company worth several billion, charge for services, and don't even run enough of your own supernodes and monitor them in such a way that they cannot handle an outage effectively, you need serious help.

No one expects 40% of a globally distributed network to crash at once. No one.FTFA:

The initial crashes happened just before our usual daily peak-hour (1000 PST/1800 GMT), and very shortly after the initial crash, which resulted in traffic to the supernodes that was about 100 times what would normally be expected at that time of day.

Not even a multi-billion dollar company would have a disaster plan that provisions 100x capacity as a hot/cold spare.Though I bet their new plan includes automatic spawning of nodes on EC2 or some other distributed CDN.

I agree. But it wasn't an initial 100x surge, right? It was a cascading failure where eventually supernodes were up 100% because there were fewer and fewer of them. It's a matter of prevention, not cure.

Skype seems clueless. They're thinking of using "processes for providing ‘automatic’ updates to our users so that we can help keep everyone on the latest Skype software. We believe these measures will reduce the possibility of this type of failure occurring again." Contrariwise - this would only make the matter worse. What if the _current_ version were the one with the problem, and an automated

Ah, but its a brave new world where the client/server relationship is becoming fuzzier all the time. The part I think you are missing is that if you read the actual post it is obvious that everything that was crashing was applications on clients computers. It appears that some clients are promoted to server status to handle routing requests.

As for bad design/software I would instead say they had features without consideration of consequences. Here are where their problems are from what I can see.

If you are a node-based company worth several billion, charge for services, and don't even run enough of your own supernodes and monitor them in such a way that they cannot handle an outage effectively, you need serious help.

No one expects 40% of a globally distributed network to crash at once. No one.FTFA:

The initial crashes happened just before our usual daily peak-hour (1000 PST/1800 GMT), and very shortly after the initial crash, which resulted in traffic to the supernodes that was about 100 times what would normally be expected at that time of day.

Not even a multi-billion dollar company would have a disaster plan that provisions 100x capacity as a hot/cold spare.Though I bet their new plan includes automatic spawning of nodes on EC2 or some other distributed CDN.

It was their own widely deployed buggy software that caused the big chunk to go offline. Any other organization with a big deploy everywhere button would understand the importance of an equally big roll back button, and heavy testing before doing either. I guess because Skype's clients are also their servers so they have no control is an excuse? Is it a good one?

I'll belive that when I hear a chinese (one that isn't out of country for decades) saying that China will rule the world for any reason but because they are a superior race or culture. China is quite deluded, even more so than the US. Half the world (ocident) is helping them getting even more deluded, and the other half (orient) is too afraid to help them cut any kind of delusion.

That doesn't mean, of course, that China isn't becoming a superpower. They may be, or may not, I don't know the future. Military, they already are...

... relying on dodgy peer to peer VOIP telephony for business purposes is retarded.

we've got people bitching at work about how it doesn't work from time to time, and why I've blocked its ability to do voice/video at the firewall. If you want VOIP, use something that uses standard SIP or some other documented, configurable traffic.

Ahh so YOU'RE the one blocking my skype.;-)I don't understand why Net Admins (such as yourself) block useful tools like Skype. Or streaming radio. I don't see any harm in letting those things into the office space, and it provides a more pleasant working environment (to distract from the boredom of sitting at a desk all day).

Why do I block skype? Because the only way to have it work properly through most firewalls is to allow ALL outgoing ports. Which means you allow any random program to do any random shit through your firewall to the outside network. Its a massive, massive security issue you could drive an oil tanker through.

Also, many companies pay for bandwidth. I don't want all of my bandwidth chewed up on video calls instead of mission critical apps.

Its not just because we're nazis, its because skype protocol is completely fucked when it comes to the ability of your admin to control resources. Want voip/video? Use something else.

Just let me clarify: corporate networks are different to your home network. your home network? fine, use skype. in the office, where you've got several hundred PCs that may/may not have malicious software and/or users at the helm - allowing all outgoing connections is just begging for trouble.

Egress filtering is a good thing.

Making your day at work "less boring" by enabling you to do non-work related shit with company resources is not what my job is about. It is about ensuring the continued operation of the company's network - and skype is a liability.

Making your day at work "less boring" by enabling you to do non-work related shit with company resources is not what my job is about. It is about ensuring the continued operation of the company's network - and skype is a liability.

Careful there, BOFH. Here I'll help:

Making your day at work "less boring" by enabling you to do non-work related shit with company resources is none of my business. Get it requested through the proper channels and you can have it. I don't make the business decisions here, I just do what the company needs done to be successful.

Look, I'm all for business driven IT, but sometimes you have to save your managers from themselves. It's not being a BOFH to look out for the corporate network. You were hired to have the expertise to make recommendations and keep things as secure as possible. If it gets shoved through anyway then it may be time to start looking for someplace that actually values your skills.

Deep packet is the only way to block Skype (or so I've heard.) The real danger lies not in the voice/videoconferencing but in the potential for tunneling and/or circumvention of data loss prevention controls.

Its a massive, massive security issue you could drive an oil tanker through.

Oh, come on. Sure, egress filtering is a polite thing to do, but it's inbound connections that put you at risk. And chances are, if you do fall victim to some nefarious piece of malware that's making unwanted outbound connections, simple packet filtering will be useless anyway because it will fall back to TCP 80, or TCP 443, or even UDP 53, to tunnel out. Just l

Because skype wasn't written that way. You want standard voice/video, use a SIP program. Skype was written deliberately by the developers to allow it to talk to anywhere and everywhere through your network so it can route other people's calls, and connect to random other nodes for your own call routing. That free lunch you're eating? Paid for by other's use of your bandwidth.

Multiply 500 users by 48kbit. thats 24 megabit in streaming audio. That you can get off that fucking $10 FM radio on your desk. Now i'm not sure how expensive bandwidth is where you are, but a 24 business grade meg METERED (say, 300 gigs) internet connection here is about 5-10 grand a month. The business is not going to wear the cost of 5-10k per month for our users to listen to shitty quality streaming MP3. Thats before you take into account the increase latency to mission critical apps, or remote end points on crappy satellite connections paying anywhere up to $7 per MEG of data

Any decent company I've ever worked with would have separate internet links for the "mission-critical" stuff and the regular internet traffic.
They would have a dedicated link to the servers but users would have access to the internet through regular consumer broadband.
Works great, you get the best of both worlds. Maybe you should leave your BOFH nest and consider this option and try to become less hated by your users (I know I would hate you).

While I think that comparing banning streaming music to record burnings is a bit over the top, you do make a good point about bandwidth. The cost of the bandwidth for audio streaming is trivial on a per user basis. Decent companies spend dramatically more than that to try to make work a pleasant place to be. Even crappy places to work often spend more than that. The claim "It's company equipment, so you should be using if for personal things." is basically a company statement that working for them shoul

Back in the day I worked at a place that banned streaming audio because one day there wasn't enough bandwidth for the actual business applications to go about their business when everyone was listening to their streamed music.

In places where DSL or cable internet is cheap, it seems basic common sense to have a "toy" internet connection with a wireless router. That's like $25 a month per 100 users (that's what we have where I work).

Note that I'm not suggesting 100 people could actually use it at the same time, but out of 100 people actually working, maybe 100 use any real bandwidth at once.

Within the network I manage, it boils down to bandwidth, security, and slacking off.

We have two large offices and a few small offices. All of the internet traffic is routed through the WAN to the main office that has a 10Mb link which is shared with our internet facing servers. The other large office acts only as a backup and has a 5Mb internet connection. The WAN links are 3Mb with the exception of the main office having a 6Mb one. Regular business WAN traffic is a steady 1Mb across the board with the usua

Well, we don't block Skype here... Though we do block streaming radio. I can give you a couple good reasons for both.

1) Bandwidth. A service like Skype or streaming some radio station may not actually take all that much bandwidth itself... But if you've got 10 or 100 or 1,000 folks using it simultaneously the bandwidth requirements get quite steep. And it's un-necessary bandwidth. You could pick up your phone and not hit the Internet, you could turn on a regular radio and not hit the Internet. Busin

Sorry if this is off topic or an ignorant question, but how does Skype define supernodes? Does the company just randomly choose users who are online a lot and declare them supernodes without the owner's knowledge, or is there some other process?cheers

I was merely suggesting that its just fine and dandy as far as SKYPE the company goes to rip people's bandwidth off. If you cbf reading the license and just click OK for the free shit then you deserve whatever raping you get. Nothing is free.

The information is there for you to read, should you care to look it up.

This article will be "google-able" within a week by 100% of the internet.

People who have read about this outage will be more informed, should anyone care to ask around "about this Skype thing I've heard about" and information from geeks is immensely free-flowing (sometimes you can't shut us up).

Don't go shouting and swearing at the GP who is trying to point out that lusers - don't fucking care - about the stuff we try and tell th

the lusers as you call them are whom the internet is for. the point is to make the internet to their standards: none and few, rather than making it to your standards: computer science majors only. the internet is not an exclusive club for the technically sophisticated

think of it as an engineering exercise in robustness and hardiness and elasticity in the face of abuse. because your current inferior attitude that some sort of technical proficiency is required to use the internet is a standard that will simpl

Hmm. Seems to me their biggest problem is that they allowed clients with a known bug to become supernodes; if 50% of the network had upgraded, they should only have been creating supernodes from the upgraded clients.

And in hindsight (I don't know that they should be blamed for not considering this before), the number of supernodes should probably be ~100-150% more than needed to service expected load. That way, if a third of them die, they _still_ have more than needed to handle the expected load. (And thus, hopefully, more than needed to handle the excessive load without causing them to shut down).

Hmm. Seems to me their biggest problem is that they allowed clients with a known bug to become supernodes; if 50% of the network had upgraded, they should only have been creating supernodes from the upgraded clients.

If they had the power to stop bugged clients from becoming supernodes, why not just use that same power to make them patch? You're sort of assuming that they ever imagined that this could have happened. It's pretty clear that they didn't...

It's subtle, but it's there at the bottom where they admit 'we need to test our crap first and we need some way of making people patch' - which is kind of a known thing in the modern software world.

Seems to me their biggest problem is that they allowed clients with a known bug to become supernodes

OK, FTFA

Approximately 40% of all Skype users that were online crashed, taking down around 30% of all supernodes.

- so supposedly this means that 30% of the supernodes went offline due to the bug, is this correct?

But look at the number: 40% of ALL Skype users went offline! That's insane, that's almost half. At the same time ONLY 30% of the supernodes went offline due to this bug, right?

Something does not add up.

FTFA:

Clients that continued to be up and running, and clients that restarted the application had their network searches directed to the supernodes still running, leading to an overload of those. Since Skype has in place a protection when a supernode is overloaded, so it would not consume too much of a client’s system’s resources, the supernodes started to shutdown automatically one after another, leading to a generalized failure of the network.

- so the sequence of events is supposedly this:

1. Bug causes 40% of all Skype clients to stop functioning, this includes 30% of all supernodes.

"At its core, Skype relies on a third generation P2P network that has lots of peer nodes and a number of supernodes, one for several hundreds of nodes. Since Skype does not have a centralized directory to support finding routes between two or more nodes that want to communicate, the virtual network uses supernodes as directories. When a client enters Skype, it registers itself with a supernode, giving its IP address so it can be found by other clients who might want to establish a communication."

Skype is a peer-to-peer network? Like torrent? So the supernode is like a tracker website, to connect peers to one another? No supernode==no tracker==no calls going through. Hmmmm. Maybe they should try DHT.

Lots of users were using an old outdated buggy version of Skype, lots of client crashes at once bringing down big chunks of the P2P network, remaining network couldn't handle the load and went down too, took a while for Skype to put it's own supernodes up to help get the network self-sustaining again.

They're considering an auto-update feature now since such a feature could have kept this from happening. Personally I think old versions should be blocked from making or receiving calls too, so users would be encouraged to update (works for Team Fortress 2). Of course auto updates would make updating super easy anyway so impact from that would be minimal.

The problem with the auto-update feature in Skype vs. gaming is that most gaming computers will be close to top-of-the-line. Most computers used for Skyping will not be top of the line.

From experience, the 5.0 version of Skype doesn't work as well as the 3.8 branch. Switching between windowed and full-screen video on the 5.0 branch takes ~4 sec to accomplish, with the audio becoming choppy at the same time. In addition, the video is choppy and audio quality is scratchy at best. The 3.8 branch doesn't have t

...unless you need something in the newer version (feature, security update etc.). Of course us geeks like to have the latest to fiddle with, but for the average Joe end-user, if it ain't broke, don't fix it. There is always the risk that the newer software will contain new bugs. At one point the buggy version of the Skype software was the latest version and was what users were being pushed to upgrade to. If the crash had happened then, I wonder if they'd find a new way to scapegoat users.

By the way new versions breaking existing functionality isn't theoretical, or rare. I'm currently installing software on my new laptop. I've had to downgrade both Zonealarm and Virtualbox. The former broke remote desktop. The later broke file sharing. No idea why, but in each case uninstalling and installing an older version I knew worked fixed the issue for me.

The problem is that it is broke, you just often don't realize it. Older doesn't mean more secure or more stable inherently. New versions fix bugs discovered in old versions. If everyone did update immediately, then everyone would have had the bug fix and this outage wouldn't have happened.

You're suffering from sample bias. Newer software is also 'broke' and you also don't know that. I think the point would be, if it is 'broke' but not impacting you in a way that you'd know it, do you care? In some cases yes, in other cases no.

How about they release some supernode only software that people can setup on a server and possibly the ability to setup Skype to use a preferred supernode. So a businesses can setup a supernode of their own and point their users too it. But also that supernode is part of the collective of supernodes and routes Skype connections for everyone else too. This would hopefully give Skype more supernodes out there that are 24/7 and not desktop computers routing the traffic.

If problems with the client can lead to problems with the server then the server system lacks robustness. For applications like this the servers should be practically immune to any client state much ups.

There's an exception to the client-server divide, and this is a classic example: if your mistake causes a big chunk of your client base to DoS your infrastructure, it's going to go down, no matter how good your infrastructure is.

The QA of this release is way down. On top of that, skype auto-updated people from 4.0 to 5.0. Within a few days, the buggy 5.0 had enough penetration (50%) to bring them down.

The windows client has widely been reported to:consume 2x as much CPU (33% to 60% on mine after upgrade)leak RAM (starts out ok but after some use over 1.5gig needed)the GUI is slow, so the fade effects on some computers (mine) causes video tearing. It is no longer possible to run full-screen. (320x240 is all I get before tearing sets in)The fonts in the video area don't render correctly.It should be noted that I have a AMD X2 1.6 and Radeon 1200 card in this computer. Its not shabby. But the 5.0 client brought it to its knees.

It plays SCII just fine (albeit on the lowest setting).

It comes at a bad time when they are trying for more corporate agreements, but can't run on my 3-year-old hardware.

Back when I was doing one of the first VOIP solutions (this one mostly for LAN use) we dreamed up something like Skype, that would work in similar fashion. The big advantage is that it could be done by any reasonably large group of users and no phone company at all need be involved -- no charge to anyone, no control over anyone by some big monolithic corp. It could still be done, and I wonder why no one in the open source area has managed?
Critical mass issue; selling the first phone is a bear -- who you

I hate when apps run auto update daemons. This precisely the reason why I don't use any Google desktop software on my computers.

Proper thing to do in this case is simply disallow users to log in with a message they need to upgrade their client if they want to continue to use the app. Simple thing to do, rather than each app running a daemon. Soon enough there will be hundred update daemons on each user's computer, eating resources, connecting online all the time and bogging down the user experience. Thanks but no thanks. I refuse to use any of those.

"We believe that increased load in supernode traffic led to some of these parameters exceeding normal limits, and as a result, more supernodes started to shut down"

Maybe I'm missing something, but why are supernodes coded to shut down during increased load instead of simply throttling requests? It seems like the idea of 'too many requests, shut down' is what caused the cascade. Can someone enlighten me as to why this is the preferred overload handling mechanism?

Assume your socket connections will always work, and don't bother handling errors, throttling or connection requests, its the cheapest, easiest way after all. Its probably not even "too many requests, shut down" but "too many requests, crash". Once there - ship and let your users be damned.

Only in this case, the company found out why you should hire the best devs you can and not the cheapest. If your business is software, you need to treat it like an asset, not a cost.

One important lesson to be learned is this: many users do not update their software if they don’t have to. Skype had a newer version in place, without the triggering bug, but most users had the buggy one.

Yeah. Right. Because all recent Skype updates (staring with version 3(?)) were known to contain mostly only one of this: more ads or more UI bloat. And occasional breakages.

There is an option between "auto-update" and "update when you want"; depricated versions. If a version has a known major bug in it that could compromise the system require updates only those versions. That way only the bad version will be replaced and we won't be updating everyone at every release. The main advantage is that the system is kept safe without unnecessary updates.

NAT is evil. Skype needs to build an overly complex networking protocol because too many people are behind NAT gateways. Skype *could* probably get away with their basic available hardware if only they got to design for a NAT free world.

One could also say they were trying to cheap out and not invest as much hosting required to assure reliability of their chosen networking architecture.

Of course, on the flip side, Skype as a service would be nearly useless in a NAT-free world. No need for a coordinating e

Approximately 40% of all Skype users that were online crashed, taking down around 30% of all supernodes. Clients that continued to be up and running, and clients that restarted the application had their network searches directed to the supernodes still running, leading to an overload of those. Since Skype has in place a protection when a supernode is overloaded, so it would not consume too much of a client’s system’s resources, the supernodes started to shutdown automatically one

A non-telephone company had a cascading problem with its ad-hoc peer-to-peer networking that provides telephony and video services at costs way below any telephone (or cable) company. The company is profitable enough to make its own way in this world.

This story was broadcast pretty-much worldwide by all media.

The non-telephone company was embarrased and released a statement to the media about how this happened as a means by which it might encourage everyone to download new, fr

Google video chat, perhaps? Or maybe acknowledge that its fairly impossible to provide both 100% uptime and free video chat at the same time, without the resources of a major player behind you to promote goodwill?

Seriously, they were down for some percentage of the people for 1% of one year, during which time many competitive products were available. This is not an earth-shattering catastrophe.

I think we're talking about better up-time than that for Skype. If we believe the outage numbers presented on their Wikipedia page http://en.wikipedia.org/wiki/Skype [wikipedia.org], they've had a total of 72 hours down time since the initial release in 2003--and assuming a 100% outage in all cases (which was not the case here)--their up-time minutes work out to something like:

99.9988%

Seven years and 72 hours of total down-tine... It might not be five nines, but does seem a pretty respec

The uptime of Skype to the user is the product of Skype's uptime, that of the user's Internet service, that of her electrical service, and that of her hardware. That product might exceed one 9 but it'll won't come near 5 9s.

Well said. Skype is primarily a piece of technology aimed at the individual consumer. It is made completely clear at the outset that it doesn't claim to be a landline replacement, so anyone who lost business as a result of the outage doesn't get much sympathy from me.

The dowmtime period for me was about a day and a half, which amounts to 0.41% of the year. No biggie, I have SIP and mobile alternatives. Or both if I run a SIP client over my wireless internet dongle or phone tether.

Skype For SIP is the perfect way to integrate Skype with your existing PBX, allowing the communications from your PBX to be complemented by Skype functionality – head over to the Business blog to find out more about the Beta programme.

Somehow I don't think PBX interoperability is aimed at the consumer market. (though SIP support might help some consumers)

It's sheer laziness to not patch your software. Yes, sometimes, a buggy update is unleashed upon the world. However, this is a case in point against running unpatched software.

No, commodore64 is right. There needs to be a reason to patch and that reason needs to outweigh both the hassle of doing it AND the risk that something new will be broken.

If you're not handing over fresh new dollar bills for a piece of software, expect it to be assembled with the bare minimum effort. This includes all patches. The likelihood that one of this will suck worse than the problem they're attempting to fix is very, very high.