Slashdot videos: Now with more Slashdot!

View

Discuss

Share

We've improved Slashdot's video section; now you can view our video interviews, product close-ups and site visits with all the usual Slashdot options to comment, share, etc. No more walled garden! It's a work in progress -- we hope you'll check it out (Learn more about the recent updates).

brajesh writes to tell us that Skype has blamed its outage over the last week on Microsoft's Patch Tuesday. Apparently the huge numbers of computers rebooting (and the resulting flood of login requests) revealed a problem with the network allocation algorithm resulting in a couple days of downtime. Skype further stressed that there was no malicious activity and user security was never in any danger.

Yeah, but Patch Tuesday usually involves a dozen patches or less, any handful of which (2-3) might apply to any one system. This one included more than 50 patches, 12 of which were needed by most computers in my office.

It depends. For the US East Coast, the patches are installed overnight - which would place Europe in the early morning hours and West Coast in the evening, and the machines are rebooted. The vast majority of such computers than hangs on the login screen, so Skype doesn't load. I'd expect the bulk of the oncoming Skype traffic to come at 9AM GMT and then again at 9AM Eastern Time when people logged into their workstations.

installing two patches, two dozen patches or even two thousand patches...

You still typically need to reboot when done. In this case, I don't think the load should have been a big issue - other than what was mentioned by another reply, namely that it would increase the variance of time for when the reboots occured (differing connection speeds). This would actually be to the advantage of Skype I'd think.

Skype said the problem wasn't the specific patches, but the fact that everybody rebooted at once. Patch Tuesday doesn't always require rebooting your machine, but my home machine got rebooted; my work machine also rebooted but sometimes that's because of what else my IT department wants to do when they're downloading the Microsoft patches, so it's hard for me to tell.

Maybe the average machine had more downtime on this month's reboot? Or the reboots happened in a more concentrated time window?

Under this circumstance, I think it was funny, that they recommended leaving the client running in order to reconnect automagically again once the login service was fixed. Sounds like a bad idea while having login issues...

But then, if it was the resulting flood of log-ons that caused the problem, either a whole lot of people all got on their computers and logged in to Skype at the same time, or a whole lot of people had their computers reboot after applying the patches all at the same time and had them set up to automatically log in to Windows and Skype (and last time I checked, you needed TweakUI for the former). Either one seems pretty unlikely to me...

I had to leave town and usually leave Thunderbird up and running to filter my mail on my IMAP account so my laptop syncs without having to redo all the filters I have in place. After no reboot on Tuesday I was relieved that I wouldn't have an issue with a down T-bird unless the power went out - which never happens unless I leave town (happened only once before).Sure enough, none of my mail is filtered after Thursday. Come home this morning and see "Your computer has been recently updated" balloon.

Given that this baby [wiretap law] was steamrolled through the Congress two weeks ago, the outage seems coincidental.

Interesting point, but Skype is based in Luxembourg and has no obligation to US law. Then again, they are owned by eBay, but just because they are owned by a US company does not mean much: they do not have to follow every shareholder's local law.

You are so, so wrong. If a US company owns them, then they are subject to US law. This is to prevent US based companies from just setting up a shell and providing services to, say....Cuba or any other restricted country. There are countless examples of subsidiaries getting in trouble for things that are illegal in the US -- but not where their offices are.

Otherwise, Foster Wheeler would just setup a shell in another country and start building refineries for Cuba.

I, personally, know of companies who have gotten into trouble when their equipment, somehow, found it's way to a restricted country (Cuba, Sudan, Syria, Iran, etc). The US treasury department publishes a list. [doc.gov] Admittedly, this is only the voluntary actions but I am certain there are involuntary actions as well (ie: criminal cases). See the entry about Varian (Switzerland) for a specific example of what I am talking about.

You are so, so wrong. If a US company owns them, then they are subject to US law. This is to prevent US based companies from just setting up a shell and providing services to, say....Cuba or any other restricted country. There are countless examples of subsidiaries getting in trouble for things that are illegal in the US -- but not where their offices are.

Or the other way round... In Norway, denying services due to e.g. nationality is illegal. If a US owned company operating in Norway does not serve Cuban customers, they could face discrimination charges. As they should, US law should not apply here.

Given that this baby [wiretaping law] was steamrolled through the Congress two weeks ago, the outage seems coincidental.

Consider that Skype could not tell the users of the real reason even if they wanted to: the law mandates that the forced cooperation be kept in secret.

Yes, the US government ordered Skype (a UK company, btw) to shut down for two days and blame it on Microsoft, and they complied. Hint: The aluminum foil goes on your head, not crammed forcibly into your ear.

Except that that's not the problem. Skype uses the resources of it's users to do everything, and when a huge portion of their users go offline simultaneously, then log back on at the same time, then no "logon servers" (read: network peers) are available.

Something was different last week wrt Microsoft. I had six servers reboot that had autoupdates turned off. My desktop system running 2003R2 and my laptop running XP also rebooted w/o my permission. We have quite a few pissed-off customers because of the updates. It was an unusual situation.

It just goes to show that you DON'T have control over your machine when it's running Microsoft Windows and it's on the internet. We have seen problems that result from this level of consumer trust in Microsoft before. I just have to wonder how much more will consumers tolerate? Seems like plenty since most people thing that anything Microsoft does is normal.

For the love of God editors, I understand that it is fine to write a sensationalist title on some articles but that is blatant FALSE. It is a complete LIE. People at Skype specifically stated that the fault was in *their* log-in mechanisms.

Really this kind of journalism is disgusting... I am tagging this story as LIE which I hope other people do as well, unless editors change the title.

I find hard to believe Slashdot has got so low... this and the speculative digg-like "articles" ending with a question mark "?", What the fuck.

For the love of God editors, I understand that it is fine to write a sensationalist title on some articles but that is blatant FALSE. It is a complete LIE. People at Skype specifically stated that the fault was in *their* log-in mechanisms.

Really? So when they said [skype.com], "[t]he disruption was triggered by a massive restart of our users computers across the globe within a very short timeframe as they re-booted after receiving a routine set of patches through Windows Update", they didn't really mean it?

Come on, just admit that you're wrong. It was a fault with their auth service in the sense that it wasn't able to cope with a Patch Tuesday-induced slashdotting that it wasn't designed for.

After watching Sycko now I am very afraid to live in the USA. How can you live there?

They're "running" on whatever you're running on. Skype runs distributed across the network of PCs belonging to its users.

Skype's model is somewhat controversial. My own company does not allow employees to run Skype on company issued laptops because the closed code is running distributed and there is no way of knowing where company confidential conversations might be landing.

There's a difference between a reason and an excuse. The *reason* the network went down was related to the MS patches. That's not an excuse -- Skype admits there is no excuse, and is now fixing their code.

No, it's just another example of your moronic blinding hatred of a company. EVERY other software distribution has [frequent, but not necessarily monthly] updates that require a restart like this. Sane software distributions make fixes available as soon as they are ready [including Microsoft, for sufficient values of criticality]. For marketing and big dumb company reasons, Microsoft saves them up for a once a month ordeal instead of letting users have things in a timely fashion and chose their time and s

The minute I saw the headlines on some of the blogs about this, I KNEW it'd be on Slashdot with the same misleading headline.

Normally Skype's peer-to-peer network has an inbuilt ability to self-heal, however, this event revealed a previously unseen software bug within the network resource allocation algorithm which prevented the self-healing function from working quickly.

SKYPE is blaming Skype for the outage quite contrary to the completely misleading headline on this article.

No, I don't know better [slashdot.org]. They have takes some part of the blame but a M$ anomaly was the initiating cause. To be fair to Skype you have to admit that 85% of the world's computers turning off at the same time is not an event a normal person would predict nor could such an event be tested in advance. M$'s synchronized forced updates are a menace.

I can imagine that an awful lot of people rebooted and logged back into the service crashing their servers. It seems to me that this type of thing should be built into surge capacity so that if the servers started getting hammered, they would just bounce the users that they could not handle while sending back a message saying the server was busy and to try again later. Other services do this. And it's not like patch Tuesday isn't well known.It sounds like bad planning on their part. A large scale power outa

It was just a few days ago the Open Source elders asked people to stop bashing Microsoft.
Skype did not blame Microsoft for the outage. They admitted the fault was in their software. We are not children here or part of a cult. This type of child play is no appreciated here.

Skype blames global warming on Colonel Mustard. In the conservatory (greenhouse). With the pipe. Since Colonel Mustard callously smashed all the windows in the greenhouse, it released all sorts of greenhouse gases into the environment thus dooming all the gay, baby polar bears unless the polar bears cooled themselves off by running the AC units of their Hummers at full blast. Why does Colonel Mustard hate the environment?

Yes, but the reboots would be clustered near the supernodes, because of the correlation between latency and time zone (nearby systems have lower latency to each other and are likely to be in the same time zone). So there would be a rolling overstress of their P2P architecture.

That's because this isn't a vulnerability. Furthermore, MS only has the power to reboot machines when explicitly granted that power. But the rest of your post makes sense. Oh, wait, that nonsense comprises the entirety of your post. Nevermind.

Yes, that's a power they explicitly grant to themselves by default, I'd imagine that 80% of Windows boxes are set to reboot automatically. After all, some would argue that's the way to ensure the patches get loaded.

Well that's an easy one. We'd format them and install Linux instead, so it can't happen to our friends again.

Of course, we'd put Windows right back on for our customers, since 2 hours sitting on your ass and getting paid for it is always good, and Windows virtually assures you'll get to do it again in the future, too.

"The disruption was triggered by a massive restart of our users' computers across the globe within a very short timeframe as they re-booted after receiving a routine set of patches through Windows Update."

I think this demonstrates the goofiness of a p2p telephone system. If I use Skype, I depend upon my data flowing through other users' computers because I am too dumb to allow incoming VOIP connections to my computer.VOIP connections should be direct encrypted connections from my computer to the computer of the person whom I wish to contact. Period.

<sarcasm> Sure, we can do that. Just before you make any calls we'll need you to lay copper directly from your location, to the location of the person you are trying to reach. </sarcasm>

Hello, it's the freaking internet, you're call is going to get routed to hell and back. Encrypted or not, you're going to be bouncing from routers to ISPs, to backbones, and back down the other side, and depending on your flavor you may even have a 3rd party provider to talk to in the loop.

That's exaclty what skype does.
All voice (video/chat/file) flows are encrypted, and they go from you to your party.
Only if both of you are behind a NAT or/and firewall, then skype routes the call through another node.
If you want more infos, have a look at
"Revealing Skype Traffic: when randomness plays with you" and references therein...
http://www.sigcomm.org/ccr/drupal/?q=node/245 [sigcomm.org]

"On Thursday, 16th August 2007, the Skype peer-to-peer network became unstable and suffered a critical disruption. The disruption was triggered by a massive restart of our users' computers across the globe within a very short timeframe as they re-booted after receiving a routine set of patches through Windows Update."This has been going on for years now. You will note that the outage occurred on *Thursday* August 16th. Microsoft's patching schedule is every Tuesday. Typically computers reboot on Wednesd

Gee, I hope no one tried to call 911 during the outage. That "enhanced" (insert guffaw, it's like calling a hamburger without the meat and just a bun "enhanced") 911 didn't do a tinkers damn worth of good for anyone who's service was out.

This is why I won't even consider VoIP. Why in the world would I want to take risks like this? I live in a house my family has lived in for over 60 years, with the same old phone line and it's NEVER GONE DOWN IN SIXTY YEARS! A couple of times a month my Internet craps out, though, though usually for less than an hour. And sometimes the router needs to be reset, like many people find they have to do periodically. What happens if I need 911 during one of those times, and I can't get around it?

"Internet phone", "digital phone" whatever they want to call it, anything but a REAL land-line from the local phone company is a substandard service by definition. They can throw whatever words out there to make it sound super-dooper, but it's a substandard service just like anyone who experienced this outage can tell you.

I live in a house my family has lived in for over 60 years, with the same old phone line and it's NEVER GONE DOWN IN SIXTY YEARS!

It's not as simple as you describe. For example, in the United States at least, a large number of landlines were unable to initiate any phone calls on September 11, 2001, whereas internet based services such as e-mail had no problems on that day.

Even for people who need a landline for 911, VoIP is still a useful complement for a landline. You can use VoIP for calling overseas, and the landline for local calls. In fact, you don't even need to subscribe to a VoIP service -- any calls that you place overs

There are already stories that when Verizon installs FIOS, they conveniently remove the copper wire connection that has served you so faithfully for sixty years. If you ask them to leave it in place they are supposed to honor the request, but other stories suggest that if you aren't physically present when they install the service that request is apt to get overlooked.The ultrareliable telephone service the U. S. has known for about a century is going away. It just doesn't make much money for the carriers,

How do you know your phone service has never been out in 60 years? Do you monitor it? How many calls a day do you make? Are you home 24/7 and do you use the phone all the time, as in more than 10,000 minutes per month?

Sure, you've never been affected by an outage of your phone service, but that doesn't mean it hasn't been out of service ever.

Plus, you pay for it too. At $30-40/month per line, you expect minimal outages. When you are paying $30/year or even nothing, a two day outage, while annoying, isn't surprising, especially when operated on a public network. Your phone line is on a private, dedicated network. You simply can't compare the two when it comes to uptime.

If all of Skype's customers paid $30-40/month, I'm much more confident that they wouldn't have had this outage.

"How do you know your phone service has never been out in 60 years? Do you monitor it? How many calls a day do you make? Are you home 24/7 and do you use the phone all the time, as in more than 10,000 minutes per month?"

Nice attempt at deflection of the topic, but the answer is very simple. No one who has lived in my house in 60 years has ever picked up the phone and it not worked.

Reminds me of the late 90s where AOL's crashing mail servers ended up bringing down my universities server (and many other organizations) because of the surge of load when AOL came back online and started sending backlogged mail.

Does anyone know what OS those Skype servers are running? If the OS is Linux, then I blame Skype administrators. If it is any flavour of Windows, then I blame Microsoft. Now, some of you might say I am biased.

I don't remember where/when this happened, so it might be an urban legend. But the story is that many years ago an earthquake rattled a California town. No major damage was done, but it killed all the phones in the town for several days.

The earthquake had jostled thousands of telephones off hook. The central office switches survived the quake just fine, but crashed due to a bug that seems eerily like the one Skype just described. Basically the switch kept a list of phones that were off hook. The switch is responsible for playing "dial tone" to those phones, but the central office only had a certain number of units that could play dial tone and listen for dialing. So the first "n" phones off hook got dial tone; the rest were put into a FIFO list of phones waiting for dial-tone equipment.

There were so many phones off hook due to the earthquake that the FIFO list overflowed, crashing the switch.

When the switch rebooted, it had to figure out which phones needed dial-tone. So it had to examine each phone line in turn, putting the ones that were off hook into the queue for a dial tone...thus overflowing the list and crashing the switch again. And again. And again.

After a while the telco folks figured out what was wrong, but then couldn't tell anyone about it...since the phones were down. They eventually had police and fire trucks driving all over town, stopping to hang up all the pay phones that were jostled off hook, and blaring over megaphones for people to hang up their phones.:)

Eventually enough phones were hung up so the switch could reboot without crashing - end of crisis.

If it is a flaw in the self-healing mechanism, then I don't know if this is such a good reason. A whole internet rebooting is something to be scared of though. I presume that can be helped by rebooting systems in some sort of time-schedule.I think the "mono-culture" thing is an interesting argument, but nobody is going to add or change operating systems because of this reason. So the argument is mostly academic. Furthermore, to solve this problem, you would need to replace the Skype mono-culture, not the Wi

I think the "mono-culture" thing is an interesting argument, but nobody is going to add or change operating systems because of this reason. So the argument is mostly academic. Furthermore, to solve this problem, you would need to replace the Skype mono-culture, not the Windows mono-culture.

Yes, that's also a good point. My argument isn't specific to an operating system monoculture; it applies equally to an application-level monoculture. This is why I believe in multiple implementations around a central o

"you would need to replace the Skype mono-culture, not the Windows mono-culture."Not really why do you think that any exploit for Windows is so dangerous? Even then if you think about it the idea that EVERY windows system is going to have to reboot on a certian day is just laughable.

While your point is valid it's not really relevant to this particular situation since it was a single implementation of VOIP that died.

Skype going down had zero impact on my life or my network. If a computer is relying solely on Skye for VOIP then your statements would be relevant to the story. This is why I have both Cisco VOIP and Vertical's VOIP implemented into my network. The Cisco as a backup to my primary PBX. It's not as functional but during a failure mode it will still allow us to call out and to

A hard-nosed person might say the real solution is to design a secure OS.

A reasonable one might suggest Slashdot change their misleading headline, and recommend Skype fix their network. It's not like this is the first Patch Tuesday in history, or the last.

It's convenient to blame Microsoft, I think. Skype knows all over the internets today people will be waxing poetic about "software monoculture" and "M$ Windoze is teh suxxorz" instead of questioning why a simple DoS they're supposed to be able to handl

Read the article. They are not blaming MS for the failure, they are blaming their own code. It was just because of the mass reboot that their own flaw became apparent. Headline is factually inaccurate.

Perhaps it would be troubling if they were blaming Microsoft. In this case they explained that the large number of simultaneous reboots and subsequent logins simply stressed their servers. They further stated that their "self healing" did not function as designed. It is strange that earlier "patch Tuesdays" did not cause this to occur, but as I write code I find that many behaviors I see in my applications are strange until I truly understand their root cause. It may have been that the software was resilient to a point and then just fell over. Perhaps the point that it fell over was when the "self healing" kicked in and hit its fatal bug.

Load testing is hard. I know. I used to do it. It is hard to anticipate what your peak load might be. It can also be hard to generate the right kinds and volumes of loads that your service might experience. Proper load testing requires a realistic test bed with enough machines running client simulation scripts to sufficiently load the machine. This requires a deep understanding from management that spending large amounts of money on non-production systems is essential. Your setup might deal with some kinds of load well and fail on others. Perhaps Skype had considered what might happen during a natural disaster with a large number of calls originating at the same time, but neglected to see login as a significant risk, especially if they had weathered that storm before.

My least proud moment in quality assurance was seeing my company's service go down for a weekend due to excessive database load. We had a new version of our web service software that required significant database changes to each user account (including database structure redesign...go ahead and wade through that hard book on database principles before you start coding my friends...funny its what I'm doing right now as I go from QA dude to programmer). We made an upgrade script that ran when each user logged in, which brought the user's data up to date with the current version of our software. The thing is I knew about the risk, measured a high load at user login, notified engineering about the potential problem, but didn't demand that the upgrade be placed on hold until the issue could be better quantified. Ah, live and learn.

Hey look, if I'm a skilled corporate comms officer -- and I have no doubt Skype has one of those --, and I have to lie about an outage, I'd do it so that it would be believable. All they had to say was:We recently upgraded our login server authentification routines, and in spite of our testing, we missed something.

The underlying problem with Skype has always been the auth server: everything has to go through it. Worse, when a supernode goes down (e.g., reboots due to a planned install), everything connected to that supernode has to go through it. Now, Skype has been growing pretty fast, pretty much every week their auth servers handle more traffic than the previous week. Your average user might not reboot all computers at the same moment, but what about big enterprises?

And how does Skype pick its supernodes? We know one of the criteria is bandwidth. So let's say in some part of the world where a bunch of little skype clients are wired to a few big bandwidth providers, patch Tuesday hits, and a bunch of those supernodes reset at the same time. The Auth server is hit with the traffic, not from the rebooting supernode, but from all the clients connected to it. That's "peak load" for your auth server, and it increases every patch Tuesday.

...Huh? Why would this affect MS in the least, let alone strike "a huge blow against Windows on the wordstation"? It's not as if this will happen every month: It was a one off due to a bug in Skype's network algorithms, it's already been fixed, and the chances of it happening again are negligibe.