Lessons Learned from Skype’s Outage

On December 22nd, 1600 GMT, the Skype services started to become unavailable, in the beginning for a small part of the users, then for more and more, until the network was down for about 24 hours. A week later, Lars Rabbe, CIO at Skype, explained what happened in a post-mortem analysis of the outage.

At its core, Skype relies on a third generation P2P network that has lots of peer nodes and a number of supernodes, one for several hundreds of nodes. Since Skype does not have a centralized directory to support finding routes between two or more nodes that want to communicate, the virtual network uses supernodes as directories. When a client enters Skype, it registers itself with a supernode, giving its IP address so it can be found by other clients who might want to establish a communication. Supernodes are then interrogated when someone wants to start an IM or voice/video session with another client, the target IP is obtained and a direct communication link is opened between the two clients. Supernodes are vital elements of Skype’s network.

Skype also runs a number of support servers responsible for delivering offline messages. Due to an unexpectedly large number of undelivered messages, these servers sent the messages some time later. A bug in Skype for Windows version 5.0.0.152 made the application crash when receiving late messages. The latest Skype version, 5.0.0.156, and previous versions for Windows and all the other versions for non-Windows machines were not affected by the bug, but the problem was that around 50% of the users were using the faulty version, which was the initial version when Skype 5 was released. Approximately 40% of all Skype users that were online crashed, taking down around 30% of all supernodes.

Clients that continued to be up and running, and clients that restarted the application had their network searches directed to the supernodes still running, leading to an overload of those. Since Skype has in place a protection when a supernode is overloaded, so it would not consume too much of a client’s system’s resources, the supernodes started to shutdown automatically one after another, leading to a generalized failure of the network.

Skype cannot function without supernodes, so the company started initially hundreds then thousands of supernodes hoping to restore the service. They did not specify what systems they used for that, maybe some of their own or some on Amazon EC2. The network started to build itself around these supernodes, the service being restored after 24 hours. They said they stopped most supernodes they had to start, leaving a few around in case there was a sign of trouble, being known that the network is very used during Christmas.

One important lesson to be learned is this: many users do not update their software if they don’t have to. Skype had a newer version in place, without the triggering bug, but most users had the buggy one. Skype is going to review the auto-updating process, perhaps implementing something like Google Chrome has:

We will also be reviewing our processes for providing ‘automatic’ updates to our users so that we can help keep everyone on the latest Skype software. We believe these measures will reduce the possibility of this type of failure occurring again.

Another issue is that one should do all possible to make sure the software is thoroughly tested, Skype deciding to review their “testing processes to determine better ways of detecting and avoiding bugs which could affect the system.”

The last lesson, but not the least, is the capacity of the Skype servers supporting the network, such as those serving offline IM, Rabbe mentioning they “will keep under constant review the capacity of our core systems that support the Skype user base, and continue to invest in both capacity and resilience of these systems.”

The article is misleading in that it blames the user for not upgrading skype.The reality is that it is the company, Skype.com, which is at fault.The newer versions of skype have an auto-updating feature and the userhas NO CONTROL OVER IT! The update proceeds whether the user wants it or not!

So the system wide crash is really caused by from skype's own updating process, NOT from user error/inaction. Users were running the most current version of skype they were given.

I agree, I found myself the day after with my Skype automatically updated without me doing anything. I remember being offered to upgrade the day before and declining (yes, I said no and the damned thing updated anyway). Skype's policy of silently updating on Windows shot their network in the foot. I managed to get back to the 4.2 version that I was on and was reasonably happy with however the article is logically misleading.

The bug impacted on a particular version of a client with undelivered messages. If they're like me that was silently upgraded to the buggy version (before they fixed it) and then started it, each start will crash Skype. The user was either automatically upgraded to a buggy version or knowingly updated to the latest version and restarted (potentially not triggering the bug if they get back online before anyone sends them a message) then they aren't particularly at fault - more over after updating their client crashes and then when it stops crashing the network is offline anyway! The update system can't function properly until the client stops crashing (that the user just updated to) so the network has to die before those users realise they need ANOTHER update after the one they did yesterday to fix the crashing. In this particular case upgrading the version caused the problem, people on the old client were perfectly fine. If anything people keeping up to date encourage this bug in addition to Skype's magical update without your permission system.

This might have been avoided if Skype didn't automatically update silently and without/against permission. They might have had a better chance of getting people on the non-buggy version before the network went down entirely.

The inherent design of the system is at fault. If the system is designed to prevent overload of a supernode by removing the supernode then it is obvious that the problem will cascade as clients keep querying and overwhelming other supernodes. Duh. I believe this was covered in my second year of computer science at University.

Hi "skype user",the article does not try to blame anyone. While I could find a culprit, it was not my intention that. I did not blame the users for not having the fixed version. I said "One important lesson to be learned is this: many users do not update their software if they don’t have to." And that is true. And it is not true just for Skype, it is true for all software in general. Microsoft has serious problems convincing some people to move to a newer/safer version of Windows. Some are still using Windows 95.

Regarding the update process. Lars Rabbe, CIO at Skype, said "Since a bug was identified in Skype for Windows (version 5.0.0.152), we had provided a fix to v5.0 of our Windows software prior to the incident." So, they provided the update. On the other hand, some users complain about the update not working, or they do not want it, and they are a bit confusing, making one wonder what actually is the problem:

Why did Skype automatically update my skype 4 version to 5 when I logged in? What if I don't want that new "fancy" version? Don't you think it's a bit rude to do this without asking me at all, not to mention that when I clicked the X button on the updating window it did NOTHING.

Hi, I'm currently using Skype ver. 4.2.0.169, but the issue I'm about to describe has been around since at least the first release of version 4.

I've selected the tick boxes for 'Notify me' and 'Automatically download and install' under 'When a new version of Skype is available...' in Advanced settings and yet I never receive such a warning. Instead I always have to manually check for new updates from the Help menu. This happens if a major release is available as well.

Is the auto update functionality broken, or is it just me?

Well, I am personally using Skype 4.2 and it does not auto-update because I do not have set the option "Automatically download and install it" under the "Notify me" one. For me, the update process works fine. From these forum posts I quoted it is hard to grasp what is actually going on. Are some users really having a problem, or they don't use Skype's settings correctly? I do not intend to clarify that. With this article I just wanted to find out what are some of the lessons software companies can learn from Skype's outage.

the newest client doesnt seem to be a miracle cure-all
by
mitko didkov

i registered here only because i disagree with the "fault" the article implies to the end users.SKYPE ALREADY UPDATES ITSELF super stelathy!

therefor i am using only the following:-"skype 3.6 portable.exe"it stores data in %appdata% and works as normal, only no audio update is going on behind the scenes.it wokrs like a charm, and the protocol has backward compability

-the web based imo.imo service i will never install skype 4.xor skype 5.x k10x bai

Re: the newest client doesnt seem to be a miracle cure-all
by
Abel Avram

Hi Mitko,the article I wrote does not try to find faults with the users or Skype. The article just presents some lessons to be learned. One of them is: "many users do not update their software if they don’t have to." And that is true. What you said comes as a proof to that. You are still using an older version of Skype, and you said you won't upgrade. This can prove to be very detrimental in some cases, as it happened with Skype.

Your analogy describes it perfectly! I wonder how a Skype client could kill a server component. I don't believe automated update can solves the problem entirely, perhaps they should inspect supernode programming thoroughly to make it fail-safe.

Supernodes are NOT "server components". They are simply clients who accepted to dedicate some resources as a directory. I don't think that as a client, you would appreciate your home computer to consume all resources simply to help the network.

But on the other hand, I agree that maybe there could be some throttling mechanism implemented limited usage instead of shutting it down.