Building a Robust Skype in the Microsoft Ecosystem.

During my career I have worked with several high tech companies, as both employers and clients. One common thread is that you build new offerings or modify existing offerings based on performance experiences and opportunities that the previous offerings exposed. Developer toolkits become more efficient, cloud-based services demonstrate significant cost savings, new platforms, such as smartphones and tablets become popular. And the back end infrastructure of an offering can significantly impact both features and performance.

The migration of an offering from PC’s to mobile also has exposed some significant challenges for many applications, especially with resource hungry offerings such as Skype. Initially relying heavily on direct connections between local PC’s with a minimum of servers, such as a directory server, it has become apparent that a much more robust server-based backend was necessary to support its growing user base and the increased focus on mobile devices. Skype’s acquisition by Microsoft provided the level of both the financial and technology resources required to address these issues.

A recent email authored by Skype’s Principal Architect, Matthew Kaufman, provided an overview of what Skype/Microsoft have been working on over the past couple of years to improve both the robustness (who wants another outage?), reliability (is it a dial tone ready service, especially on mobile?) and communications efficiency of Skype (who wants to miss a message when one party is not available?). Not only do we get a subtle feel for infrastructure issues that arose during the eBay and Silver Lake ownership periods but also:

….. what is driving Skype to move not just the supernodes but actually many other parts of our calling and messaging infrastructure “to the cloud”, …. is the amazing growth of mobile and tablet computing. The Skype peer-to-peer network, and many of its functions (such as instant messaging) was built for a world where almost every machine is powered by a wall socket, plugged into broadband Internet, and on for many hours a day.

Dealing with issues that are exposed to the user:

1. Outages and how to avoid them.

Turns out that Skype’s original peer-to-peer architecture, while providing totally tight encryption but “electing” certain users’ PC’s as supernodes, was subject to triggering an outage caused by a crashing bug in a Windows Skype client. (recall December 2010 and August 2007). Like recharging a superconducting MRI magnet that has lost its liquid helium (been there), “bootstrapping the network back into existence afterwards was painful and lengthy,”. By providing nodes through deploying servers that Skype/Microsoft control, there is not only improved robustness but also the ability to scale more reliably and use much more efficient code.

“… nodes that we control, can handle orders of magnitudes more clients per host, are in protected data centers and up all the time, and running code that is less complex that the entire client code base.”

2. The challenges of running Skype on mobile

Skype on a PC has an infinite power source, takes advantage of very fast processors and runs effortlessly over WiFi and broadband Internet connections. (Recall that Skype’s initial success ten years ago was in part due to the rapidly expanding availability of broadband Internet connections at the time. ) For the most part users either do not have data caps or have large data caps and the connection was “always on”. Users could see your availability; etiquette evolved such that you would initially ask if the other party could take a call. With sufficient upload bandwidth, a robust Internet connection and the right webcam video calls are supported up to 1080p resolution.

But when it comes to mobile it’s been a different story:

Battery drain is probably the major issue. If you have a few hundred contacts, the presence monitoring can take you down in a few hours. I’ve seen my iPhone drained in less than four hours.

Processor speeds challenge the video resolution that can be supported as well as the ability to deal with the stability of the network connection, amongst other issues.

Mobile phone plans usually have a data cap. A “free” 10 minute Skype-to-Skype voice call can use up 15 to 25 MB. And more with video calling.

When roaming outside the home carrier’s territory, use of WiFi only is advised due to high roaming data charges.

A user is not “always on”. Receiving requires the application to be “loaded” on the contact’s device but even then iOS suspends the application when not the foreground application until there is a “notification” activation. It’s not a true multi-tasking OS.

Following several instant messaging threads can contribute to battery drain.

Yet the number of Skype users on mobile devices has become a significant portion of Skype activity. From Kaufman’s email:

And these devices are a lot different: they’re running on battery, sometimes on WiFi but often on expensive (both in money and battery) 2G or 3G data networks, and essentially “off” most of the time. On iOS devices, applications are killed and evicted from memory when they attempt to do too much background processing or use too much memory. On Windows RT and Windows 8 Modern applications, when the application is not in the foreground we only get a few seconds of CPU execution time every 15 minutes and again, strict memory limitations if we want to stay loaded.

3. Instant Messaging has its own challenges

If the receiving party is not online messages were being lost. There was no buffering or recall when the other party did come online. The need to support asynchronous messaging became apparent with the rise of mobile device use.

Today instant messages are buffered and are available when you log into Skype. You will see recent activity show up in the messaging/call logging screen. You can even mark them all as read, especially useful if your account is also on a PC. On PC’s they go back 30 days; on mobile devices they can be recalled going back 2 weeks.

How has Skype/Microsoft addressed these issues? Again, quoting Kaufman:

Servers. Lots of them, and more and more often in the Windows Azure cloud infrastructure. In the case of instant messaging, we have merged the Skype and Windows Messenger message delivery backend services, and this now gets you delivery of messages even when the recipient is offline, and other nice features like spam filtering and malicious URL removal. For calling, we have the dedicated supernodes already, and additional services to help calls succeed when the receiving client is asleep and needs a push notification to wake up. And over time you will see more and more services move to the Skype cloud, offloading memory and CPU requirements from the mobile devices everyone wants to enjoy to their fullest and with maximum battery life.

Bottom line: Yes, with the Microsoft acquisition Skype’s back end has changed significantly. But with the goal of making Skype more robust and more reliable and scalable to billions of users. On mobile devices it needs to be a “dial tone” offering where whenever someone calls, there is no technology impediment to answering the call. When you make a call select a contact and launch the call; if the called party is on the device, receive the call regardless of any other activity. Keep instant messages in chronological order (one recent improvement); make instant messaging truly asynchronous.

Making a $8.5B investment and ignoring these issues only opens up opportunities for others and also negates efforts to infiltrate business communications, whether for small business using Skype or large enterprise using Lync connected to Skype. Skype has stated their focus is on improving the mobile experience; we now have some exposure to the level of effort required to get there.

Question: how much similar infrastructure will be required when WebRTC calling becomes mainstream. Will it only make the voice/video connection? What will support robustness, reliability and scalability? To what extent will it evolve into having the complete scope of Skype’s offerings? Who will be responsible for putting this level of infrastructure in place?

And as for government monitoring, etc. I’ll leave that to others to discuss. However, as pointed out indirectly in Kaufman’s email, it was not the driving force for making back end changes. I have worked with several developer teams over the years; they are quietly proud (in a positive sense) and want to bring the best performance to their users. But it’s also an ongoing evolutionary process building on previous experience and feedback; if it means the end of peer-to-peer so be it.

In the end it’s the user experience that is important; it should be transparent to the underlying technology. As traxanderson.com points out in the post linked below:

And that as long as you are not discussing how to go on a crazy shooting spree or blow some innocent folks to thine kingdom come, you shouldn’t really bother much. This is just the beginning and there won’t be any ending soon or ever so better get use to it.

Following the U.S. government efforts and the travels of Edward Snowden is starting to read like a Tom Clancy novel. Where’s James Bond when we need him? Oh, I forgot – M died! But there are times when you make changes for the sake of a more reliable, robust and scalable offering; that appears to be the horse that goes before the cart in this case.