For a couple of years now, WebRTC has captivated the communications industry with high expectations. Against this constant interest, the reasons for the excitement has changed. Even the defining characteristics have changed. Initially, what was noted was that there is no need for any downloads. Slowly it was deprecated since many realized that though the codecs and media engine is not downloaded, the signaling procedure, a significant component, is being downloaded. Then the attention shifted to development ease. But subsequently, people realized many developers strong in web technologies usually lack certain communication oriented skills, like signaling protocol and use of STUN/TURN. Lately, with emphasis on mobiles and Apple’s continued silence on supporting WebRTC in iOS environment, it has become necessary to focus on native apps for communication, avoiding browser altogether. Given this necessity, it looks like any app can claim the WebRTC moniker as long as it uses just a small aspect of the original idea. Some apps who do not have any aspects of “web” are now routinely being called WebRTC-enabled. The purpose of this post is to define what is WebRTC, why WebRTC is attractive and how to use WebRTC capability. (The other 3 Ws – Who, When and Where are really not that interesting.)

(Note: In the following, “media” refers not only yo voice and video, but also data transfer.)

What is WebRTC

This section enumerates defining characteristics of WebRTC. Not all of them are mandatory. Since the use cases are varied, some characteristics are not needed for some use cases. But there are a couple of them that is mandatory for all use cases. Accordingly, the list is arranged in importance.

The session is initiated with an HTTP GET/POST. This is the natural way to do within a browser. Though a native app is free to use any mechanism for self-contained initiation, it should allow this or a custom URI scheme for an external entity that requires the services of the app.

Use of triangular connection model. Both the initiator and the recipient of the session are served by the same entity – “server”.

The session control procedure is dynamically downloaded at the time of session initiation. This may not be required for some native apps that are dedicated to connect to only pre-determined server(s) which use a fixed set of procedures.

The link that carries session control messages and the link that carries session media traffic are separate.

It is expected that the end-points have UI elements like screen where sophisticated and context-specific information. This will of course be further dependent on the specific use case.

Why WebRTC

This section elaborates on the rationale for the requirements listed in the previous section, by pointing out the benefits.

Traditionally, communication systems have been stand alone and monolithic entities. Consequently, it has been difficult to integrate with other processes or interwork other communication systems. For example, BECP has been touted for a long time. But communication systems required the session initiation be done within the system. If there is a need to bring in any historical information, which would naturally be contained in the application associated with the business process, then these two systems must agree on the method and format of such exchange. This results in logistic and coordination quagmire. But requirement 1 simplifies this enormously. The URI scheme allows passing of information from the business process to the communication system straightforward. Additionally, the communication system can be upgraded or changed to an alternate system with minimal disruption. For example, the business process can utilize HTTP redirect to maintain the original URI, but redirect to a different entity with minimal administrative intervention.

Use of triangular connection coupled with dynamic download of signaling procedure eliminates any interoperability issues and feature compatibility. A server can introduce a new feature and try out a different UI without worrying whether clients will be handle the change.

If the session control messages uses a link independent of the media link, then it becomes much easier and quicker to stop and restart media flow, without incurring session setup time, which will be longer due to the required authentication procedure.

How to use WebRTC

WebRTC apps can be designed for a wide variety of applications, requiring both high-end and low-end scalability.

Personal communication system: A small footprint WebRTC app server that will allow the host”s friends initiate communication session from a browser as and when needed. This eliminates the need for a 3rd party provider. Here WebRTC helps to eliminate the need for network effect at the application layer.

Contact center: A potential customer browsing a website can initiate a communication session. The website can dynamically construct the session’s reach URL based on the customer’s profile and browsing history and other cookie information. Since the reach URL is dynamically generated, the website owner can change the WebRTC app server very easily. The communication provider has been commoditized.

Unified Communication System: An organization can decide to gradually introduce a new UC system. Unlike traditional systems, a system based on WebRTC allows for “guest access”. This way early adopters do not suffer from lack of network effect. Indeed early adopters can become advocates of the new system because they will be in a position to demonstrate the benefits to late adopters.

In a recent post by WebRTC “activists” on the impact of WebRTC on UC, Alan Quayle writes,

The application diversity being driven by the person we’re trying to communicate with and their preferences. So what impact will WebRTC have on UC? None. Because the problem is in federation of presence, not in the standardization of media codecs, and the lack of federation is driven more by commercial issues than lack of standardization.

There is a way for a WebRTC-based system to address the natural application diversity you identify. There is a fundamental problem in current implementation of distributing Presence information. The problem arises because Presence information is usually pushed to the recipients. While federating, for various reasons it is preferable to selectively share this information outside of the local organization. There is no dependable way to do that. Instead, if the Presence information is pulled, then it will be easy to selectively share Presence information depending on the person querying it. A WebRTC-system can universally support pull request via HTTP.

The next issue federation has to address is signaling protocol. But WebRTC tackles that by dynamically downloading the signaling procedure. This is why it is important to recognize the benefit of triangular connection afforded by WebRTC. Very often, WebRTC is credited with standardizing media codecs. But by allowing dynamic download of signaling procedure, it has eliminated the need to standardize the signaling protocol.

The final point Alan makes is very valid. Till now federation between two organization means there has to be an elaborate organizational agreement has to be reached even before administration setup can be made. But a WebRTC-based system allows an organization can unilaterally give “Guest access” to some or all of the members of the partnering organization as long as the partnering organization has a federating id mechanism like SSO. The local organization can enforce guest privileges using federated id and maintaining whitelists and blacklists.

Apart from Alan’s points, WebRTC is going to impact UC market in a major way. Thus far, it is very difficult to incrementally roll out UC system. More often than not, users of a UC system get to utilize the full feature set only when they are interacting other users of the same UC system. But “Guest access” allows for incremental roll out. This is going to have impact on the current players as well. Interestingly Skype for Business and Cisco have announced their plans to offer “Guest access”. We have to wait and see how they will be impacted.

But there is a cautionary point that needs to be noted: there is a major gotcha for “Guest access”. If an enterprise will not allow UDP traffic out of its Intranet, then “Guest access” will fail. Current WebRTC/ICE mechanism does not allow for the originating enterprise be involved. There are proposals to address this point. This is critical this gets resolved soon.

In a couple of hours there will be a VUC session on this topic. So I thought it will be useful to record some of my observations and outstanding questions.

A user or administration of the local network must have a way to designate the STUN and TURN servers that override the ones specified by the application. STUN is analogous to DNS server and just like we are at liberty to specify the DNS servers, we must be able to specify the STUN server. Depending on the security considerations, a network may be obligated to record all conversations. To facilitate that, a network may deploy a TURN server and may require all RTC traffic to flow through this server.This can be simple done if the browser were to tacitly utilize its own TURN server and assign the highest priority to the corresponding ICE candidate. This is analogous to using SOCK proxy for HTTP flow.

Both the users and application providers should recognize that external STUN and TURN providers have access to session metadata.

TURN adds overhead and this is further added when ReTURNs are used. TURN needs this additional overhead to multiplex multiple streams between a TURN client and server. Most of the WebRTC use cases will involve a single stream. I think it is a good tradeoff to consume the occasional additional ports at the server, rather than consuming additional bandwidth for all the flows. So, it might be worthwhile to use a relay server rather than a full fledged TURN server.

Some have expressed concern in sharing local address with other clients. Given that Trickle ICE is part of WebRTC, a modification to listing ICE candidates should be considered. Browsers should not include local addresses in the initial candidate set. Instead they should be added if and only if the peer’s server-reflexive or peer-reflexive address matches its own and te connectivity test passes. Of course, we have to recognize that the call setup time may increase slightly.

TURN is required only when both the end-points are behind symmetric NATs. If it is known a priori that this will not be the case (as when the session is always to app’s own device/server), then we can dispense with relay addresses as ICE candidates. If further we know that app’s own device/server will have public Internet presence, then even STUN can be eliminated, since that device/server can use peer-reflexive addr it learns as part of Trickle ICE.

As part of connectivity test, the two end-points must perform authentication of the other end before meaningful information is exchanged.

In a post that prompted me to write this, Tsahi discusses different alternative signaling protocols one can use in a WebRTC-enabled app. In this post, I approach the issue from a different angle and I hope this sheds additional light and helps you to reach a choice appropriate for you.

Before we dig deep, we have to recognize that we have to decide on two independent matters: 1) how will the signaling messages be carried and 2) what will be the signaling protocol. There are very many variables that will affect the optimal answer for your scenario. So it is best that we discuss them in general and let you decide on a case by case basis.

First let us consider the transport mechanism.

Pure HTTP: Since the app will be accessed from a browser, an easy choice would be to use HTTP as the transport. It works great if the browser is initiating a signaling procedure and the server responds.

HTTP w Long Polling/Comet: But there are times, when the server needs to initiate asynchronously. Some examples are when the server wants to notify one user of another’s action like placing mic or speaker on mute. Or the server would like to notify of an incoming call request. Since the server can autonomously initiate an HTTP session an alternate will be to use long polling or Comet. This may increase the load on the server due to excessive polling or may introduce latency and its undesirable effect on UX.

HTTP w Push Notification: Alternatively the server can use Push Notification offered by both Chrome and Firefox to push a notification and upon receiving such a notification, the browser can initiate an HTTP session to continue the procedure. Of course this addresses the server load, but does not address the latency issue, especially for “in-session” procedures. Worse, the latency is affected by a third party service.

Websocket: This where use of Webscoket has its advantages. Since Websocket starts as an HTTP session which is then converted to a persistent TCP session. Almost all browsers (most recent versions) support Websocket and there are server implementations that are very efficient. So it addresses both the issues.

Websocket w Push Notification: If maintaining a Websocket connection during an idle period (so as to inform of an incoming session request), then one can use Push Notification during idle periods and then use Websocket only during active sessions.

Data Channel w X: Final choice is for the server not to be involved during an active session, but allow the browsers to handle the signaling procedures directly between themselves via a WebRTC Data Channel. But this approach does not address how to handle notification during idle periods.

As you can see there are many choices with each having its own trade-offs. But knowing the trade-offs, you can decide the appropriate transport for your use case.

Deciding which protocol to use is either “no-brainer” or “not so fast”. If the paramount objective is to work with already deployed system and WebRTC app is just another access mechanism, then there is nothing more to consider. It is optimal just to use the signaling procedure used by the deployed system and that is that. Otherwise, it is better to start from scratch and ask questions differently. From the time of Q.931 in ISDN Basic Access up to and including SIP, the standards bodies have focused on defining the protocol so as to ensure interoperability between two autonomous systems. Since the end-points will be of different capabilities and present different user experiences, the best a standard can do is to design a protocol that drives basic user interface. Thus for example, when the far-end places a call on hold, the near-end is not notified. It is not clear how to abstract the notification so all variation in the UI can be handled.

Next, let me quickly dismiss a faux use case, but one that is widely considered. It is know as “trapezoidal connection”. In this connection, the two end points are each connected to its own WebRTC app and the two apps are federating between themselves. The fact that the two end-points are using WebRTC as access is incidental; the real crux is that the two apps are federating and they have agreed on a protocol for this. So what the apps will select for protocol belongs to the “no-brainer” category. The apps will select a protocol that is optimal for the agreed upon federation protocol.

So the real interesting use case is where the end-points are directly connected to the app server, the so called “triangular connection”. Since both the end-points are directly connected to the app server and the server can dynamically download the signaling procedures via Javascript, it is in a position to offer a rich user experience by dynamically driving UI elements. The app designer can freely devised the needed signaling procedures – conforming to a standards is not critical. A good analogy is to compare the choice to paint by number and free-form painting. At first glance, paint by number looks straight forward; but in fact it is tedious, no room for error and not very expressive. On the other hand, free-form painting, if you are good at it, is fluid, very expressive and gives lots of freedom. If the choice were only free-form painting, then I will have only blank canvas; with paint by numbers, there is a hope that I will have something that looks like a painting. So I say to each, his own.

Recently, Carl Ford was musing about potential ideas for a WebRTC Hackathon. One idea he had was exploring different UI designs associated with “Video on Hold”. This post is a summary of our design thoughts decisions we made for a WebRTC application that is part of EnThinnai.

He felt that the design used in phone systems don’t work well for smartphones. So probably we should have different approach for video calls. As an example, he was wondering how should the user be notified when she has gone to different browser tab when the held video call is being retrieved by the other party.

In a followup post he elaborates his point. He suggests that we may imitate the idea used in 1A2 Key Systems phones. To see how far we can carry its design we need to go into a bit more detail.

These phones had some white buttons, with each one controlling a line it has access to and at most one button can be engaged. There was a red button that can place the currently engaged line on hold. All these buttons can be lit and also flash to signify the status of the line. For example, the quick flashing light will signify that there is an incoming call; a slow flashing light will signify that the line has been put on hold and a steady light will suggest an active call. Subsequently, Avaya carried over this design idea to their digital sets as well. This concept of “call appearances” and “active call appearance” is natural and very familiar in computer systems using windows. It is direct to observe that capp appearances are nothing more that open windows and active call appearance is active window. When the user selects a window to be active, the OS tacitly places other windows on hold.

But the analogy goes only so far. In a computer, even if a window is not active, activities can go on an inactive window. For example, the user may be playing a You Tube video in an inactive window. Also we should note here that 1A2 Key System phone indicates whether the local user has placed the call on hold or not; it does not know whether the far-end user has placed the call on hold or whether he is retrieving the call, which is the use case Carl wants to explore.

There is one other fundamental difference between 1A2 Key System and the environment WebRTC app will find itself. The phone can safely decide that when a call appearance becomes active, the call that was active must be placed on hold. But that may not be appropriate in the case of WebRTC. For example, the user may want to continue the call while viewing and interacting with the contents of another window. Or the user may have multiple WebRTC session going at the same time in an attempt to emulate a bridged call. So the only safe approach is to let the user explicitly select whether a video call must be placed on hold or not.

If we dig a bit deeper, we will question the basic need to place a call on hold in the first place. In PSTN systems, a call must be placed on hold if the user wants to attend another call because the access line can carry only one call at a time. But that is not the case in the case of WebRTC. The user can equivalently decide to turn off the camera or the display or both instead of placing the whole call on hold.

Recently, call centers have responded to frustrations expressed by callers due to excessive hold times, by introducing a feature called “callbacks” or “virtual queuing”. A webRTC app can offer a similar feature in an elegant manner by making the app a multi-modal one with the text chat session to periodically update the status and use it as a link to provide audio and video cues when an agent becomes available.

These thoughts are captured in the current user interface design used in EnThinnai:

Inasmuch the main utility of a formal living room is to entertain visiting guests, a WebRTC allows guests to initiate a communication session with the subscriber of the app that utilizes WebRTC.

Many go to enormous lengths to furnish and decorate a living space normally called Living Room. Notwithstanding the expense involved and the name, it is used mostly when guests are visiting. When we are entertaining guests and they are using the amenities in that room, there is no question of whether the guests have similar room and similar amenities in their houses. The only requirement is that they visit you and that you are ready to host them.

So is the case with WebRTC apps. The main reason for the app and for you to sign up for one is so people can initiate communication session with you. The only requirements are that your guests have a compatible browser and that you are willing to communicate with them.

Just because you have a lavish formal Living room does not mean that when you visit one of your friends you will experience similar luxury. Similarly, subscribing to a WebRTC app may not imply that you can initiate a communication session to one of your friends. In this respect, WebRTC apps for for receiving only. This is critical. Anyone suggesting differently is misleading you.

It’s easy to get an OpenID; in fact, you probably already have one. If you have a Google account, you can use your profile id number as your login (which can be found in your profiles.google.com url). Similarly, if you have a Yahoo account, you can use your username as your OpenID login.

Other sources for OpenIDs include 3rd party providers like Verisign Labs. If you use WordPress to host a blog, you can also install a plug-in to be your own OpenID provider.

If you have an account with one of the above providers, then you can derive your OpenID using the following rules:

ffonio.in is a web application that people can use to have IM, voice and video chats with their friends and family. Users can run this app on their own devices such as their WiFi router, Raspberry Pi or a cloud instance like Digital Ocean Droplet. As long as friends and family have an OpenID and use a browser that supports WebRTC, they do not have to host this application themselves.

The following are highlighted features of ffonio.in:

Use of OpenID for authentication. (Registered users can assign an unverified, if unsecure, “OpenID” to unregistered users in an ad-hoc fashion.)

“Availability status”, in lieu of Presence. Users can present different status to different persons.

Only users who have been previously authorized can initiate an IM, voice or video chat. The authorization can be changed at any time.

Seamlessly move to a voice or video chat from an IM chat session.

Ability for either user to mute sound or turn off camera.

Ability to buzz the other user to catch their attention.

Once the IM chat session has ended, the transcript is made available to both the users. (We plan to also make recordings of voice and video chats available in the near future.)

The app has a built-in simple relay server (two-sided NAT) to assist in NAT Traversal, replacing the functions of a TURN server.

Generate a custom reach-URL (which users can share in their email or business card) or an embed code (which users can add to their websites.)

Although our primary objective was to help individuals run their own IM, voice and video chats, this system can also be used on much larger scale, such as within an enterprise. Companies can use this product for both internal communication between employees and external communication with outside partners and customers. We plan on pursuing this direction in the near future by integrating this application with CRM systems like Salesforce and Sugar CRM.

As early as 2008, EnThinnai supported the ability to conduct IM and voice chats. At that time a Java applet was dynamically downloaded to the browser. Then the browser maintained a two-way signaling channel with the server that allowed asynchronous notification from the server to the browser – a proto Websocket you may say. The applet also contained Speex codec which was used to provide real-time speech capability – fully anticipating WebRTC.

And we were in a bind to extend this feature further. There were no freely available video codec to extend the feature to support video communication. Leading mobile devices did not support Java. Users were disabling Java due to security concerns. For us, it is a defining use case for an unregistered user to initiate a communication session with a registered user (Guest access). This means the capability afforded by the Java applet must be universally available. This is precisely the objective of WebRTC.

Now that WebRTC has reached a stable stage, we have replaced Java applet with WebRTC. So users can use any WebRTC-enabled browser to communicate under EnThinnai.

Skype is celebrating its 10th Anniversary. On this occasion, I thought it will be interesting to revisit my early comments. They were published as a guest blog post in Gigaom on March 27, 2004. But due to some server malfunction, they were lost. I am republishing that post here. I am proud to say that many of my opinions have stood the test of time, including the claim that Skype will be forced to bring all the Supernodes in house.

It is very likely that you have heard about Skype; it is even probable that you are using Skype. (Fair disclosure: I am not a subscriber of Skype.) Michael Powell, FCC Chairman suggests that the telephony market place has changed dramatically since the arrival of Skype. Is Skype really so special compared to other VoIP service providers? Of course Skype thinks so. They say that unlike other VoIP service providers, Skype has a very intuitive user interface that does not require technical skills, but is easy to configure. They also suggest that unlike other VoIP service providers, they solve NAT Traversal problem without the use of Proxies with the resultant better voice quality. Of course the clincher is that Skype is P2P and so is infinitely scalable and resilient.

Before I analyze these points, let me describe the workings of Skype based on my understanding and what is available in public.

There is a Global Index Server where all clients login and authenticate themselves and exchange security key information.

Based on this exchange, the client will be assigned a Supernode, who will maintain the presence information; Supernodes also communicate with other Supernodes while locating other end-points.

The clients and Supernodes use the well documented UDP Hole Punching algorithm to solve the NAT Traversal problem.

Upon a little reflection, we can see that functionally this architecture is equivalent to other VoIP architectures like SIP. Global Index Server is equivalent to the Registrar; the function described in item 2 is equivalent to Location Server and the function described in item 3 is Session Border Controller. What is more, many SBC vendors solve NAT Traversal problems using similar optimization techniques with the same rate of success. Consequently, the clients in other environments also do not require complicated configuration setup.

Skype users have commented positively about its voice quality. Global IP Sound indicates that Skype uses its codec, in particular iLBC. GIPS also supplies their codec to other VoIP clients. X-ten also uses iLBC codec. So one can get Skype like quality in other systems as well.

The Global Index Server is a single point of failure. If it fails, clients can not login. I suppose new Supernodes can not be drafted either. In my opinion, this is not a serious failure, because existing system can continue to function and a replacement GIS can be easily brought online.

But my concern regarding Supernode is more substantial. It is suggested that since the Supernodes are nothing more than other Skype clients, Skype is infinitely scalable. I submit that this may not be the case. To begin with, a client is eligible to be a Supernode only if it has enough processing power and bandwidth capacity to perform the functions of a Supernode. Additionally, it is a requirement that they be present on the public Internet or behind a “transparent” NAT and a “permissive” Firewall. I am betting that such clients will be scarce in relation to the total number of clients (a single Supernode serves around 100 clients).

If Supernodes need to have special capabilities, then it is likely that they will demand some form of compensation. It is not clear whether Skype is setup for this. Additionally, it is not clear how the individual clients are protected from a misbehaving Supernode. It is true that the media is encoded. But the Supernode is involved in the signaling phase. Since the Supernode has network connectivity to the client, it is tempting to use it for extra and unwanted commercial activity. So Skype may deploy their own Supernodes, eliminating one more difference between it and other VoIP providers.

Some have expressed reservation because Skype is proprietary. There have been previous instances where proprietary consumer items have found wide adoption without incurring huge collective cost. VCR is one of the examples that come to mind. But in this case there are some differences:

Alternatives, based on standards are available

Skype uses mostly well-known and open technologies; only the protocol semantics is proprietary

Even though Skype (for that matter VoIP) is naturally a “product” and not a “service”, Skype views it as service. For example, they do not allow an enterprise to use their own GIS, instead of the global one, even if communication will be restricted to internal use alone.

As I am told, there is no way to directly address another client, even if the IP address is known. Windows Messenger from Microsoft has the same limitation, whereas NetMeeting allowed direct communication.

In this respect also, they are just like other VoIP providers. It is disheartening to see that even those whose middle name should be P2P, think like this. I am reminded of an ad that appeared in a New York based Indian newspaper in 1982. The ad was taken by an Indian Restaurant that offered two free alcoholic drinks in exchange for ticket stub for the movie Gandhi. In summary, Skype shares the same functional architecture with other VoIP providers. It shares the same business plan and outlook. But they have artificially cloaked it in a proprietary system. I guess this is their “economic moat” to use a Buffett term. From a consumer point of view, the beauty of VoIP is that there is no moat and current technology is sufficient to realize direct IP Communications that does not require any intermediation.

Aswath Rao has 20 years of experience in the telecommunications field, having worked for leading R&D firms. He has worked on ISDN,Frame Relay, BISDN, wireless and satellite communications. For the past 5 years he has been working on VoIP related issues. Long before intelligence at the end became acceptable, he advocated “functional terminals” in ISDN. His proposal for Inter Connect Function has been incorporated in the TIPHON architecture and currently it is known as Session Border Controller. He has developed ways to offer PSTNsubscribers many of the features available to VoIP subscribers. He maintains a blog. He can be reached at [email protected]