In my post on WebRTC standardization I mentioned that one of the controversial points of discussion in the W3C context was whether the SDP Offer/Answer model and the current API provided the level of flexibility a wide range of WebRTC use cases would require. In order to avoid the endless and repetitive discussions that have already occurred on this topic, developers unsatisfied with the current API have just announced an alternative to the existing WebRTC API. This new proposal is called WebRTC Object API, motivation behind it is presented in this IETF draft and some example code can be found on GitHub. Note that this is not the first time an alternative API aiming to provide more control to web developers has been proposed- Microsoft’s CU-RTC-Web introduced last year took a similar approach by introducing an alternative along with a working prototype.

In order to better understand the principles behind the WebRTC Object API, today we had a chat with Iñaki Baz (@ibc_tw), one of the main authors for this initiative. Iñaki has an extensive background in SIP/VoIP. He is an active contributor to IETF and W3C and one of the authors for the ‘SIP on the Web’ project, which includes the OverSIP server and the well-known JsSIP library for WebRTC.

webrtcHacks: Iñaki, what’s wrong with the current API for your SIP-oriented use cases?

Iñaki: My JsSIP partner, José Luis Millán (@jomivi), and I found it was very difficult to implement a SIP stack in JavaScript for WebRTC. This is ironic since the reuse of SIP’s SDP O/A mechanism should have made this relatively easy. The biggest problem is that many typical SIP use cases, such as putting the call on-hold, require direct access to the SDP to identify when the remote party has put the call on hold – i.e. indicate a=inactive or a=sendonly. This means you have to have both browsers in the call parse and potentially modify the SDP. More logic and more that could go wrong.

webrtcHacks: SDP is used by billions of SIP calls – what exactly is wrong with it for WebRTC?

Iñaki: SDP comes from the telco world with no consideration for different call models that will exist in the web world. SDP O/A is an artifact of SIP that is completely unnecessary in other signaling schemes.

WebRTC signaling is not limited to SIP or XMPP/Jingle. Unfortunately the current API is essentially based on these protocols because of the SDP O/A model. That doesn’t make sense in a lot of use cases.

The SDP format is very “flexible”. The bad side of this flexibility is that each browser in a call could generate completely different SDPs while still adhering to the same semantics. SDP generated by WebRTC compatible SIP servers in the middle of the call flow would add further complexities.

The biggest problem with SDP (and draft-roach-mmusic-unified-plan-00 that defines the SDP format for WebRTC) is that the call initiator not only offers the remote party what he is going to send, but also indicates what that remote party can send back. Specifically, if the call initiator only offers to send its audio MIC, this results in one line of m=audio in the SDP offer. This means the remote peer can join with audio communication, but cannot add video (since the SDP answer must have the same number of m lines). Adding remote video would require another SDP O/A round trip.

webrtcHacks: Can you provide some other WebRTC use cases where SDP O/A doesn’t make sense?

Iñaki: There are currently many WebRTC applications that simply remotely project audio and video to a pre-defined device or window. Applications like Chromecast stream your browser screen and audio to a TV. The TV is obviously is not going to send any media back to the browser. Why should the TV have to respond to your request and negotiate what codecs to use in both directions?

webrtcHacks: So why was SDP chosen in the first place? Does it ever make sense?

Iñaki: SDP O/A is suitable for symmetric communications where both ends are sending the same number of media streams, or “media stream tracks” as they are called in JavaScript. The best example of this is an audio-only phone call in both directions.

But what if you want to do a multi-party video conference in the browser? To start this conference one browser sends an SDP offer with one m=audio line and, perhaps, one m=video line too. The conference server then accepts this and replies with a SDP answer that has the same number of m lines. Then the conferencing server must immediately generate a new SDP offer with set of m=audio and m=video lines for each participant in the conference. For example, 10 participants would be 20 lines.

Some members of the WebRTC WG suggest that another option is to require the conferencing server to be the starting entity that initiates the call. Using web signaling mechanisms, the server would be the party responsible for initiating the media session with the browser. It would offers a SDP with all the required m lines and all the media flows would be negotiated in a single SDP O/A round trip.

Is this not a clear limitation and constraint in the design of future multimedia applications? Do we really need this? Why should a future-oriented, signaling agnostic standard like WebRTC need to inherit mechanisms from SIP?

webrtcHacks: In this use case you are describing, are you referring to video conferencing done in a Peer-2-Peer manner (e.g. a mesh of media streams), or video conferencing done with a central server doing mixing (e.g. hub-and-spoke media streams)? If a mesh, would that not be done with N peerconnections (and N unique SDP O/As), instead of trying to accomplish it with one SDP having N m= lines? If it is the hub-and-spoke use case, then would that not just be one (or 2) m= lines for N participants?

Iñaki: Typically (in both SIP and XMPP) the conference server acts as a central signaling node, this is, all the participants just talk at the signaling level with the server. The server also behaves as a central media node, this can be done in two ways:

Media mixer: the old fashioned way, which means that each participant receives a single audio track with the mix of all the audio tracks from other participants, so the participant cannot mute one of them or modify its volume, etc.

Media relayer: Like Google Hangouts, in which the server relays all the media tracks from all the participants to each participant, in different RTP flows. In the case of WebRTC this would be done with multiplexed RTP having each track a different SSRC value indicated in a single SDP sent by the server. There is not a SDP O/A for each participant – that is not needed at all and would make the whole protocol really complex. The browser does not need to open N local ports for sending the same audio track N times (which would cause network upload congestion). Instead the browser sends its audio over a single connection, just once. It also receives N tracks from the conference server over the same connection .

So I mean case #2. Due to SDP rules, a SDP answer must have same m lines number as the SDP offer. And using “unified SDP” (the draft defining SDP syntax for WebRTC) each track is represented as a separate m (audio or video) line. This means that, in case the participant initiates the call, it would offer just two m lines (his audio and video channels/tracks), the server must reply with also 2 m lines, and later the server must re-invite the participant with a new SDP containing as many m lines as participants in the conference.

Alternative WebRTC API proposal: Object RTC API

webrtcHacks: What are the main principles behind the new API you’re presenting?

Iñaki: The ORTC API (Object RTC) focuses exclusively on what a peer indicates that it will send to the other, and never on what it expects the remote will send back (for that, the developer can use custom signaling if the application requires it). In the previous example in which a browser contacts a conference server, the browser would tell the server “these are the audio and video tracks that I’m sending” and the server would reply “these are mine”. Period. The number of tracks does not necessarily coincide since the conference server obviously emits many more video and audio channels (per participant).

ORTC also eliminates the imposition of handling a monolithic SDP blob as API surface. Instead, the API works with functions and events expecting and provides JavaScript Objects as arguments.

The developer is free to serialize this information to the remote destination as he wishes. Furthermore, ORTC separates transport and media layer by providing specific classes for each purpose (RTCSocket, RTCConnection, RTCMediaSession and others). The developer can choose how many connections to generate and how to associate audio and video tracks to those connections.

webrtcHacks: So is this mainly to make the PeerConnection process simpler or will it also enable new or different types of applications?

Iñaki: Let’s say that the current RTCPeerConnectiondesign is a very simple subset of what can be done with ORTC.

webrtcHacks: How is ORTC different from Microsoft’s UC-RTC-Web

Iñaki: CU-RTC-Web is a very powerful API, but perhaps too low level. This should not have been a reason to dismiss it because libraries can be written on top of it to simplify it for basic use cases without sacrificing flexibility for those that need more advanced functionality. This is analogous to how the jQuery library augments and simplifies programming for JavaScript.

ORTC offers similar functionality, but it is easier to implement for basic use cases.

webrtcHacks: What’s the goal of the ORTC group? A brand new API? Extend the current specs?

Iñaki: Ideally, the current specification would be totally discarded and ORTC would be used in WebRTC 1.0 before WebRTC is widely established and production services are launched. However, abandoning the current spec seems unlikely at this point. Alternatively we propose ORTC should be implemented in WebRTC ORTC 2.0 with backwards compatibility to WebRTC 1.0 based on the current specifications.

webrtcHacks: How is the new approach compatible with the existing API? Can ORTC be layered on top of the current API specs?

Iñaki: Totally. Without going into too many technical details, WebRTC 1.0 provides the RTCPeerConnection class as main class of WebRTC, while ORTC offers RTCConnection instead. An ORTC compatible browser (assuming ORTC becomes WebRTC 2.0) could retain the RTCPeerConnection class, but internally the browser would generate a set of RTCConnection instance(s) instead.

Another option is to make this compatibility layer through a JavaScript library which would alleviate the browser vendors of this responsibility.

webrtcHacks: Can ORTC be used with SDP?

Iñaki: Absolutely. SDP O/A is perfectly doable with ORTC. It would be done at a pure JavaScript level, of course, which means no browser vendor/version dependency, but would require a JS library (similar to JS apps based on jQuery).

webrtcHacks: How should current WebRTC developers proceed in light of the O/A discussion and your proposal? Is there a way they can minimize the risk of the API changing?

Iñaki: We expect that nothing will change for WebRTC 1.0 so developers should not care. Anyhow, if ORTC becomes WebRTC 2.0 it can easily provide backwards compatibility so applications designed on top of current RTCPeerConnection would work. At the same time, developers will find in ORTC is a richer and more powerful API for designing RTC applications without the SDP O/A constrains.

webrtcHacks: What are the next steps with your proposal?

Iñaki: Officially publishing ORTC related drafts in both the W3C and the IETF RTCWEB Working Group. Please follow and post comments on the project on the ORTC GitHub page or the W3C Object RTC group.

webrtcHacks: Can you provide some sample code for our readers?

Iñaki: The following ORTC example code is similar to the one in current WebRTC 1.0 spec:

Want to keep up on our latest posts? Please click here to subscribe to our mailing list if you have not already. We only email post updates. You can also follow us on twitter at @webrtcHacks for blog updates.