Archive

Recently I started using WebRTC. Cool technology. You can build a web chat in just tens of lines of code, you can take pictures with your webcam and using canvas you can manipulate them. And these are just a few examples of what WebRTC can do for you and your web app.

Tens of lines of code indeed. But not easy lines at all. Even though WebRTC will enable peer-to-peer communication, you still need some server components. These server components are involved in the session initiation.

First there is a ICE/STUN/TURN server that it’s used for a client to discover its public IP address if it is located behind a NAT. Depending on your requirements could not be necessary to build/deploy your own server, but use an already public (and free) existing one – here‘s a list. You can also deploy an open source one like Stuntman.

Then it comes the signaling part, used by two clients to negotiate and start a WebRTC session. There is no standard here and you have a few options.

You can use an XMPP server with a Jingle extension. Developing your own XMPP server (or component) just for this will be an overkill, so you should definitely consider using an existing one. But even installing, configuring and integrating an existing one could be too much if you don’t use it for anything else. But if you already have one in your infrastructure, you could hook up into it. On the client side, you can use an existing XMPP JavaScript library.

You can also use SIP, a protocol much more encountered for VoIP. Like XMPP, SIP it is too much to be used just for WebRTC signaling. If you already have it in your infrastructure, then could be easy to use it. As for SIP, in Java you have two development options.

The high level one is SIP servlets with its most notable implementation, Mobicents. If you develop a non-GPL application, Mobicents commercial license could be quite expensive just for WebRTC signaling.Mobicents is developed on top of Tomcat or JBoss, so, depending on your environment, this could also be a drawback.

The lower level option for SIP is JAIN-SIP, on which even Mobicents is developed on. It is completely free. There you have different protocol options, like TCP, UDP or websockets. And for WebRTC the latter will be more appropriate.

With both options, you’ll still have to develop the server logic and SIP is a pretty cumbersome protocol so prepare yourself for a few headaches.

If you decide for SIP, on the client side, things could be a little bit brighter. There are already JavaScript SIP signaling solutions that you can easily integrate into your web applications. I looked into JsSIP and SIPJS and I ended up using the latter. SIPJS is actually forked from JsSIP, but it encapsulates the intricacies of the protocol better, which makes it a little bit easier to integrate.

Another option is to develop your own signaling protocol using something like websockets. Then you have to develop from scratch both the client and the server side part. Developing the server side part, in my opinion, will be easier than the previous options. You have to practically develop a messaging system, that forwards messages between users, without caring too much about the content. @ServerEndpoint abstracts development in an elegant manner and you can even use your existing authentication system as you can bind a websocket session with the HTTP one.

In the client side, things will be a little bit more complicated, as you will have to develop the entire message flow, the server will just forward your messages to the appropriate peer. This entire flow should be asynchronous, something like a request-response paradigm.

The development difficulty arises here mainly from the different behavior on various browsers. And by various, I mean Firefox and Chrome, the main ones supporting WebRTC.

When it comes to WebRTC you don’t have to code anything for actually streaming, but only for initiating the session. Theoretically, WebRTC session initiation process is as follows. Let’s suppose A wants to talk B. And for the sake of simplicity we will use A signals B with the meaning that A sends a message to B through the signaling channel.
1. A will create and initialize an RTCPeerConnection. Few things involved here: a list of addresses of the ICE/TURN/STUN servers, a local stream to share with B.
The list of ICE servers is passed in the constructor, the local stream is obtained through navigator.getUserMedia() and added afterwards.
2. Using that connection A creates a SDP(session description protocol) offer.
3. A signals B the offer
3. When B receives the SDP offer from A (or even before) will create the RTCPeerConnection like in step 1.
4. B will set the remote session description like RTCPeerConnection.setRemoteDescription(new RTCSessionDescription(SDPOfferFromA)), where SDPOfferFromA is the JSON object received through the signaling channel.
5. B will create a SDP answer.
6. B signals A the answer
7. When A receives the answer, sets the remote description like in step 4, using SDPAnswerFromB.
8. Asynchronously, when a new ICE candidate is discovered and presented through the onicecandidate event of RTCPeerConnection, A (or B) should signal it to B (or A). When the other party receives it, B (or A) will add it using RTCPeerConnection.addIceCandidate(new RTCIceCandidate(candidateDescriptionReceived)).

Again, theoretically. In practice …

First of all, because WebRTC is not final, but in draft state, all the types are prefixed, like mozRTCPeerConnection (in Firefox) or webkitRTCPeerConnection (in Chrome and Opera). But this can be easily fixed.

Then it comes the ICE candidates part. In Firefox the ICE candidates must be gathered before creating an SDP offer or answer. So before creating an offer/answer, check RTCPeerConnection.iceGatheringState and do it only if it is "complete".

There are also few inconsistencies when it comes to the media constraints passed to navigator.getUserMedia().

In the end, I can conclude that WebRTC is really cool and you can build media application in matter of tens or hundreds of lines of code. I guess it could have been designed to be easier to use. Even though there are a lot of opinions stating that the signaling protocol is better left out of the standard, I still think that it will be better if an optional one is included. Or at least a minimal easy to implement API.