Kundan Singh

Tuesday, May 02, 2017

The command line SIP endpoint included in my rtclite project allows SIP registration to register with a SIP server to receive SIP calls. For example, you can signup for a free SIP account at iptel.org, download rtclite and py-audio on OS X, open two terminals, run the rtclite.app.sip.caller app on one to register, and on another to place a call as follows:

This will register the first instance as sip:myuser@iptel.org and place a call from the second instance as sip:anything@my-local-host to the first instance. The app is configured by default to auto-answer incoming calls. Since both the instances are on the same machine, I use a different listening SIP port, 5094, for the second instance. Use the -v option to enable detailed SIP log. Use Ctrl-C to terminate the call and to unregister.

In this article I describe how to use this command line SIP endpoint to register with Twilio to receive calls. I also describe the main differences between normal SIP registration and Twilio specific SIP registration based call flows.

Register to Twilio from command line SIP endpoint to receive calls

Since my last post, Twilio has added the SIP registration feature. The approach is described on the provider's website including the steps for creating SIP domain/endpoint such as yourname.sip.twilio.com. Once the SIP domain is created, configure the IP access control list as well as SIP credentials for the domain. Using one is enough, but using both provides more protection. Also enable SIP registration for that domain using the SIP credentials. I have put two SIP accounts in my credentials list, myuser1 and myuser2, for the example below.

Create and assign an enhanced TwiML for handling incoming voice call on that domain. The following example shows how to use the To header as is. Alternatively, you can create programmable server-side scripts to derive the target SIP address.

<Response>
<Dial timeout="20"><Sip>{{To}}</Sip></Dial>
</Response>

Open two terminals on your Mac OS X as before, and run the caller app instances as follows.

The first instance registers as myuser2, the second instance uses myuser1 to call the first instance myuser2. The Twilio's SIP registration is currently enabled only in one zone, hence the specific domain containing us1 must be specified for both receiver and caller instances. Use the -v option to enable detailed SIP log. Use Ctrl-C to terminate the call and to unregister. As in the previous post, this uses G.711 audio.

The SIP destination specified in TwiML could instead be any other SIP address, e.g., on any other non-Twilio domain. For example,

Differences with regular SIP registration call flow

Media anchored at Twilio: With Twilio, the SIP media path always flows through the Twilio server, unlike the previous iptel.org based example, where the media path is end-to-end. The Twilio call flow works by decoupling the caller and receiver - the caller app connects to the Twilio service which runs the TwiML specified above in the first call leg, which in turn dials out the SIP destination in the second call leg. Since the SIP destination happens to be one registered with Twilio service, the service connects to the receiver app. Being in the media path enables the service to inject intermediate TwiML elements such as interactive dialog or digit collection. On the other hand, it could introduce additional latency on the media path.

Proxy authentication vs. Authentication: Unlike iptel.org service, which challenges the registering app using 401 response code, the Twilio service challenges using 407 response code. Although most user agents do not care, and provide the right authorization response when challenged, the semantics of the two are different. A 407 response code indicates more a SIP outbound proxy or B2BUA to challenge any request reaching the service, whereas a 401 indicates being challenged by the serving user agent such as a SIP registrar.

Registered endpoint is not reachable directly: On iptel.org service, when an endpoint registers as user, myuser, any other SIP endpoint can reach that user via that SIP service. On Twilio, when an endpoint registers as user, myuser, it can only be reached by the Twilio service, e.g., using the example TwiML tag shown above. If a SIP endpoint attempts to reach, say, sip:myuser@yourname.sip.us1.twilio.com directly via SIP, it will reach the Twilio service, which will not reach the registered user, myuser, unless the associated voice URL TwiML is configured to dial-out to that SIP address. For example, dialing out to myuser could create a SIP call to anotheruser or join a conference or queue the call depending on the TwiML. This is more flexible, but not same as the standard SIP registration based call flow.

If you happen to try out SIP registration and call flows using my rtclite project, please drop me a note. If you face any issues, please send the SIP log using the -v option.

Saturday, December 31, 2016

I did my first project on real-time communications in 1997. In the past two decades I did numerous projects in this general area. In particular, these were related to voice over IP (VoIP), multimedia and web communication. As 2016 comes to an end, I reminisce my 20-year journey. I attempt to capture the gist of my projects. How my project themes and ideas evolved over time?(I used initials instead of names of the people I worked with. These people shaped my journey, gave directions and helped clear my path.)

1997-99 The door opened - "Age of ITU-T"

As part of a semester long practical training in the final year of my bachelor of engineering curriculum, I worked at Motorola India. Working with another student SKP, under the mentorship of SA and SA, we built our first PC-based video phone. They already had two ongoing video call projects, one with H.324 for modem and another with H.320 for ISDN. They wanted to dive into IP-based H.323 video call.

The project had quite a successful demonstration. Both of us, final year students, had job offers from elsewhere. SA offered to keep us at the company to continue and improve our project, and we stayed there. The previous project was further enhanced to include many other features and voice codecs. We also took part in ITU-T interoperability events.

While at Motorola, I did few other projects too. During this time I also worked with and was inspired by ST and HS, who taught me that code can be beautiful! There were ongoing projects on H.320, H.323 and H.324. It made sense to create interworking functions too. I mentored two student projects. One was on a H.323-H.324 gateway where a PC with both modem and ethernet could be used as a protocol translator. Another was about porting an existing H.320 system on to a real-time operating system, pSOS. Developing on embedded systems had its own difficulties related to debugging and testing. In collaboration with another person in the QA department, I created a framework to help developers.

Assert assistant: A framework and supporting C/C++ libraries and macros, that enabled quick debugging of software. The idea is to automatically inject trace statements in function entry and exit points, and to dynamically enable or disable them in various modules. Keywords: debugging, testing, framework, embedded system, C/C++, macros.

After about two years at Motorola, I decided to leave, to pursue higher studies. During the transition time, I worked on system design of a personal project, multimedia communication developer's kit. The core idea was to create and abstract basic multimedia communication elements such as camera, microphone, speaker, display and network. And to create a drag-and-drop user interface to allow creating real time applications by interconnecting them. I did not implement the idea at that time. I got a chance to implement the core concept in another project, aRtisy, fifteen years later.

As I started applying for higher studies in the US, SKP told me about the emerging work on SIP at Columbia's CS department. SIP was invented as an alternative to H.323, and was similar to other Internet style protocols. It looked innovative. I applied there as well, and was accepted by HGS in his research group named IRT (Internet Real Time) laboratory.

1999-2003 Rapid growth in learning - "Age of SIP"

My first project at Columbia was to create a gateway between H.323 and SIP. With my prior H.323 experience and the relative simplicity of the emerging SIP, it was a perfect fit.

I did successful demonstration of voice call between the two protocols. I used our locally developed SIP e*phone and Microsoft's H.323 NetMeeting. HGS had another student working on that topic before I joined, but without much progress. My work became an instant hit within the lab as well as outside. It was the first such successful attempt at interoperability between these two competing protocols. It was also quite complete in its translation of incompatible concepts, e.g., fast/slow-start vs. three way handshake, logical channels and capability negotiations vs. offer-answer. I co-wrote academic papers [pdf] and Internet drafts [pdf], and did presentations [ppt] at Voice on the Net (VON) conference on this topic. In collaboration with bunch of other folks, we created the H.323-SIP IWF effort [rfc] in the IETF. The software was further refined, productized, and sold [link] by a Columbia spin-off named SIPquest, later renamed as First Hand Technologies [link].

I built the SIP side of the gateway from our ongoing SIP server project named sipd. I extracted the SIP-parsing, formatting and transaction related code, added SIP dialog and other user agent capabilities, and created higher layer reusable library. This was later used in a number of other projects at Columbia whenever we needed SIP user agent functionality.

Furthermore, I wrote a platform abstraction layer. Thus, multi-threading and socket interface could be cross-platform. Many of my earlier projects were built to run on both Windows and Unix (Solaris, Linux, FreeBSD, Tru64) systems. I took HGS's class on Internet Systems Programming [link]. Among other things, it dealt with various Unix make/gcc quirks, system calls and interprocess communications. This further motivated and inspired me to improve the build system and cross platform capabilities. During this time I closely worked with JL, another student, especially related to the SIP stack and the SIP server.

Over the next year, while completing my MS at Columbia, I built a few more SIP applications and services. I also mentored some other students in their projects with HGS. The SIP voicemail system built using distributed and scalable SIP+RTSP architecture, and the SIP multi-party conferencing server with voice mixing, and later video forwarding, capabilities were particularly popular.

Sipum - SIP/RTSP unified messaging: A modular Internet multimedia mail system using SIP and RTSP, to allow message recording and access via any Internet connected device, and using off the shelf streaming software. The media path goes directly from the caller to media server, not via the voice mail server. Also works with access from PSTN via a gateway. The access user interface written using Tcl. Keywords: Voicemail, unified messaging, SIP, RTSP, C/C++, Tcl, Win32, Unix. [link,pdf,ppt]

During this time I closely worked with another student, XW. I often used his SIP user agent for testing and demonstrating my server side systems. Another student, TK, helped in parts of the voicemail user interface implementation. The conference server was initial implemented as part of a class project jointly with GN for the Advanced Internet Services [link] class taught by HGS. Another student, SN, helped with various enhancements such as IPv6, TLS and database scalability in many of my projects. I also co-authored papers on the topics of unified messaging [pdf] and conferencing [pdf] for presentations at some conferences and workshops.

For a class project in Web enHanced Information Management [link] taught by GK, I built my first web-based phone.

Hello2web - web based IP telephony client: A Java applet in a web page that allows sending and receiving voice calls from within the browser. Applets do not have voice (microphone) interface currently, hence create a plugin for Solaris/Netscape, to delegate the real-time voice capture and playback functionality. Uses a backend server to gateway with actual SIP phone, and to perform encoding/decoding and RTP transport of voice path. Keywords: Java, applet, plugin, click-to-call, VoIP, web.

I enjoyed the kinds of projects I was doing. By the time I was close to finishing my master's degree, I decided to enroll in the PhD program with HGS. I collaborated with other students, SN, JL, WJ, and XW to create an integrated VoIP architecture, named CINEMA, Columbia InterNet Extensible Multimedia Architecture.

CINEMA - Columbia InterNet Extensible Multimedia Architecture: An IP telephony architecture with a SIP server at the core, together with bunch of other services such as voice mail, conferencing, interactive voice response, media server, PSTN gateways, H.323 translator, etc., and various SIP user agents, media players and phone devices at the edges. Ability to replace the traditional PBX based communication system in departments, organizations and university campuses. [link,link,poster,ppt,spec,paper]

Over the next year or so, the architecture became quite popular both as our internal test bed [link] as well as for external demonstrations [link] and publications [paper,paper,paper]. The core idea was to keep the system scalable and robust by keeping Internet style signaling, and by distributing various tasks to different servers. Furthermore, additional synchronous and asynchronous collaboration services were implemented to create a comprehensive multi-platform collaboration system [paper]. We also deployed the system within the Computer Science department [paper]. We assisted with the university-wide deployment as part of the Internet2 effort [link] to adopt SIP/VoIP. For some time, during the early days, CINEMA was available as open source. Later, due business needs of our sponsoring organization, it was made closed source [faq].

The project couldn't have been successful without some of students who collaboratively worked at that time. XW took care of the SIP user agent. JL oversaw the SIP server, call routing and software distribution. WJ worked on PSTN interoperability and PBX interconnection. SN did many things including TLS, IPv6, faster database, presence and performance measurement. I was responsible for many of the server-side components including voicemail, conferencing, interworking, voice dialogs, media server, Win32 portability, SIP server farm, and so on.

By this time we already had many SIP services. We also had a gateway to connect between SIP/VoIP and our department PBX. One service missing was interactive voice dialog, e.g., to access the voicemail or conferencing system using an authentication PIN. I took it upon myself to implement such an IVR system using SIP.

The project was done jointly with another student, AN, who helped in implementing some pieces of the server. Later, we published our work in a conference [pdf]. I also mentored other students in creating various VoiceXML apps on top of this service.

Mentoring other project students played a big part in my graduate experience. During the course of my MS and PhD, I mentored more than 35 student projects. Click here to see the list of student projects. Many of these projects eventually became part of our CINEMA test bed. Some of them got used in my own research agenda in various ways. Most of these projects enhanced our test bed either to incorporate new ways of collaboration or to improve the servers, e.g., for codecs support. Some of these projects created entirely new systems unrelated to the test bed. Significant number of the projects dealt with SIP and RTP for voice and video. Some projects were related to media streaming based on RTSP. And some others related to improving the user interface of our system or adding new forms of collaboration in our test bed.

During the summer of 2002, I did internship at Bell Labs, on an entirely different topic - IP mobility. The core idea was to allow NATed devices to change their IP addresses without breaking the end-to-end transport connections.

MobileNAT - mobility across heterogeneous address spaces: Decouple the identity and routing aspects of IP addressing in the endpoint by using two separate IP addresses, virtual vs. real. Implement four pieces: (a) a Windows driver to intercept certain IP messages and apply NAT, (b) client host application to act as a DHCP proxy to decouple the exposed and real IP addresses, (c) mobility manager in the network to manage changes in the NAT mappings, and (d) enhancements to the DHCP server to distribute virtual IP addresses in addition to the real IP addresses. Keywords: mobility, NAT, DHCP, iptables, Windows drivers, Linux. [paper,paper,ppt]

I successfully demonstrated connection persistence of telnet and real time streaming using this novel technique. The internship was quite successful with couple of academic publications and a patent. It was also instrumental in getting me a job offer in the same department after I graduated.At Columbia, by this time, we had a simple web interface to configure user accounts, contacts, etc. HGS created the initial user interface. Later, I improved and expanded it significantly using Tcl-based CGI scripts [link]. The web user interface included many advanced features. For example, the ability to start/stop the servers, configure database entries, or launch backup services. Our system had evolved into a complete VoIP and collaboration platform. I had presented the architecture and demonstrations numerous times, and had co-authored a few academic papers. As my projects evolved from simple "can you hear me?" demonstrations, to more mature systems, I started evaluating system performances. Measuring the quality of the voice path, server health and scalability constraints were very important. Equally important was applying the architecture to create a complete and comprehensive collaboration framework.

2004-2006 Preparing to take on the world - "Age of the Scale"

I started my final stretch towards completing my PhD, which actually involved thesis proposal and defense. I realized that all my prior projects on various servers and systems will only form a minor subset of my thesis. I will need a lot of material on performance evaluation and improvement. I will need to create systems that significantly change the state of the art. This resulted in a dedicated and focussed effort on two areas: (1) scalable and robust server farm architecture, and (2) peer-to-peer telephony to avoid using expensive servers.For the first part, I explored both vertical and horizontal scalability and robustness of SIP servers. HGS proposed a two-stage server farm for SIP servers. I implemented it using the software derived from our SIP proxy. I showed that the performance is linear with the number of servers, indicating very good horizontal scalability of the architecture. Furthermore, I applied that same basic principle of two stage routing within a single application. I compared the effect of multi-threading and multi-processing, where different stages reside in different threads or processes.

I also wrote academic papers [pdf] on this topic. The SIP server scalability and architecture formed a big part of my thesis. However, server-based systems inherently suffered from scalability and robustness limits - particularly in disaster scenarios with limited Internet connectivity.By then, Skype was quite popular. Another student, SAB, had spent some time understanding its behavior. I had spent some time understanding the various structured P2P networks or Distributed Hash Tables (DHTs). In one of the weekly meetings with HGS, we came up with the idea of using SIP messages to create peer-to-peer network, and then, use that to route SIP calls.

SIPpeer - SIP based P2P IP telephony client adaptor: A piece of SIP software that runs on client host or local network, and turns existing SIP user agents and phones into P2P-SIP system. It creates a self-organizing, scalable and robust DHT-based P2P network, and uses it for various SIP functions such as registration, call routing, offline message delivery, and even multi-party conference. Keywords: P2P, SIP, DHT, C/C++. [pdf]

Using an external DHT as a SIP location service: Modify a SIP user agent (SIPc) to use an external DHT (OpenDHT) as the location service to register and lookup users. Explore data vs service model. Securely store user location to avoid misuse and to enable authentication. Extend the system for presence and offline message storage. Keywords: P2P, SIP, DHT, Tcl. [pdf,spec]

The P2P-SIP effort got a lot of hype within the IETF. A working group was formed to further explore the architecture and implementation. While I graduated and moved on to other projects, HGS and SAB continued the effort both within the lab and externally.

In 2006, I finished the final steps of my PhD. I defended my thesis by the mid year.

I started my professional journey implementing voice and video features for telephony and conferencing. I continued on to implementing many different types of communication and collaboration services. By the end of my PhD, I had found my new love for decentralized peer-to-peer systems. The VoIP industry was moving towards more tight control and managed service offerings. But I wanted to implement peer-to-peer communication and collaboration systems. I wanted to build loosely coupled telephony components, without control from a single application or service provider. I wanted to take the path less travelled!

2006-2011 Jumping on a roller coaster - "Age of Startups"

I joined Bell Labs immediately after completing my research at Columbia. My first project was on design of a serverless mobile gaming platform. It was loosely based on peer-to-peer self-organizing concepts, but not quite the same. However, I got pulled into another project about implementing a Java-based user interface for Lucent's attack detection system. As I finished this one, and was working on the serverless mobile gaming platform, I got a call to join Adobe Systems. A small team of people in an internal startup mode were to add various VoIP capabilities to Adobe Flash Player. This was to be based on SIP/RTP to enable point-to-point real-time media path. The goal was to make it available to all the web developers to use and innovate [link]. Adobe already had a plugin based proprietary enterprise collaboration application [link]. But they wanted to create a standards-based VoIP function directly in the Flash Player. Instead of getting involved as yet another VoIP vendor or provider, the goal was to enable other developers easily create VoIP systems. A weapon supplier always profits no matter who wins the war!HS, who knew me, was already working there. BS, who hired me, presented a vision of how some of my research ideas on SIP and P2P could integrate with Flash Player and have a tremendous impact. At Adobe, I closely worked with SC, who taught me that doing it right is more important than getting it to work! We created various working prototypes and pre-beta implementations. The project at that time was called Flash Voice, later renamed to Adobe Pacifica.

I contributed to the RTP-based media stack in Flash Player. I created a SIP stack in ActionScript. I created several working demonstrations of use cases, e.g., Flash-based click-to-call, Flash-based soft phone to send/receive SIP calls, and Flash-based multi-protocol (SIP+XMPP) communicator for instant messaging, presence and voice calls. Furthermore, I created a DHT/P2P implementation in ActionScript. It built a self organizing peer-to-peer network of Flash Player instances. It allowed to discover and reach each other for VoIP, presence and messaging. Some projects were in C/C++, but some others were in ActionScript. As I learned ActionScript, I really started to appreciate the framework, and more importantly, the development environment.

As the team started to grow, and competing business interests interfered, it became clear that the original goal of the project would not be met. People changed. Directions changed. One such change was that Flash voice would only be available in AIR apps, not in Flash Player, significantly limiting its potential. I decided to leave. There were some public news about the project after I had left. From what I heard later, the project was grounded before takeoff.

After that, I spent some time without a real job, working on open source projects, and learning Python. My original passion for P2P systems had not died yet. I created Python implementation of a SIP stack. Thus, the 39peers project [link,link] was born.

Although the project was intended for P2P-SIP, it had implementations of the complete SIP and related standards. I wrote a very extended tutorial [web,pdf] on how to implement SIP in Python. It was a way to document my project source code in a bottom-up manner. As I got pulled in other real jobs, I had to abandon my 130+ page book, and it still remains incomplete. However, the chapters on how to write a fully functional SIP, SDP, RTP stacks are complete and available there.

I contributed to another project, Python based RTMP server, and created a SIP-RTMP gateway. Later on, this project became quite popular. It was a quick and easy way to bridge Flash-based applications with VoIP service providers. The project itself was not tied to a service provider. It started as a voice gateway, and later, was extended to include video.

Siprtmp - SIP-RTMP gateway for voice and video: Using Adobe's Flash Player and Python SIP code, implement a PC-to-phone calling system which should allow both inbound and outbound calls between web browser and SIP. Allow any third-party to build user interface for the web-based soft-phone. Keywords: SIP, RTMP, Python, ActionScript. [link,link]

Those few months without job was when I learned rapid prototyping with Python. I was very productive in creating working code. Over the years, the Python-based SIP and RTMP projects became quite mature. It included support for tunneling, peer-to-peer mode, client library, cloud hosting, better performance and a range of sample applications.

When NT of Tokbox approached me and described their work, it sounded very exciting. The goal was to create a Flash based video call and messaging system. They wanted to allow phone users to connect too. My background on VoIP and more recently on Flash and ActionScript was a perfect match. And I liked the enthusiasm in the leadership, NT and RH. After joining the company, within a month, I was able to show a successful demonstration. It was for a web to phone call using my new gateway written in Java. It used open source Red5 for RTMP side, and open source NIST SIP stack for SIP/RTP side. However, it took them really long time to decide to bring that feature to the customers.

After that initial work, I got absorbed in to the main line development of the video calling system. I restructured the system from old Flash to newer Flex. I replaced the ad hoc signaling with robust XMPP. I made several improvements in the video call and conferencing application. I wrote most of the code using Flex in Mxml and ActionScript. RH, an amazing programmer, was always striving for better and cleaner software. In a small team of talented developers, notably CW, GG, JF, BS, BR, we created many cool features. Besides those, I also worked on improvements to the voice and video quality, e.g., with respect to echo cancellation, and network quality monitoring. Later on, I got involved in reviving the Flash to SIP calling feature, and integrating the emerging RTMFP based peer-to-peer support of Flash Player.

After about one year, the leadership changed. As the tide turned, it became clear that the boat will steer to another direction. After I separated, I decided to focus again on my open source effort. I created an web forum for project ideas and to help students with projects [link]. I continued my work on previous SIP and RTMP based open source projects, and also created a few more open source projects related to web and Flash-based communication. Click here to see all my open source projects. These projects were initially hosted on Google Code, but were moved to GitHub after Google decided to shutdown its code hosting website.

As I dived deeper into web services, and integration of web and communication, I learned various ways of doing the same thing - and the ability to distinguish the right from the wrong. I enjoyed the gratification of implementing non-trivial software.

Next, I went to work for 6Connex, a subsidiary of the more established DesignReactor. The project initially looked similar to my previous work, but was actually quite different. LP wanted to quickly create a web-based video conference and messaging framework, that worked with virtual events. He wanted to sell it to existing enterprise software vendors. SB designed the user experience. The Flash based client was implemented by RN, a talented programmer. The server side was implemented in Java to tie in nicely with other existing enterprise software. I wrote a lot of server side code. In the new role, I did less coding, and more of software design and architecture of the system. Nevertheless, I did implement a few pieces here and there. After about a year there, I felt it was time for me to move on. In the long term, I wanted to be in the RnD field. Working in a narrowly focussed startup was not the path leading to that desire. I was without a job again for few months.

During the free time, I got a chance to critically evaluate various programming approaches to create Flash based video conferencing. I concluded that many different types of web video conferencing and messaging applications can be built using only one Flash widget - a video box that represents a publish or play video stream.

Flash-VideoIO - reusable generic Flash application to record or play: A reusable generic Flash application to record and play live audio and video content. It can be used for a variety of use cases in audio/video communication, e.g., live camera view, recording of multimedia messages, playing video files from web server or via streaming, live video call and conference using client-server as well as peer-to-peer technology.[link,paper,slides]

This project got used in many of my other projects demonstrating a wide range of use cases. For example, multiparty conference web app [link,video], two party video call desktop application [link,video], a video chat roulette type app [link], a Facebook app for video call and broadcast, an online video office integrated with Google Chat [link], an online platform to connect to experts [link], and implementation of SIP user agent in JavaScript [link,video]. Initially, the project was closed source, but later was made open source.

During this time, I got a chance to collaborate with CD, who hosted VoIP conference and expo. I did a couple of presentations at that conference [pdf,pdf,link]. In collaboration with CD, HS, WW and AJ, I also created the voice and video on web (vvow) project at IIT.

Although this project started out as a student project, I ended up doing most of the implementation. Initially, it used Flash-VideoIO, and was later expanded to use plugin-free WebRTC. Later, bunch of us jointly published the work in a conference [pdf, ppt]. The paper compared various ideas on how web based communication can evolve, and described how my Flash based multiparty collaboration used web-oriented RESTful APIs.

As I was racing on in my open source work, I got called by Twilio. I liked JL and EC. But I also wanted to continue my open source and research activities. So I decided to work part time for Twilio, focussing on its RnD type activities. I spent the first couple of months creating prototype mobile apps for Android and iOS. It used SIP/RTP between the mobile device and the cloud service. It used pjsip for the client side SIP stack. I created an implementation of SIP-RTMP gateway, largely based on my open source project, but written using Java. It used Flazr for RTMP side and NIST/SIP for SIP/SDP side. This enabled a web browser to create a voice path to their cloud service. Both these projects eventually were productized and made available to Twilio developers. Later, I did some voice quality measurement and recommendations for improvement. I enhanced the web client and gateway to add Flash-based H.264 video path from the browser to cloud service. This was also based on my open source Python SIP-RTMP gateway, but re-written in Java. At Twilio, I closely worked with EC, JB, NV on separate projects.

During that time, I also did consulting for bunch of other companies - mostly startups. I had created my own consulting business [link]. The goal of the company was to support my open source projects, and to enable other companies use them in their products and services. For example, I helped Bittorrent, working with JK, with my Python-based open source RTMP code. And to do custom packetization of RTMP over unreliable transport.

I worked closely with MT of Emergent Communications, a startup specializing in providing SIP-based NG911 emergency call systems. In particular, I wrote a web-based call taker terminal using Flash [link,video], a Python-based public safety answering point (PSAP) director [link], and a Python-based location to service translation (LoST) client. The software enabled a phone or SIP caller to deliver emergency calls via voice, video, instant messaging or real-time text. The server included a conference server and recorder, and supported various NG911 standards. I used my previous open source projects to implement the server side. I also helped in interoperability testing of the system with other vendors. I hosted the service on Amazon EC2 for their trials. This was the first web-based NG911 call taker system that I knew of.

Finally, I continued on my open source work. I created two new projects. First, Py-WebRTC [link], to expose various objects and methods of voice and video engine of Google's WebRTC stack in Python, and second, SIP-JS. The first one never got completed. The goal of the second one, SIP-JS, was to implement a complete SIP/SDP stack in JavaScript. With a related project, Flash-Network, it allowed RTP transport and media capture for such web-based SIP user agents.

Initially, it started as a demonstration of SIP over WebSocket. The goal was to move the feature rich SIP from server side to client for web-based telephony application. Unlike SIP-RTMP gateway, where SIP stack runs in the server or gateway, the SIP-JS approach moved it to the browser. Things that were unavailable in the browser, such as media capture and transport, were implemented using Flash Player. Later, the project became more mature, with WebRTC media stack integration, and an Android app.

In the last five years, I had quite a roller coaster ride - hopping from one company to another, doing many great projects one after other, not getting satisfied by a single project, wanting to do so many things, and lacking a systematic research direction. This was quite in contrast with the next four years, with more stability, and longer term research and development vision. But still with wide range of projects.

2012-2016 Cruising in the distant seas - "Age of WebRTC"

When VK told me about an opening in his department at Avaya Labs, I eagerly applied. I had known some folks there, including XW and AJ. XW had high regards for the department and the work he did, and AJ was quite happy as well. In my presentation to the labs folks, I talked about the evolution of video conferencing, my contributions on Flash-based applications in the browser, and the emergence of web-based communication [pdf]. Thus, in a way, I set my general research direction in the area of web-based voice and video collaboration, and endpoint driven systems.

At Avaya, I worked on several RnD projects with three common connecting themes: (1) endpoint driven systems where most of the application logic runs in the endpoint [web], (2) separation of the application logic from the user data so that the end user or her organization can control, manage and store the data independent of a single application, and (3) exploration and application of the emerging WebRTC technology to web, mobile and cloud systems for various enterprise use cases. However, one of my first assignments was actually not on WebRTC. It was based on my prior open source work - in fixing H.264 video path conversion between SIP/RTP and Flash RTMP systems for their one-touch video system [link].

My employer's business was largely in enterprise market. I decided to analyze the threats and propose recommendations on - how the enterprises can adopt the emerging WebRTC technology? How does WebRTC affect enterprises? What are the novel use cases that were previously not possible? And how can enterprises deal with these use cases? As part of this effort, I refined my previous SIP in JavaScript open source work [link]. More specifically, I used WebRTC for such an endpoint driven SIP system. Furthermore, I proposed some disruptive enterprise applications enabled by WebRTC and HTML5 technologies. Some of these showed existing use cases of call, conference, messaging and presence. Many others showed other non-trivial ways of collaboration such as virtual presence, video presence, web annotations, digital trail, contextual collaboration, and so on.

In collaboration with JY and AJ, I presented threats and potential solutions to traverse WebRTC flows through enterprise firewalls. We identified problems and proposed solutions on how enterprise policies such as authentication and recording be applied to WebRTC flows. The work got published and presented in conference and journal [paper,details,slides,video].

Secure edge - apply enterprise policies to WebRTC on any website: A system of media relay, firewall and browser extension to apply enterprise policies to WebRTC flows - enable only authorized flows on third-party websites but use user's enterprise credentials, record unencrypted media for all flows, restrict bandwidth or media types, or hide private IP addresses for such flows. [paper,details,slides,notes,video]

With the emergence of WebSocket and WebRTC technologies, many web communication apps emerged. There was a general tendency of developers to create custom messaging on top of WebSocket to enable WebRTC signaling. This created a fragmented world of many walled garden applications - each locking the user data, and hindering independent innovations. I extended and refined my resource oriented software architecture - created a very light weight, robust and scalable Python based resource server, and a generic and easy to use client-server API for shared data access and event notification. I proposed, and later implemented, numerous complex endpoint driven communicating applications per this architecture. Many of the projects listed below were covered and described in my co-authored conference papers - one on building communicating web applications leveraging endpoints and cloud resource service [paper,slides,poster,video], and another on private overlay of enterprise social data and interactions in the public web context [paper,slides,video].

Living-Content: A private overlay of enterprise digital trail on public web. Combine web annotations, virtual presence, ad hoc conversation, co-browsing and client-side application mash ups for many enterprise use cases. An endpoint driven system to leave a digital trail of employees interactions on and in the context of the public web, while keeping the data private, visible to employees but not outside.

Enterprise personal wall: A context sensitive user's personal wall for social sharing within an organization. Includes automatically populated as well as user generated content. Changes appearance depending on how or where it is embedded. Ability to initiate interaction or contact request using digital visiting card.

With living content, you could edit or annotate any web page, see other employee's annotations, see who else was viewing the page at that instant, or browsing on the website, be able to initiate ad hoc communication with them, and see the past conversation history around that web page or site. The text and drawing annotations, and the interaction history, could allow an organization to keep a digital trail of the social interactions of their employees. It also integrated with our in-house corporate directory as well as third-party social directory. For example, it added additional data and presented a presence and click-to-call button on fellow employees profiles on Linked-In when viewing them from within the enterprise network. A browser extension enabled various client side changes in the web page without help from those third-party websites. I also initiated a project to unify access of various social and enterprise data from various sources including social directory or internal mail system. Thus, a new authorized web or mobile application could easily and quickly access and update user's social data.

aRtisy developer platform: A platform to quickly create communicating web apps. An app editor with drag and drop for building an app by connecting various widgets, and an assortment of pre-built widgets for common communication and collaboration tasks, e.g., phone call, video publisher, conference state, automatic call distribution, call queue, click-to-call, or text-chat.

Video recorder plugin: A browser plugin (not extension) to record audio and video from the webcam to an MP4 or wave file for video messaging. Supports NPAPI and ActiveX, and popular browsers, Chrome, Firefox, Internet Explorer and Safari. Flexible JavaScript API to control and monitor the plugin behavior, and to exchange the recorded file. Ability to selectively enable or disable the plugin only on certain websites.

Vclick - endpoint driven enterprise WebRTC: A full feature pure web based audio, video and text conferencing application. Implements the popular collaboration features such as screen sharing, white board, shared notepad, etc., as endpoint driven system, using resource oriented software architecture. Uses browser extension to enable click to call and presence in our corporate directory website. Extends to mobile, and has a version without depending on browser extension. Has cloud hosting with appropriate security.

Always-on video presence: A web based application for distributed teams to stay in touch during office hours. Shows periodic snapshots of people in a web room. Allows initiating or joining a conversation in one click. Goal is to replicate the behavior of people working in offices of a building, but using video presence, and still remain non-intrusive when possible.

The Vclick project was a joint work with JY and AJ, and was actually quite popular in our internal demonstrations. It did many things differently, e.g., authentication, call initiation, separation of call intent from session negotiation, unidirectional peer connection, and so on. It changed the common perception of how video call and conferencing could work in the emerging web/mobile era. I integrated this with many other existing systems within the company. After our several failed attempts to bring it out as experimental product, we eventually got permission to publish and present our work in conferences - one on how to use browser extensions to facilitate enterprise collaboration [paper,details,slides,notes,video1,video2], and another on the project motivation and system implementation [paper,slides,notes,video].

As part of our cloud hosting trials, I learned many things about cloud software and service. I also implemented many new security features. I implemented a portal to host other similar web applications. Later, in collaboration with JB, we created a more robust and secure cloud portal for customer trials of this and other emerging systems. We also published our work in a conference [paper,slides,notes,video]. Continuing collaboration with JB, I implemented few other software pieces - team spaces for mobile; annotations and interactions in team spaces; and video wall. In particular, I created a mobile app for his connected spaces project - a team space system that enables sharing of documents and other editable contents on the web with other team members. I integrated parts of my previous projects to enable impromptu audio/video communication on a shared document, and to create annotations as overlay in the context of that document. The goal was to embed communication within the existing context of what the user is doing, instead of requiring her to launch a communication app outside her context. All our joint projects were hosted on the cloud for internal customer trials, and were available on both desktop browsers and mobile.

As the demand for mobile versions of my projects grew, I started exploring various options. Cordova is a cross-platform development tool, to convert web code to mobile. Since many of my projects were already HTML5 and Chrome compatible, it made sense to use Cordova, and particularly Chrome Cordova Apps framework and tools. I converted many of my existing web applications to Android, and some to iOS apps. I also created some new mobile oriented projects. In particular, I built a mobile client app to connect to Avaya IP office over voice and video, an endpoint driven multi-party conference logic for client-server media path, a phone dialer to send outbound voice calls via Avaya Breeze, a server-less video phone to discover and connect with others in the local area network, and so on. Each of these cross platform apps could actually work as a web app, an installed desktop app as well as an installed mobile app. Jointly with JB, I published and presented our bag of tricks and findings in a conference [paper,slides,notes,video].

There were several communication apps, many based on WebRTC, internally as well as in the public domain. With numerous voice and video applications, each with its own walled garden, it became evident that we needed something to unify the diversity. Past attempts of server side translation or multi-protocol clients either failed or did not work well. In collaboration with VK, LP and others, I created another app, and its supporting service, for managing the user's popular contacts. It enabled the user to reach a contact quickly irrespective of which application the contact is using. This was also an endpoint driven system, and was based on resource oriented software architecture.

Strata Top9: An app to quickly connect with your popular contacts on whichever application they are on. Include built-in WebRTC clients for some of our in-house systems - Vclick, IP office, media server, Scopia, Breeze, Messaging, and so on. Use modern HTML5 technologies, and cloud hosted services.

The project actually became quite popular in our internal demonstrations. I also published and presented the project motivation and system architecture in a conference [paper,slides,notes,video]. It included ideas on dynamic contacts, ability to derive the right reachability address, endpoint driven software architecture, and many other non-trivial design decisions. The app was also written in HTML5, and converted to mobile using Cordova. There were few other projects that I worked on - video wall, media-as-a-service and telemedicine collaboration - but, due to lack of any publications on those topics, I decided to not describe them here.

While building the numerous applications, I experienced several novel concepts, challenges, and innovative ways to solve them. For example, how to create animations using CSS transitions? How to design for mobile and desktop alike? what kinds of asynchronous programming model make sense? And how are iframes used to create web components? I created a set of best practices that helped me write cross platform and mobile compatible software in vanilla JavaScript, i.e., without using other JavaScript frameworks of jQuery, Angular or such. I extensively used CSS3 for various animations and graphics. I relied on iframe based components or modules, instead of bulky and slow single page applications model. I also shared my bag of tricks internally and externally with other researchers. My two publications [first,second] cover many of these tips and tricks.

Although I enjoyed working on these exciting projects, my employer struggled to meet the business goals. When it was time to focus on these goals, long term research became an overhead that could be avoided. Subsequently, when a good incentive was offered to leave or be reassigned, I decided to leave. It was hard for me to abandon so many great projects, and to see my software vanish. Luckily, being in the research organization allowed me to publish and present many of my projects in academic conferences, and thus, brag about them. Bunch of us who had enjoyed working together, decided to continue working together. And thus, Koopid [link], a new startup was formed!

As I transitioned from one job to another, I decided to spend some part time on my other open source activities. In particular, I consolidated various project repositories related to real-time communication protocols and systems into a single project [link,blog]. The goal was to create light weight implementations of various protocols and applications in Python. I already had many of the pieces in my previous open source projects. I added a few more modules and applications, e.g., ability to make phone calls from command line, to connect to Twilio voice path from command line, and to bootstrap web apps using a light weight notification service. I did lots and lots of refactoring!

End of the year is usually the time when I want to reflect on my accomplishments, shortcomings and future goals. End of 2016 is particularly special, as it marks my 20-years in the area of real-time communications. So I decided to reflect on my journey of two decades. And thank you! You are my patient audience to have read all the way to the end.

2017-

(these pages are waiting to be filled with many more exciting projects)

Thursday, December 29, 2016

A lot of my work in the past decade has focussed on endpoint driven systems, e.g., peer-to-peer Internet telephony (P2P-SIP) [1,2] for inherent scalability/robustness, Rich Internet Applications for web video conferencing [3,4], and more recently, resource-based software architecture [5,6]. In this article, I emphasize the importance of such systems, and differentiate them from other system architectures in the context of real-time communication.Endpoint driven systems are those where most, if not all, of the application logic runs in the endpoint. Such an endpoint may use external services or other endpoints, e.g., for storing persistent data, redundancy, or traversal through restricted networks. If the application logic such as conference membership or chat interaction is abstracted as a program function that works on some data or interacts with other functions, where does that function run? Does it run in the endpoint or some server?

Client-server systems are examples of distributed systems that have clients connecting to servers over the network for accessing some data or service. Depending on the application, such systems can be endpoint driven, e.g., emerging HTML5 applications often use a web server just to host the web files, but perform many user interaction logic in the browser client. A client-server system has different roles for the two communicating entities. This is unlike a pure peer-to-peer system where every client can also act as a server when needed, and collectively they serve each other in the application.

Service oriented systems are those where services are clearly identified, and are accessed in a loosely coupled manner using well defined interfaces. Depending on the type of service and the interface, a communication system can be both endpoint driven and service oriented, e.g., Vclick [7, 8] is a web-based video collaboration application that uses a resource service with well defined APIs for shared data storage and notifications, but still runs all the application logic in the browser endpoint. RESTful client-server architectures further encourage pushing the application logic in the client.

Loosely coupled systems separate the different modules or elements so that they can be easily replaced or failed over. Fault tolerance is a big objective of such systems. However the terminology is tangential to other terms I mentioned, and can be applied to those systems. A SIP phone client and SIP proxy server are by design loosely coupled. However, often times a provider requires a particular vendor's phone or device, making it tightly coupled to the service. Similarly, existing client-server or service oriented web applications are often tightly coupled - the HTML/Javascript must use the web service on that particular web server.

Data driven systems are those where focus is on large set of data, and one writes modular and small pieces of application logic or functions that operate on that data. If the application logic runs in the endpoint, such systems can still be endpoint driven. However, existing examples of data driven systems are often server or cloud based. In a communication context, such applications are usually for offline data analytics, and not for real-time communication.

Data oriented systems (or applications) are those that separate the data and the application logic, and more importantly, allow plug and play of the two elements. For example, ability of a communication system to easily replace the user accounts and contacts data from one social network or directory to another, or ability to use a third-party chat application using existing user accounts and contacts. Many web applications are internally data oriented, but such plug and play are not exposed externally to the end users. (Data oriented design or programming is completely unrelated to this, and is about optimizing the data layout, unlike object oriented design or programming.)

A personally controlled system takes the data oriented application to the next level, where the end user is in control of her data, and can potentially use any application to work on her data. For example, user could manage and store her own user account information, contacts and chat history wherever she wants, and could use any compatible user agent or softphone to work on that data and to connect with her contacts, while temporarily giving access of her data to that user agent. If an application that uses users' social data becomes obsolete, it does not render the social data itself useless - users could easily import their data to a new application or platform.

Rich Internet Applications are usually web applications but behave similar to desktop applications, and are often delivered via browser plugins or extensive HTML5 features. Thus, most of the application logic runs in the endpoint, and are often also classified as endpoint driven systems. However, due to the legacy use of the terminology, even a simple Flash application that displays only a phone dialpad, but runs everything else in the server is mistakenly call Rich Internet Application.

Thin client refers to hardware systems with limited capabilities, that often use external server side systems to fulfill many of the application demands. In the communication context, a thin client system puts very little load in the endpoint device, and performs bulk of its functions in the server. However, due to the legacy use of the terminology, some web based rich communication applications, that are delivered via a browser, and run within the browser, are also mistakenly called thin client applications. Nevertheless, thin client often means thick server, and unlike endpoint driven applications, often requires expensive servers.

As an example, Vclick can be categorized as endpoint driven, client-server, loosely coupled, data oriented, and rich Internet application. Strata is endpoint driven, client-server but also peer-to-peer, tightly coupled, service oriented system.

There are other terms that are often used in the context of communication systems. For example, a "managed" service is typically used to refer to a telecom style billed service in which the service provider controls many aspect of the end user behavior, e.g., who the users can call, who they can add in their contacts, etc. For example, call blocking can clearly be an endpoint service, but is often implemented and marketed as a managed service. Many modern web communication applications controlled by a single provider or website are examples of this. Integrated platform or ecosystem is often used as an euphemism for such single vendor dictatorship. As the user demands grow and the number of diverse systems appear, such systems then become open to federation or one-on-one integration with other big players. A federated system is often an euphemism for a closed and/or proprietary system that selectively decides to collude with one or more other such systems.

So what are the benefits of endpoint driven systems? They are inherently (or can easily be made) scalable and robust. The scalability is usually limited by the service, and the lesser the dependency on the service, the easier it is to scale and failover. Traditional software often followed the endpoint driven, personally controlled approach, e.g., an app to edit your files or to send emails irrespective of where or how the data was stored or obtained. With emerging trends in web apps and walled garden approach of social software, systems became more "managed", data controllers and service oriented. Typically, an expensive on-premise server or cloud hosted subscription based service is a sign that it is not an endpoint driven system. With the emergence of mobile apps, endpoint driven systems are becoming more popular again.

Endpoint driven trait is not protocol specific. People familiar with public telephone systems know that endpoints or phones there are very dumb devices, and most of the intelligence lies in the network or servers. The Internet, with its end-to-end approach, attempts to change the basic notion, by keeping the intelligence in the endpoints. Ideally, a failure in the intermediate router or device does not break the end-to-end transport session for any app such as file transfer or remote-login. The Internet communication protocol, SIP (session initiation protocol), was invented to be end-to-end, with lightweight (and sometimes optional) servers. However, business constraints have forced many real-world deployments to be largely single provider systems - although all the user devices and services use SIP internally, it won't work outside the ecosystem, i.e., the end user must register her user agent with one service, and can only talk to other people who are registered only with that one service.

As another example, in SIP-based IMS architecture, a fully compliant SIP endpoint is treated as a very dumb end device, that must connect with only so-and-so IP addresses, and must only use so-and-so ways of reachability. Many of the end-to-end capability of the SIP terminal is constrained if the network element does not "support" that feature by passing that through various intermediate proxies. For example, applications such as file transfer, picture sharing or even text chat can be implemented end-to-end in SIP by just conveying that session information in the SDP. As long as the two endpoints understand that, and intermediate proxies do not mangle the session description in this endpoint driven system, things often work without problem. Good luck doing this with an IMS compliant system!

With emerging WebRTC, the trend continues to grow, as each website now becomes its own VoIP provider, preventing its users to talk to another from within that website. Compare this with a SIP user agent inline with the original endpoint driven vision of SIP, which could register with one or more third-party SIP services, talk to people on any SIP address, and incorporate new features such as video, text chat, file sharing and even device control, without depending on the intermediate servers. Independent innovation is the real power of endpoint driven systems.

The services, tools and applications should be independently developed to promote innovation in each dimension. Imagine what would have happened if every email provider had its own email client, or allowed sending emails to only its own registered users. Or imagine what would have happened if every business developed its own web browser and server applications. In the world of communication, we are essentially going on that route - islands of highly controlling providers, who manage their own communication apps, have their own communication services, and control every aspect of communication among their users. There are solutions emerging to solve this problem such as the endpoint driven approach proposed by me [9, 10], but one should avoid the problem altogether in the first place.

The endpoint driven systems, when combined with other virtues of loosely coupled, data oriented, and personally controlled system designs, one can truly create a free communication system - one that is free to use by anyone, one that is free from control of a single provider, one that is free to evolve without unnecessary constraints - similar to how email and originally web evolved. What are the building blocks of such a system? An independent data storage that allows the end user to store, access her data, get notified when something changes, or when someone views her data, and permit her data to be selectively used by certain applications. A marketplace of endpoint driven communication applications that uses this data to enable end-to-end secure and reliable communication and collaboration. And finally, the underlying network infrastructure that stays out of the way of the endpoint driven applications.

Sunday, December 25, 2016

I am really impressed by impress.js presentation framework based on CSS3. It is pretty low level, requires knowledge of HTML and CSS, and is small enough that it can stay out of your way. Others have created very impressive 3D presentations. I also did my share of 3D presentations in 2015, as various conference paper presentations, while at Avaya Labs.

Below, I list my presentations and associated demo videos largely based on impress.js framework. When viewing a presentation, use an HTML5 capable browser such as Chrome or Safari, and please wait for it to loads completely, and the browser's loading icon to stop spinning, before doing the slide show.

Following presentations and demos were already publicly disclosed with the employer's permission at various academic conferences in 2012-16. The employer may hold pending or granted IPR/patents for ideas shown there.

Developing WebRTC-based team apps with a cross platform mobile framework

Paper presented at IEEE CCNC, Las Vegas, 2016.
The paper was presented by my co-author and covers the challenges, tips and tricks of creating cross platform mobile apps written using HTML5 with the help of Chrome Cordova Apps tools and framework. It also includes a demo of a real-time streaming video widget running on various platforms, mobile devices, and in native app vs. web app modes.

Vclick: endpoint driven enterprise WebRTC

Paper presented at IEEE ISM, Miami, 2015.
The paper describes the architecture, implementation and core ideas behind a endpoint driven multimedia collaboration system that includes audio/video conferencing, text chat, screen sharing, shared whiteboard, notepad and others. All the application logic runs in the endpoint, while using a simple resource server in cloud. It also includes a demo of cloud hosted software.

User reachability in multi-apps environment

Paper presented at IEEE ISM, Miami, 2015.
The paper describes the problem of user reachability in today's seggregated multi-apps environment where an user of one service cannot easily talk to that on another. It also describes the architecture and implementation of Strata, a mobile-first but cross-platform app, that can dynamically figure out the right reachability, and can integrate with bunch of existing services such as phone system, Avaya IP office, Scopia, AMS, etc. It also includes a demo of the app.

Enterprise WebRTC powered by browser extensions

Paper presented at IPTcomm, Chicago, 2015.
The paper shows how browser extensions can be used to solve critical problems in adopting WebRTC in enterprises. In particular, policy enforcement of how to traverse authorized WebRTC flows and block others through restricted enterprise firewalls, and how to seamlessly integrate WebRTC with existing enterprise communication systems. It also includes a couple of demos of our cloud hosted software.

Avaya Labs Innovations Cloud Engagement

Paper presented at IPTcomm, Chicago, 2015.
The paper presents the architecture, implementation and tips and tricks of our cloud hosted application portal for team collaboration systems. In particular, it describes the challenges and solutions to the multi-tenancy and self-service aspects of the system. It also includes a demo of our cloud hosted portal for multi-tenancy.

Building communicating web applications leveraging endpoints and cloud resource service

Paper presented at IEEE Cloud, Santa Clara, 2013.
The paper describes the problem due to user data controlled by apps, and presents an architecture and implementation to separate the user data from the applications. It describes resource based application architecture, a generic WebRTC signaling for communicating web applications, and a developer platform for building such communicating apps. It also includes a demo of Artisy app builder and widgets.

Private overlay of enterprise social data and interactions in the public web context

Paper presented at IEEE CollaborateCom, Austin, 2013.
The paper describes the architecture and implementation of our living content system that demonstrats enterprise social interactions such as web annotations, virtual presence, co-editing, click-to-call from corporate or social directory, and real-time WebRTC based collaboration on third-party websites. It also includes a demo of living content in several use cases.

Wednesday, July 06, 2016

Twilio client [1] enables embedding voice conversation in web and mobile apps, by creating a voice pipe between your browser or mobile device and the service. One thing missing there is the ability to create such a voice pipe from non-mobile or non-web programs, such as a command line application. There is a shell script [2] to place a call for testing from command line, but it uses pre-defined text or recorded file for media, and looses the real-time interactive nature of the voice path. In particular, it does not create a voice pipe between the local machine and the service.

Motivation

Ability to connect real-time interactive voice path from an application opens doors to wide range of other use cases such as media path processing and analysis, e.g., for real-time transcription, or to bridge call between diverse services, e.g., translate between IM and voice call. Secondly, such as mechanism is independent of a specific browser or mobile platform, and can work in headless mode for automated testing or client-side programmability of voice call or its media path.

At the high level, there are four potential ways to create such a voice pipe. Two of these can be accomplished using my rtclite project, that we will describe in more detail in this article.

WebRTC - using Twilio 1.3+ web client with command line WebRTC app

Using Twilio mobile client interface ported to command line app

RTMP - using Twilio 1.2 web client API with command line RTMP client

SIP - send/receive SIP call to/from Twilio [3]

The first two approaches essentially implement a command line version of the web and mobile SDKs. For example, a WebRTC stack compiled for Linux/OS X may be used to accomplish (1). The third approach uses a command line RTMP client and an older version of client API, and the fourth one uses a command line SIP endpoint to dial into the service.

We describe how to do the last two using software pieces from our open source rtclite project [5]. The following video demonstration shows the command line call initiation. Don't forget to view in full screen!

Connect to Twilio from command line SIP endpoint

The approach is described on the provider's website [3] including the steps for creating the SIP domain/endpoint such as yourname.sip.twilio.com. The description there is targeted for VoIP providers rather than client. Currently it is not possible [4] to configure your SIP softphone to use the Twilio service as SIP server. Once the SIP domain is created, you can send request to "sip:something@yourname.sip.twilio.com". The sender's IP address needs to be white-listed or the caller's credential needs to be preapproved for authentication. we have only tried the IP address white-listing, and not the SIP authentication using credentials.

After configuring a SIP endpoint, you should be able to see the SIP domain and its associated voice URL on the provider's website, e.g., "yourname.sip.twilio.com" mapped to voice URL of "http://yourserver/yourtwiml.xml".A simple call forwarding can be done using the following TwiML. These steps can be used to connect to any other TwiML application from the command line SIP endpoint, if you configure your SIP endpoint's voice URL accordingly.

You can also create programmable server side script to derive the target number using the called SIP address. For example, send call to sip:1415yyyyyyy@yourname.sip.twilio.com, and in your script, extract the user name part of the URL to populate the Dial tag's content.

Once the initial configuration is complete, use the SIP endpoint available in the rtclite project to initiate a call. More details on the command line SIP endpoint are available in my previous blog post as well as on the project website [6]. After setting up the initial dependencies of the project, run the caller module as follows.

The "domain", "use-lf" and "samplerate" options are described on the project website. The "to" option is important, and represents the target address to call to. Once the call is established, the py-audio project's modules are used to interface with the audio device, and to send and receive audio in the call.

Some important notes follow. For a demo account with the provider, your callerId and target number in the Dial tag must both be verified for your account. Some Internet Service Providers (ISP) may block a SIP request on port 5060. In fact, I sometime experience wifi reset of my residential equipment, when I send a SIP packet out from my home machine to outside VoIP service. Router interference of SIP messages may also be the reason for incorrect CRLF handling, and the need for use-lf option. Using TLS may be an option to work around router interference. Make sure that the sample rate specified on the command matches the allowed sample rate of the audio device on your client machine. On Mac OS X, audio capture is usually done with 48kHz.

Connect to Twilio from command line RTMP client

Here, we exploit the older Flash-based Twilio Web Client SDK version 1.2. First, we describe how to figure out the client-server RTMP message flow, and next we show how to do this using the rtmpclient module from the rtclite project.

This second JavaScript file shows that the Flash-based client-server connection is accomplished using the MediaStream class which internally attempts RTMP connection to the service.

Using the hello-client-monkey.php example from the provider, but hosted on your website, and running wireshark on default RTMP port 1935, you can find more information about this client-server exchange. Make sure to use http instead of https for your hosted web app to avoid encryption. The following screenshots show a few example RTMP messages, and their order, and the parameters sent in the initial connect request.

In summary, the NetConnection's connect is done to "rtmp://chunder.twilio.com/chunder", with additional 6 arguments. The first argument looks like the capability token generated by the helper library. The second and third are null and empty string respectively. The fourth one is a JSON formatted string with some client side attributes. The fifth looks like the account SID. And last one looks like the client SDK version. After connect is complete, there are two createStream calls and a startCall RPC method. This is followed by a received RPC method callsid from the server. Next, the "input" stream name is published, and "output" stream name is played. This is followed by bunch of audio data.

We wrote an example application, client.py [7], which uses the rtmpclient module from the rtclite project, and automates client-server exchange process. It then connects the audio sent/received on the two streams to the local audio device. Although, not well tested, there are other command line options in this application that allow recording the received stream, or playing a file to the sent stream.

The command line application takes three mandatory parameters, which if not supplied, will be prompted for. These parameters are for account SID, auth token and application SID. The client application then connects to the service using RTMP and uses those supplied parameters. An example command line invocation is shown below, where the three parameters are fake - use the correct values in your test!

Comparison

Here, we attempt to compare our two approaches for command line client: SIP and RTMP.

The audio codec used are G.711 in SIP and Speex in RTMP. Thus bandwidth requirement is more for SIP, but can theoretically give better quality, e.g., for real-time transcription. It may be possible to send G.711 in RTMP to server, but is not clear how to force the server to send back G.711 in RTMP. The media path is over UDP (RTP) for SIP vs. TCP for RTMP - making RTMP one more suseptible to network issues such as latency in interactive conversations. While both these are voice only at this time, using WebRTC or Mobile SDK based approach may allow video pipe. The py-webrtc project may become useful for this, once it is completed.

The SIP approach has provider supplied documentation on how to do incoming calls, but the RTMP approach requires more work to figure that out. The SIP approach seems to target server-to-server call flows, e.g., to connect your soft PBX to provider service. Due to router interference of SIP messages, it may not work in all the cases or all the time from a client machine. On the other hand, the RTMP approach is inspired by the client SDK, and hence suitable for clients. However, Flash-based client has been deprecated by the provider, and the corresponding service may no longer be available in near future. Moreover, RTMP is kind of an obsolete technology. Using WebRTC or Mobile SDK approach will work better in that case. Once SIP registration becomes available on the provider service, a standard command line SIP endpoint [6] should be enough for send/receive of calls.