I am considering building a virtual assistant for incoming phone calls but I am very new to SIP. Several people have recommended to use Asterisk/Freeswitch but I am concerned about scalability of the solution.

Alternatively, I am considering existing solutions Plivo, Tropo, Twilio for Text To Speech (TTS) generation and my own propietary solution for Speech recognition and automated dialog generation. However, the best I have been able to build with these solutions is a dialog where the user is prompted when to speak with a beep and audio is only available once the user has finished and that is hardly any improvement over previous technologies.

I wonder whether I could use SIP Sorcery to fork the incoming phonecall into two legs: the one for TTS output, which should be forwarded to Plivo, Tropo, Twilio, so I can use their API's to provide output to the user, and the one for audio input, i.e., the incoming RTP so we can run our Speech recognition solution.

In particular, I would appreciate a solution that solves two problems: how to match incoming RTP's to incoming calls to the cloud-based IVR product and how to guarantee that the RTP receives only audio from the user and nothing from the RTP.

Yes, I am a newbie on SIP so, please, try not to use too fancy terms and, if you do, send me a link. I have considered getting it all done with an Asterisk or FreeSwitch box but I am concerned about scalability and I would prefer to have existing solutions in the market take care or SIP signaling and TTS generation.

novice wrote:I wonder whether I could use SIP Sorcery to fork the incoming phonecall into two legs: the one for TTS output, which should be forwarded to Plivo, Tropo, Twilio, so I can use their API's to provide output to the user, and the one for audio input, i.e., the incoming RTP so we can run our Speech recognition solution.

When you fork an incoming call with SIP you forward the call request to two or more destinations. The first one of those destinations is the one that establishes the call and the call requests to other forks are cancelled.

In other words with a SIP fork you still only end up with two parties on the call, the caller and the callee. It sounds like you are after a 3 way call with the caller and two callee's. Although based on your description if you forward the call to tropo etc I'm pretty sure tropo can forward the call to your own serve while still allowing you to execute commands and inject media into the call from the tropo server. So you could forward the call to your speech recognition server and then when you need to play a response call a tropo web service to get it to inject text-to-speech into the call.

Base on what I've understood from your scenario there's unlikely to be anything in the sipsorcery service can help you with.

novice wrote:In particular, I would appreciate a solution that solves two problems: how to match incoming RTP's to incoming calls to the cloud-based IVR product and how to guarantee that the RTP receives only audio from the user and nothing from the RTP.

The standard way of matching RTP to a SIP call is by the IP socket it uses. When a SIP call is answered the SIP response contains information about where the call media (audio, video etc) should be sent.

While there's nothing stopping a SIP user agent using a single socket for RTP from multiple calls it would need some extra logic to be able to distinguish between the different RTP streams. There is nothing in the SIP or RTP standard that provides a mechanism for this.

I read the article about SIP one way audio situations https://sipsorcery.wordpress.com/2009/0 ... -problems/ and I thought it was exactly the type of situation that I was trying to recreate: RTP audio from the client is sent to an IP address but incoming audio arrives from a different address.

I wonder whether Sipsorcery or some kind of SIP proxy could help establish a dialog between a sip client and a solution like Twilio/Tropo/Plivo where the RTP packages from the client were sent to my voice recognition server but it was Twilio/Tropo/Plivo what sent RTP packages to the client.