Elias Majic and Spencer Lord have guest authored this post on speech recognition in JavaScript. Do you have something to share? Consider your own guest post and contact us!

Recently Google’s free text to speech api has made the rounds. The reverse is also possible, converting speech to text.

With speechapi.com’s javascript API, it is possible to build interesting speech-web mashups that include both speech-to-text as well as text-to-speech.

A combination of several technologies and open source tools make this possible. In the browser, Flash is used to access the microphone and stream the audio to an RTMP server. Red5 is used because its a versatile media server that has the benefit of being open source and free.

Once that audio is received on the server, it needs to be converted to text. There are many speech recognition engines to choose from. Many are proprietary and provide very good accuracy results but they are pricey and closed source. There are some state of the art opensource speech recognition engines too, such as julius and Sphinx to name a couple. The speechapi service uses sphinx because it is license friendly and has a strong community.

Now this is great, we can transmit audio and convert it to text but we need to control the process and use the results in the web page. That is where Javascript comes in. Speechapi.com provides a Javascript API. There is a setupRecognition method that sets up the grammar used in the speech-to-text process. There is a simple grammar mode, where you can just provide a comma seperated list of words. JSGF is also supported and is useful for more complex grammars. There are also methods that communicate with the flash control to indicate when to start transmitting audio and when to stop transmitting audio. You can also use the flash controls built in press to speak button to specify the speech endpoints.

Recognition results are returned to your web page in a callback that you specify in the speechapi constructor. The results are passed from the server to client as a JSON string. The result object contains the raw text results as well as other information that can be useful for you speech client, like pronunciation and “grammar tags” that can be useful for semantic interpretation of the results.

We think this technology is pretty cool and we encourage you to try it out. You can try it for free at speechapi.com where you just include a few lines of of javascript and html into your webpage to enable speech recognition. We are also open sourcing the package over the next few months, so sign up at our site if your interested.

A Basic Speech Page

If you were to copy and paste the below code in a webpage it would show a basic Flash control that detects a number between one and five. Since it interacts with a Flash app you can’t just load the file off your harddrive. You need to load it from a website, so copy it to your apache /var/www or equivalent. Lets take a closer look at the code. First you need to include our api javascript file. Then you need to constuct the SpeechapiObject. The constructor requires a username and password, the callback function for recognition complete event, the callback for text to speech complete event, the name of the swf object that you can use to place the flash UI in the web page and the url of the speech server.

//When the flash control is ready and loaded we setup the recognition with the words we want to recognize, "one,two,three,four,five". "SIMPLE" represents the type, which is just words seperated by commas.

function onStartup() {

speech.setupRecognition("SIMPLE", "one,two,three,four,five");

}

//OK so we have a result. Lets use tts to play the result.

function onResult(result) {

speech.speak(result.text,'female');

}

//Just show a javascript alert when the tts is completed.

function onTtsComplete() {

alert("tts complete");

}

</script>

<body>

...

<!-- somehere in the body include the swf -->

<divid="swfcontainer"></div>

...

</body>

Running the Basic Speech Page

To run this application, load the new web page with your favorite browser from a web location. You should see the Adobe Flash Player Setting popup at the location you placed the swfcontainer. It looks like this. You will need to click Allow if you want to be able to send audio to the server.

Then the speech widget will be displayed in its place. To trigger recognition, press and hold the green button, speak a word between one and five (we set that up the grammar earlier) and then release the button. The button will turn yellow while it is depressed, indicating you are sending audio to the recognizer.

If all goes well, the system will speak the text that was recognized. Once the result is spoken, an alert indicating that the word was finished being played will appear.

Conclusion

We have shown that using existing opensource technolgies you can add speech-to-text and text-to-speech to your web pages. The speechapi.com’s javascript API is designed to be easy yet powerful.

In the next installment we will talk about how you can build multi-modal UI’s using JSGF’s tags and rules (to help provide semantic interpretation of recognition results) and Jquery selectors. If you want a preview check some of the demos at http://speechapi.com/demos.

There’s also the WAMI speech API (Web-Accessible Multimodal Applications) which I think was featured here a while ago. It uses a Java applet to capture audio and I made a mashup of it using Ext JS if you want to try it out: http://ext-scheduler.com/examples/speech/speech.html

Having an HTML 5 solution would be ideal, but getting access to the microphone (and webcam) is not yet possible without flash (or an applet). It will be possible in the future with the device tag. As far as I know, the specification for the device tag is still in the draft phase. Once it is available, it should be easy to replace the “flash part”.