Menu

Category Archives: Text-to-Speech

An Interesting Opportunity

Voice recognition and text to speech technologies make a good combo for future user interfaces. Imagine browsing the web by voice and listening to blogs like you listen to podcasts. Your mobile phone will be able to provide these features in the near future. Voice web browsing is not only useful for the visually impaired, but also for users who wish to do other things while surfing the web. Voice recognition and text to speech feature prominently on the iPhone 4S, but voice web browsing is not incorporated.

This blog post includes code which allows you to extract the main text content from a web page and convert it to a playable audio file. The process is triggered by simple voice commands such as “receive hacker news”, “read article 7” or “save article 19”. The resulting audio file can also be synchronized to other services such as DropBox, Amazon Cloud Drive, and Apple iCloud to be played later. We use IKVM to allow us to run the boilerpipe library, which is written in Java, over .NET. We choose .NET because it comes with “batteries included”: ready to use voice recognition and text to speech capabilities.

Good voice recognition and text to speech systems are expensive and require training. Companies which provide these services do not offer a more granular way to access a web service or run a local engine. For example, AT&T Natural Voice for TTS sounds good, but their licensing terms are prohibitive for small companies and startups. On the voice recognition side of the equation, Nuance has been accumulating patents to strengthen their market position, making it difficult for others to compete. Although many other companies offer voice recognition, most of them actually use Nuance’s technology. See for example: Siri, Do You Use Nuance Technology? Siri: I’m Sorry, I Can’t Answer That. Voice recognition systems require training to improve their accuracy. SpinVox, a Nuance subsidiary, used “conversion experts”. They built teams which listened to audio messages and manually converted them to text. If you want to use this approach, you’ll need to wait for Amazon’s Mechanical Turk to offer micro jobs in real time.

What is missing here? None of the leading companies offer good quality voice recognition on a charge per use basis. Google seems to be actively researching voice recognition, and has achieved impressive results with “experimental” speech recognition technology. Sadly, Google’s voice recognition and text to speech APIs can not be used to develop all desktop and server applications. Their use is restricted to Android phones, Chrome’s beta html5 support, and Chrome’s extensions. It would be nice for Google to remove this restriction and include this service on their web APIs Console.

Our code demonstrates that it is possible to use voice recognition and text to speech while avoiding the licensing, patent and API conundrum.

Using VR and TTS under .NET

Prerequisites

If you use our VoiceWebBrowsing code, also available on GitHub, you will just need Microsoft Visual Studio 2010. However if there is a new version of boilerpipe you will have to generate the boilerpipe .NET assemblies yourself as follows:

See Also

Where can you go from here?

You can write a continuous Hacker News front page reader which constantly checks the feed and reads you new titles.

You can write a voice oriented mobile web browser for Windows Phone, Google Android, and for Google Chrome using their APIs. To write a mobile browser for iPhones, you will have to wait for iOS Siri API.

You can write a cloud service to provide a web page to a podcast converter service, store the audio file in the cloud, and automatically convert Google Reader’s starred items.