Mobile Development at U2U Consult

Based in Brussels, Belgium, U2U Consult has been offering consulting & development services in the EMEA region for over 10 years. We offer mobile development on all platforms from Apple to Android & Microsoft.

This article introduces you to speech recognition (speech-to-text, STT) and speech synthesis (text-to-speech, TTS) in a Universal Windows XAML app. It comes with a sample app – code attached at the end- that has buttons for

opening the standard UI for speech recognition,

opening a custom UI for speech recognition,

speaking a text in a system-provided voice of your choice, and

speaking a text from a SSML document.

On top of that, the Windows Phone version of the sample app comes with a custom control for speech and keyboard input, based on the Cortana look-and-feel.

Here are screenshots from the Windows and Phone versions of the sample app:

Speech Recognition

For speech recognition there’s a huge difference between a Universal Windows app and a Universal Windows Phone app: the latter has everything built in, the former requires some extra downloading, installation, and registration. But after that, the experience is more or less the same.

Windows Phone

For speech-to-text, Windows Phone comes with the SpeechRecognizer class, just create an instance of it: it will listen to you, think for a while, and then come up with a text string. You can make the recognition easier by giving the control some context in the form of Constraints, like

a predefined or custom Grammar – words and phrases that the control understands, or

a topic, or

just a list of words.

Contradictory to the documentation, you *have* to provide constraints, and a call to CompileConstraintsAsync is mandatory.

You can trigger the default UI experience with RecognizeWithUIAsync. This opens the UI and starts waiting for you to speak. Eventually it returns the result as a SpeechRecognitionResult that holds the recognized text, together with extra information such as the confidence of the answer (as a level in an enumeration, and as a percentage). Here’s the full use case:

This is how the default UI looks like. It takes the top half of the screen:

If you don’t like this UI, then you can start a speech recognition session using RecognizeAsync. The recognized text is revealed in the Completed event of the resulting IAsyncOperation. Here’s the whole UI-less story:

Windows App

The SpeechRecognizer class is only native to Windows Phone apps. But don’t worry: if you download the Bing Speech Recognition Control for Windows 8.1, you get roughly the same experience. Before you can actually use the control, you need to register an application in the Azure Market Place to get the necessary credentials. Under the hood the control delegates the processing to a web service that relies on OAUTH. Here’s a screenshot of the application registration pages:

If you’re happy with the standard UI then you just drop the control on a page, like this:

<sp:SpeechRecognizerUxx:Name="SpeechControl"/>

Before starting a speech recognition session, you have to provide your credentials:

For Windows as well as Phone projects that require speech-to-text, don’t forget to enable the ‘Microphone’ capability. That also means that the user will have to give consent (only once):

Speech Synthesis

You now know how to let your device listen to you, it’s time to give it a voice. For speech synthesis (text-to-speech) both Phone and Windows apps have access to a Universal SpeechSynthesizer class. The SynthesizeTextToStreamAsync method transforms text into an audio stream (like a *.wav file). You can optionally provide the voice to be used; by selecting from the list of voices (SpeechSynthesizer.AllVoices) that are installed on the device. Each of these voices has it own gender and language. The voices depend on the device and your cultural settings: my laptop only speaks US and UK English, but my phone seems to be fluent in French and German too.

When the speech synthesizer's audio stream is complete, you can play it via a MediaElement on the page. Note that Silverlight 8.1 apps do not need a MediaElement, they can call the SpeakTextAsync method, which is not available for Universal apps.

Here’s the full flow. I wrapped it in a Voice class in the shared project (in an MVVM app this would be a 'Service'):

In a multi-page app, you must make sure that there is a MediaElement on each page. In the sample app, I reused a technique from a previous blog post. I created a custom template for the root Frame:

<Application.Resources><!-- Injecting a media player on each page --><Stylex:Key="RootFrameStyle"TargetType="Frame"><SetterProperty="Template"><Setter.Value><ControlTemplateTargetType="Frame"><Grid><!-- Voice --><MediaElementIsLooping="False"/><ContentPresenter/></Grid></ControlTemplate></Setter.Value></Setter></Style></Application.Resources>

And applied it in app.xaml.cs:

// Create a Frame to act as the navigation context and navigate to the first page
rootFrame = new Frame();
// Injecting media player on each page.
rootFrame.Style = this.Resources["RootFrameStyle"] as Style;
// Place the frame in the current Window
Window.Current.Content = rootFrame;

The voice class can now easily access the MediaElement from the current page:

The speech synthesizer can also generate an audio stream from Speech Synthesis Markup Language (SSML). That’s an XML format in which you can not only write the text to be spoken, but also the pauses, the changes in pitch, language, or gender, and how to deal with dates, times, and numbers, and even detailed pronunciation through phonemes. Here’s the document from the sample app:

<?xmlversion="1.0"encoding="utf-8" ?><speakversion='1.0'xmlns='http://www.w3.org/2001/10/synthesis'xml:lang='en-US'>
Your reservation for <say-asinterpret-as="cardinal"> 2 </say-as> rooms on the <say-asinterpret-as="ordinal"> 4th </say-as> floor of the hotel on <say-asinterpret-as="date"format="mdy"> 3/21/2012 </say-as>, with early arrival at <say-asinterpret-as="time"format="hms12"> 12:35pm </say-as> has been confirmed. Please call <say-asinterpret-as="telephone"format="1"> (888) 555-1212 </say-as> with any questions.
<voicegender='male'><prosodypitch='x-high'> This is extra high pitch. </prosody><prosodyrate='slow'> This is the slow speaking rate. </prosody></voice><voicegender='female'><s>Today we preview the latest romantic music from Blablabla.</s></voice>
This is an example of how to speak the word <phonemealphabet='x-microsoft-ups'ph='S1 W AA T . CH AX . M AX . S2 K AA L . IH T'> whatchamacallit </phoneme>.
<voicegender='male'>
For English, press 1.
</voice><!-- Language switch: does not work if you do not have a french voice. --><voicegender='female'xml:lang='fr-FR'>
Pour le français, appuyez sur 2
</voice></speak>

Speech Recognition User Control

The Windows Phone project in the sample app contains a SpeechInputBox. It’s a user control for speech input through voice or keyboard that is inspired by the Cortana look-and-feel. It comes with dependency properties for the text and a highlight color, and it raises an event when the input is processed:

The control behaves like the Cortana one:

it starts listening when you tap on the microphone icon,

it opens the onscreen keyboard if you tap on the text box,

it notifies you when it listens to voice input,

the arrow button puts the control in thinking mode, and

the control says the result out loud when the input came from voice (not from typed input).

The implementation is very straightforward: depending on the state the control shows some UI elements, and calls the API’s that were already discussed in this article. There’s definitely room for improvement: you could add styling, animations, sound effects, and localization.

Here’s how to use the control in a XAML page. I did not use data binding in the sample, the main page hooked an event handler to TextChanged:

<Controls:SpeechInputBoxx:Name="SpeechInputBox"Highlight="Cyan"/>

Here are some screenshot of the SpeechInputBox in action, next to its Cortana inspiration:

If you intend to build a Silverlight 8.1 version of this control, you might want to start from the code in the MSDN Voice Search for Windows Phone 8.1 sample (there are quite some differences with Universal apps: in the XAML as well as in the API calls). If you want to roll your own UI, make sure you follow the Speech Design Guidelines.

Code

Here’s the code, it was written in Visual Studio 2013 Update 3. Remember that you need to register an app to get the necessary credentials to run the Bing Speech Recognition control: U2UC.WinUni.SpeechSample.zip (661.3KB).