One recent evening I was running late for an after-work meeting with an old friend. I knew he was already driving to the rendezvous, so calling him would be out of the question. Nevertheless, as I dashed out of my office and ran toward my car, I grabbed my Windows Phone and held down the Start button. When I heard the “earcon” listening prompt, I said, “Text Robert Brown,” and when the text app started up, I said, “Running late, leaving office now,” followed by “send” to send the text message.

Without the speech features in the built-in texting app, I would’ve had to stop running and fumble around in frustration to send a text because I find the keypad hard to use with my fat fingers and the screen difficult to read while on the run. Using speech to text saved me time, frustration and no small amount of anxiety.

Windows Phone 8 offers these same speech features for developers to interact with their users through speech recognition and text-to-speech. These features support the two scenarios illustrated in my example: From anywhere on the phone, the user can say a command to launch an app and carry out an action with just one utterance; and once in the app, the phone carries on a dialog with the user by capturing commands or text from the speaker’s spoken utterances and by audibly rendering text to the user for notification and feedback.

The first scenario is supported by a feature called voice commands. To enable this feature, the app provides a Voice Command Definition (VCD) file to specify a set of commands that the app is equipped to handle. When the app is launched by voice commands, it receives parameters in a query string such as the command name, parameter names and the recognized text that it can use to execute the command specified by the user. This first installment of a two-part article explains how to enable voice commands in your app on Windows Phone 8.

In the second installment I’ll discuss in-app speech dialog. To support this, Windows Phone 8 provides an API for speech recognition and synthesis. This API includes a default UI for confirmation and disambiguation as well as default values for speech grammars, timeouts and other properties, making it possible to add speech recognition to an app with just a few lines of code. Similarly, the speech synthesis API (also known as text-to-speech, or TTS) is easy to code for simple scenarios; it also provides advanced features such as fine-tuned manipulation via the World Wide Web Consortium Speech Synthesis Markup Language (SSML) and switching between end-user voices already on the phone or downloaded from the marketplace. Stay tuned for a detailed exploration of this feature in the follow-up article.

To demonstrate these features, I’ve developed a simple app called Magic Memo. You can launch Magic Memo and execute a command by holding the Start button and then speaking a command when prompted. Inside the app, you can enter your memo using simple dictation or navigate within the app and execute commands using speech. Throughout this article, I’ll explain the source code that implements these features.

Requirements for Using Speech Features in Apps

The Magic Memo app should work out of the box, assuming your development environment meets the hardware and software requirements for developing Windows Phone 8 apps and testing on the phone emulator. When this article went to press the requirements were as follows:

64-bit version of Windows 8 Pro or higher

4GB or more of RAM

Second Level Address Translation supported by the BIOS

Hyper-V installed and running

Visual Studio 2012 Express for Windows Phone or higher

As always, it’s best to check MSDN documentation for the latest requirements before attempting to develop and run your app.

Three other things to keep in mind when you develop your own app from scratch:

Ensure that the device microphone and speaker are working properly.

Add capabilities for speech recognition and microphone to the WpAppManifest.xml file either by checking the appropriate boxes in the properties editor or manually by including the following in the XML file:

<Capability Name="ID_CAP_SPEECH_RECOGNITION"/>

<Capability Name="ID_CAP_MICROPHONE"/>

When attempting speech recognition, you should catch the exception thrown when the user hasn’t accepted the speech privacy policy. The GetNewMemoByVoice helper function in MainPage.xaml.cs in the accompanying sample code download gives an example of how to do this.

The Scenario

On any smartphone, a common scenario is to launch an app and execute a single command, optionally followed by more commands. Doing this manually requires several steps: finding the app, navigating to the right place, finding the button or menu item, tapping on that, and so on. Many users find this frustrating even after they’ve become accustomed to the steps.

For example, to display a saved memo—for instance, memo No. 12—in the Magic Memo sample app, the user must find and launch the app, tap on “View saved memos” and scroll down until the desired memo is displayed. Contrast this with the experience of using the Windows Phone 8 voice commands feature: The user holds the Start button and says “Magic Memo, show memo 12,” after which the Magic Memo app is launched and the desired memo is displayed in a message box. Even for this simple command, there’s a clear savings in user interaction.

There are three steps to implementing voice commands in an app and an optional fourth step for handling dynamic content. The following sections outline those steps.

Specifying the User Commands to Recognize...

Last month, in part 1 (msdn.microsoft.com/magazine/jj721592) of this two-part series, I discussed enabling voice commands in a Windows Phone 8 app. Here, I’ll discuss dialog with the user in a running app using speech input and output.

Once an app has been launched, many scenarios can benefit from interaction between the user and the phone using speech input and output. A natural one is in-app dialog. For example, the user can launch the Magic Memo app (see previous article) to go to the main page and then use speech recognition to enter a new memo, receive audio feedback and confirm the changes. Assuming no misrecognitions, the user can completely enter and save several memos without touching the phone (other than the first long push on the Start button).

You can imagine many other scenarios using speech dialog starting out in the app. For example, once the user has navigated to a page showing a list of saved favorites such as memos, movies or memorabilia, she could use recognition to choose one and take an action: edit, play, order, remove and so on. Speech output would then read back the selection and ask for confirmation.

In the following sections I’ll lay out examples using speech for input and output, starting with simple examples and working up to more complex examples. I’ll show how easy it is to implement the simple cases and show some of the richer functionality available for advanced scenarios.

Communicating to the User: Speech Synthesis API

Computer-generated speech output is variously called text to speech (TTS) or speech synthesis (though strictly speaking, TTS encompasses more than speech synthesis). Common uses include notification and confirmation, as mentioned earlier, but it’s also essential to other use cases such as book readers or screen readers.

...

Speech Input: Speech Recognition API

The two broad classes of use cases for speech recognition in an app are text input and command and control. In the first use case, text input, the app simply captures the user’s utterance as text; this is useful when the user could say almost anything, as in the “new memo” feature of the sample code.

In the second use case, command and control, the user manipulates the app by spoken utterance rather than by tapping buttons or sliding a finger across the face of the phone. This use case is especially useful in hands-free scenarios such as driving or cooking.

A Simple Example of Speech Recognition Before going into detail about the features of speech recognition in an app, let’s take a look at the simplest case: text input in a few lines of code.

...

Introduction to Speech Recognition Grammars

Modern speech recognition engines all use grammars to restrain the set of phrases through which the recognition engine must search (hereafter called the “search space”) to find a match to the user’s utterance, and thus improve recognition accuracy. Grammar rules may allow recognition of phrases as simple as a list of numbers or as complex as general conversational text.

In the Windows Phone 8 speech API you can specify a grammar in three ways, as described in the following sections. For each case, you add the grammar to a collection of grammars on the SpeechRecognizer object.

Simple List Grammar The easiest way to specify a custom grammar for an app is to provide a list of all the phrases for which the recognizer should listen in a simple string array. These list grammars are handled by the on-device speech recognition engine. The code to create and add a list grammar can be as simple as the following for a static list of button names to recognize against:

...

The Next ‘Killer App’

The speech features for apps on Windows Phone 8 represent, among all smartphone offerings, the first fully functional developer platform for speech featuring both on-device and remote recognition services. Using voice commands and in-app dialog, you can open up your app to many compelling scenarios that will delight your users. With these speech features, your app could catch the buzz and be the next “killer app” in the marketplace.

So the ball is in your court now, lets see those awesome voice enabled apps!