Daily Archives: May 23, 2016

WINDOWS SYSTEMS ADMINISTRATOR We are hiring for an experienced Windows Systems Administrator. The ideal candidate will … years of experience in IT supporting Windows Servers Broad Windows Server experience including Server 2003/2012 Ability…

maintaining the health of various applications based on the Windows architecture and .NET framework. Ensure long-term health and reporting metrics and usage trend of applications, deploy updated code and patches to servers in accordance to…

Arrow Solutions Group is seeking a WindowsSystemAdministrator in Plymouth, Minnesota to work on a short-term contract. … Job Duties and Responsibilities: Provide Windows domain system administration. Troubleshoot issues regarding the…

Next week, I look forward to joining over 100,000 industry partners from over 29 countries at COMPUTEX 2016 in Taipei. The partner ecosystem is critical to our success and we continue to work hard with our hardware partners to create opportunities, drive innovation, and deliver incredible Windows devices to the world.

On June 1st, I will take the stage to discuss how Microsoft is enabling modern Windows devices and experiences. I look forward to sharing how Windows 10 technologies like Windows Hello, Windows Ink, Cortana and more are coming to life in all new and exciting ways on a broad range of devices. And, how Microsoft productivity apps and cloud services provide rich experiences across devices to help people achieve more – and have more fun.

This year’s COMPUTEX is particularly exciting because I’ll be joined by Terry Myerson, executive vice president, Windows & Devices Group, and Alex Kipman, technical fellow and inventor of Microsoft HoloLens, to explore how Windows 10 can inspire all new devices.

I look forward to seeing our partners in Taipei and we’ll share our keynote news right here, on the Windows Blog.

In the previous article, we introduced the idea of recognizing speech inside of a Windows 10 Universal Windows Platform (UWP) app and took a look at the SpeechRecognizer class and some of what it can do to enable speech recognition in our apps.

In this article, we’re going to dig further into some of the options for speech recognition with SpeechRecognizer. First, however, we’re going to take a detour and look at Speech synthesis via the recognizer’s friend, the SpeechSynthesizer.

This is short and sweet, but it’s not quite the inverse of the functionality that we saw in the previous article with SpeechRecognizer, which actively listened to a microphone for speech. The SpeechSynthesizer does not emit audio via a speaker or headphones.

If we call this function with a string like “Hello World,” then we get an audio stream returned that we then need to play through a means such as XAML MediaElement. The method below builds on the previous one to add that extra piece of functionality:

If we have a &lt;strong&gt;MediaElement&lt;/strong&gt; defined in a XAML UI and named 'uiMediaElement' then we could call this new method with a snippet like this:
1
async void Button_Click(object sender, RoutedEventArgs e)
{
await this.SpeakTextAsync(&quot;Hello World&quot;, this.uiMediaElement);
}

The sharp-eyed amongst you may know that there previously was no MediaElement.PlayStreamAsync method, so we added one as an extension to tidy up the code. Here’s a possible implementation of that extension method:

Now we’re talking (literally!) and we can vary the output here by manipulating properties on the MediaElement such as PlaybackRate. But there are other and perhaps better ways of affecting the voice being used here.

Where is the voice coming from?

The voice that’s being used for synthesis by the previous code snippet will be the default one set in the control panel for the system. For example, on a PC the setting looks like this:

This choice of voice is reflected in the SpeechSynthesizer.DefaultVoice property but can be changed by making use of the SpeechSynthesizer.AllVoices property to find and use other voices, as in the snippet below:

This code changes the voice used for synthesis to the first female voice that it finds on the system, and your own app can similarly offer the user a choice for which voice they’d like to hear.

Taking more control of speech delivery

There’s more that an app can do to control how speech is delivered to the user via the SpeechSynthesizer.SynthesizeSsmlToStreamAsync method.

SSML is Speech Synthesis Markup Language and it’s an XML grammar that can be used to control many aspects of speech generation, including volume, pronunciation, and pitch. The complete specification is on the W3C site. It’s fairly involved and perhaps more for specialized scenarios, but it’s relatively easy to write a similar method that synthesizes SSML from any file:

If we use these opening lines of ‘Macbeth’ as an example piece of SSML, then below is a simple way of marking up that text:

&lt;?xml version='1.0' encoding='utf-8'?&gt;
&lt;speak
version=&quot;1.0&quot;
xmlns=&quot;http://www.w3.org/2001/10/synthesis&quot;
xml:lang=&quot;en-US&quot;&gt;
When shall we three meet again in thunder, lightning, or in rain?
When the hurlyburly's done
When the battle's lost and won
That will be ere set of sun
Where the place?
Upon the heath
There to meet with Macbeth
&lt;/speak&gt;

But we can drive quite different output from the synthesizer if we take some control in our SSML and add some pauses, emphasis, and speed settings:

Taking more control of recognition

Continuing on this theme of ‘taking control,’ let’s now return to the SpeechRecognizer that was the subject of the previous article and see how we can apply a little more control there.

As we saw previously, the recognizer always constrains the speech that it is listening for by maintaining a list of SpeechRecognizer.Constraints of different types, which we’ll work through in the following sections.

Recognition with dictionaries and hints

If you do not explicitly add constraints to your SpeechRecognizer, it will default to use a SpeechRecognitionTopicConstraint, which you can customize to be one of the options:

It’s important to note that (1) constraining speech this way requires the user to opt in to the ‘Get to know me’ option explained here and (2) that recognition is performed by a remote web service, which means that there are potential implications around privacy, performance, and connectivity.

As an example, if you wanted to ask your user for a telephone number, then that’s easily done:

This snippet did a very decent job of recognizing my own phone numbers whether spoken with or without country codes.

Recognizing Lists of Words or Commands

Another area where the recognizer works really efficiently is when it is guided to listen purely for a specific list of words or commands. We can replace the SpeechRecognitionTopicConstraint in the previous snippet with a SpeechRecognitionListConstraint, as per the snippet below:

This might make sense for ‘command’-based scenarios, where voice shortcuts can supplement existing mechanisms of interaction.

Note that in the snippet above, the two constraints have been tagged – the SpeechRecognitionResult.Constraint can be checked after recognition and the Tag used to identify what has been recognized.

One example of this might be to use “play/pause/next/previous/louder/quieter” commands for a media player, while another might be to control a remote camera with “zoom in/zoom out/shoot” commands.

It’s possible to use this type of technique to implement a state machine whereby your app can drive the user through a sequence of speech interactions, selectively enabling and disabling the recognition of particular words or phrases based on the user’s state.

This is attractive in that it can be very dynamic, but it can also get complex. This is where a custom grammar can step in and make life easier.

Grammar-based recognition

Grammar based recognition is usually based on a grammar that it static but which might contain many options. The grammars understood by the SpeechRecognitionGrammarFileConstraint are Speech Recognition Grammar Syntax (SRGS) grammars V1.0, as specified by the W3C.

That’s a big specification to plough through. It may be easier to look at some examples.

If we have the function below, which opens up a grammar.xml file embedded in the application and uses it to recognize speech, what then becomes interesting is the content of the grammar file and the way in which the recognition results can be interpreted, post-recognition, to replace the “Process the results…” comment:

The following snippet can then be used to inspect the SpeechRecognitionResult.SemanticInterpretation properties to interpret the results, and the SpeechRecognitionResult.RulePath can identify which rule steps have been followed, although there is only one in this example:

SpeechRecognitionResult result = await recognizer.RecognizeAsync();
if (result.Status == SpeechRecognitionResultStatus.Success)
{
// This will be &quot;foodItem&quot;, the only rule involved in parsing here
string ruleId = result.RulePath[0];
// This will be &quot;pizza&quot; or &quot;burger&quot;
string foodType = result.SemanticInterpretation.Properties[&quot;foodType&quot;].Single();
}

Example 2: Expanding the options

Let’s expand out this grammar to include more than just a single word and make it into a more realistic example of ordering food and a drink:

This grammar has a root rule called “order” which references other rules named “foodItem,”“drinkItem,” and “drinkChoice,” which themselves present various items that are either required or optional and which break down into a number of choices. It is meant to allow for phrases such as the following:

“I would like to order a hamburger please”

“I need to buy a pie”

“I want a cheeseburger and a cola”

“Please can I buy a pizza and a lemonade”

It allows for quite a few other combinations, as well, but the process of parsing the semantic intention is still very simple and can be covered by this snippet of code for processing the results:

SpeechRecognitionResult result = await recognizer.RecognizeAsync();
if (result.Status == SpeechRecognitionResultStatus.Success)
{
IReadOnlyDictionary&lt;string, IReadOnlyList&lt;string&gt;&gt; properties =
result.SemanticInterpretation.Properties;
// We could also examine the RulePath property to see which rules have
// fired.
if (properties.ContainsKey(&quot;foodType&quot;))
{
// this will be &quot;burger&quot; or &quot;pizza&quot;
string foodType = properties[&quot;foodType&quot;].First();
}
if (properties.ContainsKey(&quot;drinkType&quot;))
{
// this will be &quot;cola&quot;, &quot;lemonade&quot;, &quot;orange juice&quot;
string drinkType = properties[&quot;drinkType&quot;].First();
}
}

Hopefully, you get an impression of the sort of power and flexibility that a grammar can give to speech recognition and how it goes beyond what you might want to code yourself by manipulating word lists.

Voice-command-file-based recognition

For completeness, the last way in which the SpeechRecognizer can be constrained is to make use of a SpeechRecognitionVoiceCommandDefinitionConstraint—this is for scenarios where your app is activated by Cortana with a VoiceCommandActivatedEventArgs argument containing a Result property, which will carry with it a Constraint of this type for interrogation. This is specific for those scenarios and not something that you would construct in your own code.

Wrapping up

We’ve covered quite a lot of ground in this and the previous post about speech recognition and speech synthesis all under the APIs provided by the UWP.

In the next post, we’ll look at some speech-focused APIs available to us in the cloud and see where they overlap and extend the capabilities that we’ve seen so far.