The MIX07 Keynote includes a brief screen capture from the Daily Mail eReader and reminded me that I had meant to blog about the speech synthesis portion of the app. It's something which many users might not have seen, because it's only available on Windows Vista. Vista includes the Speech API (SAPI) version 5.3 which provides a text-to-speech (TTS) engine which is greatly superior to SAPI 5.2 which shipped with Windows XP.

How it looks

The way the UI works is pretty straightforward - it's meant to mimic an autocue. So a user will trigger the functionality and the window you see on the right will appear. As the user's computer begins speaking the news story, the text will slowly scroll up. The word currently being spoken will always be bold and will always appear in the letterbox.

I'm personally a bit tickled that this bit of UI made it into Ray Ozzie's keynote presentation (even if it was only for about a quarter of a second) because I developed this part of the eReader and wrote this UI.

Getting SAPI to produce speech

The TTS functionality is actually quite easy to kick off. All it takes is a call to the appropriate function, passing in a string holding the text to be read, and SAPI will begin to speak. Specifically, .Net 3.0 provides a System.Speech assembly which does all the hard work for you. This assembly includes the class System.Speech.Synthesis.SpeechSynthesizer which has a method publicvoid Speak(string textToSpeak). This is easy to use if all you want is text-to-speech.

The problem is that this is a synchronous call. So the call will block until the speech rendering is complete. To get around this, the SpeechSynthesizer class also includes a method SpeakAsync. This does pretty much what you'd expect and runs the TTS activity on a background thread.

Getting updated with TTS progress

Now the SpeechSynthesizer class even provides some helpful events relating to the progress of the speech rendering. However, it turns out that the SAPI libraries regard these events as sort of incidental to their main job - i.e. they will raise the events when they can. So there is every possibility that these events will occur some time after the actual speech rendering of a word (or phoneme, etc) has started. The SAPI library also seems to stop raising the events altogether, if the event handlers are consuming too much time. This was probably a design decision made by Microsoft that the quality of the speech shouldn't be affected by calls to user code.

This last point means that you need to be very careful how you write your event handlers. I suspect that a fair amount of time is consumed with the transition from the unmanaged SAPI libraries back into the System.Speech assembly, which doesn't leave much time for your C# code to do anything useful. It certainly doesn't leave enough time to update a XAML UI. The initial approach I used would update the UI with about 2 or 3 words and then UI updating would cease altogether.

The solution was (a) to be very, very careful to write efficient code; (b) use the Output window for debug info rather than trying to break into the running code; and (c) make good use of asynchronous delegates. The SAPI events are called on a background thread, which means that a Dispatcher.Invoke call is needed before code can update the UI at all. So the simplest solution was to replace this with a Dispatcher.BeginInvoke call which then updated the UI asynchronously to the event handler.

Constructing the UI

It took a little bit of trial and error to get a XAML UI that could efficiently update without continually doing a lot of layout work. Ironically, the grey letterbox with its opacity and the opacity gradient of the textual content were the easy bits! And getting a thick window border of Vista glass was also pretty trivial. The part which took some effort was working out that to display the content I needed to use a ScrollViewer control with three separate Run elements - one for the text which had been read, one for the text being read and one for the text which is still to be read. The updates from the TTS engine can then simply be translated into shuffling characters between the various Run elements.

Getting the highlighted word to stay in the letterbox also took some trial and error. The solution lay in the following line of code: