Using Events with TTS (SAPI 5.3)

Using Events with TTS

This tutorial covers a basic text-to-speech example but uses a Windows application with a graphical interface.

Setting up the project

Create the project

The code is generated from Visual C++ 6.0 and uses the "Hello, World" example. To make the sample base, create a new project as a Windows 32 application and call it "Test." In the subsequent wizard, select "a typical 'Hello World!' application." The resulting project is lengthier than the command line version. Most of the new complexity has little to do with SAPI however, since graphical interfaces require more code to function.

Set SAPI paths

The SAPI paths need to be declared. Add Sapi.h to the path:

On the File menu, select Tools, and then click Options.

Click the Directories tab.

Select the Include Files drop-down menu.

Add the path by clicking in the first unused line in the paths list and enter "C:\Program Files\Microsoft Speech SDK 5.3\Include".

Create speak menu item

To be able to speak on demand, one modification is required; this is a mechanism to initiate speech. To use the current example in Visual C++, the user should add a File menu item called Speak with a resource ID of IDM_SPEAK. The code handling the event from this menu item will be addressed later in this example. Compile and run the application to make sure everything works. The application does not display anything other than "Hello, World" along the top of the screen. Even so, it's a good start.

Using the sample

This sample is not a practical one since it speaks only one sentence. The sentence is hard coded, something few applications would do in a practical situation. A more complete or robust application would retrieve the text from a dialog box, resource, or file. However, the sample does represent the foundation of text-to-speech and showcases many of those mechanisms.

More importantly, it demonstrates the interaction between SAPI and the application. Text-to-speech would be marginally useful if that is all it did. However, using this interaction, the application determines words being spoken. In two separate examples using this information, the application displays the words on the screen and highlights them in real time. In doing so, the application also demonstrates the eventing model for SAPI. This includes a brief explanation about speech messages and a related feature, interests. Interests are unique to SAPI.

Furthermore, the interaction is not limited to determining words spoken. A multitude of activities involving SAPI or speech engines could interest the application. SPEVENTENUM lists these possible activities. For instance, if your application is animating a character for speech, you would be interested each time a new viseme is encountered. The viseme essentially represents a change in the mouth position during speech. Accordingly, the character's mouth would move, or even close. In the same way, starting and stopping of the speech audio stream could interest the application. In general, these activities are called interests.

Step 1: Initialize COM

As with any SAPI application, COM must be successfully initialized. This is done in a simple manner illustrated below in a snippet from WinMain(). The only restriction is that COM must be available before any SAPI-specific code is implemented and it must be active during the time SAPI is used. Since SAPI is implemented in InitInstance(), the COM statements come before InitInstance() and after the event loop, essentially enclosing the entire initialization and message loop.

Step 2: Setting up voices

Once COM is running, the next step is to create the voice. Simply declare the instance and use CoCreateInstance(). As mentioned in the command line example, SAPI uses intelligent defaults. This requires a minimal amount of initialization and you can use the voice immediately. The defaults are located in Speech properties in Control Panel and include a selection of voices (if more than one is available on your system), and languages (English, Japanese, etc.). While some defaults are obvious, others are not (speaking rate, pitch, etc.). Nevertheless, you can change all defaults either through Speech properties or programmatically.

This example makes several exceptions for the sake of brevity and convenience. First, it uses InitInstance() to initialize the voice. InitInstance() is the least intrusive call to be placed for this demonstration. Applications, especially those using speech recognition (SR) instances, may have their own procedures explicitly for this so that the speech code is more isolated. Second, the voice is defined globally. Depending on your application's design and requirements, you may not need a global declaration. Third, the instance is immediately released and the memory freed. Obviously, if the voice is to be used, it cannot be released beforehand. In fact, even this application is not going to keep those statements for long. And last, if the initialization fails, this application stops. A more robust application would check errors more extensively and report more detailed information.

Step 3: Speak!

Fortunately, the most interesting part of the task is also the simplest. Speaking a sentence involves calling one line. The text to be spoken is provided as a parameter. The source of that text depends on the application. As mentioned previously, the string is usually from a dialog box or a file. Alternatively, the string can also be from a stream but that is handled by another call, ISpVoice::SpeakStream. This example uses a simple, hard-coded sentence. While ::Speak could have used an inline string such as:

Speak( L"I am glad to speak.", SPF_ASYNC, NULL);

The string will be used several times during the application. The application retrieves each word and parses it accordingly. For that reason, it is copied to a global string before being used.

The code is placed inside the window messaging area within WndProc(). Selecting the Speak from the File menu will produce the following message: "I am glad to speak."

Step 4: Setting events

Like most Windows applications, there are interactions among the components and messages are sent to indicate these. SAPI is no different. As information is processed by either the TTS or SR engine, certain activities are initiated or completed. Many times these activities by SAPI or SAPI engines are of interest to the application. For example, the application could be informed when a recognition process is started, so that the user can subsequently be informed. Likewise, the application may be interested in knowing when there is no more information to process, perhaps to inform the user of this condition, or even to shut down either the engine or application itself when it is safe to do so.

An application processes the information of these activities in a two step operation. First, it receives a general message from SAPI or a SAPI engine. This message is similar to other messages, such window events, mouse clicks or a myriad of other messages used by the operating system. Since the message is not defined by the operating system, the application must define it. However, all activities from SAPI use the same message. To determine the exact activity taking place, additional information is provided by SAPI and is called an interest. A complete list of interests is found in SPEVENTENUM.

The second step comes after trapping the message. The application examines an event structure completed by SAPI and retrieves the relevant information.

Setting interests

During initialization, SAPI can be informed of which interests to pass back to the application. This is done using iISpEventSource::SetInterest. By default, TTS does not set any interests and SR uses only recognition (SPEI_RECOGNITION). That is, if the SetInterest call were omitted entirely, TTS would not pass back any interest information to the application and SR would report only successful recognitions. Values can be combined with logical OR statements. Using this combination, two or more interests can be specifically set, while excluding others at the same time. Using the first parameter, the application can be notified when a specific interest occurs. The second parameter queues the interest for later retrieval. For the moment, keep the two parameters of SetInterest identical since the application will need to store information later. Interests can be changed at anytime in the application as the user's requirements change.

Setting messages

Regardless of the interests set, the application has to associate a message to SAPI. This is done with ISpNotifySource::SetNotifyWindowMessage. If this call is not included, no message could be sent back to the application. There are three types of message notifications and at least one must be included to receive messages. A fourth type is for multithreaded applications and is not used here. All four are explained in the ISpNotifySource interface section. The actual message name and ID is determined by the application. This example uses the standard WM_USER for private messages.

Step 5: Determining events

As mentioned previously, working with events is a two step process. The first is a simple and standard approach to Windows events. A message (however generated) is sent back to the application and the message loop dispatches it accordingly. In this example, WndProc() receives the WM_USER message. Once the message is trapped, the rest relies on SAPI.

The second step is to determine which interest occurred. Since the SetInterest method responds only to SPEI_WORD_BOUNDARY, it is likely that it is an SPEI_WORD_BOUNDARY interest. However, in larger applications or if several interests were set, the application must be able to determine the exact one. SAPI determines this using the event structure, SPEVENT and the GetEvents method. Used together, you can retrieve specific information about the SAPI event, including the type of interest. This value in member eEventId coincides with parameters used by SetInterest. The SPEVENT structure must be initialized before first use and cleared before reuse. It is possible for information to persist from call to call. The helper function SpClearEvent clears the event.

It is possible for events and interests to occur faster than the application can process them. This is a common situation especially if a viseme interest is set, because it generates an event for each sound encountered. GetEvents can retrieve more than one event at time. This allows for batch processing of events should a more specialized application need to do so. Another way to handle this situation is to use a while loop. This retrieves each event one at a time. Regardless of the design, once a valid SPEVENT is available, the application has only to compare the interest type from eEventId with an action. Again for simplicity, a switch statement filters interests and subsequent code completes the action.

Step 6: Reacting to events

Once the event and interest is determined, the programming becomes more standard. How an actual interest is handled is the application's own design and implementation. In this example, the application identifies individual words using the SPEI_WORD_BOUNDARY interest. Whenever this interest is returned, the SAPI engine has found a distinct word, usually offset by white spaces or certain punctuation. Also in this case, relevant information is passed back from a Voice::GetStatus call using SPVOICESTATUS structure.

The individual words are noted as offsets from the complete string, marking the positions of the first letter and last letters of the sequence. For demonstration, the words are then displayed in a Win32 message box on the screen. One subtlety to notice is that each word is displayed as soon as possible. That is, the screen is updated during the actual speaking of the text. This characteristic is controlled during by the SPF_ASYNC flag of the Voice::Speak method:

pVoice->Speak( theString, SPF_ASYNC, NULL);

The alternative is to wait until all the speech is complete and then process the events and interests. For example, if the second parameter was replaced with NULL, the message boxes would still display but would wait until the speaking is complete. The difference in timing may be important to applications depending on needs.