Natural vs. Direct Dialog and How VoiceXML Enables Both

This article is the third article in a three-part series that provides an introduction to VoiceXML, as well as SRGS, SSML, and SISR for building conversational web applications. The first installment discussed building VoiceXML dialogs through both menu and form elements. The second outlined how VoiceXML takes advantage of the distributed web-based application model as well as advanced features including: local validation and processing, audio playback and recording, support for context-specific and tapered help, and support for reusable sub dialogs. Finally, this piece will discuss natural vs. direct dialogue and how VoiceXML enables both by allowing input grammars to be specified at the form level, not just at the field level.

To review from the first two articles, the web has primarily delivered information and services using visual interfaces and as a result has largely bypassed customers that primarily use the telephone - for which voice input and audio output provide the primary means of interaction.
Building on top of the market established in 1999 by the VoiceXML Forum's VoiceXML 1.0 specification [VXML1], VoiceXML 2.0 and several complementary standards are changing the way we interact with voice services and applications - by simplifying the way these services and applications are built.

Natural Dialog

VoiceXML as presented in the first two articles provides the capability for simple "directed" dialogs, meaning that the computer directs the conversation at each step by prompting the user for the next piece of information. Dialogs between humans of course don't operate on this simple model. In a natural dialog each participant may at various stages take the initiative in leading the conversation. A computer-human dialog modeled on this idea is referred to as a "mixed-initiative" dialog because either the computer or the human may take the initiative in leading the conversation.

The field of spoken interfaces is not nearly as mature as the field of visual interfaces, so standardizing an approach to natural dialog is more difficult than designing a standard language for describing visual interfaces like HTML. Nevertheless VoiceXML takes some modest steps toward allowing applications to be built that give the user some degree of control over the conversation.

Browser: What would you like to drink?
User: Orange juice.
Browser: What sandwich would you like?
User: Roast beef lettuce and swiss on rye.

The set of phrases that the user could speak in response to each field prompt was specified by a separate grammar for each field. This approach allows the user to supply one field value in sequence.

Consider a dialog for making airline travel reservations in which the user must supply a date, a city to fly from, and a city to fly to. A directed dialog conversation for completing such a form might proceed as follows:

Browser: Where are you traveling from?
User: New York.
Browser: Where are you traveling to?
User: Chicago.
Browser: When would you like to travel?
User: ...

By contrast, a somewhat more natural dialog might proceed as follows:

Browser: How can I help you?
User: I'd like to fly from New York To Chicago
Browser: When would you like to travel?
User: ...

VoiceXML enables such dialogs by allowing input grammars to be specified at the form level, not just at the field level and by further annotating the rules in the grammar to identify which portion of a user's input is intended for which field in the form. This annotation is accomplished through semantic interpretation.

Although the grammar looks fairly complex, it's not really. It differs from earlier grammars only in that it now has semantic interpretation information included within it, indicated by the <tag> elements throughout.