Abstract

This document specifies VoiceXML, the Voice Extensible Markup
Language. VoiceXML is designed for creating audio dialogs that
feature synthesized speech, digitized audio, recognition of
spoken and DTMF key input, recording of spoken input, telephony,
and mixed-initiative conversations. Its major goal is to bring
the advantages of web-based development and content delivery to
interactive voice response applications.

This document is a Note made available by W3C for discussion
only. This work does not imply endorsement by, or the consensus
of the W3C membership, nor that W3C has, is, or will be
allocating any resources to the issues addressed by the Note.
This document is a work in progress and may be updated, replaced,
or rendered obsolete by other documents at any time.

A list of current W3C technical documents can be found at the
Technical Reports page.

About the VoiceXML Forum

The VoiceXML Forum is an industry organization founded by
AT&T, IBM, Lucent and Motorola. It was established to develop
and promote the Voice eXtensible Markup Language (VoiceXML), a
new computer language designed to make Internet content and
information accessible via voice and phone.

With the backing and technology contributions of its four
world-class founders, and the support of leading Internet
industry players, the VoiceXML Forum has made speech-enabled
applications on the Internet a reality.

Implementation Scope

VoiceXML 1.0 was designed for speech-based telephony
applications. Where a specific application environment does not
require all the features of v1.0, exceptions should be clearly
noted, and publicly documented as a subset of VoiceXML 1.0. Any
vendor-specific additions or changes should be similarly noted as
proprietary extensions to VoiceXML 1.0. The VoiceXML Forum
provides no support for, and make no guarantee of, future
compatibility with such changes.

Disclaimers

This document is subject to change without notice and may be
updated, replaced or made obsolete by other documents at any
time.

The VoiceXML Forum disclaims any and all warranties, whether
express or implied, including (without limitation) any implied
warranties of merchantability or fitness for a particular
purpose.

The descriptions contained herein do not imply the granting of
licenses to make, use, sell, license or otherwise transfer any
technology required to implement systems or components conforming
to this specification. The VoiceXML Forum, and its member
companies, makes no representation on technology described in
this specification regarding existing or future patent rights,
copyrights, trademarks, trade secrets or other proprietary
rights.

By submitting information to the VoiceXML Forum, and its
member companies, including but not limited to technical
information, you agree that the submitted information does not
contain any confidential or proprietary information, and that the
VoiceXML Forum may use the submitted information without any
restrictions or limitations.

Revision History

Version

Date

Description

0.9

17 Aug 1999

Initial release. Provided as baseline in support of comment
period from supporters.

This document introduces VoiceXML, the Voice Extensible Markup
Language. VoiceXML is designed for creating audio dialogs that
feature synthesized speech, digitized audio, recognition of
spoken and DTMF key input, recording of spoken input, telephony,
and mixed-initiative conversations. Its major goal is to bring
the advantages of web-based development and content delivery to
interactive voice response applications.

Here are two short examples of VoiceXML. The first is the
venerable “Hello World”:

The top-level element is <vxml>, which is mainly a
container for dialogs. There are two types of dialogs:
forms and menus. Forms present information and gather
input; menus offer choices of what to do next. This example has a
single form, which contains a block that synthesizes and presents
“Hello World!” to the user. Since the form does not
specify a successor dialog, the conversation ends.

Our second example asks the user for a choice of drink and
then submits it to a server script:

This section contains a high-level architectural model, whose
terminology is then used to describe the goals of VoiceXML, its
scope, its design principals, and the requirements it places on
the systems that support it.

The architectural model assumed by this document has the
following components:

Figure 1: Architectural Model

A document server (e.g. a web server) processes
requests from a client application, the VoiceXML
Interpreter, through the VoiceXML interpreter context.
The server produces VoiceXML documents in reply, which are
processed by the VoiceXML Interpreter. The VoiceXML interpreter
context may monitor user inputs in parallel with the VoiceXML
interpreter. For example, one VoiceXML interpreter context may
always listen for a special escape phrase that takes the user to
a high-level personal assistant, and another may listen for
escape phrases that alter user preferences like volume or
text-to-speech characteristics.

The implementation platform is controlled by the
VoiceXML interpreter context and by the VoiceXML interpreter. For
instance, in an interactive voice response application, the
VoiceXML interpreter context may be responsible for detecting an
incoming call, acquiring the initial VoiceXML document,
and answering the call, while the VoiceXML interpreter conducts
the dialog after answer. The implementation platform generates
events in response to user actions (e.g. spoken or character
input received, disconnect) and system events (e.g. timer
expiration). Some of these events are acted upon by the VoiceXML
interpreter itself, as specified by the VoiceXML document, while
others are acted upon by the VoiceXML interpreter context.

VoiceXML’s main goal is to bring the full power of web
development and content delivery to voice response applications,
and to free the authors of such applications from low-level
programming and resource management. It enables integration of
voice services with data services using the familiar
client-server paradigm. A voice service is viewed as a sequence
of interaction dialogs between a user and an implementation
platform. The dialogs are provided by document servers, which may
be external to the implementation platform. Document servers
maintain overall service logic, perform database and legacy
system operations, and produce dialogs. A VoiceXML document
specifies each interaction dialog to be conducted by a VoiceXML
interpreter. User input affects dialog interpretation and is
collected into requests submitted to a document server. The
document server may reply with another VoiceXML document to
continue the user’s session with other dialogs.

Promotes service portability across implementation platforms.
VoiceXML is a common language for content providers, tool
providers, and platform providers.

Is easy to use for simple interactions, and yet provides
language features to support complex dialogs.

While VoiceXML strives to accommodate the requirements of a
majority of voice response services, services with stringent
requirements may best be served by dedicated applications that
employ a finer level of control.

The language describes the human-machine interaction provided
by voice response systems, which includes:

Output of synthesized speech (text-to-speech).

Output of audio files.

Recognition of spoken input.

Recognition of DTMF input.

Recording of spoken input.

Telephony features such as call transfer and disconnect.

The language provides means for collecting character and/or
spoken input, assigning the input to document-defined request
variables, and making decisions that affect the interpretation of
documents written in the language. A document may be linked to
other documents through Universal Resource Identifiers
(URIs).

The language promotes portability of services through
abstraction of platform resources.

The language accommodates platform diversity in supported
audio file formats, speech grammar formats, and URI schemes.
While platforms will respond to market pressures and support
common formats, the language per se will not specify them.

The language supports ease of authoring for common types of
interactions.

The language has a well-defined semantics that preserves the
author's intent regarding the behavior of interactions with the
user. Client heuristics are not required to determine document
element interpretation.

The language has a control flow mechanism.

The language enables a separation of service logic from
interaction behavior.

It is not intended for heavy computation, database operations,
or legacy system operations. These are assumed to be handled by
resources outside the document interpreter, e.g. a document
server.

General service logic, state management, dialog generation,
and dialog sequencing are assumed to reside outside the document
interpreter.

The language provides ways to link documents using URIs, and
also to submit data to server scripts using URIs.

VoiceXML provides ways to identify exactly which data to
submit to the server, and which HTTP method (get or post) to use
in the submittal.

The language does not require document authors to explicitly
allocate and deallocate dialog resources, or deal with
concurrency. Resource allocation and concurrent threads of
control are to be handled by the implementation platform.

This section outlines the requirements on the
hardware/software platforms that will support a VoiceXML
interpreter.

Document acquisition. The interpreter context is
expected to acquire documents for the VoiceXML interpreter to act
on. In some cases, the document request is generated by the
interpretation of a VoiceXML document, while other requests are
generated by the interpreter context in response to events
outside the scope of the language, for example an incoming phone
call.

Audio output. An implementation platform can provide
audio output using audio files and/or using text-to-speech (TTS).
When both are supported, the platform must be able to freely
sequence TTS and audio output. Audio files are referred to by a
URI. The language does not specify a required set of audio file
formats.

Audio input. An implementation platform is required to
detect and report character and/or spoken input simultaneously
and to control input detection interval duration with a timer
whose length is specified by a VoiceXML document.

It must report characters (for example, DTMF) entered
by a user.

It must be able to receive speech recognition grammar
data dynamically. Some VoiceXML elements contain speech grammar
data; others refer to speech grammar data through a URI. The
speech recognizer must be able to accommodate dynamic update of
the spoken input for which it is listening through either method
of speech grammar data specification.

It should be able to record audio received from the
user. The implementation platform must be able to make the
recording available to a request variable.

A VoiceXML document (or a set of documents called an
application) forms a conversational finite state machine.
The user is always in one conversational state, or dialog,
at a time. Each dialog determines the next dialog to transition
to. Transitions are specified using URIs, which define the
next document and dialog to use. If a URI does not refer to a
document, the current document is assumed. If it does not refer
to a dialog, the first dialog in the document is assumed.
Execution is terminated when a dialog does not specify a
successor, or if it has an element that explicitly exits the
conversation.

There are two kinds of dialogs: forms and menus.
Forms define an interaction that collects values for a set of
field item variables. Each field may specify a grammar that
defines the allowable inputs for that field. If a form-level
grammar is present, it can be used to fill several fields from
one utterance. A menu presents the user with a choice of options
and then transitions to another dialog based on that choice.

A subdialog is like a function call, in that it
provides a mechanism for invoking a new interaction, and
returning to the original form. Local data, grammars, and state
information are saved and are available upon returning to the
calling document. Subdialogs can be used, for example, to create
a confirmation sequence that may require a database query; to
create a set of components that may be shared among documents in
a single application; or to create a reusable library of dialogs
shared among many applications.

A session begins when the user starts to interact with
a VoiceXML interpreter context, continues as documents are loaded
and processed, and ends when requested by the user, a document,
or the interpreter context.

An application is a set of documents sharing the same
application root document. Whenever the user interacts
with a document in an application, its application root document
is also loaded. The application root document remains loaded
while the user is transitioning between other documents in the
same application, and it is unloaded when the user transitions to
a document that is not in the application. While it is loaded,
the application root document’s variables are available to
the other documents as application variables, and its
grammars can also be set to remain active for the duration of the
application.

Figure 2 shows the transition of documents (D) in an
application that share a common application root document
(root).

Each dialog has one or more speech and/or DTMF grammars
associated with it. In machine directed applications, each
dialog’s grammars are active only when the user is in that
dialog. In mixed initiative applications, where the user
and the machine alternate in determining what to do next, some of
the dialogs are flagged to make their grammars active
(i.e., listened for) even when the user is in another dialog in
the same document, or on another loaded document in the same
application. In this situation, if the user says something
matching another dialog’s active grammars, execution
transitions to that other dialog, with the user’s utterance
treated as if it were said in that dialog. Mixed initiative adds
flexibility and power to voice applications.

VoiceXML provides a form-filling mechanism for handling
"normal" user input. In addition, VoiceXML defines a mechanism
for handling events not covered by the form mechanism.

Events are thrown by the platform under a variety of
circumstances, such as when the user does not respond, doesn't
respond intelligibly, requests help, etc. The interpreter also
throws events if it finds a semantic error in a VoiceXML
document. Events are caught by catch elements or their syntactic
shorthand. Each element in which an event can occur may specify
catch elements. Catch elements are also inherited from enclosing
elements "as if by copy". In this way, common event handling
behavior can be specified at any level, and it applies to all
lower levels.

A link supports mixed initiative. It specifies a
grammar that is active whenever the user is in the scope of the
link. If user input matches the link’s grammar, control
transfers to the link’s destination URI. A <link> can
be used to throw an event to go to a destination URI.

A VoiceXML document is primarily composed of top-level
elements called dialogs. There are two types of dialogs:
forms and menus. A document may also have
<meta> elements, <var> and <script> elements,
<property> elements, <catch> elements, and
<link> elements.

Execution within one document. Document execution
begins at the first dialog by default. As each dialog executes,
it determines the next dialog. When a dialog doesn’t
specify a successor dialog, document execution stops.

Here is “Hello World!” expanded to illustrate some
of this. It now has a document level variable called
“hi” which holds the greeting. Its value is used as
the prompt in the first form. Once the first form plays the
greeting, it goes to the form named “say_goodbye”,
which prompts the user with “Goodbye!” Because the
second form does not transition to another dialog, it causes the
document to be exited.

The version of VoiceXML of this document (required). The
initial version number is 1.0.

base

The base URI.

lang

The language and locale type for this document.<

application

The URI of this document’s application root document,
if any.

Executing a multi-document application.

Normally, each document runs as an isolated application. In
cases where you want multiple documents to work together as one
application, you select one document to be the application
root document, and refer to it in the other documents’
<vxml> elements.

When this is done, every time the interpreter is told to load
a document in this application, it also loads the application
root document if it is not already loaded. The application root
document remains loaded until the interpreter is told to load a
document that belongs to a different application. Thus one of the
following two conditions always holds during interpretation:

The application root document (or a stand-alone document) is
loaded and the user is executing in it.

The application root document and one other document in the
application are both loaded and the user is executing in the
non-root document.

There are two benefits to multi-document applications. First,
the application root document’s variables are available for
use by the other documents in the application, so that
information can be shared and retained. Second, the grammars of
the application root document may be set to remain active even
when the user is in other application documents, so that the user
can always interact with common forms, links, and menus.

In this example, the application is designed so that main.vxml
must be loaded first. Its application attribute specifies that
app-root.vxml should be imported as the application root
document. So, app-root.vxml is then loaded, which creates the
application variable bye and also defines a link that navigates
to /operator-xfer.vxml whenever the user says
“operator”. The user starts out in the say_goodbye
form:

C: Shall we say Ciao?

H: Si.

C: I did not understand what you said. (a
platform-specific default message.)

H: Ciao

C: I did not understand what you said.

H: Operator.

C: (Goes to operator_xfer.vxml, which
transfers the caller to a human operator.)

Note that when the user is in a multi-document application, at
most two documents are loaded at any one time: the application
root document, and unless the user is actually interacting with
the application root document, one other application
document.

If a document refers to a non-existent application root
document, or if an application root document itself has a
reference to another application root document, an error.semantic
event is thrown.

Subdialogs.

A subdialog is a mechanism for decomposing complex sequences
of dialogs to better structure them, or to create reusable
components. For example, the solicitation of account information
may involve gathering several pieces of information, such as
account number, and home telephone number. A customer care
service might be structured with several independent applications
that could share this basic building block, thus it would be
reasonable to construct it as a subdialog. This is illustrated in
the example below. The first document, app.vxml, seeks to adjust
a customer’s account, and in doing so must get the account
information and then the adjustment level. The account
information is obtained by using a subdialog element that invokes
another VoiceXML document to solicit the user input. While the
second document is being executed, the calling dialog is
suspended, awaiting the return of information. The second
document provides the results of its user interactions using a
<return> element, and the resulting values are accessed
through the variable defined by the name attribute on the
<subdialog> element.

Subdialogs add a new execution context when they are invoked.
The subdialog could be a new dialog within the existing document,
or a new dialog within a new document. The invocation of a
subdialog limits the scope of active grammars to the subdialog
only.

Figure 3 shows the execution flow when a sequence of documents
(D) transitions to a subdialog (SD) and then back.

Figure 3: Execution flow when invoking
a subdialog composed of two documents.

A set of form items, elements that are visited in the
main loop of the form interpretation algorithm. Form items are
subdivided into field items, those that define the
form’s field item variables, and control items,
those that help control the gathering of the form’s
fields.

Declarations of non-field item variables.

Event handlers.

“Filled” actions, blocks of procedural logic that
execute when certain combinations of field items are filled
in.

Form attributes are:

id

The name of the form.

scope

The default scope of the form’s grammars. If it is
dialog then the form grammars are active only in the form. If the
scope is document, then the form are active during any dialog in
the same document. If the scope is document and the document is
an application root document, then the form grammars are active
during any dialog in any document of this application. A form
grammar that has dialogscope is active only in its form.

This section describes some of the concepts behind forms, and
then gives some detailed examples of their operation.

Forms are interpreted by an implicit form interpretation
algorithm (FIA). The FIA has a main loop that repeatedly selects
a form item and then visits it. The selected form item is the
lexically first whose guard condition is not satisfied. For
instance, a field item’s default guard condition tests to
see if the field item variable has a value, so that if a simple
form contains only field items, the user will be prompted for
each field item in turn.

Interpreting a form item generally involves:

Selecting and playing one or more prompts;

Collecting a user input, either a response that fills in one
or more fields, or a throwing of some event (help, for instance);
and

Interpreting any <filled> actions that pertained to the
newly filled in fields.

The FIA ends when it interprets a transfer of control
statement (e.g. a <goto> to another dialog or document, a
<submit> of data to the document server). It also ends with
an implied <exit> when no form item remains eligible to
select.

A field item specifies a field item variable to gather
from the user. Field items have prompts to tell the user what to
say or key in, grammars that define the allowed inputs, and event
handlers that process any resulting events. A field item may also
have a <filled> element that defines an action to take just
after the field item variable is filled in. Field items are
subdivided into:

<field>

A field item whose value is obtained via ASR or DTMF
grammars.

<record>

A field item whose value is an audio clip recorded by the
user. A <record> element could collect a voice mail
message, for instance.

<transfer>

A field item which transfers the user to another telephone
number. If the transfer returns control, the field variable will
be set to the result status.

<object>

This field item invokes a platform-specific "object" with
various parameters. The result of the platform object is an
ECMAScript Object with one or more properties. One platform
object could be a built-in dialog that gathers credit card
information. Another could gather a text message using some
proprietary DTMF text entry method. There is no requirement for
implementations to provide platform-specific objects, although
support for the <object> element is required.

<subdialog>

A <subdialog> field item is roughly like a function
call. It invokes another dialog on the current page, or invokes
another VoiceXML document. It returns an ECMAScript Object as its
result.

A sequence of procedural statements used for prompting and
computation, but not for gathering input. A block has a (normally
implicit) form item variable that is set to true just before it
is interpreted.

<initial>

This element controls the initial interaction in a mixed
initiative form. Its prompts should be written to encourage the
user to say something matching a form level grammar. When at
least one field item variable is filled as a result of
recognition during an <initial> element, the form item
variable of <initial> becomes true, thus removing it as an
alternative for the FIA.

Each form item has an associated form item variable,
which by default is set to undefined when the form is entered.
This form item variable will contain the result of interpreting
the form item. A field item’s form item variable is also
called a field item variable, and it holds the value
collected from the user. A form item variable can be given a name
using the name attribute, or left nameless, in which case an
internal name is generated.

Each form item also has a guard condition, which
governs whether or not that form item can be selected by the form
interpretation algorithm. The default guard condition just tests
to see if the form item variable has a value. If it does, the
form item will not be visited.

Typically, field items are given names, but control items are
not. Generally form item variables are not given initial values
and additional guard conditions are not specified. But sometimes
there is a need for more detailed control. One form may have a
form item variable initially set to hide a field, and later
cleared (e.g., using <clear>) to force the field’s
collection. Another field may have a guard condition that
activates it only when it has not been collected, and when two
other fields have been filled. A block item could execute only
when some condition holds true. Thus, fine control can be
exercised over the order in which form items are selected and
executed by the FIA, however in general, many dialogs can be
constructed without resorting to this level of complexity.

In summary, all form items have the following attributes:

name&

The name of a dialog-scoped form item variable that will hold
the value of the form item.

expr

The initial value of the form item variable; default is
ECMAScript undefined. If initialized to a value, then the form
item will not be executed unless the form item variable is
cleared.

cond

An expression to evaluate in conjunction with the test of the
form item variable. If absent, this defaults to true, or in the
case of <initial>, a test to see if any field item variable
has been filled in.

The simplest and most common type of form is one in which the
form items are executed exactly once in sequential order to
implement a computer-directed interaction. Here is a weather
information service that uses such a form.

The form interpretation algorithm’s first iteration
selects the first block, since its (hidden) form item variable is
initially undefined. This block outputs the main prompt, and its
form item variable is set to true. On the FIA’s second
iteration, the first block is skipped because its form item
variable is now defined, and the state field is selected because
the dialog variable state is undefined. This field prompts the
user for the state, and then sets the variable state to the
answer. The third form iteration prompts and collects the city
field. The fourth iteration executes the final block and
transitions to a different URI.

Each field in this example has a prompt to play in order to
elicit a response, a grammar that specifies what to listen for,
and an event handler for the help event. The help event is thrown
whenever the user asks for assistance. The help event handler
catches these events and plays a more detailed prompt.

Here is a second directed form, one that prompts for credit
card information:

Fields are the major building blocks of forms. A field
declares a variable and specifies the prompts, grammars, DTMF
sequences, help messages, and other event handlers that are used
to obtain it. Each field declares a VoiceXML field item variable
in the form’s dialog scope. These may be submitted once the
form is filled, or copied into other variables.

Each field has its own speech and/or DTMF grammars, specified
explicitly using <grammar> and <dtmf> elements, or
implicitly using the type attribute. The type attribute is used
for standard built-in grammars, like digits, boolean, or number.
The type attribute also governs how that field’s value is
spoken by the speech synthesizer.

Each field can have one or more prompts. If there is one, it
is repeatedly used to prompt the user for the value until one is
provided. If there are many, they must be given count attributes.
These determine which prompt to use on each attempt. In the
example, prompts are become shorter. This is called tapered
prompting.

The <catch event="help"> elements are event handlers
that define what to do when the user asks for help. Help messages
can also be tapered. These can be abbreviated, so that the
following two elements are equivalent:

The <filled> element defines what to do when the user
provides a recognized input for that field. One use is to specify
integrity constraints over and above the checking done by the
grammars, as with the date field above.

The last section talked about forms implementing rigid,
computer-directed conversations. To make a form mixed
initiative, where both the computer and the human direct the
conversation, it must one or more <initial> form items and
one or more form-level grammars.

If a form has form-level grammars:

Its fields can be filled in any order.

More than one field can be filled as a result of a single user
utterance.

Also, the form’s grammars can be active when the user is
in other dialogs. If a document has two forms on it, say a car
rental form and a hotel reservation form, and both forms have
grammars that are active for that document, a user could respond
to a request for hotel reservation information with information
about the car rental, and thus direct the computer to talk about
the car rental instead. The user can speak to any active grammar,
and have fields set and actions taken in response.

Example. Here is a second version of the weather
information service, showing mixed initiative. It has been
“enhanced” for illustrative purposes with advertising
and with a confirmation of the city and state:

<form id="weather_info">
<grammar src="cityandstate.gram" type="application/x-jsgf"/>
<!-- Caller can't barge in on today's advertisement. -->
<block>
<prompt bargein="false">
Welcome to the weather information service.
<audio src="http://www.online-ads.example/wis.wav"/>
</prompt>
</block>
<initial name="start">
<prompt> For what city and state would you
like the weather? </prompt>
<help> Please say the name of the city and
state for which you you would like a weather
report.</help>
<!-- If user is silent, reprompt once, then
try directed prompts. -->
<noinput count="1"> <reprompt/></noinput>
<noinput count="2"> <reprompt/>
<assign name="start" expr="true"/></noinput>
</initial>
<field name="state">
<prompt>What state?</prompt>
<help>Please speak the state for which you
want the weather.</help>
</field>
<field name="city">
<prompt>Please say the city in <value
expr="state"/> for which you want the
weather.</prompt>
<help>Please speak the city for which you
want the weather.</help>
<filled>
<!-- Most of our customers are in LA. -->
<if cond="city == 'Los Angeles' && state == undefined">
<assign name="state" expr="'California'"/>
</if>
</filled>
</field>
<field name="go_ahead" type="boolean" modal="true">
<prompt>Do you want to hear the weather for
<value expr="city"/>, <value expr="state"/>?
</prompt>
<filled>
<if cond="go_ahead">
<prompt bargein="false">
<audio src="http://www.online-ads.example/wis2.wav"/>
</prompt>
<submit next="/servlet/weather" namelist="city state"/>
</if>
<clear namelist="start city state go_ahead"/>
</filled>
</field>
</form>

C: Mostly sunny today with highs in the 80s. Lows tonight from
the low 60s …

The go_ahead field has its modal attribute set to true. This
causes all grammars to be disabled except the ones defined in the
current form item, so that the only grammar active during this
field is the built-in grammar for boolean.

An experienced user can get things done much faster (but is
still forced to listen to the ads):

C: Mostly sunny today with highs in the 80s. Lows tonight from
the low 60s …

Controlling the order of field collection.

The form interpretation algorithm can be customized in several
ways. One way is to assign a value to a form item variable, so
that its form item will not be selected. Another is to use
<clear> to set a form item variable to undefined; this
forces the FIA to revisit the form item again.

Another method is to explicitly specify the next field item to
visit using <goto nextitem>. This forces an immediate
transfer to that field item. If the <goto nextitem> occurs
in a <filled> action, the rest of the <filled> action
and any pending <filled> actions will be skipped.

Here is an example <goto nextitem> executed in response
to the exit event:

<form id="survey_2000_03_30">
<catch event="exit">
<goto nextitem="confirm_exit"/>
</catch>
<block>
<prompt>
Hello, you have been called at random to answer questions
critical to U.S. foreign policy.
</prompt>
</block>
<field name="q1" type="boolean">
<prompt>Do you agree with the IMF position on
privatizing certain functions of Burkina Faso’s
agriculture ministry?</prompt>
</field>
<field name="q2" type="boolean">
<prompt>If this privatization occurs, will its
effects be beneficial mainly to Ouagadougou and
Bobo-Dioulasso?</prompt>
</field>
<field name="q3" type="boolean">
<prompt>Do you agree that sorghum and millet output
might thereby increase by as much as four percent per
annum?</prompt>
</field>
<block>
<submit next="register" namelist="q1 q2 q3"/>
</block>
<field name="confirm_exit" type="boolean">
<prompt>You have elected to exit. Are you
sure you want to do this, and perhaps adversely affect
U.S. foreign policy vis-à-vis sub-Saharan Africa for
decades to come?</prompt>
<filled>
<if cond="confirm_exit">
Okay, but the U.S. State Department is displeased.
<exit/>
<else/>
Good, let’s pick up where we left off.
<clear namelist="confirm_exit"/>
</if>
</filled>
</field>
</form>

If the user says “exit” in response to any of the
survey questions, an exit event is thrown by the platform and
caught by the <catch> event handler. This handler directs
that confirm_exit be the next visited field. The confirm_exit
field would not be visited during normal completion of the survey
because the preceding <block> element transfers control to
the registration script.

Whenever a form is entered, it is initialized. Internal prompt
counter variables (in the form’s dialog scope) are reset to
1. Each variable (form-level <var> elements and form item
variables) is initialized, in document order, to undefined or to
the value of the relevant expr attribute.

The collect phase: the next unfilled form
item is visited, which prompts the user for input, enables
the appropriate grammars, and then waits for and collects an
input (such as a spoken phrase or DTMF key presses) or an
event (such as a request for help or a no input
timeout).

The process phase: an input is processed by filling
form items and executing <filled> elements to perform
actions such as input validation. An event is processed by
executing the appropriate event handler for that event type.

Note that the FIA may be given an input (a set of grammar
slot/slot value pairs) that was collected while the user was in a
different form’s FIA. In this case the first iteration of
the main loop skips the select and collect phases, and goes right
to the process phase with that input.

Select phase

The purpose of the select the next form item to visit. This is
done as follows:

If a <goto> from the last main loop iteration’s
process phase specified a <goto nextitem>, then the
specified form item is selected.

Otherwise the first form item whose guard condition
is false is chosen to be visited.

If no guard condition is false, then the last iteration
completed the form without encountering an explicit transfer of
control, so the FIA does an implicit <exit> operation.

Collect phase

The purpose of the collect phase is to collect an input or an
event. The selected form item is visited, which performs
actions that depend on the type of form item:

If a field item is visited, the FIA selects and queues up any
prompts based on the field item’s prompt counter and the
prompt conditions. Then it listens for the field level grammar(s)
and any active higher-level grammars, and waits for a grammar
recognition or for some event.

If an <initial> is visited, the FIA selects and queues
up prompts based on the <initial>’s prompt counter
and prompt conditions. Then it listens for the form level
grammar(s) and any active higher-level grammars. It waits for a
grammar recognition or for an event.

A <block> element is visited by setting its form item
variable to true, evaluating its content, and then bypassing the
process phase. No input is collected, and the next iteration of
the FIA’s main loop is entered.

Process phase

The purpose of the process phase is to process the input or
event collected during the collect phase, as follows:

If an event (such as a noinput or a hangup) occurred, then
the applicable catch element is identified and executed. This can
cause the FIA to terminate (e.g. if it transitions to a different
dialog or document or it does an <exit>), or it can cause
the FIA to go into the next iteration of the main loop (e.g. as
when the default help event handler is executed).

If an input matches a grammar from a <link> then that
link’s transition is executed, or its event is thrown. If
the <link> throws an event, the event is processed in the
context of the current form item.

If an input matches a grammar in a form other than the
current form, then the FIA terminates, the other form is
initialized, and that form’s FIA is started with this input
in its process phase.

If an input matches a grammar in this form, then:

The input’s grammar slot values are assigned to the
corresponding field item variables.

The <filled> actions triggered by these assignments are
identified as described in section 15.

Each identified <filled> action is executed in document
order. If a <goto> or <throw> is encountered, the
remaining <filled> elements are not executed, and the FIA
either terminates or continues in the next main loop
iteration.

After completion of the process phase, interpretation
continues by returning to the select phase.

A more detailed form interpretation algorithm can be found in
Appendix C.

A menu is a convenient syntactic shorthand for a form
containing a single anonymous field that prompts the user to make
a choice and transitions to different places based on that
choice. Like a regular form, it can have its grammar scoped such
that it is active when the user is executing another dialog. The
following menu offers the user three choices:

Menu element

This identifies the menu, and determines the scope of its
grammars. Menu attributes are:

id

The identifier of the menu. It allows the menu to be the
target of a <goto> or a <submit>.

scope

The menu’s grammar scope. If it is dialog – the
default – the menu’s grammars are only active when
the user transitions into the menu. If the scope is document, its
grammars are active over the whole document (or if the menu is in
the application root document, any loaded document in the
application).

dtmf

When set to true, any choices that do not have explicit DTMF
elements are given the implicit ones "1", "2", etc.

Choice element

The <choice> element serves several purposes:

It specifies a speech grammar fragment and/or a DTMF grammar
fragment that determines when that choice has been selected.

DTMF in menus. Menus can rely purely on speech, purely
on DTMF, or both in combination by including a <property>
element in the <menu>. Here is a DTMF-only menu with
explicit DTMF sequences given to each choice, using the
choice’s dtmf attribute:

Enumerate element

The <enumerate> element is an automatically generated
description of the choices available to the user. It specifies a
template that is applied to each choice in the order they appear
in the menu. If it is used with no content, a default template
that lists all the choices is used, determined by the interpreter
context. If it has content, the content is the template
specifier. This specifier may refer to two special variables:
_prompt is the choice’s prompt, and _dtmf is the
choice’s assigned DTMF sequence. For example, if the menu
were rewritten as

The <enumerate> element may also be used analogously in
prompts for <field> elements that contain a set of
<option> elements as discussed in Section 14.1.3

Grammar Generation

Any choice phrase specifies a set of words and phrases
to listen for. The user may say any phrase consisting of any
subset of the words of the choice phrase in the same order in
which they occur in the choice phrase. A choice phrase is
constructed from the PCDATA of the elements contained directly or
indirectly in the <choice> element. For example, in
response to the prompt “Stargazer astrophysics news”
a user could say “Stargazer”,
“astrophysics”, “Stargazer news”,
“astrophysics news”, and so on. The equivalent JSGF
rule would be “[Stargazer] [astrophysics] [news]”
(where […] indicates optionality).

As an example of the use of PCDATA contained in descendants of
the <choice> element, consider the following example:

This choice would be read from the audio file, or as
“Stargazer Astrophysics News” if the file
could not be played. The grammar for the choice would be the
equivalent of “[Stargazer] [astrophysics] [news] ”
gleaned from the PCDATA of the <choice> element’s
descendants.

Interpretation model

A menu behaves like a form with a single field that does all
the work. The menu prompts become field prompts. The menu event
handlers become the field event handlers. The menu grammars
become form grammars.

Upon entry, the menu’s grammars are built and enabled,
and the prompt is played. When the user input matches a choice,
control transitions according to the value of the next, expr, or
event attribute of the <choice>, only one of which may be
specified.

A <link> element has one or more grammars, which are
scoped to the element containing the <link>. Grammar
elements contained in the <link> are not permitted to
specify scope. When one of these grammars is matched, the link
activates, and either:

The <link> element can be a child of <vxml>,
<form>, or of a form item. A link at the <vxml> level
has grammars that are active throughout the document. A link at
the <form> level has grammars active while the user is in
that form. If an application root document has a document-level
link, its grammars are active no matter what document of the
application is being executed.

If execution is in a modal form item, then link grammars at
the form or document level are not active.

You can also define a link that, when matched, throws an event
instead of going to a new document. This event is thrown at the
current location in the execution, not at the location where the
link is specified. For example, if the user matches this
link’s grammar, a help event is thrown in the form item the
user was visiting:

VoiceXML variables are in all respects equivalent to
ECMAScript variables. The variable naming convention is as in
ECMAScript, but names beginning with the underscore character
(“_”) are reserved for internal use.

<field name="num_tickets" type="number">
<prompt>How many tickets do you wish to purchase?</prompt>
</field>

Variables declared without an explicit initial value are
initialized to the ECMAScript undefined value. Variables must be
declared before being used.

In a form, the variables declared by <var> and those
declared by form items are initialized when the form is entered.
The initializations are guaranteed to take place in document
order, so that this, for example, is legal:

When the user visits this <form>, the form’s
initialization first declares the variable one and sets its value
to 1. Then it declares the field item variable two and gives it
the value 2. Then the initialization logic declares the variable
three and gives it the value 3. The form interpretation algorithm
then enters its main interpretation loop and begins at the go_on
field.

9.2. Variable Scopes

Variables can be declared in following scopes:

session

These are read-only variables that pertain to an entire user
session. They are declared and set by the interpreter context.
New session variables cannot be declared by VoiceXML documents.
See Section 9.4.

application

These are declared with <var> elements that are
children of the application root document's <vxml> element.
They are initialized when the application root document is
loaded. They exist while the application root document is loaded,
and are visible to the root document and any other loaded
application leaf document.

document

These variables are declared with <var> elements that
are children of the document’s <vxml> element. They
are initialized when the document is loaded. They exist while the
document is loaded, and are visible only within that
document.

dialog

Each dialog (<form> or <menu>) has a dialog scope
that exists while the user is visiting that dialog, and which is
visible to the element of that dialog. Dialog variables are
declared by <var> child elements of <form>, by
<var> elements inside executable content (e.g.
<block> content or catch element content), and by the
various form item elements. The child <var> elements of
<form> are initialized when the form is first visited. The
<var> elements inside executable content are initialized
when the executable content is executed. The form item variables
are initialized when the form item is collected.

(anonymous)

Each <block>, <filled>, and catch element defines
a new anonymous scope to contain variables declared in that
element.

The following diagram shows the scope hierarchy:

Figure 8: The scope hierarchy.

The curved arrows in this diagram show that each scope
contains a variable whose name is the same as the scope that
refers to the scope itself. This allows you for example in the
anonymous, dialog, and document scopes to refer to a variable
Xin the document scope using document.X.

The expression language used in cond and expr is precisely
ECMAScript. Note that the condoperators “>”,
“<”, “>=”, “<=”, and
“&&” must be escaped in XML (to
“&gt;” and “&lt;” and so on). For
clarity, examples in this document do not use XML escapes.

Variable references match the closest enclosing scope
according to the scope chain given above. You can prefix a
reference with a scope name for clarity or to resolve ambiguity.
For instance to save the value of a form field item variable for
use later on in a document:

<assign name="document.ssn" expr="dialog.ssn"/>

If the application root document has a variable x, it is
referred to as application.x in non-root documents, and either
application.x or document.x in the application root document.

Automatic Number Identification. Thisvariable provides
the result from the Automatic Number Identification service that
provides the receiver of a telephone call with the number of the
calling phone. This information is provided only if the service
is supported, and is undefined otherwise.

session.telephone.dnis

Dialed Number Identification Service. This variable
provides the result from the Dialed Number Identification Service
that identifies for the receiver of a call the number that the
caller dialed. This information is provided only if the service
is supported, and is undefined otherwise.

session.telephone.iidigits

Information Indicator Digit. This variable provides
information about the originating line (e.g. payphone, cellular
service, special operator handling, prison) of the caller.
Telecordia publishes the complete list of II digits in Section 1
of each volume of the "Local Exchange Routing Guide". This
information is provided only if the service is supported, and is
undefined otherwise.

session.uui

User to User Information. This variable returns
supplementary information provided as part of an ISDN call set-up
from a calling party. This information is provided only if the
service is supported, and is undefined otherwise.

specifies a set of utterances that a user may speak to perform
an action or supply information, and

provides a corresponding string value (in the case of a field
grammar) or set of attribute-value pairs (in the case of a form
grammar) to describe the information or action.

The <grammar> element is designed to accommodate any
grammar format that meets these two requirements. At this time,
VoiceXML does not specify a grammar format nor require support of
a particular grammar format. This is similar to the situation
with recorded audio formats for VoiceXML, and with media formats
in general for HTML.

The <grammar> element may be used to specify an
inline grammar or an external grammar. An inline
grammar is specified by the content of a <grammar>
element:

<grammar type="mime-type">
inline speech grammar
</grammar>

It may be necessary in this case to enclose the content in a
CDATA section. For inline grammars the type parameter
specifies a MIME type that governs the interpretation of the
content of the <grammar> tag.

An external grammar is specified by an element of the form

<grammar src="URI" type="mime-type"/>

The MIME type is optional in this case because this
information may be obtained via the URI protocol (as in the case
of HTTP), and may be inferred from the filename extension. If the
type is not specified, and cannot be inferred, the default type
is platform specific. However, if the type is specified using the
type attribute, it overrides other information about the
type.

See Appendix D for notes on using the Java
Speech API Grammar Format (JSGF) with VoiceXML. (Note: Java is a trademark
of Sun Microsystems Inc.)

Attributes of <grammar> include:

src

The URI specifying the location of the grammar, if it is
external.

scope

Either document, which makes the grammar active in all
dialogs of the current document (and relevant application leaf
documents), or dialog, to make the grammar active throughout the
current form. If omitted, the grammar scoping is resolved by
looking at the parent element.

type

The MIME type of the grammar. If this is omitted, the
interpreter context will attempt to determine the type
dynamically.

defines a set of key presses that a user may use to perform an
action or supply information, and

defines the corresponding string value that describes that
information or action.

The <dtmf> element is designed to accommodate any
grammar format that meets these two requirements. VoiceXML does
not specify nor require support for any particular grammar
format: as with <grammar>, it is expected that standards
efforts and market pressures will cause each widely used VoiceXML
interpreter context to support a common set of formats.

The <dtmf> element can refer to an external grammar:

<dtmf src="URI" type="mime-type"/>

or to an inline grammar:

<dtmf type="mime-type">
<!-- inline dtmf grammar -->
</dtmf>

The attributes of <dtmf> are precisely those of
<grammar>:

src

The URI specifying the location of the grammar, if it is
external.

scope

Either document, which makes the grammar active in all
dialogs of the current document (and relevant application leaf
documents), or dialog, to make the grammar active throughout the
current form. If omitted, the grammar scoping is resolved by
looking at the parent element.

type

The MIME type of the grammar. If this is omitted, the
interpreter context will attempt to determine the type
dynamically.

Field grammars are always scoped to their fields, that is,
they are not active unless the interpreter is visiting that
field. Grammars contained in fields cannot specify a scope.

Link grammars are given the scope of the element that contains
the link. Thus, if they are defined in the application root
document, links are also active in any other loaded application
document. Grammars contained in links cannot specify a scope.

Form grammars are by default given dialog scope, so that they
are active only when the user is in the form. If they are given
scope document, they are active whenever the user is in the
document. If they are given scope document and the document is
the application root document, then they are also active whenever
the user is in another loaded document in the same application. A
grammar in a form may be given document scope either by
specifying the scope attribute on the form element or by
specifying the scope attribute on the <grammar> element. If
both are specified, the grammar assumes the scope specified by
the <grammar> element.

<menu> grammars are also by default given dialogscope,
and are active only when the user is in the menu. But they can be
given the document scope and be active throughout the document,
and if their document is the application root document, also be
active in any other loaded document belonging to the application.
Grammars contained in menu choices cannot specify a scope.

Sometimes a form may need to have some grammars active
throughout the document, and other grammars that should be active
only when in the form. One reason for doing this is to minimize
grammar overlap problems. To do this, each individual
<grammar> and <dtmf> element can be given its own
scope if that scope should be different than the scope of the
<form> element itself:

When the interpreter waits for input as a result of visiting a
field, the following grammars are active:

grammars for that field, including grammars contained in links
in that field;

grammars for its form, including grammars contained in links
in that form;

grammars contained in links in its document, and grammars for
menus and other forms in its document which are given document
scope;

grammars contained in links in its application root document,
and grammars for menus and forms in its application root document
which are given document scope.

In the case that an input matches more than one active
grammar, the list above defines the precedence order. If the
input matches more than one active grammar with the same
precedence, the precedence is determined using document order.
Menus behave with regard to grammar activation like their
equivalent forms (see Section 7).

If the form item is modal (i.e., its modal attribute is set to
true), all grammars except its own are turned off while waiting
for input. If the input matches a grammar in a form or menu other
than the current form or menu, control passes to the other form
or menu. If the match causes control to leave the current form,
all current form data is lost.

The platform throws events when the user does not respond,
doesn't respond intelligibly, requests help, etc. The interpreter
throws events if it finds a semantic error in a VoiceXML
document, or when it encounters a <throw> element. Events
are identified by character strings.

Each element in which an event can occur has a set of catch
elements, which include:

<catch>

<error>

<help>

<noinput>

<nomatch>

An element inherits the catch elements (“as if by
copy”) from each of its ancestor elements, as needed. If a
field, for example, does not contain a catch element for nomatch,
but its form does, the form’s nomatch catch element is
used. In this way, common event handling behavior can be
specified at any level, and it applies to all descendents.

The occurrence of the event (default is 1). The count allows
you to handle different occurrences of the same event
differently. Each form item and <menu> maintains a counter
for each event that occurs while it is being visited; these
counters are reset each time the <menu> or form item's
<form> is re-entered.

cond

An optional condition to test to see if the event may be
caught by this element. Defaults to true.

An element inherits the catch elements (“as if by
copy”) from each of its ancestor elements, as needed. When
an event is thrown, the scope in which the event is handled and
its enclosing scopes are examined to find the best
qualified catch element, according to the following
algorithm:

Form an ordered list of catches consisting of all catches in
the current scope and all enclosing scopes (form item, form,
document, application root document, interpreter context),
ordered first by scope (starting with the current scope), and
then within each scope by document order.

Remove from this list all catches whose event name does not
match the event being thrown or whose cond evaluates to
false.

Find the “correct count”: the highest count among
the catch elements still on the list less than or equal to the
current count value.

Select the first element in the list with the “correct
count”.

The name of a thrown event matches the catch element event
name if it is either an exact match or a prefix match. A prefix
match occurs when the catch element event attribute has a prefix
in common with the name of the event being thrown. For
example,

The interpreter is expected to provide implicit default catch
handlers for the noinput, help, nomatch, cancel, exit, and error
events if the author did not specify them.

The system default behavior of catch handlers for various
events and errors is summarized by the definitions below that
specify (1) whether any audio response is to be provided, and (2)
how execution is affected. Note: where an audio response is
provided, the actual content is platform dependent.

Event Type

Audio Provided

Action

cancel

no

don’t reprompt

error

yes

exit interpreter

exit

no

exit interpreter

help

yes

reprompt

noinput

no

reprompt

nomatch

yes

reprompt

telephone.disconnect

no

exit interpreter

all others

yes

exit interpreter

Specific platforms and locales will differ in the default
prompts presented.

There are pre-defined events and application-defined events.
Events are also subdivided into plain events (things that happen
normally), and error events (abnormal occurrences). The error
naming convention allows for multiple levels of granularity.

The pre-defined events are:

cancel

The user has requested to cancel playing of the current
prompt.

telephone.disconnect.hangup

The user has hung up.

telephone.disconnect.transfer

The user has been transferred unconditionally to another line
and will not return.

exit

The user has asked to exit.

help

The user has asked for help.

noinput

The user has not responded within the timeout interval.

nomatch

The user input something, but it was not recognized.

The predefined errors are:

error.badfetch

A failed fetch. This may be the result, for example, of a
missing document, a malformed URI, a communications error during
the process of fetching the document, a timeout, a security
violation, or a malformed document.

error.semantic

A run-time error was found in the VoiceXML document, e.g. a
divide by 0, substring bounds error, or an undefined variable was
referenced.

error.noauthorization

The user is not authorized to perform the operation requested
(such as dialing an invalid telephone number, or one for which
the user is not allowed to call).

error.unsupported.format

The requested resource has a format that is not supported by
the platform, e.g. an unsupported grammar format, audio file
format, object type, or MIME type.

error.unsupported.element

The platform does not support the given element. For
instance, if a platform does not implement <record>, it
must throw error.unsupported.record. This allows an author to use
event handling to adapt to different platform capabilities.

Application-specific error types should follow the following
format:

error.com.mot.mix.noauth

Access to personal profile information is not
authorized.

error.com.ibm.portal.restricted

The document tried to access a restricted resource.

Catches can catch specific events (cancel) or all those
sharing a prefix (error.unsupported).

Fetching of content from a URI occurs in a VoiceXML
interpreter context to: (1) fetch VoiceXML documents to
interpret, or (2) fetch other document types, such as audio
files, objects, grammars, and scripts. All occasions for fetching
content in a VoiceXML interpreter context are governed by the
following three attributes:

caching

Either safe to force a query to fetch the most recent copy of
the content, or fast to use the cached copy of the content if it
has not expired. If not specified, a value derived from the
innermost caching property is used.

fetchtimeout

The interval to wait for the content to be returned before
throwing an error.badfetch event. If not specified, a value
derived from the innermost fetchtimeout property is used.

fetchhint

Defines when the interpreter context should retrieve content
from the server. prefetch indicates a file may be downloaded when
the page is loaded, whereas safe indicates a file that should
only be downloaded when actually needed. In the case of a very
large file (implying long download times) or a streaming audio
source, stream indicates to the interpreter context to begin
processing the content as it arrives and should not wait for full
retrieval of the content. If not specified, a value derived from
the innermost relevant *fetchhint property is used.

When content is fetched from a URI, the caching attribute
determines where it is located (in the cache or not), the
fetchtimeout attribute determines how long to wait for the
content (starting from the time when the resource is needed), and
fetchhint determines when the content is fetched. The caching
policies for a VoiceXML interpreter context are explained in more
detail in the next section.

The fetchhint attribute is used to help interpreter contexts
that can improve their performance by exploiting information
about when content can be fetched. There is no requirement that
an interpreter context must actually change when it fetches
documents from other than a safe setting. However, any
interpreter context that is capable of operating in a prefetch or
stream setting, must also be able to operate under the safe
setting.

When transitioning from one dialog to another, through either
a <subdialog>, <goto>, <submit>, <link>,
or <choice> element, there are additional rules that affect
interpreter behavior. If the referenced URI names a document
(e.g. "doc#dialog") or query data is provided (through POST or
GET), then a new document is obtained (either from the local
cache or from a server). When it is obtained, the document goes
through its initialization phase (i.e., obtaining and
initializing a new application root document if needed,
initializing document variables, and executing document scripts).
The requested dialog (or first dialog if none is specified) is
then initialized and execution of the dialog begins. If the
referenced URI names only a fragment (e.g. "#dialog") then no
document is obtained, and no initialization of the document is
performed. The requested dialog is processed as before.

Elements that fetch VoiceXML documents also support the
following additional attribute:

fetchaudio

The URI of the audio clip to play while the fetch is being
done. If not specified, the fetchaudio property is used, and if
that property is not set, no audio is played during the
fetch.

The fetchaudio attribute is useful for enhancing a user
experience when there may be noticeable delays while the next
document is retrieved. This can be used to play background music,
or a series of announcements. When the document is retrieved, the
audio file is interrupted if it is still playing.

The VoiceXML interpreter context, just like HTML visual
browsers, can use caching to improve performance in fetching
documents and other resources; audio recordings (which can be
quite large) are as common to VoiceXML documents as images are to
HTML pages. In a visual browser it is common to include end user
controls to update or refresh content that is perceived to be
stale. This is not the case for theVoiceXML interpreter context,
since it lacks equivalent end user controls. Thus enforcement of
cache refresh is at the discretion of the applications program
through appropriate use of the caching policies employed by
VoiceXMLinterpreter contexts.

The default caching policy for VoiceXML interpreter contexts
is one commonly employed in HTML browsers:

If the document referenced by a URI is unexpired in the cache,
then use the cached copy.

If the document referenced by a URI is expired or not present
in the cache, then fetch it from the server using get. Note: it
is an optimization to perform a “get if modified” on
an expired document still present in the cache.

In VoiceXML this caching policy is known as fast. But because
fast cache usage can lead to anomalous results, VoiceXML
interpreter contexts also implement a safe caching policy:

Even if the document referenced by a URI is in the cache and
is unexpired, still do a “get if modified” operation.
This will force a more recent version of the document to replace
the cached version, if a more recent version exists. If no more
recent version exists, the server does not go to the expense of
transferring the document.

If the document referenced by a URI is expired or not present
in the cache, then fetch it from the server using get. Note: it
is an optimization to perform a “get if modified” on
an expired document still present in the cache.

The safe caching policy ensures that the VoiceXML interpreter
context always has the most up to date version of a document, at
the expense of performance (due to the extra access to the
document server). The safe policy is similar to the effect of
always reloading or refreshing a web page in an HTML visual
browser.

VoiceXML allows the author to select which caching policy to
use. The caching attribute of certain elements may be set to safe
or fast to determine what default policy to use for that element.
If the attribute is not specified, the policy is determined a
<property> element that specifies a value for the caching
property (see Section 17).

One common practice will be to use safe caching during
development, when documents and resources change continually, and
then use fast caching with selected resources fetched
“safely” as the application goes into system test and
then production.

It is also possible, though perhaps less likely, to have a
production application that uses safe caching by default and
fetches some resources using the fast caching policy.

The prompt element controls the output of synthesized speech
and prerecorded audio. Conceptually, prompts are instantaneously
queued for playing, so interpretation proceeds until the user
needs to provide an input. At this point, the prompts are played,
and the system waits for user input. Once the input is received
from the speech recognition subsystem (or the DTMFrecognizer),
interpretation proceeds.

Prompts have the following attributes:

bargein

Control whether a user can interrupt a prompt. Default is
true.

cond

An expression telling if the prompt should be spoken. Default
is true.

count

A number that allows you to emit different prompts if the
user is doing something repeatedly. If omitted, it defaults to
“1”.

timeout

The timeout that will be used for the following user input.
The default noinput timeout is platform specific.

Sometimes text needs to be rendered using a particular style.
For example, a telephone number adhering to the North American
Dialing Plan needs a break after the first three digits, and
another break after the second three digits. To effect this, use
the class attribute:

While the interpreter must tolerate the full set of speech
markup, if its implementation platform uses a text-to-speech
engine that doesn’t have this level of speech markup
functionality, the platform will have to map the VoiceXML markups
as best it can. Specifically, all platforms must allow all speech
markup elements, and if an element with contained text is not
supported, the contained text must still be spoken.

If the audio file cannot be played (e.g. unsupported format,
invalid URI, etc.), the content of the audio element is played
instead. The content may include text, speech markup, or another
audio element. If the audio file cannot be played (e.g.
unsupported format, invalid URI, etc.) and the content of the
audio element is empty, an appropriate error event will be
thrown.

Attributes of <audio> include:

src

The URI of the audio prompt. See Appendix E for suggested
audio file formats.

If an implementation platform supports barge-in, the service
author can specify whether a user can interrupt, or
“barge-in” on, a prompt. This speeds up
conversations, but is not always desired. If the user must hear
all of a warning, legal notice, or advertisement, barge-in should
be disabled. This is done with the bargein attribute:

<prompt bargein="false">
<audio src="legalese.wav"/>
</prompt>

Users can interrupt a prompt whose bargein attribute is true,
but must wait for completion of a prompt whose bargein attribute
is false. In the case where several prompts are queued, the
bargein attribute of each prompt is honored during the period of
time in which that prompt is playing. If bargein occurs during
any prompt in a sequence, all subsequent prompts are not played.
If bargein is not specified, then the value of the bargein
property is used.

Tapered prompts are those that may change with each
attempt. Information-requesting prompts may become more terse
under the assumption that the user is becoming more familiar with
the task. Help messages become more detailed perhaps, under the
assumption that the user needs more help. Or, prompts can change
just to make the interaction more interesting.

Each form item and each menu has an internal prompt counter
that is reset to one each time the form or menu is entered.
Whenever the system uses a prompt, its associated prompt counter
is incremented. This is the mechanism supporting tapered
prompts.

For instance, here is a form with a form level prompt and
field level prompts:

C: What is your favorite flavor? (the
“flavor” field’s prompt counter is
1)

H: Pecan praline.

C: I do not understand.

C: What is your favorite flavor? (the
prompt counter is now 2)

H: Pecan praline.

C: I do not understand.

C: Say chocolate, vanilla, or strawberry.
(prompt counter is 3)

H: What if I hate those?

C: I do not understand.

C: Say chocolate, vanilla, or strawberry.
(prompt counter is 4)

H: …

When it is time to select a prompt, the prompt counter is
examined. The child prompt with the highest count attribute less
than or equal to the prompt counter is used. If a prompt has no
count attribute, a count of “1” is assumed.

A conditional prompt is one that is spoken only if its
condition is satisfied. In this example, a prompt is varied on
each visit to the enclosing form.

The timeout attribute specifies the interval of silence
allowed while waiting for user input after the end of the last
prompt. If this interval is exceeded, the platform will throw a
noinput event. This attribute defaults to the value specified by
the timeout property (see Section 17).

The reason for allowing timeouts to be specified as prompt
attributes is to support tapered timeouts. For example, the user
may be given five seconds for the first input attempt, and ten
seconds on the next.

The prompt timeout attribute determines the noinput timeout
for the following input:

A form item is an element of a <form> that can be
visited during form interpretation. They include <field>,
<block>, <initial>, <subdialog>,
<object>, <record>, and <transfer>.

All form items have the following characteristics:

They have a result variable, specified by the name attribute.
This variable may be given an initial value with the expr
attribute.

They have a guard condition specified with the cond
attribute.

Form items are subdivided into field items, those that
define the form’s field item variables, and control
items, those that help control the gathering of the
form’s fields. Field items (<field>,
<subdialog>, <object>, <record>, and
<transfer>) generally may contain the following
elements:

<filled> elements containing some action to execute at
the moment the result field is filled in.

<property> elements to specify properties that are in
effect for this field item.

<prompt> elements to specify prompts to be played when
this field is visited.

<grammar> and <dtmf> elements to specify allowable
spoken and character input for this field item.

<catch> elements and catch shorthands that are in effect
for this field item.

Each field item may have an associated set of shadow
variables. Shadow variables are used to return results from
the execution of a field item, other than the value stored under
the name attribute. For example, it may be useful to know the
confidence level that was obtained as a result of a recognized
grammar in a <field> element. A shadow variable is
referenced as name$.shadowvar where name is
the value of the field item’s name attribute, and
shadowvar is the name of a specific shadow variable. For
example, the <field> element returns a shadow variable
confidence. The code fragment below illustrates how this shadow
variable is accessed.

A field specifies an input item to be gathered from the user.
Attributes of fields include:

name

The field item variable in the dialog scope that will hold
the result.

expr

The initial value of the form item variable; default is
ECMAScript undefined. If initialized to a value, then the form
item will not be visited unless the form item variable is
cleared.

cond

A boolean condition that must also evaluate to true in order
for the form item to be visited.

type

The type of field, i.e., the name of an internal grammar.
This name must be from a standard set supported by all conformant
platforms. If not present, <grammar> and/or <dtmf>
elements can be specified instead.

slot

The name of the grammar slot used to populate the variable
(if it is absent, it defaults to the variable name). This
attribute is useful in the case where the grammar format being
used has a mechanism for returning sets of slot/value pairs and
the slot names differ from the field item variable names. If the
grammar returns only one slot, as do the built-in type grammars
like boolean, then no matter what the slot’s name, the
field item variable gets the value of that slot.

modal

If this is false (the default) all active grammars are turned
on while collecting this field. If this is true, then only the
field’s grammars are enabled: all others are temporarily
disabled.

The shadow variables of a <field> element whose name is
name are:

name$.confidence

The confidence level in the recognized result from 0.0-1.0. A
value of 0.0 indicates minimum confidence, and a value of 1.0
indicates maximum confidence. More specific interpretation of a
confidence value is platform-dependent.

name$.utterance

The raw string of words that were recognized. The exact
tokenization and spelling is platform-specific (e.g. “five
hundred thirty” or “5 hundred 30” or even
“530”).

The <field> type attribute is used to specify a built-in
grammar for one of the fundamental types, and also specifies how
its value is to be spoken if subsequently used in a value
attribute in a prompt. An example:

In this example, the boolean type indicates that inputs are
various forms of true and false. The value actually put into the
field is either true or false. The field would be read
“yes” or “no” in prompts.

In the next example, digits indicates that input will be
spoken or keyed digits. The result is stored as a string, and
rendered as digits, i.e., “one-two-three”, not
“one hundred twenty-three”. The <filled> action
tests the field to see if it has 12 digits. If not, the user
hears the error message, and nomatchevent is thrown to cause a
reprompt.

It is important that there be input conventions for each
built-in type, so that, for instance, generic prompt and help
messages can be written that apply to all implementations of
VoiceXML. These are locale-dependent, and a certain amount of
variability is allowed. For example, the boolean type’s
grammar should minimally allow “yes” and
“no” responses, but each implementation is free to
add other choices, such as “yeah” and
“nope”. In cases where an application requires a
different behavior, it should use explicit field grammars.

In addition, each built-in type has a convention for the
format of the value returned. These are independent of locale and
of the implementation. The return type for built-in fields is
string except for the booleanfield type. To access the actual
recognition result, the author can reference the shadow
variablename$.utterance.

All built-in types must support both voice and DTMF entry.

The builtin types are:

boolean

Inputs include affirmative and negative phrases appropriate
to the current locale. DTMF 1 is yes and 2 is no. The result is
ECMAScript true for “yes” or false for
“no”. The value will be submitted as the string
“true” or the string “false”. If the
field value is subsequently used in a prompt, it will be spoken
as an affirmative or negative phrase appropriate to the current
locale.

date

Valid spoken inputs include phrases that specify a date,
including a month day and year. DTMF inputs are: four digits for
the year, followed by two digits for the month, and two digits
for the day. The result is a fixed-length date string with format
yyyymmdd, e.g. “20000704”. If the year is not
specified, yyyy is returned as “????”; if the month
is not specified mm is returned as “??”; and if the
day is not specified dd is returned as “??”.

digits

Valid spoken or DTMF inputs include one or more digits, 0
through 9. The result is a string of digits. If the field value
is subsequently used in a prompt, it will be spoken as a sequence
of digits. A user can say for example “two one two
seven”, but not “twenty one hundred and
twenty-seven”.

currency

Valid spoken inputs include phrases that specify a currency
amount. For DTMF input, the “*” key will act as the
decimal point. The result is a string with the format UUUmm.nn,
where UUU is the three character currency indicator according to
ISO standard 4217:1995 or null if not spoken by the user. If the
field value is subsequently used in a prompt, it will be spoken
as a currency amount appropriate to the current locale.

number

Valid spoken inputs include phrases that specify numbers,
such as “one hundred twenty-three”, or “five
point three”. Valid DTMF input includes positive numbers
entered using digits and “*” to represent a decimal
point. The result is a string of digits from 0 to 9 and may
optionally include a decimal point (“.”) and/or a
plus or minus sign.

phone

Valid spoken inputs include phrases that specify a phone
number. DTMF asterisk “*” represents “x”.
The result is a string containing a telephone number consisting
of a string of digits and optionally containing the character
“x” to indicate a phone number with an extension. For
North America, a result could be
“8005551234x789”.

time

Valid spoken inputs include phrases that specify a time,
including hours and minutes. The result is a five character
string in the format hhmmx, where x is one of “a” for
AM, “p” for PM, “h” to indicate a time
specified using 24 hour clock, or “?” to indicate an
ambiguous time. Input can be via DTMF. Because there is no DTMF
convention for specifying AM/PM, in the case of DTMF input, the
result will always end with “h” or “?”.
If the field value is subsequently used in a prompt, the value
will be spoken as a time appropriate to the current locale.

When a simple set of alternatives is all that is needed to
specify the legal input values for a field, it may be more
convenient to use an option list than a grammar. An option list
is represented by a set of <option> elements contained in a
<field> element. Each <option> element contains
PCDATA that is used to generate a grammar for the spoken input it
accepts using the same method described for <choice>. It
also has attributes specifying the DTMF key for selecting the
option and the value to assign to the field when the option is
chosen.

The following field offers the user three choices and assigns
the value of the value attribute of the selected option to the
maincourse variable:

The string to assign to the field item variable when a user
selects this option, whether by speech or DTMF. The default value
for this attribute is the CDATA content of the <option>
element with leading and trailing white space removed.

Some built-in field types can be parameterized. This may be
done by explicitly referring to built-in grammars using a
special-purpose “builtin:” URI scheme and a URI-style
query syntax of the form type?param=value in
the src attribute of a <grammar> or <dtmf> element,
or in the type attribute of a field, for example:

where X is one of the built-in field types
(boolean, date, etc.). The digits and boolean grammars may be
parameterized as follows:

digits?minlength=n

A string of at least n digits.

digits?maxlength=n

A string of at most n digits.

digits?length=n

A string of exactly n digits.

boolean?y=d

A DTMF grammar that treats the keypress d as an
affirmative answer.

boolean?n=d

A DTMF grammar that treats the keypress d as a
negative answer.

Note that more than one parameter may be specified separated
by “;” as illustrated above. In <grammar> or
<dtmf> elements, the src attribute URI must start with
builtin:grammar/ or builtin:dtmf/ as shown above. When a
<grammar> element is specified in a <field>, it
overrides the default speech grammar implied by the type
attribute of the field. Likewise, when a <dtmf> element is
specified in a <field>, it overrides the default DTMF
grammar.

The form item variable is automatically set to true just
before the block is entered. Therefore, blocks are typically
executed just once per form invocation.

Sometimes you may need more control over blocks. To do this,
you can name the form item variable, and set or clear it to
control execution of the <block>. This variable is declared
in the dialogscope of the form.

Attributes of <block> include:

name

The name of the form item variable used to track whether this
block is eligible to be executed; defaults to an inaccessible
internal variable.

expr

The initial value of the form item variable; default is
ECMAScript undefined. If initialized to a value, then the form
item will not be visited unless the form item variable is
cleared.

cond

A boolean condition that must also evaluate to true in order
for the form item to be visited.

In a typical mixed initiative form, the <initial>
element is visited when the user is initially being prompted for
form-wide information, and has not yet entered into the directed
mode where each field is solicited individually. Like field
items, it has prompts, catches, and event counters. Unlike field
items, <initial> has no grammars, and no <filled>
action. For instance:

While visiting an <initial> element, no field grammar is
active. If an event occurs while visiting an <initial>,
then one of its event handlers executes. As with other form
items, <initial> continues to be eligible to be visited
while its form item variable is undefined and while its cond
attribute is true. If one or more of the field item variables is
set by user input, then all <initial> form item variables
are set to true, before any <filled> actions are
executed.

An <initial> form item variable can be manipulated
explicitly to disable, or re-enable the <initial>'s
eligibility to the FIA. For example, in the program above, the
<initial>'s form item variable is set on the second nomatch
event. This causes the FIA to no longer consider the
<initial> and to choose the next form item, which is a
<field> to prompt explicitly for the origination city.
Similarly, an <initial>’s form item variable could be
cleared, so that <initial> gets selected again by the
FIA.

Note: explicit assignment of values to field item variables
does not affect the value of an <initial>’s form item
variable.

Attributes of <initial> include:

name

The name of a form item variable used to track whether the
<initial> is eligible to execute; defaults to an
inaccessible internal variable.

expr

The initial value of the form item variable; default is
ECMAScript undefined. If initialized to a value, then the form
item will not be visited unless the form item variable is
cleared.

cond

A boolean condition that must also evaluate to true in order
for the form item to be visited.

A <subdialog> element invokes a “called”
dialog (known as the subdialog) identified by its src
attribute. The subdialog executes in a new execution context. The
subdialog proceeds until the execution of a <return>
element which causes the subdialog to return. When the subdialog
returns, its execution context is deleted, and execution resumes
in the calling dialog with any appropriate <filled>
elements. An execution context includes all declarations and
state information for the dialog, the dialog’s document,
and the application root (if present). Subdialogs can permit the
reuse of a common dialog such as this example of prompting a user
for credit card information, or build libraries of reusable
applications.

The attributes are:

name

The result returned from the subdialog, an ECMAScript object
whose properties are the ones defined in the namelistattribute of
the <return> element.

expr

The initial value of the form item variable; default is
ECMAScript undefined. If initialized to a value, then the form
item will not be visited unless the form item variable is
cleared.

cond

A boolean condition that must also evaluate to true in order
for the form item to be visited.

modal

Controls which grammars are active during the subdialog. If
true (the default) all grammars active in the calling dialog are
disabled. If false, they remain active.

namelist

Same as namelist in <submit>, except that the default
is to submit nothing. Only valid when fetching another
document.

The <subdialog> element may contain elements common to
all form items, and may also contain <param> elements. The
<param> elements of a <subdialog> specify the
parameters to pass to the subdialog. These parameters must be
declared in the subdialog using <var> elements; it is a
semantic error to attempt to set a form item variable or an
undeclared variable using <param>. When a subdialog
initializes, its variables are initialized in document order to
the corresponding <param> value, if they don't have an expr
attribute. Thus <param> elements can only initialize
<var> elements without expr attributes.

In the example below, the birthday of an individual is used to
validate their driver's license. The src attribute of the
subdialog refers to a form that is within the same document. The
<param> element is used to pass the birthday value to the
subdialog.

The driver’s license value is returned to calling
dialog, along with a status variable in order to indicate whether
the license is valid or not.

This example also illustrates the convenience of using
<param> as a means for forwarding data to the subdialog as
a means of instantiating values in the subdialog without using
server side scripting. An alternate solution that uses scripting,
is shown below.

In the above example, a server side script had to generate the
document and embed the birthday value.

When a subdialog is interpreted, the only active grammars are
those in dialog-scope of the subdialog and the default grammars
defined by the interpreter context (e.g. help, cancel). The set
of active grammars remains limited for all subsequent dialogs
until a <return> is executed. For example, if subdialog A
transitions to dialog B, then the interpretation of B considers
only active grammars in its dialog scope and the default
grammars.

One last example is shown below that illustrates a subdialog
to capture general credit card information. First the subdialog
is defined in a separate document; it is intended to be reusable
across different applications. It returns a status, the credit
card number, and the expiry date; if a result cannot be obtained,
the status is returned with value "no_result".

An application that includes a calling dialog is shown below.
It obtains the name of a software product and operating system
using a mixed initiative dialog, and then solicits credit card
information using the subdialog.

A VoiceXML implementation platform may have platform-specific
functionality that an application wants to use, such as speaker
verification, native components, additional telephony
functionality, and so on. Such platform-specific objects are
accessed using the <object> element, which is analogous to
the HTML <OBJECT> element. For example, a native credit
card collection object could be accessed like this:

In this example, the <param> element (Section 18) is
used to pass parameters to the object when it is invoked. When
this <object> is executed, it returns an ECMAScript object
as the value of its form item variable. This <block>
presents the values returned from the credit card object:

The user is first prompted for the pager message, then keys it
in. The <block> copies the message to the variable
document.message.

Attributes of <object> include:

name

When the object is evaluated, it sets this variable to an
ECMAScript value whose type is defined by the object.

expr

The initial value of the form item variable; default is
ECMAScript undefined. If initialized to a value, then the form
item will not be visited unless the form item variable is
cleared.

cond

A boolean condition that must also evaluate to true in order
for the form item to be visited.

classid

The URI specifying the location of the object’s
implementation. The URI conventions are platform-dependent.

codebase

The base path used to resolve relative URIs specified by
classid, data, and archive. It defaults to the base URI of the
current document.

codetype

The content type of data expected when downloading the object
specified by classid. When absent it defaults to the value of the
type attribute.

data

The URI specifying the location of the object’s data.
If it is a relative URI, it is interpreted relative to the
codebase attribute.

type

The content type of the data specified by the data
attribute.

archive

A space-separated list of URIs for archives containing
resources relevant to the object, which may include the resources
specified by the classid and data attributes. URIs which are
relative are interpreted relative to the codebase
attribute.

If an <object> element refers to an unknown object, the
error.unsupported.object event is thrown. There is no requirement
for implementations to provide platform-specific objects,
although support for the <object> element is required.

The <record> element is a field item that collects a
recording from the user. The recording is stored in the field
item variable, which can be played back or submitted to a server,
as shown in this example:

The user is prompted for a greeting and then records it. The
greeting is played back, and if the user approves it, is sent on
to the server for storage using the HTTP POST method. Notice that
like other field items, <record> has prompts and catch
elements. It may also have <filled> actions. If the
platform supports simultaneous recognition and recording, form
and document scoped grammars can be active while the recording is
in progress.

The attributes of <record> are:

name

The field item variable that will hold the recording.

expr

The initial value of the form item variable; default is
ECMAScript undefined. If initialized to a value, then the form
item will not be visited unless the form item variable is
cleared.

cond

A boolean condition that must also evaluate to true in order
for the form item to be visited.

modal

If this is true (the default) all higher level speech and
DTMF grammars are turned off while making the recording. If this
is false, speech and DTMF grammars scoped to the form, document,
application, and calling documents are listened for. Most
implementations will not support simultaneous recognition and
recording.

beep

If true, a tone is emitted just prior to recording. Defaults
to false.

maxtime

The maximum duration to record.

finalsilence

The interval of silence that indicates end of speech.

dtmfterm

If true, a DTMF keypress terminates recording. Defaults to
true. The DTMF tone is not part of the recording.

type

The MIME format of the resulting recording. Defaults to a
platform-specific format.

The <record> shadow variable name$ has the
following ECMAScript properties after the recording has been
made:

name$.duration

The duration of the recording in milliseconds.

name$.size

The size of the recording in bytes.

name$.termchar

If the dtmfterm attribute is true, and the user terminates
the recording by pressing a DTMF key, then this shadow variable
is the key pressed (e.g. “#”). Otherwise it is
null.

Occasionally, it is appropriate to suspend the session between
the user and the interpreter and initiate a session with another
entity. The most common use for this capability in current
practice is to connect a user in a telephone conversation with a
interpreter to a third party through the telephone network. The
<transfer> element directs the interpreter to make such a
third party connection. Two scenarios are supported:

bridging

the original caller resumes his session with the
interpreter.

blind transfer

no resumption is possible; as soon as the call connects, the
platform throws a telephone.disconnect.transfer.

The form item variable is used to store the outcome of the
transfer attempt. Here are the possible values:

busy

The endpoint refused the call.

noanswer

There was no answer within the specified time.

network_busy

Some intermediate network refused the call.

near_end_disconnect

The call completed and was terminated by the caller.

far_end_disconnect

The call completed and was terminated by the callee.

network_disconnect

The call completed and was terminated by the network.

This example attempts to transfer the user to a customer
support operator and then wait for that conversation to
terminate.

During a bridge transfer, the platform can listen for DTMF
input from the caller. In particular, if a DTMF grammar appears
inside the <transfer> element, DTMF input matching that
grammar will terminate the transfer and return control to the
interpreter. A bridge transfer may be terminated by recognition
of an utterance matching an enclosed<grammar> element;
support of this feature is not required. The <transfer>
element is modal in that no grammar defined outside its scope is
active.

Attributes include:

name

The outcome of the transfer attempt.

expr

The initial value of the form item variable; default is
ECMAScript undefined. If initialized to a value, then the form
item will not be visited unless the form item variable is
cleared.

cond

A boolean condition that must also evaluate to true in order
for the form item to be visited.

dest

The URI of the destination (phone, IP telephony
address).

destexpr

An ECMAScript expression yielding the URI of the
destination.

bridge

This attribute determines what to do once the call is
connected. If bridge is true, document interpretation suspends
until the transferred call terminates.

If it is false, as soon as the call connects, the platform throws
a telephone.disconnect.transfer.

connecttimeout

The time to wait while trying to connect the call before
returning the noanswer condition. Default is platform
specific.

maxtime

The time that the call is allowed to last, or 0 if it can
last arbitrarily long. Only applies if bridge is true. Default is
0.

The <transfer> shadow variable (name$) has the
following ECMAScript properties after a transfer completes:

name$.duration

The duration of a successful call in seconds
(floating-point).

Events thrown inside a <transfer> include:

telephone.disconnect.hangup

If the caller hung up.

telephone.disconnect.transfer

If the caller has been transferred unconditionally to another
line and will not return.

The <filled> element specifies an action to perform when
some combination of fields are filled by user input. It may occur
in two places: as a child of the <form> element, or as a
child of a field item.

As a child of a <form> element, the <filled>
element can be used to perform actions that occur when a
combination of one or more fields is filled. For example, the
following <filled> element does a cross-check to ensure
that a starting city field differs from the ending city
field:

If the <filled> element appears inside a field item, it
specifies an action to perform after that field is filled in by
user input. This is a notational convenience for a form-level
<filled> element that triggers on a single field item:

After each gathering of the user’s input, all the fields
mentioned in the input are set, and then the interpreter looks at
each <filled> element in document order (no preference is
given to ones in fields vs. ones in the form). Those whose
conditions are matched by the utterance are then executed in
order, until there are no more, or until one transfers control or
throws an event.

Attributes include:

mode

Either all (the default), or any. If any, this action is
executed when any of the specified fields is filled by the last
user input. If all, this action is executed when all of the
mentioned fields are filled, and at least one has been filled by
the last user input. A <filled> element in a field item
cannot specify a mode.

namelist

The fields to trigger on. For a <filled> in a form,
namelist defaults to the names (explicit and implicit) of the
form’s field items. A <filled> element in a field
item cannot specify a namelist; the namelist in this case is the
field item name.

The <meta> element specifies meta-data, as in HTML,
which is data about the document rather than the document’s
content. There are two types of <meta>. The first type
specifies a meta-data property of the document as a whole. For
example to specify the maintainer of a VoiceXML document:

The interpreter could use this information, for example, to
compose and email an error report to the maintainer.

VoiceXML does not specify required meta-data properties, but
the following are recommended:

author

Information describing the author.

copyright

A copyright notice.

description

A description of the document for search engines.

keywords

Keywords describing the document.

maintainer

The document maintainer’s email address.

robots

Directives to search engine web robots.

The second type of <meta> specifies HTTP response
headers. In the following example, the first <meta> element
sets an expiration date that prevents caching of the document;
the second <meta> element sets the Dateheader.

The <property> element sets a property value. Properties
are used to set values that affect platform behavior, such as the
recognition process, timeouts, caching policy, etc.

Properties may be defined for the whole application, for the
whole document at the <vxml> level, for a particular dialog
at the <form> or <menu> level, or for a particular
form item. Properties apply to their parent element and all the
descendants of the parent. A property at a lower level overrides
a property at a higher level. Properties specified in the
application root document provide default values for properties
in every document in the application; properties specified in an
individual document override property values specified in the
application root document.

In some cases, <property> elements specify default
values for element attributes, such as timeout or bargein. For
example, to turn off bargein for all the prompts in a particular
form:

Properties are also used to specify platform-specific data and
settings. For example, to set a platform-specific property to
prepend one second of silence before each recording made by a
particular document:

The speech recognition confidence level, a float value in the
range of 0.0 to 1.0. Results are rejected (a nomatch event is
thrown) when the engine’s confidence in its interpretation
is below this threshold. A value of 0.0 means minimum confidence
is needed for a recognition, and a value of 1.0 requires maximum
confidence. The default value is 0.5.

sensitivity

Set the sensitivity level. A value of 1.0 means that it is
highly sensitive to quiet input. A value of 0.0 means it is least
sensitive to noise. The default value is 0.5.

speedvsaccuracy

A hint specifying the desired balance between speed vs.
accuracy. A value of 0.0 means fastest recognition. A value of
1.0 means best accuracy. The default is value 0.5.

completetimeout

The speech timeout value to use when an active grammar is
matched. The default is platform-dependent. See
Appendix F.

incompletetimeout

The speech timeout to use when no active grammar has been
matched. The default is platform-dependent. See
Appendix F.

Several generic properties pertain to DTMF grammar
recognition:

interdigittimeout

The inter-digit timeout value to use when recognizing DTMF
input. The default is platform-dependent. See Appendix F.

termtimeout

The terminating timeout to use when recognizing DTMF input.
The default value is "0s". Appendix F.

termchar

The terminating DTMF character for DTMF input recognition.
The default value is "#". See Appendix F.

These properties apply to the fundamental platform prompt and
collect cycle:

bargein

The bargein attribute to use for prompts. Setting this to
true allows barge-in by default. Setting it to false disallows
barge-in. The default value is "true".

timeout

The time after which a noinput event is thrown by the
platform. The default value is platform-dependent. See Appendix F.

These properties pertain to the fetching of new documents and
resources:

caching

Either safe to never trust the cache when fetching, or fast
to always trust the cache. The default value is fast.

audiofetchhint

This tells the platform whether or not it can attempt to
optimize dialog interpretation by pre-fetching audio. The value
is either safe to say that audio is only fetched when it is
needed, never before; prefetch to permit, but not require the
platform to pre-fetch the audio; or stream to allow it to stream
the audio fetches. The default value is prefetch.

documentfetchhint

Tells the platform whether or not documents may be
pre-fetched. The value is either safe (the default), or
prefetch.

grammarfetchhint

Tells the platform whether or not grammars may be
pre-fetched. The value is either prefetch (the default), or
safe.

objectfetchhint

Tells the platform whether the URI contents for
<object> may be pre-fetched or not. The values are prefetch
(the default), or safe.

scriptfetchhint

Tells whether scripts may be pre-fetched or not. The values
are prefetch (the default), or safe.

fetchaudio

The URI of the audio to play while waiting for a document to
be fetched. The default is not to play any audio. There are no
fetchaudio properties for audio, grammars, objects, and
scripts.

fetchtimeout

The timeout for fetches. The default value is
platform-dependent.

This property determines which input modality to use:

inputmodes

The input modes to enable: dtmf and voice. On platforms that
support both modes, inputmodes defaults to “dtmf
voice”. To disable speech recognition, set inputmodes to
“dtmf”. To disable DTMF, set it to
“voice”. One use for this would be to turn off speech
recognition in noisy environments. Another would be to conserve
speech recognition resources by turning them off where the input
is always expected to be DTMF.

Our last example shows several of these properties used at
multiple levels.

The <param> element is used to specify values that are
passed to subdialogs or objects. It is modeled on the HTML
<PARAM> element. Its attributes are:

name

The name to be associated with this parameter when the object
or subdialog is invoked.

expr

An expression that computes the value associated with
name.

value

Associates a literal string value with name.

valuetype

One of data or ref, by default data; used to indicate to an
object if the value associated with name is data or a URI (ref).
This is not used for <subdialog>.

type

The MIME type of the result provided by a URI if the
valuetype is ref; only relevant for uses of <param> in
<object>.

Exactly one of expr or value must be present. The use of
valuetype and type is optional in general, although they may be
required by specific objects. When <param> is contained in
a <subdialog> element, the values specified by it are used
to initialize dialog <var> elements in the subdialog that
is invoked. When <param> is contained in an <object>,
the use of the parameter data is specific to the object that is
being invoked, and is outside the scope of the VoiceXML
specification.

Below is an example of <param> used as part of an
<object>. In this case, the first two <param>
elements have expressions (implicitly of valuetype="data"), the
third <param> has an explicit value, and the fourth is a
URI that returns a MIME type of text/plain. The meaning of this
data is specific to the object.

If it occurs in executable content, it declares a variable in
the anonymous scope associated with the enclosing <block>,
<filled>, or catch element. This declaration is made only
when the <var> element is executed. If the variable is
already declared in this scope, subsequent declarations act as
assignments, as in ECMAScript.

If a <var> is a child of a <form> element, it
declares a variable in the dialog scope of the <form>. This
declaration is made during the form’s initialization phase
as described in Section 6.6.1. The <var> element is not a
form item, and so is not visited by the Form Interpretation
Algorithm’s main loop.

If a <var> is a child of a <vxml> element, it
declares a variable in the document scope. This declaration is
made when the document is initialized; initializations happen in
document order.

Attributes of <var> include:

name

The name of the variable that will hold the result.

expr

The initial value of the variable (optional). If there is no
expr attribute, the variable retains its current value, if any.
Variables start out with the ECMAScript value undefined if they
are not given initial values.

Prompts can appear in executable content, in their full
generality, except that the <prompt> countattribute is
meaningless. In particular, the cond attribute can be used in
executable content. Prompts may be wrapped with <prompt>
and </prompt>, or represented using PCDATA. Wherever
<prompt> is allowed, the PCDATA xyz is interpreted
exactly as if it had appeared as
<prompt>xyz</prompt>.

The FIA assumes that when a catch element is executed, it has
queued appropriate prompts. Therefore the FIA normally suppresses
playing of prompts on the iteration of the FIA following the
execution of a catch element. However, if a <reprompt> is
executed in the catch, this tells the FIA that when it selects
the next form item to visit, it should do the normal prompt
processing (which includes selection of a prompt and incrementing
the prompt counter).

For example, this noinput catch expects the next form item
prompt to be selected and played:

Note that if no <reprompt> is executed in a catch, then
the FIA skips the prompt selection and queuing phase of the
selected form item. The form item’s prompt counter is
therefore not incremented.

If a <reprompt> is executed, then the FIA executes the
form item’s prompt selection queuing phase. This does
increment the form item’s prompt counter. A
<reprompt> does not cause the prior prompt to be played, in
general, but will cause prompt(s) to be played based on the
current value of the prompt counter and the current values of the
prompt conditions.

The URI may be absolute or relative to the current document.
You may specify the starting dialog in the next document using a
fragment that corresponds to the value of the id attribute of a
dialog. If no fragment is specified, the first dialog in that
document is chosen.

Note that transitioning to another dialog in the current
document causes the old dialog’s variables to be lost, even
in the case where a dialog is transitioning to itself.
Transitioning to another document will likewise drop the old
document level variables, even if the new document is the same
one that is making the transition. If you want data to persist
across multiple documents, store data in the application
scope.

Attributes of <goto> are:

next

The URI to which to transition.

expr

An ECMAScript expression that yields the URI.

nextitem

The name of the next form item to visit in the current
form.

expritem

An ECMAScript expression that yields the name of the next
form item to visit.

The <submit> element is similar to <goto> in that
it results in a new document being obtained. Unlike <goto>,
it lets you submit a list of variables to the document server via
an HTTP GET or POST request. For example, to submit a set of form
items to the server you might have:

Like next, except that the URI is dynamically determined by
evaluating the given ECMAScript expression. One of next or expr
is required.

namelist

The list of variables to submit. By default, all the named
field item variables are submitted. If a namelist is supplied, it
may contain individual variable references which are submitted
with the same qualification used in the namelist.

method

The request method: get (the default) or post.

enctype

The MIME encoding type of the submitted document. The default
is application/x-www-form-urlencoded. Interpreters may support
additional encoding types.

Returns control to the interpreter context which determines
what to do next.

<exit/>

This element differs from <return> in that it terminates
all loaded documents, while <return> returns from a
<subdialog> invocation. If the <subdialog> caused a
new document (or application) to be invoked, then <return>
will cause that document to be terminated, but execution will
resume after the <subdialog>.

Note that once <exit> returns control to the interpreter
context, the interpreter context is free to do as it wishes. It
may play a top level menu for the user, drop the call, or
transfer the user to an operator, for example.

Attributes include:

expr

A return expression (e.g. “0”, or
“oops!”).

namelist

Variable names to be returned to interpreter context. The
default is to return no variables; this means the interpreter
context will receive an empty ECMAScript object.

Return ends execution of a subdialog and returns control and
data to a calling dialog. The attributes are:

event

Return, then throw this event.

namelist

Variable names to be returned to calling dialog. The default
is to return no variables; this means the caller will receive an
empty ECMAScript object.

In returning from a subdialog, an event can be thrown at the
invocation point, or data is returned as an ECMAScript object. A
return element that is encountered when not executing as a
subdialog throws a semantic error. The example below shows an
event propagated from a subdialog to its calling dialog when the
subdialog fails to obtain a recognizable result. It also shows
data returned under normal conditions.

Form with calling dialog

<form>
<subdialog name="result" src="#getssn">
<nomatch>
<!-- a no match event that is returned by the
subdialog indicates that a valid social security
number could not be matched. -->
<goto next="http://myservice.example/ssn-problems.vxml"/>
</nomatch>
<filled>
<submit namelist="result.ssn"
next="http://myservice.example/cgi-bin/process"/>
</filled>
</subdialog>
</form>

The subdialog event handler for <nomatch> is triggered
on the third failure to match; when triggered, it returns from
the subdialog, and includes the nomatch event to be thrown in the
context of the calling dialog. In this case, the calling dialog
will execute its <nomatch> handler, rather than the
<filled> element, where the resulting action is to execute
a <goto> element. Under normal conditions, the
<filled> element of the subdialog is executed after a
recognized social security number is obtained, and then this
value is returned to the calling dialog, and is accessible as
result.ssn.

The <script> element allows the specification of a block
of client-side scripting language code, and is analogous to the
HTML <SCRIPT> element. For example, this document has a
script that computes a factorial.

A <script> element may occur in the <vxml>
element, or in executable content (in <filled>, <if>,
<block>, <catch>, or the short forms of
<catch>). Scripts in the <vxml> element are evaluated
just after the document is loaded, along with the <var>
elements, in document order. A <script> element in
executable content is executed, like other executable elements,
as it is encountered.

The ECMAScript scope chain (see section 10.1.4 in
http://www.ecma.ch/stand/ECMA-262.htm) is set up so that
variables declared with <var> are put into the scope
associated with the element in which the <var> element
occurs. All variables must be declared before being assigned or
referenced by ECMAScript scripts, or by VoiceXML elements.

Time designations follow those used in W3C's Cascading Style
Sheet recommendation (http://www.w3.
org/TR/REC-CSS2/syndata.html#q20). They consist of an
unsigned integer followed by an optional time unit identifier.
The time unit identifiers are:

Procedural logic that occurs in <block>,
<filled>, and event handlers.

field item

A form item whose purpose is to input a field item
variable. Field items include <field>, <record>,
<object>, <subdialog>, and <transfer>.

form

A dialog that interacts with the user in a
highly flexible fashion with the computer and the user
sharing the initiative.

form item

An element of <form> that can be visited during form
execution: <initial>, <block>, <field>,
<record>, <object>, and <transfer>.

form item variable

A variable, either implicitly or explicitly defined,
associated with each form item in a form. If the
form item variable is undefined, the form interpretation
algorithm will visit the form item and use it to interact with
the user.

implementation platform

A computer with the requisite software and/or hardware to
support the types of interaction defined by VoiceXML.

link

A set of grammars that when matched by something the
user says or keys in, either transitions to a new dialog or
document or throws an event in the current form item.

menu

A dialog presenting the user with a set of
choices and takes action on the selected one.

mixed initiative

A computer-human interaction in which either the computer or
the human can take initiative and decide what to do next.

The form interpretation algorithm (FIA) drives the interaction
between the user and a VoiceXML form or menu. A menu can be
viewed as a form containing a single field whose grammar and
whose <filled> action are constructed from the
<choice> elements.

The FIA must handle:

Form initialization.

Prompting, including the management of the prompt counters
needed for prompt tapering.

Grammar activation and deactivation at the form and form item
levels.

Entering the form with an utterance that matched one of the
form’s document-scoped grammars while the user was visiting
a different form or menu.

Leaving the form because the user matched another form, menu,
or link’s document-scoped grammar.

Processing multiple field fills from one utterance, including
the execution of the relevant <filled> actions.

Selecting the next form item to visit, and then processing
that form item.

Choosing the correct catch element to handle any events thrown
while processing a form item.

First we define some terms and data structures used in the
form interpretation algorithm:

active grammar set

The set of grammars active during a VoiceXML interpreter
context’s input collection operation.

utterance

A summary of what the user said or keyed in, including the
specific grammar matched, and a dictionary of slot name/slot
value pairs. An example utterance might be: “grammar 123
was matched, and the slots are from_city = ‘chicago’,
to_city = ‘new orleans’, and flight_num =
2233”.

execute

To execute executable content – either a block, a
filled action, or a set of filled actions. If an event is thrown
during execution, the execution of the executable content is
aborted. The appropriate event handler is then executed, and this
may cause control to resume in a form item, in the next iteration
of the form’s main loop, or outside of the form. If a
<goto> is executed, the transfer takes place immediately,
and the remaining executable content is not executed.

Here is the conceptual form interpretation algorithm. The FIA
can start with no initial utterance, or with an initial utterance
passed in from another dialog:

//
// Initialization Phase
//
foreach ( <var> and form item variable, in document order )
Declare the variable, initializing it to the value of
the “expr” attribute, if any, or else to undefined.
foreach ( field item )
Declare a prompt counter and set it to 1.
if ( there is an initial item )
Declare a prompt counter and set it to 1.
if ( user entered form by speaking to its
grammar while in a different form )
{
Enter the main loop below, but start in
the process phase, not the select phase:
we already have a collection to process.
}
//
// Main Loop: select next form item and execute it.
//
while ( true )
{
//
// Select Phase: choose a form item to visit.
//
if ( the last main loop iteration ended
with a <goto nextitem> )
Select that next form item.
else if (there is a form item with an
unsatisfied guard condition )
Select the first such form item in document order.
else
Do an <exit/> -- the form is full and specified no transition.
//
// Collect Phase: execute the selected form item.
//
// Queue up prompts for the form item.
unless ( the last loop iteration ended with
a catch that had no <reprompt> )
{
Select the appropriate prompts for the form item.
Queue the selected prompts for play prior to
the next collect operation.
Increment the form item’s prompt counter.
}
// Activate grammars for the form item.
if ( the form item is modal )
Set the active grammar set to the form item grammars,
if any. (Note that some form items, e.g. <block>,
cannot have any grammars).
else
Set the active grammar set to the form item
grammars and any grammars scoped to the form,
the current document, the application root
document, and then elements up the <subdialog>
call chain.
// Execute the form item.
if ( a <field> was selected )
Collect an utterance or an event from the user.
else if ( a <record> was chosen )
Collect an utterance (with a name/value pair
for the recorded bytes) or event from themuser.
else if ( an <object> was chosen )
Execute the object, setting the <object>’s
form item variable to the returned ECMAScript value.
else if ( a <subdialog> was chosen )
Execute the subdialog, setting the <subdialog>’s
form item variable to the returned ECMAScript value.
else if ( a <transfer> was chosen )
Do the transfer, and (if wait is true) set the
<transfer> form item variable to the returned
result status indicator.
else if ( the <initial> was chosen )
Collect an utterance or an event from the user.
else if ( a <block> was chosen )
{
Set the block’s form item variable to a defined value.
Execute the block’s executable context.
}
//
// Process Phase: process the resulting utterance or event.
//
// Process an event.
if ( the form item execution resulted in an event )
{
Find the appropriate catch for the event.
Execute the catch (this may leave the FIA).
continue
}
// Must have an utterance: process ones from outside grammars.
if ( the utterance matched a grammar from outside the form )
{
if ( the grammar belongs to a <link> element )
Execute that link’s goto or throw, leaving the FIA.
if ( the grammar belongs to a menu’s <choice> element )
Execute the choice’s goto or throw, leaving the FIA.
// The grammar belongs to another form (or menu).
Transition to that form (or menu), carrying the utterance
to the other form (or menu)’s FIA.
}
// Process an utterance spoken to a grammar from this form.
// First copy utterance slot values into corresponding
// form item variables.
Clear all “just_filled” flags.
foreach ( slot in the user’s utterance )
{
if ( the slot corresponds to a field item )
{
Copy the slot value into the field item’s form item variable.
Set this field item’s “just_filled” flag.
}
}
// Set <initial> form item variable if any field items are filled.
if ( any field item variable is set as a result of the user utterance )
Set the <initial> form item variable.
// Next execute any <filled> actions triggered by this utterance.
foreach ( <filled> action in document order )
{
// Determine the form item variables the <filled> applies to.
N = the <filled>’s “namelist” attribute.
if ( N equals “” )
{
if ( the <filled> is a child of a form item )
N = the form item’s form item variable name.
else if ( the <filled> is a child of a form )
N = the form item variable names of all the form
items in that form.
}
// Is the <filled> triggered?
if ( any form item variable in the set N was “just_filled”
AND ( the <filled> mode is “all”
AND all variables in N are filled
OR the <filled> mode is “any”
AND any variables in N are filled) )
Execute the <filled> action.
}
}

specify a set of utterances that a user may speak to perform
an action or supply information, and

provide a corresponding string value (in the case of a field
grammar) or set of attribute-value pairs (in the case of a form
grammar) to describe the information or action.

JSGF supports the first requirement above by providing a
language for describing context-freegrammars.
The following table is a summary of the features of JSGF.

Feature

Purpose

word or "word"

words (terminals, tokens) need not be quoted

<rule>

rule names (non-terminals) are enclosed in <>

[x]

optionally x

(...)

Grouping

x {tag text}

arbitrary "tag" text may be associated with any of the
above

x*

0 or more occurrences of x

x+

1 or more occurrences of x

x y z ...

a sequence of x then y then z then ...

x | y | z | ...

a set of alternatives of x or y or z or ...

<rule> = x;
public <rule> = x;

a private and a public rule definition

The JSGF tag facility provides a means for meeting the second
requirement of providing values for forms to describe the action
requested. In the case of field grammars, where only a single
string value is needed, a tag may be used to supply the value. If
no tag is specified, the text of the utterance itself is used as
the value.

As described in the section on grammars, a grammar element be
either inline or external. Furthermore, in the case
of JSGF, an inline grammar may be either a grammar
fragment or complete grammar. These three cases are
described below.

Inline grammar fragment.

The content of the <grammar> element is the
right-hand-side of a JSGF rule. (In JSGF terminology this is
called a "rule expansion"). In the most common case, where no
reference to non-terminals is made, no use is made of the XML
reserved special characters, and so the rule expansion may be
specified inline without need for quoting or use of a PCDATA
element. This form is thus particularly convenient for expressing
simple lists of alternative ways of saying the same thing, for
example:

In the first example, any of the ways of saying "help" result
in a help event being thrown. In the second example, the user may
say "hamburger" or "burger" and the "sandwich" field will be
given the value "hamburger", or the user may say "chicken" or
"chicken sandwich" and the "sandwich" field will be given the
value "chicken".

Inline complete grammar.

The content of the <grammar> element is a complete JSGF
grammar, consisting of one or more rule definitions, with
possible reference to external grammars. In this case all public
rules in the supplied grammar are used. Since this form requires
the use of XML reserved special characters generally a PCDATA
element will be needed.

External grammar.

A complete JSGF grammar is found at the URI specified by the
src attribute of the grammar element; the <grammar> element
content must be empty. The specified URI may take the form of

a URI naming a whole document, in which case all public rules
in the grammar contained in the document at the specified URI are
used, or

a URI naming a document fragment, that is, a URI ending with
#fragment, in which case the fragment name is
taken to be the name of a public rule from the grammar contained
in the document at the specified URI; only the rule so named is
used.

The various timing properties for speech and DTMF recognition
work together to define the user experience. The ways in which
these different timing parameters function are outlined in the
timing diagrams below. In these diagrams, the start for wait of
DTMF input, or user speech both occur at the time that the last
prompt has finished playing.

termchar and interdigittimeout, Grammar Can Terminate

In the example below, a termchar is non-empty, and is entered
by the user before an interdigittimeout expires, to signify that
the users DTMF input is complete; the termchar is not included as
part of the recognized value.

termchar Empty When Grammar Must Terminate

In the example below, the entry of the last DTMF has brought
the grammar to a termination point at which no additional DTMF is
expected. Since termchar is empty, there is no optional
terminating character permitted, thus the recognition ends and
the recognized value is returned.

Figure 9:Timing diagram for termchar empty when grammar must
terminate.

termchar Non-Empty and termtimeout When Grammar Must
Terminate

In the example below, the entry of the last DTMF has brought
the grammar to a termination point at which no additional DTMF is
allowed by the grammar. If the termchar is non-empty, then the
user can enter an optional termchar DTMF. If the user fails to
enter this optional DTMF within termtimeout, the recognition ends
and the recognized value is returned. If the termtimeout is 0s
(the default), then the recognized value is returned immediately
after the last DTMF allowed by the grammar, without waiting for
the optional termchar.

Figure 10: Timing diagram for termchar non-empty and termtimeout
when grammar must terminate.

termchar Non-Empty and termtimeout When Grammar Must
Terminate

In this last DTMF example, the entry of the last DTMF has
brought the grammar to a termination point at which no additional
DTMF is allowed by the grammar. Since the termchar is non-empty,
the user enters the optional termchar within termtimeout causing
the recognized value to be returned (excluding the termchar).

incompletetimeout with Speech Grammar Unrecognized

In the example above, the user provided a utterance that is
not as yet recognized by the speech grammar but is the prefix of
a legal utterance. After a silence period of incompletetimeout
has elapsed, a nomatch event is thrown.

The initial value of the form item variable; default is
ECMAScript undefined. If initialized to a value, then the form
item will not be visited unless the form item variable is
cleared.

cond

A boolean condition that must also evaluate to true in order
for the form item to be visited.

modal

If this is true (the default) all higher level speech and
DTMF grammars are turned off while making the transcription. If
false, speech and DTMF grammars scoped to the form, document,
application, and calling documents are also listened for (if the
implementation supports that).

beep

If true, a tone is emitted just prior to transcription.
Defaults to false.

maxtime

The maximum duration to transcribe.

finalsilence

The interval of silence that indicates end of speech.

dtmfterm

If true, a DTMF keypress terminates transcription. Defaults
to true. The DTMF tone is not part of the transcription.

The <transcribe> shadow variable (name$) has the
following ECMAScript properties after the transcription has been
made:

name$.confidence

The confidence level in the transcription from 0.0-1.0. A
value of 0.0 indicates minimum confidence, and a value of 1.0
indicates maximum confidence. More specific interpretation of a
confidence value is platform-dependent.

name$.termchar

If the dtmfterm attribute is true, and the user terminates
the transcription by pressing a DTMF key, then this shadow
variable is the key pressed (e.g. “#”). Otherwise it
is null.

name$.utterance

The raw string of words that were recognized. The exact
tokenization and spelling is platform-specific (e.g. “five
hundred thirty” or “5 hundred 30” or even
“530”). For example, the raw utterance might be
“I need a hundred and twenty five dollars by tonight”
for a final transcription of “I need $125 by
tonight”.

// Execute the form item.
if ( a <field> was selected )
Collect an utterance or an event from the user.
else if ( a <record> was chosen )
Collect an utterance (with a name/value pair
for the recorded bytes) or event from the user.
else if ( a <transcribe> was chosen )Collect an utterance (with a name/value pair
for the transcription) orevent from the user.
else if ( an <object> was chosen )
Execute the object, setting the <object>’s form
item variable to the returned ECMAScript value.
…