Abstract

This document describes a loosely coupled architecture for
multimodal user interfaces, which allows for co-resident and
distributed implementations, and focuses on the role of markup and
scripting, and the use of well defined interfaces between its
constituents.

Status of this Document

This section describes the status of this document at the time
of its publication. Other documents may supersede this document. A
list of current W3C publications and the latest revision of this
technical report can be found in the W3C technical reports index at
http://www.w3.org/TR/.

This is the
6 September 201112 January 2012 W3C
Last Call Working DraftCandidate Recommendation of "Multimodal
Architecture and Interfaces".
W3C publishes a technical report as a
Candidate Recommendation to indicate that the document is believed
to be stable, and to encourage implementation by the developer
community.

Publication as a
Working DraftCandidate Recommendation does not imply
endorsement by the W3C Membership. This is a draft document and may
be updated, replaced or obsoleted by other documents at any time.
It is inappropriate to cite this document as other than work in
progress.

This specification defines a general and
flexible framework providing interoperability among
modality-specific components from different vendors - for example,
speech recognition from one vendor and handwriting recognition from
another. This document hasThere is no normative change since thesecond Last Call Working Draftin September 2011, though several
clarifications had been
produced as part ofadded to thesecond Last Call Working Draftin order to address detailed feedback
from the
W3C Multimodal
Interaction Activity , followingpublic list. Please check the
procedures set out forDisposition of
Commentsreceived during the
W3C Process . The authorsfirst Last Call period. Adiff-marked
version of this document
is also available for comparison
purposes.

The entrance criteria to the Proposed
Recommendation phase require at least two independently developed
interoperable implementations of each feature. Detailed
implementation requirements and the invitation for participation in
the Implementation Report are
membersprovided in theImplementation Report Plan .
We expect to meet all requirements of
that report within the
Candidate Recommendation period closing29 February 2012 .The Multimodal Interaction Working Group
will advance Multimodal Architecture and
Interfaces to Proposed Recommendation no sooner than29 February 2012.

The
main normative change fromfollowing features in the
previouscurrent draft
is removing the 'immediate' field from the
following Life Cycle Events:specification are considered to be at risk of
removal due to potential lack of implementations:

CancelRequestThe Confidential field in all life-cycle
events.

PauseRequestThe RequestAutomaticUpdate field of the
StatusRequest message.

A diff-marked
version

The AutomaticUpdate field of
this document is also available for
comparison purposes.the StatusResponse message.

Comments for this specification are welcomed
until 29 February 2012 and should have
a subject starting with the prefix '[ARCH]'. Please send them to
< www-multimodal@w3.org >,
the public email list for issues related to Multimodal. This list
is archived
and acceptance of this archiving policy is requested automatically
upon first post. To subscribe to this list send an email to < www-multimodal-request@w3.org
> with the word subscribe in the subject line.

This document was produced by a group
operating under the 5 February 2004 W3C Patent Policy . W3C
maintains a public list of any patent disclosures made in
connection with the deliverables of the group; that page also
includes instructions for disclosing a patent. An individual who
has actual knowledge of a patent which the individual believes
contains Essential Claim(s) must disclose the information in
accordance with section 6 of the W3C Patent Policy .

Transport and format of Life-Cycle Event messages may be
implemented in any manner, as long as their contents conform to the
standard Life-Cycle Event definitions given in 6.2 Standard Life Cycle Events .
Any implementation that uses XML format to represent the life-cycle
events must comply with the normative MMI XML schemas contained in
C Event Schemas .

The key words MUST , MUST NOT , REQUIRED ,
SHALL ,
SHALL
NOT , SHOULD , SHOULD
NOT , RECOMMENDED , MAY , and OPTIONAL
in this specification are to be interpreted as described in [IETF RFC 2119] .

The terms BASE URI and RELATIVE
URI are used in this specification as they are defined in [IETF RFC 2396] .

Any section that is not marked as 'informative' is
normative.

2 Summary

This section is informative.

This document describes a loosely coupled architecture for
multimodal user interfaces, which allows for co-resident and
distributed implementations, and focuses on the role of markup and
scripting, and the use of well defined interfaces between its
constituents.

3 Overview

This section is informative.

This document describes the architecture of the Multimodal
Interaction (MMI) framework [MMIF] and the
interfaces between its constituents. The MMI Working Group is aware
that multimodal interfaces are an area of active research and that
commercial implementations are only beginning to emerge. Therefore
we do not view our goal as standardizing a hypothetical existing
common practice, but rather providing a platform to facilitate
innovation and technical development. Thus the aim of this design
is to provide a general and flexible framework providing
interoperability among modality-specific components from different
vendors - for example, speech recognition from one vendor and
handwriting recognition from another. This framework places very
few restrictions on the individual
components or on their interactions with each
other,components, but instead focuses on
providing a general means for
allowing them to communicate with each
other,communication, plus basic
infrastructure for application control and platform services.

Our framework is motivated by several basic design goals:

Encapsulation. The architecture should make no assumptions
about the internal implementation of components, which will be
treated as black boxes.

Distribution. The architecture should support both distributed
and co-hosted implementations.

Extensibility. The architecture should facilitate the
integration of new modality components. For example, given an
existing implementation with voice and graphics components, it
should be possible to add a new component (for example, a biometric
security component) without modifying the existing components.

Recursiveness. The architecture should allow for nesting, so
that an instance of the framework consisting of several components
can be packaged up to appear as a single component to a
higher-level instance of the architecture.

Modularity. The architecture should provide for the separation
of data, control, and presentation.

Even though multimodal interfaces are not yet common, the
software industry as a whole has considerable experience with
architectures that can accomplish these goals. Since the 1980s, for
example, distributed message-based systems have been common. They
have been used for a wide range of tasks, including in particular
high-end telephony systems. In this paradigm, the overall system is
divided up into individual components which communicate by sending
messages over the network. Since the messages are the only means of
communication, the internals of components are hidden and the
system may be deployed in a variety of topologies, either
distributed or co-located. One specific instance of this type of
system is the DARPA Hub Architecture, also known as the Galaxy
Communicator Software Infrastructure [Galaxy] . This is a distributed,
message-based, hub-and-spoke infrastructure designed for
constructing spoken dialogue systems. It was developed in the late
1990's and early 2000's under funding from DARPA. This
infrastructure includes a program called the Hub, together with
servers which provide functions such as speech recognition, natural
language processing, and dialogue management. The servers
communicate with the Hub and with each other using key-value
structures called frames.

Another recent architecture that is relevant to our concerns is
the model-view-controller (MVC) paradigm. This is a well known
design pattern for user interfaces in object oriented programming
languages, and has been widely used with languages such as Java,
Smalltalk, C, and C++. The design pattern proposes three main
parts: a Data Model that represents the underlying logical
structure of the data and associated integrity constraints, one or
more Views which correspond to the objects that the user
directly interacts with, and a Controller which sits
between the data model and the views. The separation between data
and user interface provides considerable flexibility in how the
data is presented and how the user interacts with that data. While
the MVC paradigm has been traditionally applied to graphical user
interfaces, it lends itself to the broader context of multimodal
interaction where the user is able to use a combination of visual,
aural and tactile modalities.

4 Design versus Run-Time
considerations

This section is informative.

In discussing the design of MMI systems, it is important to keep
in mind the distinction between the design-time view (i.e., the
markup) and the run-time view (the software that executes the
markup). At the design level, we assume that multimodal
applications will take the form of multiple documents from
different namespaces. In many cases, the different namespaces and
markup languages will correspond to different modalities, but we do
not require this. A single language may cover multiple modalities
and there may be multiple languages for a single modality.

At runtime, the MMI architecture features loosely coupled
software constituents that may be either co-resident on a device or
distributed across a network. In keeping with the loosely-coupled
nature of the architecture, the constituents do not share context
and communicate only by exchanging events. The nature of these
constituents and the APIs between them is discussed in more detail
in Sections 3-5, below. Though nothing in the MMI architecture
requires that there be any particular correspondence between the
design-time and run-time views, in many cases there will be a
specific software component responsible for each different markup
language (namespace).

4.1 Markup and The Design-Time
View

At the markup level, an application consists of multiple
documents. A single document may contain markup from different
namespaces if the interaction of those namespaces has been defined.
By the principle of encapsulation, however, the internal structure
of documents is invisible at the MMI level, which defines only how
the different documents communicate. One document has a special
status, namely the Root or Controller Document, which contains
markup defining the interaction between the other documents. Such
markup is called Interaction Manager markup. The other documents
are called Presentation Documents, since they contain markup to
interact directly with the user. The Controller Document may
consist solely of Interaction Manager markup (for example a state
machine defined in CCXML [CCXML] or SCXML [SCXML] ) or it may contain Interaction Manager
markup combined with presentation or other markup. As an example of
the latter design, consider a multimodal application in which a
CCXML document provides call control functionality as well as the
flow control for the various Presentation documents. Similarly, an
SCXML flow control document could contain embedded presentation
markup in addition to its native Interaction Management markup.

These relationships are recursive, so that any Presentation
Document may serve as the Controller Document for another set of
documents. This nested structure is similar to 'Russian Doll' model
of Modality Components, described below in 4.2 Software Constituents and The Run-Time
View .

The different documents are loosely coupled and co-exist without
interacting directly. Note in particular that there are no shared
variables that could be used to pass information between them.
Instead, all runtime communication is handled by events, as
described below in 6 Interface between the
Interaction Manager and the Modality Components . Note,
however, that this only applies to non-root documents. The IM,
which loads the root document, interacts with "other components".
I.e., the IM (having the root-document) interacts directly through
life-cycle events with Modality Components (having different
documents and/or namespaces).

Furthermore, it is important to note that the asynchronicity of
the underlying communication mechanism does not impose the
requirement that the markup languages present a purely asynchronous
programming model to the developer. Given the principle of
encapsulation, markup languages are not required to reflect
directly the architecture and APIs defined here. As an example,
consider an implementation containing a Modality Component
providing Text-to-Speech (TTS) functionality. This Component must
communicate with the Interaction Manager via asynchronous events
(see 4.2 Software Constituents and The
Run-Time View ). In a typical implementation, there would
likely be events to start a TTS play and to report the end of the
play, etc. However, the markup and scripts that were used to author
this system might well offer only a synchronous "play TTS" call, it
being the job of the underlying implementation to convert that
synchronous call into the appropriate sequence of asynchronous
events. In fact, there is no requirement that the TTS resource be
individually accessible at all. It would be quite possible for the
markup to present only a single "play TTS and do speech
recognition" call, which the underlying implementation would
realize as a series of asynchronous events involving multiple
Components.

Existing languages such as HTML may be used as either the
Controller Documents or as Presentation Documents. Further examples
of potential markup components are given in 5.2.7 Examples

4.2 Software Constituents and
The Run-Time View

At the core of the MMI runtime architecture is the distinction
between the Interaction Manager (IM) and the Modality Components,
which is similar to the distinction between the Controller Document
and the Presentation Documents. The Interaction Manager interprets
the Controller Document while the individual Modality Components
are responsible for specific tasks, particularly handling input and
output in the various modalities, such as speech, pen, video,
etc.

The Interaction Manager receives all the events that the various
Modality Components generate. Those events may be commands or
replies to commands, and it is up to the Interaction Manager to
decide what to do with them, i.e., what events to generate in
response to them. In general, the MMI architecture follows a
'targetless' event model. That is, the Component that raises an
event does not specify its destination. Rather, it passes it up to
the Runtime Framework, which will pass it to the Interaction
Manager. The IM, in turn, decides whether to forward the event to
other Components, or to generate a different event, etc.

Modality Components are black boxes, required only to implement
the Modality Component Interface API which is described below. This
API allows the Modality Components to communicate with the IM and
hencethus indirectly with each other, since
the IM is responsible for delivering events/messages among the
Components. Since the internals of a Component are hidden, it is
possible for an Interaction Manager and a set of Components to
present themselves as a Component to a higher-level Interaction
Manager. All that is required is that the IM implement the
Component API. The result is a "Russian Doll" model in which
Components may be nested inside other Components to an arbitrary
depth. Nesting components in this manner is one way to produce a
'complex' Modality Component, namely one that handles multiple
modalities simultaneously. However, it is also possible to produce
complex Modality Components without nesting, as discussed in 5.2.3 The Modality Components
.

In addition to the Interaction Manager and the modality
components, there is a Runtime Framework that provides
infrastructure support, in particular a transport layer which
delivers events among the components.

Because we are using the term 'Component' to refer to a specific
set of entities in our architecture, we will use the term
'Constituent' as a cover term for all the elements in our
architecture which might normally be called 'software
components'.

4.3
Relationship to EMMA

The Extended Multimodal Annotation Language [EMMA] , is a set of specifications for multimodal
systems, and provides details of an XML markup language for
containing and annotating the interpretation of user input. For
example, a user of a multimodal application might use both speech
to express a command, and keystroke gesture to select or draw
command parameters. The Speech Recognition Modality would express
the user command using EMMA to indicate the input source (speech).
The Pen Gesture Modality would express the command parameters using
EMMA to indicate the input source (pen gestures). Both modalities
may include timing information in the EMMA notation. Using the
timing information, a fusion module combines the speech and pen
gesture information into a single EMMA notation representing both
the command and its parameters. The use of EMMA enables the
separation of recognition process from the information fusion
process, and thus enables reusable recognition modalities and
general purpose information fusion algorithms.

5
Overview of Architecture

Here is a list of the Constituents of the MMI architecture. They
are discussed in more detail below.

the Interaction Manager, which coordinates the different
modalities. It is the Controller in the MVC paradigm.

the Data Component, which provides the common data model and
represents the Model in the MVC paradigm.

the Modality Components, which provide modality-specific
interaction capabilities. They are the Views in the MVC
paradigm.

the Runtime Framework, which provides the basic infrastructure
and enables communication among the other Constituents.

5.1 Run-Time
Architecture Diagram

5.2 The
Constituents

This section presents the responsibilities of the various
constituents of the MMI architecture.

5.2.1 The Interaction
Manager

All life-cycle events that the Modality Components generate MUST be
delivered to the Interaction Manager. All life-cycle events that
are delivered to Modality Components MUST be sent
by the Interaction Manager.

Due to the Russian Doll model, Modality Components MAY contain
their own Interaction Managers to handle their internal events.
However these Interaction Managers are not visible to the top level
Runtime Framework or Interaction Manager.

If the Interaction Manager does not contain an explicit handler
for an event, it MUST respect any default behavior that has
been established for the event. If there is no default behavior,
the Interaction Manager MUST ignore the event. (In effect, the
Interaction Manager's default handler for all events is to ignore
them.)

The following paragraph is informative.

Normally there will be specific markup associated with the IM
instructing it how to respond to events. This markup will thus
contain a lot of the most basic interaction logic of an
application. Existing languages such as SMIL, CCXML, SCXML, or
ECMAScript can be used for IM markup as an alternative to defining
special-purpose languages aimed specifically at multimodal
applications. The IM fulfills multiple functions. For example, it
is responsible for synchronization of data and focus, etc., across
different Modality Components as well as the higher-level
application flow that is independent of Modality Components. It
also maintains the high-level application data model and may handle
communication with external entities and back-end systems.
Logically these functions could be separated into separate
constituents and implementations may want to introduce internal
structure to the IM. However, for the purposes of this standard, we
leave the various functions rolled up in a single monolithic
Interaction Manager component. We note that state machine languages
such as SCXML are a good choice for authoring such a multi-function
component, since state machines can be composed. Thus it is
possible to define a high-level state machine representing the
overall application flow, with lower-level state machines nested
inside it handling the the cross-modality synchronization at each
phase of the higher-level flow.

5.2.2 The Data Component

This section is informative.

The Data Component is responsible for storing application-level
data. The Interaction Manager is a client of the Data Component and
is able to access and update it as part of its control flow logic,
but Modality Components do not have direct access to it. Since
Modality Components are black boxes, they may have their own
internal Data Components and may interact directly with backend
servers. However, the only way that Modality Components can share
data among themselves and maintain consistency is via the
Interaction Manager. It is therefore a good application design
practice to divide data into two logical classes: private data,
which is of interest only to a given modality component, and public
data, which is of interest to the Interaction Manager or to more
than one Modality Component. Private data may be managed as the
Modality Component sees fit, but all modification of public data,
including submission to back end servers, should be entrusted to
the Interaction Manager.

This specification does not define an interface between the Data
Component and the Interaction Manager. This amounts to treating the
Data Component as part of the Interaction Manager. (Note that this
means that the data access language will be whatever one the IM
provides.) The Data Component is shown with a dotted outline in the
diagram above, however, because it is logically distinct and could
be placed in a separate component.

5.2.3
The Modality Components

This section is informative.

Modality Components, as their name would indicate, are
responsible for controlling the various input and output modalities
on the device. They are therefore responsible for handling all
interaction with the user(s). Their only responsibility is to
implement the interface defined in 6
Interface between the Interaction Manager and the Modality
Components . Any further definition of their
responsibilities will be highly domain- and application-specific.
In particular we do not define a set of standard modalities or the
events that they should generate or handle. Platform providers are
allowed to define new Modality Components and are allowed to place
into a single Component functionality that might logically seem to
belong to two or more different modalities. Thus a platform could
provide a handwriting-and-speech Modality Component that would
accept simultaneous voice and pen input. Such combined Components
permit a much tighter coupling between the two modalities than the
loose interface defined here. Furthermore, modality components may
be used to perform general processing functions not directly
associated with any specific interface modality, for example,
dialog flow control or natural language processing.

In most cases, there will be specific markup in the application
corresponding to a given modality, specifying how the interaction
with the user should be carried out. However, we do not require
this and specifically allow for a markup-free modality component
whose behavior is hard-coded into its software.

5.2.4 The Runtime
Framework

The Runtime Framework is a cover term for all the infrastructure
services that are necessary for successful execution of a
multimodal application. This includes starting the components,
handling communication, and logging, etc. For the most part, this
version of the specification leaves these functions to be defined
in a platform-specific way, but we do specifically define a
Transport Layer which handles communications between the
components.

5.2.4.1 The Event Transport
Layer

The Event Transport Layer is responsible for delivering events
among the IM and the Modality Components. Clearly, there are
multiple transport mechanisms (protocols) that can be used to
implement a Transport Layer and different mechanisms may be used to
communicate with different modality components. Thus the Event
Transport Layer consists of one or more transport mechanisms
linking the IM to the various Modality Components.

We place the following requirements on all transport
mechanisms:

Events MUST be delivered reliably. In particular, the
event delivery mechanism MUST report an error if an event can not be
delivered, for example if the destination endpoint is
unavailable.

Events MUST be delivered to the destination in the
order in which the source generated them. There is no guarantee on
the delivery order of events generated by different sources. For
example, if Modality Component M1 generates events E1 and E2 in
that order, while Modality Component M2 generates E3 and then E4,
we require that E1 be delivered to the Runtime Framework before E2
and that E3 be delivered before E4, but there is no guarantee on
the ordering of E1 or E2 versus E3 or E4.

5.2.4.1.1 Event and
Information Security

This section is informative.

Events will often carry sensitive information, such as bank
account numbers or health care information. In addition events must
also be reliable to both sides of transaction: for example, if an
event carries an assent to a financial transaction, both sides of
the transaction must be able to rely on that assent.

We do not currently specify delivery mechanisms or internal
security safeguards to be used by the Modality Components and the
Interaction Manager. However, we believe that any secure system
will have to meet the following requirements at a minimum:

The following two optional requirements can be met by using the
W3'sW3C's XML-Signature Syntax and
Processing specification [XMLSig] .

Authentication. The event delivery mechanism should be able to
ensure that the identity of components in an interaction are
known.

Integrity. The event delivery mechanism should be able to
ensure that the contents of events have not been altered in
transit.

The remaining optional requirements for event delivery and
information security can be met by following other
industry-standard procedures.

Authorization. A component should provide a method to ensure
only authorized components can connect to it.

Privacy. The event delivery mechanism should provide a method
to keep the message contents secure from any unauthorized access
while in transit.

Non-repudiation. The event delivery mechanism, in conjunction
with the components, may provide a method to ensure that if a
message is sent from one constituent to another, the originating
constituent cannot repudiate the message that it sent and that the
receiving constituent cannot repudiate that the message was
received.

Multiple protocols may be necessary to implement these
requirements. For example, TCP/IP and HTTP provide reliable event
delivery, but additional protocols such as TLS or HTTPS could be
required to meet security requirements.

5.2.5 System and OS
Security

This section is informative.

This architecture does not and will not specify the internal
security requirements of a Modality Component or Runtime
Framework.

5.2.6 Media stream
handling

Media streams do not typically flow through the Interaction
Manager. This specification does not specify how media connections
are established, as the main focus of this specification is the
flow of control data. However, all control data logically sent
between modality components MUST flow through the Interaction Manager.

5.2.7
Examples

This section is informative.

For the sake of concreteness, here are some examples of
components that could be implemented using existing languages. Note
that we are mixing the design-time and run-time views here, since
it is the implementation of the language (the browser) that serves
as the run-time component.

CCXML [CCXML] could be used as both the
Controller Document and the Interaction Manager language, with the
CCXML interpreter serving as the Runtime Framework and Interaction
Manager.

SCXML [SCXML] could be used as the
Controller Document and Interaction Manager language

In an integrated multimodal browser, the markup language that
provided the document root tag would define the Controller Document
while the associated scripting language could serve as the
Interaction Manager.

6 Interface between the
Interaction Manager and the Modality Components

The most important interface in this architecture is the one
between the Modality Components and the Interaction Manager.
Modality Components communicate with the IM via asynchronous
events. Constituents MUST be able to send events and to handle
events that are delivered to them asynchronously. It is not
required that Constituents use these events internally since the
implementation of a given Constituent is black box to the rest of
the system. In general, it is expected that Constituents will send
events both automatically (i.e., as part of their implementation)
and under mark-up control.

The majority of the events defined here come in request/response
pairs. That is, one party (either the IM or an MC) sends a request
and the other returns a response. (The exceptions are the
ExtensionNotification, StatusRequest and StatusResponse events,
which can be sent by either party.) In each case it is specified
which party sends the request and which party returns the response.
If the wrong party sends a request or response, or if the request
or response is sent under the wrong conditions (e.g. response
without a previous request) the behavior of the receiving party is
undefined. In the descriptions below, we say that the originating
party "MAY" send the request, because it is up to
the internal logic of the originating party to decide if it wants
to invoke the behavior that the request would trigger. On the other
hand, we say that the receiving party "MUST" send the response, because it is
mandatory to send the response if and when the request is
received.

6.1
Common Event Fields

The concept of 'context' is basic to these events described
below. A context represents a single extended interaction with zero
or more users across one or more modality components. In a simple
unimodal case, a context can be as simple as a phone call or SSL
session. Multimodal cases are more complex, however, since the
various modalities may not be all used at the same time. For
example, in a voice-plus-web interaction, e.g., web sharing with an
associated VoIP call, it would be possible to terminate the web
sharing and continue the voice call, or to drop the voice call and
continue via web chat. In these cases, a single context persists
across various modality configurations. In general, the 'context'
SHOULD
cover the longest period of interaction over which it would make
sense for components to store information.

6.1.1 Context

A URI that MUST be unique for the lifetime of the system.
It is used to identify this interaction. All events relating to a
given interaction MUST use the same context URI. Events
containing a different context URI MUST be
interpreted as part of other, unrelated, interactions.

6.1.2 Source

A URI representing the address of the sender of the event. The
recipient of the event MUST be able to send an event back to the
sender by using this value as the 'target' of a message.

6.1.3 Target

A URI that MUST represent the address to which the event
will be delivered.

6.1.4
RequestID

A unique identifier for a Request/Response pair. Most life-cycle
events come in Request/Response pairs that share a common
RequestID. For any such pair, the RequestID in the Response event
MUST match
the RequestID in the request event. The RequestID for such a pair
MUST be
unique within the given context.

6.1.5 Status

An enumeration of 'Success' and 'Failure'. The Response event of
a Request/Response pair MUST use this field to report whether it
succeeded in carrying out the request.

6.1.6
StatusInfo

The Response event of a Request/Response pair MAY use this
field to provide additional status information.

6.1.7 Data

Any event MAY use this field to contain arbitrary data.
The format and meaning of this data is application-specific.

6.1.8
Confidential

Any event MAY use this field to indicate whether the
contents of this event are confidential. The default value is
'false'. If the value is 'true', the Interaction Manager and
Modality Component implementations MUST not log
the information or make it available in any way to third parties
unless explicitly instructed to do so by the author of the
application.

6.2 Standard
Life Cycle Events

The Multimodal Architecture defines the following basic
life-cycle events which the Interaction Manager and Modality
Components MUST support. These events allow the
Interaction Manager to invoke modality components and receive
results from them. They thus form the basic interface between the
IM and the Modality components. Note that the ExtensionNotification
event offers extensibility since it contains arbitrary content and
can be raised by either the IM or the Modality Components at any
time once the context has been established. For example, an
application relying on speech recognition could use the 'Extension'
event to communicate recognition results or the fact that speech
had started, etc.

In the definitions below, all fields are mandatory, unless
explicitly stated to be optional.

6.2.1
NewContextRequest/NewContextResponse

A Modality Component MAY send a NewContextRequest to the IM to
request that a new context be created. If this event is sent, the
IM MUST
respond with the NewContextResponse event. The NewContextResponse
event MUST
ONLY be sent in response to the NewContextRequest event. Note
that the IM MAY create a new context without a previous
NewContextRequest by sending a PrepareRequest or StartRequest
containing a new context ID to the Modality Components. Furthermore
the IM may respond with the same context in response to
newContextRequestsNewContextRequests from different
(multiple) Modality Components, since the interaction can be
started by different Modality Components independently.

6.2.1.1 NewContextRequest
Properties

RequestID . See 6.1.4
RequestID . A newly generated identifier used to identify
this request.

6.2.2
PrepareRequest/PrepareResponse

The IM MAY send a PrepareRequest to allow the
Modality Components to pre-load markup and prepare to run. Modality
Components are not required to take any particular action in
response to this event, but they MUST return a
PrepareResponse event. Modality Components that return a
PrepareResponse event with Status of 'Success' SHOULD be
ready to run with close to 0 delay upon receipt of the
StartRequest.

The Interaction Manager MAY send multiple PrepareRequest events to a
Modality Component for the same Context before sending a
StartRequest. Each request MAY reference a different ContentURL or
contain different in-line Content. When it receives multiple
PrepareRequests, the Modality Component SHOULD
prepare to run any of the specified content.

6.2.2.1 PrepareRequest
Properties

RequestID . See 6.1.4
RequestID . A newly generated identifier used to identify
this request.

Context See 6.1.1
Context . Note that the IM MAY use the
same context value in multiple PrepareRequest events when it wishes
to execute multiple instances of markup in the same context.

ContentURL Optional URL of the content that the
Modality Component SHOULD prepare to execute.

Content Optional Inline markup that the Modality
Component SHOULD prepare to execute.

The IM MUST NOT specify both the ContentURL and
Content in a single PrepareRequest. The IM MAY leave both
contentURL and content empty. In such a
case, the Modality Component MUST revert to its default behavior. For
example, this behavior could consist of returning an error event or
of running a preconfigured or hard-coded script.

6.2.2.2 PrepareResponse
Properties

RequestID . See 6.1.4
RequestID . This MUST match the RequestID in the PrepareRequest
event.

Context See 6.1.1
Context . This MUST match the value in the PrepareRequest
event.

6.2.3
StartRequest/StartResponse

To invoke a modality component, the IM MUST send a
StartRequest. The Modality Component MUST return a
StartResponse event in response. The IM MAY include a
value in the ContentURL or Content field of this event. In this
case, the Modality Component MUST use this value.

If a Modality Component receives a new StartRequest while it is
executing a previous one, it MUST either cease execution of the previous
StartRequest and begin executing the content specified in the most
recent StartRequest, or reject the new StartRequest, returning a
StartResponse with status equal to 'Failure'.

6.2.3.1 StartRequest
Properties

RequestID . See 6.1.4
RequestID . A newly generated identifier used to identify
this request.

Context See 6.1.1
Context . Note that the IM MAY use the
same context value in multiple StartRequest events when it wishes
to execute multiple instances of markup in the same context.

ContentURL Optional URL of the content that the
Modality Component MUST attempt to execute.

Content Optional Inline markup that the Modality
Component MUST attempt to execute.

The IM MUST NOT specify both the ContentURL and
Content in a single StartRequest. The IM MAY leave both
contentURL and content empty. In such a case, the Modality
Component MUST run the content specified in the most
recent PrepareRequest in this context, if there is one. Otherwise
it MUST
revert to its default behavior. For example, this behavior could
consist of returning an error event or of running a preconfigured
or hard-coded script.

6.2.3.2 StartResponse
Properties

RequestID . See 6.1.4
RequestID . This MUST match the RequestID in the StartRequest
event.

Context See 6.1.1
Context . This MUST match the value in the
StartStartRequest event.

The DoneNotification event is intended to indicate the
completion of the processing that has been initiated by the
Interaction Manager with a StartRequest. As an example a voice
modality component might use the DoneNotification event to indicate
the completion of a recognition task. In this case the
DoneNotification event might carry the recognition result expressed
using EMMA. However, there may be tasks which do not have a
specific end. For example the Interaction Manager might send a
StartRequest to a graphical modality component requesting it to
display certain information. Such a task does not necessarily have
a specific end and thus the graphical modality component might
never send a DoneNotification event to the Interaction Manager.
Thus the graphical modality component would display the screen
until it received another StartRequest (or some other lifecycle
event) from the Interaction Manager.

6.2.5
CancelRequest/CancelResponse

The IM MAY send a CancelRequest to stop processing in
the Modality Component. In this case, the Modality Component MUST stop
processing and then MUST return a CancelResponse.

6.2.5.1 CancelRequest
Properties

RequestID . See 6.1.4
RequestID . A newly generated identifier used to identify
this request.

Context See 6.1.1
Context . This MUST match the value in the StartRequest
event.

6.2.6
PauseRequest/PauseResponse

The IM MAY send a PauseRequest to suspend processing
by the Modality Component. Modality Components MUST return a
PauseResponse once they have paused, or once they determine that
they will be unable to pause.

6.2.6.1 PauseRequest
Properties

RequestID . See 6.1.4
RequestID . A newly generated identifier used to identify
this request.

Context See 6.1.1
Context . This MUST match the value in the
StartStartRequest event.

6.2.7
ResumeRequest/ResumeResponse

The IM MAY send the ResumeRequest to resume
processing that was paused by a previous PauseRequest. The IM MUST NOT
send the ResumeRequest to a context that is not paused due to a
previous PauseRequest. Implementations that have paused MUST attempt
to resume processing upon receipt of this event and MUST return a
ResumeResponse afterwards. The 'Status' MUST be
'Success' if the implementation has succeeded in resuming
processing and MUST be 'Failure' otherwise.

6.2.7.1 ResumeRequest
Properties

RequestID . See 6.1.4
RequestID . A newly generated identifier used to identify
this request.

Context See 6.1.1
Context . This MUST match the value in the
StartStartRequest event.

6.2.8
ExtensionNotification

This event MAY be generated by the IM and MAY be
generated by the Modality Component. It is used to encapsulate
application-specific events that are extensions to the framework
defined here. For example, if an application containing a voice
modality wanted that modality component to notify the Interaction
Manager when speech was detected, it would cause the voice modality
to generate an ExtensionNotification event ( with a 'name' of
something like 'speechDetected') at the appropriate time.

6.2.8.1 ExtensionNotification
Properties

RequestID . See 6.1.4
RequestID . A newly generated identifier used to identify
this request.

Name The name of the application-specific
event.

Context See 6.1.1
Context . MUST match the value in the
StartStartRequest event.

6.2.9
ClearContextRequest/ClearContextResponse

The IM MAY send a ClearContextRequest to indicate
that the specified context is no longer active and that any
resources associated with it may be freed. Modality Components are
not required to take any particular action in response to this
command, but MUST return a ClearContextResponse. Once the
IM has sent a ClearContextRequest to a Modality Component, it MUST NOT
send the Modality Component any more events for that context.

6.2.9.1 ClearContextRequest
Properties

RequestID . See 6.1.4
RequestID . A newly generated identifier used to identify
this request.

Context See 6.1.1
Context . This MUST match the value in the
StartStartRequest event.

6.2.10
StatusRequest/StatusResponse

The StatusRequest message and the corresponding StatusResponse
are intended to provide keep-alive functionality. Either the IM or
the Modality Component MAY send the StatusRequest message. The
recipient MUST respond with the StatusResponse
message.

6.2.10.1 Status Request
Properties

RequestID . See 6.1.4
RequestID . A newly generated identifier used to identify
this request.

Context See 6.1.1
Context . Optional specification of the context for which
the status is requested. If it is present, the recipient MUST respond
with a StatusResponse message indicating the status of the
specified context. If it is not present, the recipient MUST send a
StatusResponse message indicating the status of the underlying
server, namely the software that would host a new context if one
were created.

RequestAutomaticUpdate . A boolean value. If it is
'true' the recipient SHOULD send periodic StatusResponse messages
without waiting for an additional StatusRequest message. If it is
'false', the recipient SHOULD send one and only one StatusResponse
message in response to this request.

6.2.10.2 StatusResponse
Properties

RequestID . See 6.1.4
RequestID . This MUST match the RequestID in the StatusRequest
event.

AutomaticUpdate . A boolean value. If it is 'true'
the sender MUST keep sending StatusResponse messages in
the future without waiting for another StatusRequest message. If it
is 'false', the sender MUST wait for a subsequent StatusRequest
message before sending another StatusResponse message.

Context See 6.1.1
Context . An optional specification of the context for
which the status is being returned. If it is present, the response
MUST
represent the status of the specified context. If it is not
present, the response MUST represent the status of the underlying
server.

Status An enumeration of 'Alive' or 'Dead'. The
meaning of these values depends on whether the 'context' parameter
is present. If it is, and the specified context is still active and
capable of handling new life cycle events, the sender MUST set this
field to 'Alive'. If the 'context' parameter is present and the
context has terminated or is otherwise unable to process new life
cycle events, the sender MUST set the status to 'Dead'. If the
'context' parameter is not provided, the status refers to the
underlying server. If the sender is able to create new contexts, it
MUST set
the status to 'Alive', otherwise, it MUST set it to
'Dead'.

A Modality Component
States

[This section is informative]

Within an established context, a Modality Component can be
viewed as functioning in one of three states: Idle, Running or
Paused. Lifecycle events received from the Interaction Manager
imply specific actions and transitions between states. The table
below shows possible MC actions, state transitions and response
contents for each Request event the IM may send to a MC in a
particular state.

A Failure: ErrorMessage annotation indicates that the
specified Request event is either invalid or redundant in the
specified state. In this case, the Modality Component responds by
sending a matching Response event with Status=Failure and
StatusInfo=ErrorMessage. In all other cases, the Modality performs
the requested action, possibly transitioning to another state as
indicated.

event / state

Idle

Running

Paused

PrepareRequest

preload or update content

preload or update content

preload or update content

StartRequest

Transition: Running

use new content if provided, otherwise use last available
content

stop processing current content, restart as in
Idle

Transition: Running

stop processing current content, restart as in Idle

Failure: NoContent if MC requires content to run and
none has been provided

C.9 StartRequest.xsd

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<xs:schema xmlns:mmi="http://www.w3.org/2008/04/mmi-arch" xmlns:xs="http://www.w3.org/2001/XMLSchema" attributeFormDefault="qualified" elementFormDefault="qualified" targetNamespace="http://www.w3.org/2008/04/mmi-arch">
<xs:annotation>
<xs:documentation xml:lang="en">
StartRequest schema for MMI Life cycle events version 1.0.
The Runtime Framework sends the event StartRequest to invoke a Modality Component
(to start loading a new GUI resource or to start the ASR or TTS). The Modality Component
must return a StartResponse event in response. If the Runtime Framework has sent a previous
PrepareRequest event, it may leave the contentURL and content fields empty, and the Modality
Component will use the values from the PrepareRequest event. If the Runtime Framework includes
new values for these fields, the values in the StartRequest event override those in the
PrepareRequest event.
</xs:documentation>
</xs:annotation>
<xs:include schemaLocation="mmi-datatypes.xsd"/>
<xs:include schemaLocation="mmi-attribs.xsd"/>
<xs:element name="startRequest">
<xs:complexType>
<xs:choice>
<xs:sequence>
<xs:element name="contentURL" type="mmi:contentURLType"/>
<xs:element minOccurs="0" name="data" type="mmi:anyComplexType"/>
</xs:sequence>
<xs:sequence>
<xs:element name="content" type="mmi:anyComplexType"/>
<xs:element minOccurs="0" name="data" type="mmi:anyComplexType"/>
</xs:sequence>
</xs:choice>
<xs:attributeGroup ref="mmi:group.allEvents.attrib"/>
</xs:complexType>
</xs:element>
<xs:annotation>
<xs:documentation xml:lang="en">
StartRequest schema for MMI Life cycle events version 1.0.
The Runtime Framework sends the event StartRequest to invoke a Modality Component
(to start loading a new GUI resource or to start the ASR or TTS). The Modality Component
must return a StartResponse event in response. If the Runtime Framework has sent a previous
PrepareRequest event, it may leave the contentURL and content fields empty, and the Modality
Component will use the values from the PrepareRequest event. If the Runtime Framework includes
new values for these fields, the values in the StartRequest event override those in the
PrepareRequest event.
</xs:documentation>
</xs:annotation>
<xs:include schemaLocation="mmi-datatypes.xsd"/>
<xs:include schemaLocation="mmi-attribs.xsd"/>
<xs:element name="StartRequest">
<xs:complexType>
<xs:choice>
<xs:sequence>
<xs:element name="ContentURL" type="mmi:contentURLType"/>
<xs:element minOccurs="0" name="data" type="mmi:anyComplexType"/>
</xs:sequence>
<xs:sequence>
<xs:element name="Content" type="mmi:anyComplexType"/>
<xs:element minOccurs="0" name="data" type="mmi:anyComplexType"/>
</xs:sequence>
</xs:choice>
<xs:attributeGroup ref="mmi:group.allEvents.attrib"/>
</xs:complexType>
</xs:element>
</xs:schema>

C.11
DoneNotification.xsd

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<xs:schema xmlns:mmi="http://www.w3.org/2008/04/mmi-arch" xmlns:xs="http://www.w3.org/2001/XMLSchema" attributeFormDefault="qualified" elementFormDefault="qualified" targetNamespace="http://www.w3.org/2008/04/mmi-arch">
<xs:annotation>
<xs:documentation xml:lang="en">
DoneNotification schema for MMI Life cycle events version 1.0.
The DoneNotification event is intended to be used by the Modality Component to indicate that
it has reached the end of its processing. For the VUI-MC it can be used to return the ASR
recognition result (or the status info: noinput/nomatch) and TTS/Player done notification.
</xs:documentation>
</xs:annotation>
<xs:include schemaLocation="mmi-datatypes.xsd"/>
<xs:include schemaLocation="mmi-attribs.xsd"/>
<xs:include schemaLocation="mmi-elements.xsd"/>
<xs:element name="doneNotification">
<xs:complexType>
<xs:sequence>
<xs:element minOccurs="0" name="data" type="mmi:anyComplexType"/>
<xs:element minOccurs="0" ref="mmi:statusInfo"/>
</xs:sequence>
<xs:attributeGroup ref="mmi:group.allResponseEvents.attrib"/>
</xs:complexType>
</xs:element>
<xs:annotation>
<xs:documentation xml:lang="en">
DoneNotification schema for MMI Life cycle events version 1.0.
The DoneNotification event is intended to be used by the Modality Component to indicate that
it has reached the end of its processing. For the VUI-MC it can be used to return the ASR
recognition result (or the status info: noinput/nomatch) and TTS/Player done notification.
</xs:documentation>
</xs:annotation>
<xs:include schemaLocation="mmi-datatypes.xsd"/>
<xs:include schemaLocation="mmi-attribs.xsd"/>
<xs:include schemaLocation="mmi-elements.xsd"/>
<xs:element name="DoneNotification">
<xs:complexType>
<xs:sequence>
<xs:element minOccurs="0" name="data" type="mmi:anyComplexType"/>
<xs:element minOccurs="0" ref="mmi:statusInfo"/>
</xs:sequence>
<xs:attributeGroup ref="mmi:group.allResponseEvents.attrib"/>
</xs:complexType>
</xs:element>
</xs:schema>

C.18
ExtensionNotification.xsd

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<xs:schema xmlns:mmi="http://www.w3.org/2008/04/mmi-arch" xmlns:xs="http://www.w3.org/2001/XMLSchema" attributeFormDefault="qualified" elementFormDefault="qualified" targetNamespace="http://www.w3.org/2008/04/mmi-arch">
<xs:annotation>
<xs:documentation xml:lang="en">
ExtensionNotification schema for MMI Life cycle events version 1.0.
The extensionNotification event may be generated by either the Runtime Framework or the
Modality Component and is used to communicate (presumably changed) data values to the
other component. E.g. the VUI-MC has signaled a recognition result for any field displayed
on the GUI, the event will be used by the Runtime Framework to send a command to the
GUI-MC to update the GUI with the recognized value.
</xs:documentation>
</xs:annotation>
<xs:include schemaLocation="mmi-datatypes.xsd"/>
<xs:include schemaLocation="mmi-attribs.xsd"/>
<xs:element name="extensionNotification">
<xs:complexType>
<xs:sequence>
<xs:element minOccurs="0" name="data" type="mmi:anyComplexType"/>
</xs:sequence>
<xs:attributeGroup ref="mmi:group.allEvents.attrib"/>
<xs:attributeGroup ref="mmi:extension.name.attrib"/>
</xs:complexType>
</xs:element>
<xs:annotation>
<xs:documentation xml:lang="en">
ExtensionNotification schema for MMI Life cycle events version 1.0.
The extensionNotification event may be generated by either the Runtime Framework or the
Modality Component and is used to communicate (presumably changed) data values to the
other component. E.g. the VUI-MC has signaled a recognition result for any field displayed
on the GUI, the event will be used by the Runtime Framework to send a command to the
GUI-MC to update the GUI with the recognized value.
</xs:documentation>
</xs:annotation>
<xs:include schemaLocation="mmi-datatypes.xsd"/>
<xs:include schemaLocation="mmi-attribs.xsd"/>
<xs:element name="ExtensionNotification">
<xs:complexType>
<xs:sequence>
<xs:element minOccurs="0" name="data" type="mmi:anyComplexType"/>
</xs:sequence>
<xs:attributeGroup ref="mmi:group.allEvents.attrib"/>
<xs:attributeGroup ref="mmi:extension.name.attrib"/>
</xs:complexType>
</xs:element>
</xs:schema>

D Ladder
Diagrams for the MMI Architecture with a Web Browser and VXML
Interpreter

[This section is informative]

D.1 Creating a Session

The following ladder diagram shows a possible message sequence
upon a session creation. We assume that an Interaction Manager
session is already up and running. The user starts a multimodal
session for example by starting a web browser and fetching a given
URL.

After loading the initial documents (and scripts) the modality
component implementation issues a
mmi:newContextRequestmmi:NewContextRequest message to the
IM. The IM may load a corresponding markup document, if necessary,
and initializes and starts a new session.

In this scenario the Interaction Manager manager logic issues a
number of
mmi:startRequestmmi:StartRequest messages to the
various modality components. One message is sent to the graphical
modality component (GUI) to instruct it to load a HTML document.
Another message is sent to a voice modality component (VUI) to play
a welcome message.

The voice modality component has (in this example) to create a
VoiceXML session. As VoiceXML 2.1 does not provide an external
event interface a CCXML session will be used for external
asynchronous communication. Therefore the voice modality component
uses the session creation interface of CCXML 1.0 to create a
session and start a corresponding script. This script will then
make a call to a phone at the user device (which could be a regular
phone or a SIP soft phone on the user's device). This scenario
illustrates the use of a SIP phone, which may reside on the users
mobile handset.

After successful setup of a CCXML session and the voice
connection the voice modality component instructs the CCXML browser
to start a VoiceXML dialog and passing it a corresponding VoiceXML
script. The VoiceXML interpreter will execute the script and play
out the welcome message. After the execution of the VoiceXML script
has finished, the voice modality component notifies the Interaction
Manager using the mmi:done event.

D.2 Processing User
Input

The next diagram gives a example for the possible message flow
while processing of user input. In the given scenario the user
wants to enter information using the voice modality component. To
start the voice input the user has to use the "push-to-talk"
button. The "push-to-talk" button (which might be a hardware button
or a soft button on the screen) generates a corresponding event
when pushed. This event is issues as a
mmi:extensionmmi:Extension event towards the
Interaction Manager. The Interaction Manager logic sends a
mmi:startRequestmmi:StartRequest to the voice modality
component. This
mmi:startRequestmmi:StartRequest message contains a URL
which points to a corresponding VoiceXML script. The voice modality
component again starts a VoiceXML interpreter using the given URL.
The VoiceXML interpreter loads the document and executes it. Now
the system is ready for the user input. To notify the user about
the availability of the voice input functionality the Interaction
Manager might send an event to the GUI upon receiving the
mmi:startResponsemmi:StartResponse event (which
indicates that the voice modality component has started to execute
the document). But note that this is not shown in the picture.

The VoiceXML interpreter captures the users voice input and uses
a speech recognition engine to recognize the utterance. The speech
recognition result will be represented as an EMMA document and sent
to the interaction manager using the mmi:done message. The
Interaction Manager logic sends a
mmi:extensionmmi:Extension message to the GUI
modality component to instruct it to display the recognition
result.

D.3 Ending a Session

In the following scenario a modality component instance will be
destroyed as a reaction to a user input, e.g. because the user
selected to change to the GUI only mode. In this case a
mmi:clearContextRequestmmi:ClearContextRequest will be issued
to the voice modality component. The voice modality component
wrapper will then destroy the CCXML (and VoiceXML) session.

The application logic (i.e. the IM) may also decide to indicate
the removed voice functionality and disable an icon on the screen
which indicates the availability of the voice modality.

E Localization and
Customization

[This section is informative]

The MMI architecture specification describes a set of lifecycle
events which define the basic interface between the interaction
management and the modality components. The
startRequestStartRequest lifecycle event defines
the "content" and "contentURL" elements which may contain markup
code (or references to markup code). The markup has to be executed
by the modality component. Using the
"content""Content" or
"contentURL""ContentURL" attributes introduces a
dependency of the lifecycle event to a specific modality component
implementation. In other words, the interaction manager has to
issue different
startRequests,StartRequests, depending on which
markup a GUI modality component may be able to process.

But multimodal applications may want to support different
modality component implementations, such as HTML or Flash, for the
same application. In this case the interaction manager should be
independent of the modality component implementation and hence not
generate a markup specific lifecycle event (e.g. containing a link
to HTML or even HTML content), but a further abstracted description
of the command.

Furthermore, localization needs to be taken into account. If the
interaction manager sends markup code to the modality component (or
references to it), this markup code should not contain any
dependencies to the user's language. Instead the interaction
manager needs to send the locale information to the modality
component and let it select the appropriate strings.

Here is an example to show, how these two issues could be
addressed within the lifecycle events. This example uses a generic
data structure to carry the locale information (within the xml:lang
attribute) and the data to be visualized at a GUI.

This
startRequestStartRequest carries a generic
<gui> structure as its payload which contains a "resourceid"
and the xml:lang information. The "resourceid" has to be
interpreted by the modality component (either to load an HTML
document or a corresponding dialog, e.g. if it is a flash app),
whereas "xml:lang" is used by the modality component to select the
appropriate string tables.

The content of the <gui> structure is an application
specific (but generic) description of data to be used by the
modality component. This could contain a description of the status
of GUI elements (such as "enabled" or "disabled") or a list of
items to be displayed. The following example shows a
startRequestStartRequest to display a list of music
songs. The list of songs will be loaded from a backend system and
are dynamic. The representation of the song list is agnostic to the
modality component implementation. It is the responsibility of the
modality component to interpret the structure and to display its
content appropriately.

F HTTP transport
of MMI lifecycle events

[This section is informative]

The "Multimodal Architecture and Interfaces" specification
supports deployments in a variety of topologies, either distributed
or co-located. In case of a distributed deployment, a protocol for
the lifecycle event transport needs to be defined. HTTP is the
major protocol of the web. HTTP is widely adopted, it is supported
by many programming languages and especially used by web browsers.
Technologies like AJAX provide asynchronous transmission of
messages for web browsers and allow to build modality components on
top of it in distributed environments. This section describes how
the HTTP protocol should be used for MMI lifecycle event transport
in distributed deployments. Modality components and the Interaction
Manager need an HTTP processor to send and receive MMI lifecycle
events. The following picture illustrates a possible modularization
of the Runtime Framework, the Interaction Manager and the Modality
Components. It shows internal lifecycle event interfaces (which
abstract from the transport layer) and the HTTP processors. The
HTTP processors are responsible for assembling and disassembling of
HTTP requests, which carry MMI lifecycle event representations as
payloads.

The following sections describe, how the HTTP protocol should be
used to transport MMI lifecycle events.

HTTP defines the concept of client and server [RFC2616] . One possible deployment of the
multimodal architecture is shown in following figure:

In this deployment scenario the Interaction Manager acts as an
HTTP server, whereas modality components are HTTP clients, sending
HTTP requests to the Interaction Manager. But other configurations
are possible.

The multimodal architecture specification requires an
asynchronous bi-directional event transmission. To achieve this (in
the given scenario, where modality components are HTTP clients and
the Interaction Manager acts as an HTTP server) separate (parallel)
HTTP requests
(refered(referred to as send and receive
channels in the picture) are used to send and receive lifecycle
events.

Modality components use HTTP/POST requests to send MMI lifecycle
events to the IM. The request contains the following URL request
parameters:

contextContext (or token
)

sourceSource

The lifecycle event itself is contained in the body of the
HTTP/POST request. The Content-Type header field of
the HTTP/POST request has to be set according to the lifecycle
event format, e.g. “text/xml”.

The URL request parameters contextContext and sourceSource are equivalent to the
respective MMI lifecycle event attributes. The contextContext must be used whenever
available. The contextContext is only unknown to the
modality component during startup of a multimodal session, as the
contextContext will be returned from the
Interaction Manager to the Modality component with the newContextResponseNewContextResponse lifecycle event.
Hence, when sending a newContextRequestNewContextRequest , the context is
unknown. Therefore a token is used to associate the
newContextRequestNewContextRequest and newContextResponseNewContextResponse
messages.

The token is a unique id (preexisting knowledge,
e.g. generated by the modality component during registration) to
identify the channel between a modality component and the
Interaction Manager.

Once the contextContext is exchanged, the contextContext must be used with subsequent
requests and the token must not be used
anymore.

The response (to a HTTP/POST request, which carries a lifecycle
event from a Modality Component to to the Interaction Manager) must
not contain any content and the HTTP response code must be “204 No
Content”.

The HTTP processor of the Interaction Manager is expected to
handle POST requests (which contain lifecycle events sent from the
modality component to the Interaction Manager) as following:

Modality components, which are not HTTP servers (such as modality
components build on top of web browsers) are not able to receive
HTTP requests. Thus, to receive MMI events from the Interaction
Manager, such modality components need to poll for events. The
modality component has to send an HTTP/GET request to the
Interaction Manager to request for the next MMI event. For network
performance optimization the HTTP processor of the Interaction
Manager may block the HTTP request for a certain time to avoid
delay and network traffic (long living HTTP request). The modality
component may control the maximum delay using the optional
parameter timeout (in milliseconds). The request
contains the following URL request parameters:

contextContext (or token
)

sourceSource

timeout (optional)

See discussion of the parameter contextContext in the previous section. The
parameter sourceSource describes the source of the
request, i.e. the modality components id. The parameter
timeout is optional and describes the maximum delay in
milliseconds. Only positive integer values are allowed for the
parameter timeout . The request with
timeout set to “0” returns immediately. The
Interaction Manager may limit the timeout to a (platform specific)
maximum value. In case of absence of the parameter
timeout the Interaction Manager uses a platform
specific default.

The HTTP response body contains the lifecycle event as a string.
The HTTP response header must contain the Content-Type
header field, which describes the format of the lifecycle event
string (e.g. “text/xml”).

The HTTP processor of the Interaction Manager is expected to
handle HTTP/GET requests (which are used by the Modality Component
to receive lifecycle events) as following:

check for corresponding events (i.e. are there events to send
from Interaction Manager to this particular Modality Component).
This step might be blocking for a certain time (according to
timeout parameter) to optimize network performance.

For modality components, which are HTTP servers themselves, the
Interaction Manager needs to send a lifecycle event through an
HTTP/POST request. The request contains the following
parameters:

contextContext

targetTarget

S ee discussion of parameters in previous sections. Again, the
parameter targetTarget is equivalent to the
corresponding MMI lifecycle event attribute and describes the
receiver of the event. Hence, the receiver of the HTTP request uses
this parameter to identify the corresponding modality
component.

F.4 Error handling

Various MMI lifecycle events (especially response events)
contain Status and StatusInfo fields. These
fields should be used for error indication whenever possible.
However, a failure during delivery of a lifecycle event needs to be
indicated using HTTP response codes.

The HTTP processor of the Interaction Manager has to use HTTP
response codes to indicate success or errors during request
handling. In case of a successful processing of a request
(successful in terms of transport, i.e. an event has been
successfully delivered) a 2XX status code (e.g. "204 No Content")
has to be returned. Transport related errors, which lead to failure
in delivery of a lifecycle event, are indicated using 4XX or 5XX
response codes. 4XX error codes referring to "client errors" (wrong
parameters etc.) whereas 5XX error codes indicating server errors
(see also HTTP response codes in [RFC2616]
).

The treatment of transport errors is up to the implementation,
but the implementation should make errors visible to author code
(e.g. raise event within Interaction Manager when a lifecycle event
has not been successfully delivered to a Modality Component).

G Glossary

[This section is informative]

CCXML: [CCXML] is designed to provide
telephony call control support for dialog systems, such as
VoiceXML.

Controller Document: A document that contains markup defining
the interaction between the other documents. Such markup is called
Interaction Manager markup.

Data Component: The Data Component is a sub-component of the
Runtime Framework which is responsible for storing
application-level data.

Interaction Manager: The Interaction Manager (IM) is the
component that is responsible for handling all life-cycle events
that the other Components generate. It is responsible for
synchronization of data and focus, etc., across different Modality
Components as well as the higher-level application flow that is
independent of Modality Components.

Life cycle events: The Multimodal Architecture defines basic
life-cycle events which must be supported by all modality
components. These events allow the Runtime Framework to invoke
modality components and receive results from them. They form the
basic interface between the Runtime Framework and the Modality
components.

Modality Component: Modality Components are responsible for
controlling the various input and output modalities on the device.
Modality components may also be used to perform general processing
functions not directly associated with any specific interface
modality, for example, dialog flow control or natural language
processing

Nested components: An Interaction Manager and a set of
Components can present themselves as a Component to a higher-level
Framework. All that is required is that the IM implement the
Component API. The result is a "Russian Doll" model in which
Components may be nested inside other Components to an arbitrary
depth.

Runtime Framework: The Runtime Framework is responsible for
starting the application and interpreting the Controller Document.
It provides the basic infrastructure into which the IM and the
various Modality Components plug into and controls the
communication among the other Constituents.

Software Constituent: An architecturally significant entity in
the architecture. Because we are using the term 'Component' to
refer to a specific set of entities in our architecture, we will
use the term 'Constituent' as a cover term for all the elements in
our architecture which might normally be called 'software
components'.

H Types of Modality
Components

[This section is informative]

H.1 Simple modality
components

Modality components can be classified into either of three
categories: simple, complex or nested.

A simple modality component presents information to a user or
captures information from a user as directed by an interaction
manager. A simple modality component is atomic in that it can not
be portioned into two or ore simple modality components that send
events among themselves. A simple modality component is like a
black box in that the interaction manager can not directly access
any function inside of the black box other than by using life-cycle
events.

A simple modality component might contain functionality to
present one of the following types of information to the user or
user agent. For example:

TTS—generates synthetic speech from a text string

Audio replay—replays an audio file to a user

GUI presentation—presents HTML on a display device.

Ink replay—replays one or more ink strokes

Video replay—replays one or more video clips

A simple modality component might contain functionality to
capture one of the following types of information from the user or
user agent as directed by a complex modality or interaction
manager:

Audio capture—records user utterances

ASR—captures text from the user by using a grammar to convert
spoken voice into text

DTMF—captures integers from a user by using a grammar a user
capture digits represented by the sounds created by touch tone
keypad on a phone

Ink capture—capture one or more ink strokes

Ink recognition—captures one or more ink strokes and interprets
them as text by using a grammar.

Speaker verification—determines if a user is who the user
claims to be by comparing spoken voice characteristics with the
voice characteristics known to be associated with the user

Speaker identification—determines who a speak is by comparing
spoken voice characteristics with a set of preexisting voice
characteristics of several individuals.

Face verification—determines if a user is who the user claims
to be by comparing face patterns with the face patterns known to be
associated with the user

Face identification—determines who a speak is by comparing face
pattern characteristics with a set of preexisting face patterns of
several individuals

GPS—captures the current GPS location of a device.

Keyboard or mouse—captures information entered by the user
using a keyboard or mouse.

Figure 1: Two simple modality components

Figure 1 illustrates two simple modality components—ASR modality
for capturing input from the user and TTS for presenting output to
the user. Note that all information exchanged between the two
modality components must be sent as life-cycle events to the
interaction manager which forwards them to the other modality
component.

H.2 Complex modality
components

A complex modality component may contain functionality of two or
more simple modality components, for example:

GUI—presents information to the user, and captures keystrokes
and mouse movements

VXML—presents a VoiceXML dialog to the user that both present
speech to the user and captures the user's speech

GUI/VUI—enables user to both speak and listen, and read and
type.

Figure 2: A basic modality component with two
functions

Figure 2 illustrates a complex modality component containing two
functions, ASR and TTS. The ASR and TTS functions within the
complex modality component may communicate directly with each
other, in addition to sending and receiving life-cycle events with
the interaction manager

H.3 Nested modality
components

A nested modality component is a set of modality components and
a script (possibly written in SCXML) that manages them. The script
communicates with the child modality components using life cycle
events. The script communicates with the interaction manager using
only life-cycle events. The
childrenchild modality components may not
communicate directly with each other.

In effect, the script within a nested modality component can be
thought of as an interaction manager that manages the child
modality components. In effect, a nested modality component is a
nested interaction manager. This is the so-called "Russian Doll"
model of nested interaction managers.

"Galaxy
Communicator" Galaxy Communicator is an open source hub
and spoke architecture for constructing dialogue systems that was
developed with funding from Defense Advanced Research Projects
Agency (DARPA) of the United States Government.