CSS Extensions for Multimodal Interaction

Dave Raggett

W3C/Canon

dsr@w3.org

Max Froumentin

W3C/ERCIM

mf@w3.org

ABSTRACT

This note describes experimental ideas for
extending CSS style sheets to add multimodal capabilities to XHTML
documents without the need for any changes to the markup, and in a
manner that is fully backwards compatible with existing browsers.
The approach is modality independent and can be used with key strokes,
speech and pen-based input, together with aural and visual prompts.
This paper describes original work by the authors, and should not
be taken to represent the views of the W3C, nor those of any W3C
working group.

Keywords

User Interface, Browsers, Multimodal, Stylesheets, CSS, XHTML

1. INTRODUCTION

The vast majority of Web browsers today rely on the graphical
user interface as invented by Doug Englebart in the 1960s and
subsequently refined by Xerox PARC and Apple Computer. We are
used to using a pointer device to click on links and scroll the
window, and using the keyboard to fill in text fields. Developments
in speech and handwriting recognition are providing opportunities
for new modes of interaction with Web pages. Electronic pens
allow for gestures, drawings and visual notations like mathematics
and music, as well as for handwriting. Speech allows for hands
and eyes free operation, as well as being more convenient than
the limited keypads available on small devices like cellphones.
The true potential for natural language—the medium of choice
for human to human communication—is only just beginning
to be tapped.

This paper describes ideas for extending Web browsers to
support multiple modes of interaction, and explores some ways
in which style sheets could be extended to describe interaction
independent of whether the user chooses to respond with speech,
handwriting or keystrokes. The idea that users should be free
to make such choices is proposed as a key principle:

The principle of modality independence
— users should be free to select the mode of
interaction, subject to device capabilities, while designers
should be empowered to provide an effective user interface
for whichever choices are made.

To enable widespread adoption, and to repeat the success
of the GUI Web, new authoring languages need to be easy to
learn for existing Web designers skilled in HTML, and just as
importantly, these languages should be backwards compatible
with existing browsers. This allows users to view the pages
with older browsers that are widely deployed, whilst enabling
users with newer browsers to gain the benefits of multimodal
interfaces. If the languages aren't backwards compatible, few
designers will feel justified in developing new content while
there are only a relatively small number of people with access
to the new browsers. The lack of such content could then inhibit
the deployment of such browsers.

Backwards compatibility speeds adoption
— designers can use the same content for old and
new browsers

A related issue concerns how much effort is needed to create
an effective user interface. A language offering only low level
abstractions may provide plenty of flexibility, but can make it
hard for most designers to work with. A high level language can
suffer from the opposite problem, being easy to work with, but
restricted in flexibility. The experience with the Web suggests
the value in providing an easy to learn declarative language for
the most common cases, together with the means to add flexibility
via scripting, to cope with the less common cases.

The 80/20 rule for language design
— 80% of cases should be easy to address with
declarative solutions, while the remaining 20% should
be amenable with the addition of scripting capabilities.

Some previous approaches such as SALT[1] and X+V[2] have
approached the challenge of multimodal interaction through
mixing XHTML[3] markup with additional markup for speech. This
has two consequences. The first is a risk of incompatibility
with existing browsers, and the second is a reduction in
flexibility when it comes to allowing users to freely select
the mode of interaction. As an example of the first problem,
SALT prompt text may be unintentionally visible as part of
the body of a document. This occurs because browsers are
designed to ignore start and end tags they don't recognize,
as required to by the HTML specification. As an example of
the second problem, a SALT enabled browser may reprompt the
user even though a value was provided via the keyboard.

These problems may be avoided by separating the user interface
from the application markup. This paper explores the view
that application user interfaces can be treated as a matter of styling, and that variations in user preferences and device capabilities can be treated as a matter of applying different
style sheets. This idea is perhaps better known as skin-able
user interfaces.

Skin-able user interfaces
— rely on the separation of the user interface
from the application to offer users a choice of
interfaces as appropriate to device capabilities and
user preferences.

A well known example is "winamp", a Windows application for
playing music. Users can download a range of skins and try them
out in turn. The skin not only allows for wide variations in the
visual appearence of the application, it also changes the details
of how you interact with the application. This idea is very much
relevant to the needs for Web authoring languages given the wide
variations in device capabilities, such as the display size and
supported modes of interaction. W3C's Cascading Style Sheets (CSS)
language[4] is a well established means for styling Web content,
but currently assumes that the user interface is part of the
application and not part of the style sheet. This limits the
potential of CSS for defining skin-able user interfaces, which
need to cover the choice of user interface controls and the details
of the interaction with the user, as well as the visual appearence.
By treating interaction as part of styling, it is easy to change
the user interface simply by switching the style sheet.

2. PATTERNS OF INTERACTION

This section of the paper gives a brief introduction to what
it means to provide a multimodal interface to Web pages. This
will be used to motivate the choice of features used for the
CSS extensions proposed in section 3.

The user interface for Web browsers may be characterized in
terms of the following categories:

Browser controls

These include the Back, Forward, Stop, Reload and Home
buttons, as well as the means to pick a Website from the
browser favorites or to enter a URI explicitly. Browsers
generally allow users to set preferences such as the
default fonts. A further function is support for scrolling
through documents larger than the document window.

Navigation

The means to follow hypertext links whether these are
textual or image based, plus the means to move the input
focus from one input control to the next. This is critically
important for text fields, but is also needed when using
the keyboard without a pointing device.

Selection

Menus, radio buttons and checkboxes all involve making
a selection of some kind.

Text fields

Single or multi-line text entry fields.

Special purpose controls

The are application specific and often created using external
formats such as Macromedia Flash.

Browser controls can be bound to a fixed set of spoken commands.
For navigation, the commands will be application specific. The
simplest approach is to make the spoken command the same as the
visible label for the link or control, since this makes them easier
to learn—the idea of "say what you see". The same applies to
selection. For text fields, there is a choice between an application
specific grammar, and free text entry using statistical dictation
models for speech recognition. The accuracy of using speech for
free text entry is generally less than that for application
specific grammars, but nonetheless, is expected to become
increasingly practical. One advantage of speech is the
opportunity to fill out multiple fields or to give multiple
commands with a single utterance. This is especially valuable
when using network based recognition due to the increased
latency compared with embedded (i.e. local) recognition.

For devices with an electronic pen or stylus, the browser
controls, navigation, and selection can be operated in much the
same way as when using a mouse pointer. You just need to tap on the
button or link. The platform may also support a set of pen gestures
where a particular movement of the pen is interpreted as a command.
In principle, applications could define their own gestures, but
this is difficult in practice as there are no established standards
for defining such gestures. For handwriting recognition, the user
may be free to write directly on the text field, or be required to
write one character at a time in a special area. In either case,
the speed of text entry tends to be slower for experienced users
than when using a conventional full sized keyboard. The application
may also be able to collect ink traces for processing on a server.
This gives the designer the freedom to enable the use of ink for
scribbled notes, drawings, and specialized notations. W3C is
developing an XML format for ink traces called "InkML"[8] with this
in mind.

When interacting with Web pages in a conventional manner, the
user is free to choose which link to click on, and in which order
to fill out fields etc. This is known as "user directed"
interaction. When carrying out a task like placing an order, the
application can direct the user through a sequence of pages, e.g.
for selecting products, confirming the selection, collecting
payment and delivery details, and placing the order. This is known
as "application directed" interaction. This is appropriate for
tasks that must be performed in a particular order, or when the
user needs guidance, e.g. for an unfamiliar or complex task.

Application directed interaction is also useful when the user
fails to respond as expected within a reasonable time, or when
there are uncertainties in speech or handwriting recognition.
In such situations, the application can be designed to provide
progressively stronger guidance with what are known as "tapered
prompts". This involves transiting between a sequence of dialog
states. Such states can also be used to ask the user for
confirmation, or to select from a short list of recognition
hypotheses, or to provide feedback on progress, and to set the
user's expectation for what the application is currently doing.
In principle, transiting between dialog states could be
implemented by asking the server for the next page (e.g. as an
kind of form submission). But this can be expensive in terms of
increased latency and network useage—factors that are
important for mobile applications. This makes it worth
providing the means to support dialog transitions without
requiring such page loads.

Prompts guide the user to respond within the expectations
described by application grammars. Sometimes the prompt and
grammar are static, but in other situations, it will be
necessary to refer back to something the user previously input.
This creates a need for dynamically generated prompts and
grammars, that are computed as functions of application data.
This could be computed server-side or client-side depending on
whether dialog transitions involve page loads or not.

This section concludes with a look at issues concerning
coupling across different modes of input that need to be
addressed by multimodal authoring languages. The first case
is where there are two text fields and the user selects the
second using the keyboard or pointing device, whilst still
saying the text for the first field. This may be considered by
the user as akin to type ahead. Users are likely to be upset
if their input for the first field is discarded or placed into
the wrong field. This suggests that the notion of input focus
should be taken as an indication of intent and interpreted in
coordination with the speech modality. If the second field is
associated with a spoken prompt, this should be deferred until
the user has finished talking, just as would normally be the
case when two people are in a conversation. The same holds for
activating the speech grammar for the second field. It would
considerably simplify the application designer's task if
the implementation were to manipulate the event queue to hide
the details of how this is achieved. This is assumed to be
the case for the proposal described in this paper.

The second case is where the user can provide input using
various combinations of two modes. An example is an application
involving a map where the user can zoom or pan the map, or ask
for information about a specific location or region. In principle,
this could be implemented in terms of a form with two fields, one
for speech and one for ink trace data. Depending on what the
user says, the application may or may not expect ink data,
equally depending on what the user draws, the application may
be fine with just ink data or may expect accompanying speech.
Such paired fields should share the input focus since
simultaneous speech and pen input needs to be allowed. In
principle, such multimodal map controls could be implemented
using W3C's Scalable Vector Graphics format[11] with the addition
of a means to collect ink traces. The flexibility involved in such
composite controls is likely to involve a limited amount of
scripting, but otherwise falls within the scope of the ideas
described in this paper.

3. THE PROPOSAL

This section of the paper describes proposed extensions to
W3C's Cascading Style Sheets language[4]. The extensions cover
prompts, grammars, declarative event handlers and named dialog
states. The proposal is based upon the idea of text as an
abstract modality. Speaking, writing or typing are just different
ways to enter text. Likewise, the application can present text
using either the display or synthetic speech. This reflects the
principle of modality independence as described in the
introduction. For an effective user interface, the Web designer
needs to control how text is obtained or presented in different
modes. It isn't enough to simply leave this to the browser. You
might think that HTML Forms are generic across modes, but they
rely on the assumption that input is reliable and unambiguous,
which while true for key strokes and pointer input, is not the
case for speech and handwriting.

In principle, style sheets alter the style of an application
but not its substance. Thus you should still be able to do
everything when style sheets are disabled, although perhaps
not using the same modes or in exactly the same way. This
constrains the range of actions that can be applied as part
of style sheets. Please note that this paper is not intended as
a complete specification due to practical limitations on the
length of conference papers. The authors expect to be able to
demonstrate a working implementation of the ideas at the
conference.

3.1 The 'prompt' property

prompt: none | auto | url(<address>) | expr(<expression>) | <string>

The prompt property is used to specify prompts that
guide users as to how to interact with the application. Prompts
are triggered by events as determined by the CSS selector, and
may be presented in a variety of ways depending on the CSS media
type, the device features and user preferences, for example, via
speech synthesis, tool tip or status bar message.

The typical use of the prompt property is with a string, e.g.

body { prompt: "Welcome to ring-tones galore!" }
#search:focus { prompt: "please write or say the name of a band or artist" }

In the first example, a prompt is defined that will be played
in response to the onload event when the document is
first loaded. When the document is unloaded, any prompts that
are playing will be stopped in response to the onunload
event. In the second example, a text input field with an XML ID
value of "search" is associated with a prompt to be played when
the field is given the input focus. Prompts are handled via a
prompt queue. If several prompts are triggered by same event,
then they should be queued in document order.

Designers need to take care to keep prompts consistent with
the text and graphics in the markup, for instance:

<h1 style="prompt: 'say yes or no'">Say black or white</h1>

Prompts can also be defined using external Web formats such as
W3C's Speech Synthesis Markup Language[6], multimedia presentations
expressed in SMIL[12] or scalable vector graphics represented in
SVG[11]. References to such resources are expressed using the CSS
url() syntax as follows:

body { prompt: url(welcome.ssml) }

For dynamically computed prompts, you can use an expression
that returns a typed value such as a string, or an XML resource
like SSML. For example:

body { prompt: expr("Welcome back " + userName) }

The expression is evaluated dynamically just prior to the
prompt being presented. An open question is whether the syntax
for expressions should be described as part of CSS or whether
an arbitrary ECMAScript expression is acceptable.

The special value auto is reserved for indicating that
prompt is to be automatically constructed based upon the selected
element, for instance, based upon the associated label element in
XForms[13] or the title attribute in XHTML[2]. The precise means for
constructing the prompt is dependent on the markup language.

To allow for styling of prompts, you can use the ::prompt
pseudo-element. This can be combined with CSS pseudo-classes
as in:

#search:focus::prompt { voice-family: female }

The special value none can be used to suppress a
cascaded prompt when no prompt is intended.

Many speech interfaces allow the user to start talking before
the prompt has finished. This is often referred to as "barge in"
and it may be possible to such behavior, forcing the user to
wait until the prompt has finished. This can be supported through
the barge-in property, e.g.

body {
prompt: url(disclaimer.srgs);
barge-in: avoid;
}

A spoken prompt has a natural duration. This may not be the
case when the prompt is rendered visually. The user may thus be
able to respond immediately regardless of the barge-in property.
An open question is whether there should be a way for application
designers to set a minimum prompt duration to ensure that users
are given sufficient time to read legal notices etc. This time
interval could be specified as part of the barge-in property.

3.2 The 'grammar' property

The grammar property is used to enable text input that
is subject to the syntactic and semantic constraints specified by
the grammar property value. The grammar is activated according to
the selector in the same way as for prompts, for instance on the
onload or onfocus events. If barge-in is inhibited, then activation
is delayed until the prompt has finished.

The special value any is reserved for use with free
text entry. Some speech recognizers may be unable to support this.
For constrained text entry, simple grammars can be expressed inline
using a subset of the W3C Speech Recognition Grammar Specification[7]
ABNF notation, for example:

A word is defined here to be a string that doesn't contain whitespace
characters or any of the other special characters including brackets of
all kinds, semicolons, vertical bars, quotation marks and other punctuation
symbols. The tag-content is a string that cannot contain a "}" character.
In the absence of an explicit tag, the input string is used in its place.
Note the use of empty curly braces "{}" to return a null string as the
result. More complex normalizations, e.g. those requiring calculations,
may be done using SRGS[7] together with the Semantic Interpretation
specification[10].

Here is an example showing how tags can be used to normalize
input values:

If the user enters "pepsi" the input value is normalized to
"pepsi cola", while "coke" is normalized to "coca cola".

The ABNF format could in principle be extended to allow for
the use of regular expressions for constraining text input,
and for using InkML[8] as a way to bind pen gestures to
semantic results. For this, InkML would be used to define
examples of gestures that can be matched against the user's
input.

Larger grammars are supported by reference to external
resources, e.g.

It may be worth allowing for a small number of built-in
grammars to reduce the need for complex grammars and to
enable the platform to use a platform specific input method.
In the following example, a built-in calendar control could
be used for convenience in selecting a date.

The use of type(auto) is intended for markup languages
like XForms[13] where the type information is supplied as part of
the markup, or where is it is practical to automatically create the
grammar from the set of labels, e.g. for the XForms <select>
and <select1> elements.

You can also use an expression for dynamically computed
grammars, with the same syntax for expressions as for prompts.
Similarly, the special value none can be used to indicate
that no input is expected.

The default processing of the input value depends on the
nature of the element to which the associated grammar property
is bound. For grammars associated with hypertext links or buttons,
any non-null value activates the link or button. For elements acting
as check boxes, or as choices in a multiple selection list, any
non-null value activates the selection, while a null value
de-activates the selection.

A complication is the need to distinguish between setting the
focus and filling out a text field. The reason for this is that
speech recognition and handwriting recognition are imperfect. The
chances of success are enhanced if the user is constrained in
what they say or write. In a mobile Web application, the current
Web page may have only a few hypertext links and form controls,
while a text field on that page may be associated with a relatively
large grammar, for instance, the set of names of US airports.

A solution is to make the behavior dependent on whether the
selected element has the input focus. If it has the focus, the
element is updated with the returned value, otherwise, the input
focus is given to the selected element and its value is left
unchanged. In essence, this means that a text field must already
have the focus if the field's value is to be updated.

This default processing may be overridden with a scripted
event handler. This could be used, for example, to interpret
EMMA[9] documents representing annotated interpreted input.

An open question is what level of control to provide over
how grammars are de-activated. In the "tap and talk" idiom, the
user of a pen enabled device, taps on a field and then speaks
to fill it out. The grammar is then de-activated upon a no input,
a no match or a match event. In some situations, the grammar
should remain active. It can then raise a succession of match
events as the user's speech is matched against the grammar. This
behavior could be enabled though an additional CSS property.

3.3 The 'reprompt' property

reprompt: none | [<time>] [<action> [<action>]]

The reprompt property enables reprompting when
the user doesn't respond within a specified timeout following
the presentation of the prompt, or when the user's input doesn't
match the associated grammar. The property has no effect unless
the associated element has the input focus and a non-empty
value for the grammar property.

The value of the reprompt property is a timeout followed
by the action to be taken on no input, or unexpected input
that doesn't match the associated grammar (no match). If the
action is missing, the default is to re-enter the current state.
In the following example, the prompt is repeated after 3 seconds
if the user doesn't provide a number:

If only one action is provided it will apply to both the
no input and no match events. If two actions are given
(separated by whitespace), the first applies to no input
and the second to no match events. As for other properties,
the special value none can be used to suppress
cascades.

An open question is whether this property should be split
into separate properties for no input and no match events.
One reason for doing so would be to provide finer grain control
over different kinds of speech time outs. For instance, a
babble time out where the user is still talking when the time
out occurs, but the recognizer hasn't yet found a match with
the associated grammar. It might be worth using different
properties for time out values and actions to allow for
cascading of time outs, and to allow them to be overridden
by a user style sheet.

3.4 The 'next' property

The next property is used to handle match events
where the user's input matches the associated grammar. It only
effects elements that have values other than 'none' for either
the grammar or prompt properties. It gives authors the means
to automatically advance the focus to a specified element or
document, thereby overriding the normal flow of interaction.

The time parameter may be used to specify a time interval
after which the actions designated by the property take effect.
This time interval gives users the chance to take the initiative
before the application steps in to give them a hand. Of course,
users don't have to follow directions, and may choose to do
something different, e.g. by tapping with the stylus on a
different field, or by speaking a navigation command.

The filter expression may be used to filter the input before
evaluating the conditions. The expression should return a
string value that will be used in place of the recognition
result. This gives considerable flexibility for dealing with
structured results when these are expressed in EMMA[9]. For
example, when driving a printing application, the user might
say "print 3 copies, A4, best quality". The semantic
interpretation rules associated with the grammar could be used
to map this to an interpretation like:

A simple scripted function can pick out these values and
use them to fill out each of the corresponding form fields.
In principle, such a script could be replaced by a mechanism
to match the interpretation with the form using some kind of
unification algorithm. The 'next' property could then provide
a syntax such as unify(#form-identifer). However, the
flexibility of W3C's extensible multimodal annotation language
(EMMA) is such that it would be impractical to provide declarative
solutions for all cases. The filter mechanism and scripting are
thus still of value even if such a unification mechanism is
provided.

Conditions take one of the following forms:

A quoted string that is matched against the
interpreted input, e.g. as obtained by applying the associated
grammar to the user's input. This can be used to control which
action is taken based upon the user's input.

on(expression) is an expression
in a scripting language that is dynamically evaluated to obtain
a boolean value.

If several conditions evaluate to true, only the action
associated with the first such condition will be taken.
Actions have the following forms:

#identifier is used to transfer the
input focus to the element with the corresponding XML ID.

do(expression) defines an action
in terms of the evaluation of an expression in a scripting
language.

activate(#identifier) identifies
the XML ID of an element to be activated, e.g. a hypertext link
or button. For a text field this has the effect of giving it the
input focus.

activate has the effect of activating the
current element.

submit which has the effect of submitting
the form associated with this element, provided there are no
unfilled required fields.

repeat which has the effect of re-entering
the current state

back which has the effect of transiting
to the previous state

If the action is missing the default is the next element in
the document defined tab order that can accept text input.
The tab order is the order in which you can move through the
input controls via pressing the tab key or equivalent. This
can be controlled via the XHTML tabindex attribute.

on(expression) and do(expression) enable the author to
determine which actions to take according to the application
state rather than just the current text value. They can
also be used in combination with markup components exposed
to scripting via binding mechanisms such as XBL.

Here is an example that determines which element to set
the focus to based upon the user's input:

This example asks whether the user wants a drink. If the user
responds "yes", the focus will be set to the element with the XML
ID of "beverage", if "no", the focus will set to the element with
the XML ID of "no". If the final action lacks a preceding condition
string, it acts as the default action for the situation where none
of the previous condition strings match the result. If this default
action is not provided, the default will be the next appropriate
element in the document tab order.

For menus, check boxes and radio buttons etc. the text value
is internal value of the field. For the XHTML <input> and
<option> elements this is given by the value attribute.
If the condition string is missing, any non-empty string will be
matched. The timeout only applies after a match is detected.

Consider the situation where a user has filled out a field in
a way that matches the grammar, and later wants to change the value.
Upon setting the focus back to that field, it would be inappropriate
for the focus to be immediately moved away on account of a matching
condition in the 'next' property, since this would preclude the
ability of the user to update the field's value. To avoid this
happening, there is the precondition that after an element gets
the focus, it has to receive input before the conditions given
by the 'next' property are evaluated. This precondition does not
apply if the element is associated with a null grammar using the
special value none.

The submit mechanism should be conditional on required fields
having been filled out. What is the most convenient way to deal
with this? One possibility is a property that indicates that the
associated field is required. The submit action would then set
the focus to the first unfilled required field. Where first is
defined in terms of the document specific tab order. A further
idea would be to extend the submit action to name dependent
fields, e.g. a voice command might require a pen gesture input
via a scribble control.

In principle, the CSS @media rule can be used to tailor the
timeouts and control flow depending on the media type. The CSS
media type "speech" is applicable when the modes of interaction
are restricted to speech and DTMF. See also the
CSS3 Speech[5] properties for styling the rendering of XML to
speech.

Further work is needed to enrich CSS Media Queries to fully
realize their potential for customizing interaction to match
user preferences and device capabilities. The current set of
media names in CSS 2.1 are inadequate for expressing the
possible choices of input and output modes and their combination
with other device characteristics.

3.5 The ':state' pseudo-class

The :state(name) pseudo-class allows you to define
named states for use in defining dialogs. Here is an example
where it is used to change the prompt when reprompting:

The CSS cascade ensures that retry state inherits the
properties defined with :focus. In the above example, this
applies to the 'grammar', 'reprompt' and 'next' properties,
but not to 'prompt' which is overridden.

The next example shows how :state can be used to provide
an apology after an unrecognized input:

The ability to create named sub-states was inspired by the
work of David Harel on statecharts[14], and part of the UML
standard. Harel's work allows for hierarchically nested states
and for concurrent activation of multiple states. This can be
used to formalize the way in which CSS selectors and properties
are used in this paper to bind behaviors to XML markup.

3.6 Dealing with uncertain input

A speech or handwriting recognizer may have difficulties in
correctly identifying what the user said. In some circumstances,
it is sufficient to assume the most likely recognition hypothesis,
and to ask the user for confirmation before submitting the form
etc. In other cases, it will be necessary to ask the user to
indicate which interpretation was intended. How is this to be
supported?

Many systems provide the application designer with access
to the N-best list of recognition hypotheses along with the
associated confidence scores. The designer can then apply
some kind of thresholding on these scores to determine how to
proceed. The problem with this is that the scores tend to be
platform dependent, making interoperability problematic. One
way around this is to leave the thresholds to the platform.
Here is one possibility:

confidence: <high> <medium> <low>

Where <high>, <medium> and <low> are the
actions to be taken if the last input was above the upper
threshold, between the upper and lower thresholds, or below
the lower threshold, respectively. The actions would give
the names of interaction states as defined in section 3.5.

The absolute confidence score is not the only cue that matters,
If there are several matches with similar confidence scores, then
disambiguation will be needed. Some of the proposed matches may
be improbable based upon the current application state. In principle,
an applications script can use the current state to re-rank the
matches proposed by the recognizer. This can be done using the
filter mechanism. Finally, if the user is having problems with one
mode of input, it may be much better to encourage the user switch
to another mode of input, or to combine more than one mode.

4. CONCLUSION

This paper started with design principles, covering modality
independence, the complementary roles of declarative representations
and scripting, and the need to separate the user interface from the
application. W3C's work on CSS and XForms help, but aren't really
adequate when it comes to enabling effective multimodal user
interfaces. This paper explores the view that interaction is really
a matter of styling, and proposes a set of extensions to CSS
based upon a study of different patterns of interaction. Regrettably,
there isn't enough space in this paper to make a comparison with
other proposals such as SALT and X+V. More information about these
can be found via the references.

The emergence of embedded speech is creating an opportunity
for extending mobile devices to support multimodal interaction.
To encourage widespread adoption and a vigourous growth in the
available content, it will be critically important to provide an
effective end-user experience. The approach taken in this paper
is to try to simplify the design effort needed for such
applications. Whilst this paper has focused on extending CSS,
another possibility would be an XML language for interaction
sheets.

The authors plan to demonstrate the CSS approach at the
conference using an implementation based upon Internet
Explorer and SALT, with scripts for interpreting the CSS
extensions and dynamically compiling them to SALT. This
work has shown the viability of using scripting to explore
declarative approaches, and further experiments are under
consideration.

REFERENCES

[1] The SALT specification is available
from the SALT
Forum at http://www.saltforum.org/.

[2] The XHTML+Voice specification is available from IBM at
http://www.ibm.com/software/pervasive/multimodal/x%2Bv/11/spec.htm

ACKNOWLEDGEMENTS

During the W3C Multimodal Interaction workshop, held in Sophia
Antipolis in July 2004, one of the participants suggested that W3C
should try to develop a simple standard for authoring multimodal
applications for use on mobile devices. In discussions after the
workshop had ended on how to address this goal, the idea came up
of extending CSS to describe multimodal interaction. This seemed
like an intriguing possibility and well worth exploring. The authors
would like to thank Debbie Dahl and Bert Bos for their helpful
suggestions.