Much like traditional mobile applications, multimodal applicationswhich allow user input through a variety of methods, including voice and motionhave already put down deep roots in the global marketplace. Wise developers will stay on top of this trend as well as the development of the XML-based languages that facilitate multimodal input.

by Erin Gannon, Associate Editor

Apr 8, 2005

Page 1 of 2

he proliferation of handheld devices means that users are increasingly relying on them to perform routine, daily tasksthe cell phone, with its video card and internet access is becoming a constant companion. With this ubiquity comes the necessity to make these devices smaller, more convenient to carry, and that means smaller display and keyboard screens.

"The age of bulky 3G handsets is over," says Hiroshi Nakaizumi, Sony-Ericsson's Head of Design in its latest press release touting the K600 UMTS handset, which, despite the fact that it weighs no more than your average 2G handset, still delivers video telephony, a 1.3 MegaPixel camera, and high-performance download capabilities.

Phones just keep getting smaller and their capabilities keep getting more complex. In order to facilitate increasingly complex levels of interaction with devices that keep getting smaller, you're going to need to learn how to equip their mobile applications with input modes besides a keypad or a stylus. You're going to need to learn about multimodal development.

Kirusa's Voice SMS (KV.SMS) is a typical example of a multimodal application, integrating voice messaging with text-based SMS and multimedia-based MMS. Using this program, users can dictate and send SMS messages using only their voices, send voice messages to phones without actually ringing the phone, click on an SMS message to hear a voice message, or respond to a voice or SMS message by voice or text. Ultimate convenience.

Another typical example of a multimodal application is the Ford Model U SUV. This car's multimodal interface uses speech technology to allow drivers to control navigation, make phone calls, operate entertainment features such as the radio or an MP3 player, and adjust the climate control, the retractable roof, and personalize preferences.

The Sony K600 UTMS is representative of future handheld devices.

Another major application of multimodal technology lies in the special needs sector. Speech-enabled technologies, in particular, are of great help to those whose disabilities prevent them from taking full advantage of a GUI interface. eValues (e-library Voice Application for European Blind, Elderly and Sight-impaired) is a project to use multimodal development for the benefit of those whose disabilities present barriers to reading. This Internet-based service uses advanced text-to-speech conversion to allow blind or sight-impaired users to download any on-line book or document and listen to it. This capability works not only with PCs, but with common PDAs.

How Do They Do That?
The widespread adoption of XML and derivative markup languages has, for all intents and purposes, enabled the advent of multimodal development. The existence of an independent translator for stored data frees developers from having to develop for specific devices. XML and, most significantly, VoiceXML make it remarkably easy for developers to create flexible interfaces with which to access varying clients.

The three building-block languages for multimodal development are: SALT (Speech Application Language Tags), X+V (XHTML + Voice), and EMMA (Extensible MultiModal Annotation). All three have been submitted to the W3C for consideration as standards for telephony and/or multimodal applications. Currently, all three are under consideration for the next version of VoiceXML.

SALT: This language is an extension of HTML and other markup languages (cHTML, XHTML, WML). It's used to add speech interfaces to Web pages and it's designed for use with both voice-only browsers and multimodal browsersmeaning, cellular phones, tablet PCs, and wireless PDAs.

Microsoft developed SALT specifically to enable speech across a wide range of devices and to allow telephony and multimodal dialogs. Because SALT uses the data models and execution environments of its host environments (HTML forms and scripting), it is more familiar to Web developers. Its event-driven interaction model is useful for multimodal applications.

However, SALT is merely a set of tags for specifying voice interaction that can be embedded into other "containing" environments. Because of this dependency on an external environment, developers using SALT may need to generate differing versions of an application for each devicefor instance, an application for use on cell phones will require separate versions for Nokia and Motorola phones.

X + V: This IBM-sponsored language combines XHTML with VoiceXML 2.0, the XML Events module, and a third module containing a small number of attribute extensions to both XHTML and VoiceXML. This allows VoiceXML (audio) dialogs and XHTML (text) input to share multimodal input data.

The fact that X+V is built using previously standardized languages makes it easy to modularizethat is, to break apart its code into modes, where one mode is for speech recognition, one is for motion recognition, etc..

But using the XML Events standard is what really differentiates X+V from SALT. Whereas events drive the creation of X+V, thus defining the environment, SALT merely attaches its tags to events within a pre-existing environment. Because X+V is self-sufficient in this manner, applications written with it are generally more portable.

EMMA: This language was developed in order to provide semantic interpretations for speech, natural language text, keyboard/, and ink input (a type of stylus input that includes handwriting recognition).

EMMA is a complimentary language to SALT and X+V, functioning as a sort of middleman between a multimodal application's componentsthat is, between a user's input and the X+V- or SALT-based interpreter. This frees developers from having to worry about writing code to interpret user input. EMMA simply translates input into a format interpreted by the application language, greatly simplifying the process of adding multiple modes to an application.