Multimodal dialogue

Mobile interfaces need to allow the user and system to adapt their choice of communication modes according to user preferences, the task at hand, and the physical and social environment. We describe a multimodal application architecture which combines ﬁnite-state multimodal language processing, a speech-act based multimodal dialogue manager, dynamic multimodal output generation, and user-tailored text planning to enable rapid prototyping of multimodal interfaces with ﬂexible input and adaptive output. ...

We address two problems in the ﬁeld of automatic optimization of dialogue strategies: learning effective dialogue strategies when no initial data or system exists, and evaluating the result with real users. We use Reinforcement Learning (RL) to learn multimodal dialogue strategies by interaction with a simulated environment which is “bootstrapped” from small amounts of Wizard-of-Oz (WOZ) data.

This book is based on publications from the ISCA Tutorial and Research
Workshop on Multi-Modal Dialogue in Mobile Environments held at Kloster
Irsee, Germany, in 2002. The workshop covered various aspects of development
and evaluation of spoken multimodal dialogue systems and components
with particular emphasis on mobile environments, and discussed the state-ofthe-
art within this area. On the development side the major aspects addressed
include speech recognition, dialogue management, multimodal output generation,
system architectures, full applications, and user interface issues.

Human face-to-face conversation is an ideal model for human-computer dialogue. One of the major features of face-to-face communication is its multiplicity of communication channels that act on multiple modalities. To realize a natural multimodal dialogue, it is necessary to study how humans perceive information and determine the information to which humans are sensitive. A face is an independent communication channel that conveys emotional and conversational signals, encoded as facial expressions.

We demonstrate a multimodal dialogue system using reinforcement learning for in-car scenarios, developed at Edinburgh University and Cambridge University for the TALK project1. This prototype is the ﬁrst “Information State Update” (ISU) dialogue system to exhibit reinforcement learning of dialogue strategies, and also has a fragmentary clariﬁcation feature. This paper describes the main components and functionality of the system, as well as the purposes and future use of the system, and surveys the research issues involved in its construction. ...

The system is an in-car multimodal dialogue system for an MP3 application. It is used as a testing environment for our research in natural, intuitive mixed-initiative interaction, with particular emphasis on multimodal output planning and realization aimed to produce output adapted to the context, including the driver’s attention state w.r.t. the primary driving task.

We describe how context-sensitive, usertailored output is speciﬁed and produced in the COMIC multimodal dialogue system. At the conference, we will demonstrate the user-adapted features of the dialogue manager and text planner. three-dimensional walkthrough of the ﬁnished bathroom. We will focus on how context-sensitive, usertailored output is generated in the third, guidedbrowsing phase of the interaction. Figure 2 shows a typical user request and response from COMIC in this phase.

Navigation in large, complex and multidimensional information spaces is still a challenging task. The search is even more difﬁcult in small devices such as MP3 players, which only have a reduced screen and lack of a proper keyboard. In the MIAMM project 1 we have developed a multimodal dialogue system that uses speech, haptic interaction and advanced techniques for information visualization to allow a natural and fast access to music databases on small scale devices.

Multimodal dialogue systems allow users to input information in multiple modalities. These systems can handle simultaneous or sequential composite multimodal input. Different coordination schemes require such systems to capture, collect and integrate user input in different modalities, and then respond to a joint interpretation. We performed a study to understand the variability of input in multimodal dialogue systems and to evaluate methods to perform the collection of input information.

This paper addresses the issue of how linguistic feedback expressions, prosody and head gestures, i.e. head movements and face expressions, relate to one another in a collection of eight video-recorded Danish map-task dialogues. The study shows that in these data, prosodic features and head gestures signiﬁcantly improve automatic classiﬁcation of dialogue act labels for linguistic expressions of feedback.

This paper describes Dico II+, an in-vehicle dialogue system demonstrating a novel combination of ﬂexible multimodal menu-based dialogueand a “speech cursor” which enables menu navigation as well as browsing long list using haptic input and spoken output.

We investigate the use of machine learning in combination with feature engineering techniques to explore human multimodal clariﬁcation strategies and the use of those strategies for dialogue systems. We learn from data collected in a Wizardof-Oz study where different wizards could decide whether to ask a clariﬁcation request in a multimodal manner or else use speech alone. We show that there is a uniform strategy across wizards which is based on multiple features in the context. These are generic runtime features which can be implemented in dialogue systems. ...

The aim of this paper is to develop animated agents that can control multimodal instruction dialogues by monitoring user’s behaviors. First, this paper reports on our Wizard-of-Oz experiments, and then, using the collected corpus, proposes a probabilistic model of fine-grained timing dependencies among multimodal communication behaviors: speech, gestures, and mouse manipulations.

Automatic segmentation is important for making multimedia archives comprehensible, and for developing downstream information retrieval and extraction modules. In this study, we explore approaches that can segment multiparty conversational speech by integrating various knowledge sources (e.g., words, audio and video recordings, speaker intention and context). In particular, we evaluate the performance of a Maximum Entropy approach, and examine the effectiveness of multimodal features on the task of dialogue segmentation. ...

This paper describes MIMUS, a multimodal and multilingual dialogue system for the in– home scenario, which allows users to control some home devices by voice and/or clicks. Its design relies on Wizard of Oz experiments and is targeted at disabled users. MIMUS follows the Information State Update approach to dialogue management, and supports English, German and Spanish, with the possibility of changing language on–the– ﬂy. MIMUS includes a gestures–enabled talking head which endows the system with a human–like personality. ...

The third tier is the knowledge base (KB) that The three-tiered discourse representation defined in describes the belief system of one agent in the (Luperfoy, 1991) is applied to multimodal humandialogue, namely, the backend system being interfaced computer interface (HCI) dialogues. In the applied to. Figure 1 diagrams a partitioning of the system the three tiers are (1) a linguistic analysis information available to a dialogue processing agent.

We explore the problem of resolving the second person English pronoun you in multi-party dialogue, using a combination of linguistic and visual features. First, we distinguish generic and referential uses, then we classify the referential uses as either plural or singular, and ﬁnally, for the latter cases, we identify the addressee. In our ﬁrst set of experiments, the linguistic and visual features are derived from manual transcriptions and annotations, but in the second set, they are generated through entirely automatic means.

This paper describes the NECA MNLG; a fully implemented Multimodal Natural Language Generation module. The MNLG is deployed as part of the NECA system which generates dialogues between animated agents. The generation module supports the seamless integration of full grammar rules, templates and canned text. The generator takes input which allows for the specification of syntactic, semantic and pragmatic constraints on the output.

Multimodal conversational spoken dialogues using physical and virtual agents provide a potential interface to motivate and support users in the domain of health and ﬁtness. The paper presents a multimodal conversational Companion system focused on health and ﬁtness, which has both a stationary and a mobile component.