The falling cost of simple computing combined with wireless networking means that it's likely there will be a very wide range of future devices that don't fit into today's neat categories such as "cellphone" or "home entertainment device". Furthermore, I expect that in the longer term future user interfaces will become disassociated from individual devcies, with multiple devices participating in a user experience. E.g. use cases like "follow me" video calls which switch between the mobile phone and the TV as I move around the house. Therefore specifications should be as open as possible to new devices and new multi-device interaction paradigms.

KATERINA PASTRA

5

5

3

4

5

3

3

5

5

5

6

Simon Harper

6

6

6

6

6

6

6

6

6

6

5

This seems to be the wrong way of looking at things to me. I think that by asking these questions you pre-judge a use case / scenario. This kind of decision - by developers - is exactly the kind of thing that has created problems in the Disability community from the outset. Is MM interaction important? I f the answer is yes when we should make sure all devices can handle this at a rating of 5.

Jose ROUILLARD

5

4

5

3

4

3

3

3

1

2

3

Healthcare (patient augmented room, smart night table ...)

Kostas Karpouzis

5

1

4

1

5

1

4

5

4

4

6

Alex Pfalzgraf

1

2

5

4

4

4

4

4

4

5

6

Norbert Reithinger

1

3

5

4

5

3

4

4

1

5

6

Multimodal interaction will be most beeficial outside the traditional desktop: i.e. while on the go (mobile phone, car), when controlling remote devices (home appliances). The WII success of multimodal interaction (compared to the technically superior but traditional PS3) shows that also gaming is pretty interesting for MMI.

Biometrics might include both current and new technologies such as EMG, expression recognition. Referring to my earlier comment, the user interface may (must) be able to integrate multiple clues to the user's intention. E.g. if an accelerometer on a phone suggests I'm running, a GPS shows I'm in an airport then I'm probably running to catch a plane. This implies the use of certain UI modalities, e.g. voice interactions only. The questionnaire suggests some implicit assumptions which might not be correct during the next decade. E.g. that the UI is always intentional - the user specifies exactly what he/she wants to do. Whereas UIs will likely become more deductional, assembling clues from multiple sensors. Also, referring to my earlier comment, there is no reason for a UI to be bound to a single interaction model for the duration of an interaction.

KATERINA PASTRA

5

3

3

5

5

5

5

5

5

5

5

5

5

6

Simon Harper

6

6

6

6

6

6

6

6

6

6

6

6

6

5

I think there should not be a standard input and output method, but a standard method FOR input output which can be used for all of the technologies above. It may be more difficult to implement now, but it will pay off in the future.

Combinations of any and all input technologies may be used going beyond those suggested here. E.g. combining eye tracking and voice recognition so when I say "stop" we can deduce that I'm probably instructing the device I'm looking towards. Another issue to be addressed is what happens when there are multiple devices / applications monitoring the same source of user input. E.g. watching my expression or listening to my voice. How do we disambiguate which input is routed to which system?

KATERINA PASTRA

4

3

4

6

Simon Harper

6

6

6

5

Non Conventional forms such as sign language etc

Jose ROUILLARD

5

3

4

3

Gesture + voice (Bolt like, SVG ?).

Kostas Karpouzis

4

5

1

6

Alex Pfalzgraf

5

2

5

5

Grahpics + Speech (ASR/TTS) + Pointing Devices/Haptics + Sensor Input

Norbert Reithinger

4

1

3

5

Most interesting will be haptic devices (e.g. accelerometers in smartphones) and the semantics and interpretation of the signals. Voice is mature and will eb successrul in the current boundaries unless the recognition quality gets better in real life environments.

The interplay of speech, gesture and haptic is a basic requirement for spontaneous and successful human interection with devices of any kind depending on scenario and content (multimedia manipulation, menus navigation).

On the client and also on the server (e.g. a simple ASR on the client and a powerful ASR on the server so that the device can function in a limited mode when not connected)

2

4

6

4

1

Distributed between client and server (e.g. extract speech features on the client, send to the server where powerful ASR does the rest)

2

2

4

6

3

Others (please specify below)

1

16

Averages:

Choices

All responders:

Value

On the client

3.82

On the server

3.24

On the client and also on the server (e.g. a simple ASR on the client and a powerful ASR on the server so that the device can function in a limited mode when not connected)

3.76

Distributed between client and server (e.g. extract speech features on the client, send to the server where powerful ASR does the rest)

4.12

Others (please specify below)

5.94

Details

Responder

On the client

On the server

On the client and also on the server (e.g. a simple ASR on the client and a powerful ASR on the server so that the device can function in a limited mode when not connected)

Distributed between client and server (e.g. extract speech features on the client, send to the server where powerful ASR does the rest)

Others (please specify below)

Comments

Lawrence Catchpole

5

2

3

4

6

Patrick Nepper

5

1

1

1

6

hiromi honda

Nicholas Jones

5

5

5

5

5

Given the huge variation of computing capability from the simplest clients (such as a sensor node with a single pixel display) and the most complex (e.g. a PC) no assumptions can be made about UI distribution. Also, why can't we have the UI operating in a peer to peer mode distributed across multiple clients? This fits the model I mentioned earlier where many devices may co-operate in an interaction.

KATERINA PASTRA

1

5

5

5

6

Simon Harper

5

1

1

1

6

Jose ROUILLARD

6

6

6

6

6

Kostas Karpouzis

4

2

4

6

6

Alex Pfalzgraf

2

5

4

5

6

Norbert Reithinger

5

1

3

2

6

As the mobile client changes frequently (c.f. the model changes for smartphones), client-side solutions must be reusable and adhere to standards not tied to a specific vendor.

Jan Alexanderson

6

6

4

4

6

Quan Nguyen

5

3

3

6

6

Ali Choumane

1

1

3

5

6

Garland Phillips

5

3

5

4

6

Hirotaka Ueda

3

3

4

4

6

Massimo Romanelli

2

4

4

5

6

Daniel Sonntag

1

3

4

2

6

I think on this abstract level this is more a questions of belief rather than a more concrete consideration.

Details

Responder

Flat file

ECMAScript/JavaScript

XML

EMMA

Others (please specify below)

Comments

Lawrence Catchpole

3

3

5

2

6

Patrick Nepper

6

6

6

6

6

hiromi honda

Nicholas Jones

5

2

5

4

5

This may depend on the granularity and timeliness of the information. E.g. some form of RPC might be best for real time interactions. Also it may depend on the capability of the participatint devices. E.g. sensor nodes probably don't want to parse XML.

KATERINA PASTRA

3

3

5

4

6

Simon Harper

5

1

5

6

6

Jose ROUILLARD

2

1

4

4

6

Kostas Karpouzis

2

1

4

5

6

Alex Pfalzgraf

1

1

5

5

6

Norbert Reithinger

1

1

2

5

6

Prestructured XML schemata like EMMA help a lot in interoperability. The script based solutions lack clarity.

Please reconsider XHTML+Voice. I believe it was a very promising step forward!

hiromi honda

Nicholas Jones

5

4

5

3

5

5

6

I haven't thought about this in detail, but we prpobably have a variety of different roles which may need different languages. E.g. orchestrating a UI involving multiple devices may need some high level framework, and talking to a GPS needs a low level API.

KATERINA PASTRA

5

5

5

3

4

5

6

Simon Harper

4

6

5

4

6

5

6

Jose ROUILLARD

5

4

4

4

4

3

6

Kostas Karpouzis

4

5

5

3

5

3

6

Alex Pfalzgraf

5

5

4

5

5

4

6

Norbert Reithinger

5

5

3

3

5

3

6

Jan Alexanderson

2

2

2

2

4

4

6

Dialog management: Most of these things are too limited, e.g., VoiceXML, to be really useful for natural and intuitive applications. All these things mentioned lack a discourse memory which is vital to our applications.
Synchronization is relevant but not using VoiceXML (I don't understand that btw)
Data extraction is irrelevant.

Details

(Location-based) Mobile access to the web, accessibility for disabled people and the elderly.

hiromi honda

Nicholas Jones

Everything?

KATERINA PASTRA

Multimodal fusion for indexing, retrieval and summarization

Simon Harper

Use cases are misleading IMO, because they pre-judge the user and are filtered through technologists. In real life the number if use cases are so extreme that designing with this in mind is the only real solution.

Jose ROUILLARD

Kostas Karpouzis

Alex Pfalzgraf

Web-based services via Smartphone (esp. location-based services)
Combination of home appliance control with web-based services
Combination of car control with web-based services (Car2X) and sensoric input

Norbert Reithinger

Use cases that will dominate the multimodal area will address mobile users that are not tied to a big screen or other stationary devices.

Jan Alexanderson

Quan Nguyen

Ali Choumane

Garland Phillips

Hirotaka Ueda

Various devices around user coordinate with each other and provide various services according to the user-context.

Massimo Romanelli

Interaction with smartphones, multimedia manipulation

Daniel Sonntag

Different end devices run by same dialogue engine.

Yoshitaka Tokusho

Personal Computing area is most important because we are spending most of working time in front of Personal Comupter and working on the Computer Display.

Details

Responder

ability to plug-in/reuse the modality components

Comments

Lawrence Catchpole

5

Patrick Nepper

5

We have already created some prototypes using VoiceXML and XHTML+Voice, respectively. Besides some minor problems with X+V, we are satisfied with their simplicity and degree of abstraction (e.g. as opposed to EMMA).

hiromi honda

Nicholas Jones

6

KATERINA PASTRA

5

Simon Harper

1

Jose ROUILLARD

5

Kostas Karpouzis

5

Alex Pfalzgraf

4

Norbert Reithinger

5

I was involved in multiple multimodal project and the (partial) reusability is an absolute must. See also my remark above about model changes.

Please feel free to add any comments, justifications, and additional responses that you feel are important.

Details

Responder

General comments

Lawrence Catchpole

Patrick Nepper

Again, I would like to see a revitalization of XHTML+Voice. Both, XHTML and VoiceXML have already proven their advantages for their respective modalities. Combining both to build a multimodal specification language seems very promising to me (as far as our prototype implementations are concerned).

hiromi honda

Nicholas Jones

I think maybe there needs to be an explicit statement about how UI design is expected to evolve so we can discuss the target. I suspect we have at least 3 stages: (a) simple UI (like a GUI). (2) composite UI involving lots of input / output technologies. (3) environmental UI involving lots of separate devices in an intelligent environment.

KATERINA PASTRA

Simon Harper

Jose ROUILLARD

Kostas Karpouzis

Alex Pfalzgraf

Norbert Reithinger

Excuse for any typos and glitches of a non-native speaker :-)

Jan Alexanderson

Quan Nguyen

Ali Choumane

Garland Phillips

Hirotaka Ueda

Massimo Romanelli

Daniel Sonntag

I think the form is well-structured.
But more concrete scenarios should be regarded. I think different
integration environments and characteristics, e.g., time, effort, and skills, result naturally in different standardisation requirements.