The other position papers represented a very broad range of interests and application areas for voice-user interfaces (VUIs), but there were many areas of overlap. One of the most interesting discussions at the workshop, for me, was the extent to which it is reasonable to suggest that users can just ‘speak naturally’ rather than learning the syntax and vocabulary of utterances that can be easily understood by a particular VUI. This is something that many guides to designing for VUIs are quite determined about – there is often a clear suggestion that you should not try to change the way users speak, or teach them how to speak to your VUI (Google explicitly say: “Give users credit. People know how to talk. Don’t put words in their mouth.”) The idea is that users should think of it as a natural conversation rather than perceiving the exchange as as inputs and outputs to a system.

Many of those at the workshop had expertise in conversation analysis, and were not particularly convinced that it is accurate (nor even helpful) to view interactions with VUIs as genuinely conversational.

The team from Nottingham are studying use of VUIs in everyday life. Martin Porcheron is particularly interested in multi-party interactions, and presented a paper at CHI on a study of Alexa use in real homes, using ethnomethodology and conversation analysis. This work highlights some fundamental challenges with VUIs, and Martin’s co-authors discussed some of the ways in which interaction is currently very limited. Stuart Reeves pointed out that categorisation by Alexa (and similar) is quite restrictive. For example, despite having aspirations to provide a genuinely conversational experience, the interpretation of an utterance as a ‘request’ or ‘question’ is determined early on and stuck to. Joel Fischer invited us to consider what is missed by the system, and pointed to the potential for VUIs to be more sensitive to hesitations, elongations and pauses.

Alex Taylor, from City, University of London, further highlighted the extent to which language is rich with other things beside the written word. He is interested in how we shape our talk to be recognised, how we talk differently to VUIs, and how indexicality might work with VUIs.

Despite reservations about the extent to which such interfaces can be genuinely conversational, the workshop organisers were generally in agreement that understanding how conversation works is important for designing VUX.

This does not mean, however, that users who are adept conversationalists already know how to interact with VUIs. One of the most obvious challenges for VUIs is the lack of visibility. Users face huge difficulties in knowing what they can say that is likely to be understood.

Jofish Kaye, from Mozilla, pointed out the many different forms of conversation that exist, and suggested that it may be important to consider the need for the design of a specialist voice programming language. He also made the point that if we are looking for expert users of VUIs, it might be wise to seek input from visually impaired users.

In the CONVER-SE project, we are investigating a variety of approaches for supporting end-user programming interactions with VUIs, including scaffolding, modelling, elicitation, visual prompts, and harnessing natural tendencies towards conversational alignment.

For these more challenging types of interactions, at least, it is clear to us that it will be necessary to teach users how they can be understood. By no means is this the same as insisting that users adopt unnatural ways of speaking – our first study is entirely focused on understanding natural expression in situ and using this to design VUI support for end-user programming activities. However, believing this to remove the need for the interface to support users in learning how they can interact with it would seem quite naïve. The data we looked at during the workshops showed that even for more simple applications, such as playing a game or searching for a recipe, users cannot simply ‘speak naturally’ and expect to be understood. Improving recognition can help, but we also need to develop better ways of revealing to users what kind of utterances VUIs can understand and act on.