81

Exploring the Voice User Interface

December 11, 2014

Episode Summary

In this episode of The Digital Life, we discuss the rise of voice recognition and voice user interfaces. More so than any other interface type, the VUI has the potential to be seamless and “magical”. We’ve all used Siri, and there are plenty of VUI’s for car dashboards, but what’s coming down the road? From a first take on the new Amazon Echo to the future possibilities of voice services, we share our thoughts on the landscape of speech recognition.

Welcome to Episode 81 of The Digital Life, a show about our adventures in the world of design and technology. I’m your host, Jon Follett, and with me is founder and co-host, Dirk Knemeyer.

Dirk:

Hey, Jon. What are we going to be talking about today?

Jon:

A good one, Dirk. I thought we could talk about the rise of voice recognition and voice user interfaces, which are really starting to come to the fore in technology products this year. I think more so than a lot of other interface types, the VUI has the potential to be this seamless interface between man and machine. I think we’re so used to the idea of touch screens now or point-and-click interfaces that we basically grew up on or you can have the controllers for your gaming systems or if you’re a coder or just an old-school computer user, maybe you’re used to the command line interface. All of these interfaces are a slight hurdle to get the information from your head into this machine, into the computer.

I think more so than any of these other types, the voice user interface seems natural, seems like what you should be doing when you’re trying to convey information. Certainly, it’s what we do with each other every day. We convey information from person to person all the time just using our voice. We do all kinds of business transactions that way. We attend conferences where we learn things or go to school. Pretty much every transaction in our life has some voice element to it, with the exception of some of these other user interfaces that we use. For me, the idea that voice is going to become at least one of the next frontiers for user experience design, I find really exciting. What are your thoughts, Dirk?

Dirk:

Oh, I don’t find it exciting at all. Yeah. I can’t jump on the happy, happy, joy, joy train with you, my friend. I’ve got two issues with it. One is that the promise of it just is never realized. I am an Apple user for the most part and I have Siri, and, I don’t know, I find that Siri screws up half-ish of the time. If I say, “John Smith,” Siri gets it. If there’s any kind of nuance or longer, more complicated stuff, it’s a shit show. I don’t even really bother with Siri anymore unless I’m driving, there’s no way I can drive and text and I’m really under some duress to have to try voice. There’s just too many fails and this is pretty close to the latest and greatest in voice recognition technology. The inability to recognize my voice, recognize my intent, and to garble it into some crap really is poor.

Then sticking with Siri as the example, the software behind it isn’t necessarily great either. Something like “search for the nearest Starbucks,” the software then gives me a text menu where I have to look and look at street names of different places, which is totally worthless to me. If I’m asking for the nearest something, there is implicitly an ignorance to street names and to place and to specifics. I just want the damn thing to start taking me where I want to go. That’s not the fault of the voice recognition, but it’s the incompleteness of the software in and around the voice recognition that just makes it nonsense. It seems very logical. It seems like a great idea. It seems like it should work, but it sure as hell doesn’t for me. That’s my first big problem and objection with it. Let’s talk about that, and then I’ll circle back to my second.

Jon:

Yeah. I think the breadth of what Siri can do is probably a detriment to that VUI. What I mean by that is if the product was purely for Siri driving, so that’s what the product is and all they do is they hone that so when you say, “Siri, where is the nearest Starbucks?” the system is optimized to say, “Oh, go and take a left here and it’s on Main Street.” I think the systems are fully capable technologically of doing that, but the mistake, I think, is that it’s the opposite of the design tack that Apple took when it created the iPod. The iPod solves a very specific problem and does it so beautifully that you don’t even really notice that it’s the design integration. It just becomes part of you.

Whereas with Siri, there’s all these fails everywhere because it’s certainly capable of providing you with some information, but, instead, the use cases are so broad that they can’t do any one of them well. I think the driving one is a perfect example of that because certainly there are voice user interfaces that are being implemented in cars that are much better than Siri. It’s because it’s this general use kind of mobile, first out, they want to be first to market. I think that’s probably a botched rollout there. I agree that there’s no reason why the nearest Starbucks should necessarily be a text list when you probably have to stop your car first to look at it. I agree with you on that point.

Dirk:

Yeah, I think we’re on the same page there. What about the part that the voice recognition is often garbling what I’m saying if there’s any length to it, if there’s any complexity. Oftentimes, street names do have weird complexity or, again, most of my use cases for Siri are in the car. Ethnic restaurants can have some complexity relative to the oldie standard English. I think that’s another huge problem that the voice recognition is so questionable.

Jon:

Yeah. I don’t know the level of growth that the technology has in terms of understanding and learning what you’re saying over time. I imagine that that is probably more of an engineering and a database reference problem where there’s just not enough reference audio material, not that that couldn’t be solved. I think a better, more honed use case for Siri in a driving scenario would probably take those restaurant names or the inflection of your voice, perhaps take that a little bit more seriously in the design case there and come up with a better solution.

It’s one of those things where I feel like I’m watching the rollout of the mp3 player again. I can’t tell you how many of those I bought early on where it was a disaster. I tried really hard to make the technology work for me, and sometimes it did and sometimes it didn’t, because I had a lot of interest in making mp3s work both for entertainment and for incorporating into one of the bands I was working with at the time. I’d just have some background music for our tracks. I really wanted those to work, and I spent lots of money on early mp3 players, and they were all awful.

When the iPod came along, I was just, oh, geez, another one of these things? This is going to be terrible; I’m never going to get it. I was a late adopter of the iPod, really late. Basically, all the members of my family got them before I did because I was so burned by basically funding the R&D for that product. On my wish list, people would say what’s an mp3 player? I’d say, oh, here’s the Amazon link or whatever, go and get me this. Then when the iPod came out, I was just like, you guys are fools, but it turned out to be the killer product for that category. My guess is that these early misses are probably just that. They’re early in the cycle and they’re not honed, so they’re going to make early adopters frustrated as all heck.

Dirk:

Yeah. I think that’s a really good analysis. Let’s assume that it’s solved really well and they figure it all out. That’s going to lead into my second major objection, which is I think that the use cases where voice is appropriate in the real world are relatively limited. I don’t want people walking around in public spaces talking to the device in their pocket. Right? I don’t want people at the next cubicle talking to activate their work station. I don’t want my family members in the same room talking at devices.

To me, the use cases where voice is appropriate, where voice seamlessly and in a friendly way fits into environments are environments where we’re alone, in our car or in an office that has some noise protection and privacy to it. I just don’t think the use cases are that limited. The intrusiveness of voice on other people is, for me, a real problem. Even though I can see in certain cases voice can be really nice and useful and convenient, I don’t see it as this big game-changing silver bullet. I see it as something that is either a lot more niche, or if it becomes this huge paradigm controlling user interface, boy, I think the world will be a lot worse for it.

Jon:

I hear you on that, but let’s look at it from a slightly different angle. Let’s look at this voice interface as your personal assistant. Let’s look at it as a person who can do things for you. Let’s say I was your personal assistant, and in the morning you’d say, “Hey, Jon, can you turn on that computer for me and call up my e-mail.” Then I would go and do that. You wouldn’t find that interaction to be awful. In fact, it might be helpful to you because maybe you’re holding a cup of coffee and a Danish. Right?

Dirk:

Sure, sure, but you’ve reduced it, though, from a technology to an app. Right? We’ve just gone from, oh, this voice is going to be such a game-changing UI platform to, oh, a personal assistant app that’s voice activated sure makes a lot of sense. I agree, it does make a lot of sense.

Jon:

Yeah. What I’m saying more is if you look at voice as the way of looking at your computer as more of a person instead of as a … I was just using personal assistant as a type of work that a person could do for you. There certainly are lots of other types of tasks that this computerized person could do for you. I was just raising that in contrast to the idea that you’re … Right now, if you had a room full of people who were all talking to their personal assistants, hey, that sounds chatty, right? But I know that there are lots of environments in work spaces where … For instance, the reporting environment where you have a bullpen of reporters, there’s lots of people talking to lots of other people on the phone all the time and those environments work fairly well.

Dirk:

They work, but they’re not friendly environments. They’re not environments that you or I would want to spend our day in. We would be complaining to our wives and not really happy with our jobs if we were there. That’s not for everyone.

Jon:

Sure. Then, hopefully, for people like us who might like some quiet in a space where there isn’t as much chatter, then the voice user interface would operate equally well in a less chatty environment. All I’m crafting here is this idea that there’s a more humane way to work with this computer interface, in a more personable way, excepting that I’m going to be speaking to it rather than typing to it.

Dirk:

Yeah, I hear you. I think there’s validity to what you’re talking about, but if we look at how people use their devices in public spaces or around other people today, it’s generally in unfriendly ways. We will, as we’re going down the street, just like we could tip-tip-tap with our fingers currently, we will be talking to our device and ignoring or not caring about the strangers who are inhabiting spaces around us. Computing devices already, even just in the head-down, tap-tap-tap, are unfriendly. The more that we move into voice, it will be unfriendly as well. People have proven that they’ll treat strangers, they’ll treat people in public spaces and environments with them like flotsam and jetsam and won’t have concern and consideration. I don’t know, we’ll see, but I think it’s going to be yet another hit against civility and community and treating others well.

Jon:

Okay, yeah. There is definitely a lot of validity to what you’re saying as well. I agree that it could be very awkward scenarios where everyone is talking to their mobile device. One example, and part of the reason why I’ve been enjoying at least one of the voice user interfaces that’s out on the market now, is the Amazon Echo, which I just got in the mail this weekend. I was playing with that, and that has very, very limited use cases, but from my perspective, they are quite fun in terms of Amazon Echo.

You call it Alexa, so in terms of Alexa finding the music that I want to play and telling me things like what the weather is or what the headlines are. I noticed with my young son, he’s really taken with that interface because one of the things that Alexa can do is tell him jokes. He’s just endlessly entertained by that particular feature of the product. I think watching him interact with it, what it made me realize was that our voices and the words we say, we are really creating our own reality as we speak. Right?

Dirk:

Yeah.

Jon:

When I’m speaking to you, I can create a positive reality; I can talk about objects that are not present with us; I can talk about the future. We’re creating something as we’re having this discussion. It goes back, I think, even farther to the idea that there’s magic and power in words. If you’re into sci-fi and fantasy or even if you’re not, going back in human history, the incantation has an awful lot of power in it, special words that you say that create reality. I think somewhere in there, sort of mapping that experience where I saw my son’s delight as he’s asking Alexa to tell him more jokes and just won’t stop. Right?

Dirk:

Mm-hmm (affirmative).

Jon:

When we got in the car this morning, he said, “Dad, I didn’t talk to Alexa this morning.” Like Alexa was a person.

Dirk:

Hmm. Yeah.

Jon:

From that, my imagination gets going and I start thinking what parts of reality can we make happen with the voice user interface? That, I think, is the promise ultimately of the technology, notwithstanding all of the perhaps horrible instantiations of voice user interfaces that we will most without a doubt experience until we get there. I think there’s going to be … This is probably a poor example, but when you say, “Hey, house, boy it would be really wonderful when I got home if I could have some Chinese food delivered,” and then that’s just done. That’s not as magical as an incantation that summons a new reality, but that’s not bad. I think there’s an awful lot of promise there, and I think that promise comes not so much from the technology where it is now, but the promise of what we can do as human beings with our words.

Yeah. Who knows, right? I could be completely wrong, but we get to find out if it’s a positive experience.

Listeners, remember that while you’re listening to the show, you can follow along with the things we’re mentioning here in real time. Just head over to thedigitalife.com, that’s just one “l” in thedigitalife, and go to the page for this episode. We’ve included links to pretty much everything mentioned by everybody, so it’s a rich information resource to take advantage of while you’re listening or afterward if you’re trying to remember something that you liked.

If you want to follow us outside of the show, you can follow me on Twitter @JonFollet, that’s J-O-N F-O-L-L-E-T-T. Of course, this whole show is brought to you by Involution Studios, which you can check out at goinvo.com. That’s G-O-I-N-V-O.com. Dirk?

Dirk:

You can follow me on Twitter @DKnemeyer, that’s @D-K-N-E-M-E-Y-E-R or e-mail me, Dirk@goinvo.com.

Jon:

That’s it for Episode 81 of The Digital Life. For Dirk Knemeyer, I’m Jon Follett, and we’ll see you next time.