Thinking Out Loud: Understanding Voice UI, and How To Build for It

At work, we talk a lot about ‘voice’; what is it good for? Is it the post-mobile plat­form? And our clients ask us a lot about ‘voice’, and how to build a brand­ed app. But I’m not sure every­one is talk­ing about the same thing; and I’m just as unsure that any­one knows what makes a real­ly good brand­ed ‘voice’ app. I mean, I’m fair­ly sure I don’t.

This arti­cle is my attempt at defin­ing what we’re talk­ing about when we talk about ‘voice’; and, based on my expe­ri­ence as a user and devel­op­er of ‘voice’, try­ing to nail down some of the oppor­tu­ni­ties for brand­ed third-par­ty apps.

Defining ‘voice’

I think what we mean when we talk about ‘voice’ is: inter­ac­tion with a dig­i­tal assis­tant (Alexa, Assis­tant, Cor­tana, Siri, etc) through an inter­face that con­verts voice to text. Some­times there’s a screen that’s inter­ac­tive (Assis­tant on Android), some­times a screen that’s most­ly pas­sive (Echo Show), but most often there’s no screen at all (Home Mini, Echo Spot, etc).

To be even more pre­cise, I sus­pect that we’re most­ly talk­ing about Alexa and Assistant—the two biggest play­ers in the mar­ket that are exten­si­ble through a devel­op­ment plat­form. A voice app, in this case, is a web app that man­i­fests itself not through a brows­er, but through Alexa or Assis­tant com­pat­i­ble devices: Alexa calls them Skills, and Assis­tant, Actions.

I also think that the sin­gle ‘voice’ label is slight­ly con­fus­ing, because real­ly it’s not one thing; there are prob­a­bly four com­po­nents to it:

Input: mes­sage tran­scrip­tion, such as com­pos­ing texts.

Com­mand: when we expect an action to be exe­cut­ed: ‘call my wife’, ‘play Stranger Things’.

Enquiry: when we expect an answer: ‘will I need an umbrel­la tomor­row?’, ‘how do I start exer­cis­ing?’

Con­ver­sa­tion: a mul­ti-step engage­ment with a dig­i­tal assis­tant appli­ca­tion—either free nat­ur­al lan­guage, or guid­ed through prompts.

Brand Opportunities

Each of the com­po­nents of voice offers some oppor­tu­ni­ties to busi­ness­es, although per­haps least is Input, as this large­ly takes place at the OS lev­el—in the input box­es of mes­sag­ing apps, for exam­ple. Unless the ser­vice offer­ing is for tak­ing notes or lists, there’s prob­a­bly not much scope here.

Con­ver­sa­tion is, I think, the area that most brands think of when they con­sid­er a voice app, because it offers the longest engage­ment (‘dwell time’, in the stan­dard met­rics). But it’s unproven that users want to spend more than a hand­ful of sec­onds at a time talk­ing to a voice-only device; per­son­al­ly, I def­i­nite­ly don’t want to. I think too many brand­ed voice apps are try­ing to max­imise engage­ment and, as a result, miss­ing out on what makes voice dif­fer­ent from, say, the web.

This com­po­nent also has oth­er draw­backs: the more ‘nat­ur­al’ and com­pre­hen­sive the con­ver­sa­tion aims to be, the more resource-inten­sive it becomes to build; and, while respons­es on voice devices with screens are easy to skim, voice-only respons­es can be an inef­fi­cient deliv­ery method for infor­ma­tion.

In my opin­ion, Enquiry is prob­a­bly the area that offers most val­ue to cus­tomers; as the name implies, dig­i­tal assis­tants are there to help peo­ple get to some­thing (usu­al­ly an answer) quick­ly, with as lit­tle fric­tion in the process as pos­si­ble. There might be great val­ue in, for exam­ple, ‘ask Brand X to rec­om­mend a thing for me’ and get­ting back an appro­pri­ate answer; although dwell time would be low, sat­is­fac­tion would be high.

There are, per­haps, two big chal­lenges in this approach. First know­ing enough about the cus­tomer and their intent to give them a response that sat­is­fies them. Sec­ond, com­pet­ing against what the dig­i­tal assis­tant already knows; if a ser­vice pro­vides, say, cock­tail recipes, it would need to be suf­fi­cient­ly unique to offer more than “OK Google, how do I make a mar­ti­ni?”.

Final­ly, Com­mand is sim­i­lar­ly inter­est­ing for brands, but prob­a­bly only most­ly use­ful if you offer some kind of media, prod­uct, or ser­vice end­points: ‘ask Brand X to tell me my points bal­ance’, or ‘tell Brand X to can­cel my last order’ would be short but sat­is­fy­ing inter­ac­tions.

Discovery

One draw­back of brand apps on dig­i­tal assis­tants is that you have to include the invo­ca­tion phrase in your request; ‘play some sooth­ing sounds’ might be mem­o­rable, but ‘ask Brand X to play some sooth­ing sounds’ is hard­er to remem­ber. This is a known and con­tin­u­ing prob­lem in get­ting peo­ple to dis­cov­er and—especially—make repeat­ed uses of a voice app.

On Alexa and Assis­tant, each voice app is reg­is­tered with a hand­ful of explic­it invo­ca­tions (‘talk to Brand X’; ‘ask Brand X how to start run­ning’) so that peo­ple can invoke them direct­ly if they know the right phrase; this is usu­al­ly what brands will pro­mote to dri­ve traf­fic.

But, know­ing that dis­cov­ery on voice plat­forms tends to be intent-based, Assis­tant also offers implic­it invo­ca­tions, such as ‘how to start run­ning’; so when a user asks for ‘advice on how to start run­ning’, Assistant’s rec­om­men­da­tion algo­rithm could match their intent with a brand­ed app. (If you think this sounds like key­word adver­tis­ing, I agree; and I expect it to be mon­e­tised that way in the future.)

The Right Approach?

My hunch is that the right approach for most brands right now would be a com­bi­na­tion of Enquiry and Con­ver­sa­tion—heavy on the for­mer, light on the lat­ter. Con­sid­er­ing the intents that users might have towards their brand, opti­mis­ing for dis­cov­ery around those, and try­ing to answer them in the best way. Per­son­alised to deliv­er opti­mised results, and using con­ver­sa­tion to dis­am­biguate or steer back towards the intents, but essen­tial­ly try­ing to deliv­er a response and get out of the way as much as pos­si­ble.

This would be affect­ed by a lot of vari­ables, and I’d pre­fer if my rec­om­men­da­tion was based on data rather than expe­ri­ence and intu­ition, so I’ll be look­ing for oppor­tu­ni­ties to mea­sure and test my hypothe­ses.