The Inevitable Transition to Voice Tech That Just Works

Siri. Alexa. Cortana. These are the voice technologies most of us think of first — anthropomorphic agents trapped in tin cans. But if conversational AI was the framework that brought us these music playing assistants, then auditory AI will be the framework that enables voice technologies to generate real value. The next generation of products are better listeners, fading into the background and preemptively presenting us with feedback, notes, and analytics.

Put another way, I’m talking about the difference between active and passive products. Active products have agency and identity, and are the most familiar to us. Gmail is an active product. We intentionally seek it out as a tab in our browser and directly interact with it to accomplish our goals.

Passive products, on the other hand, are the introverts of the technology world. They’re hardly noticeable until they offer up an output. Collision avoidance systems in cars are passive products. We don’t say good morning to our backup sensor on our drive to work; in fact, we don’t even have to realize the feature exists to get value from it.

From a UX perspective, active products require an interface where passive products do not. Historically, with digital interfaces largely limited to pixels on a screen, interacting with active products hasn’t felt overly awkward or burdensome because we’re already familiar with the input interface of buttons in the physical world.

But with machine learning’s ability to predict intent, computing is no longer confined to digital displays. With voice as a key emerging method of human computer interaction, many early product designers attempted to simply apply the active product paradigm to this new medium. Unfortunately, this doesn’t usually work in practice. The uncanny valley notwithstanding, giving voice products agency and memorable monikers only burdens users with a cumbersome, unnatural user experience.

It’s difficult to find a mode of interaction for voice other than through a variant of an artificial agent — talking to a wall somehow seems worse. The only way around the conundrum is to remove the interaction altogether. Our future voice assistants might better be called hearing aids because their highest utility is extracted when they are left to listen, only showing their presence when they have something to contribute.

Companies like Chorus.ai and Cogito, addressing sales and customer support respectively, are relatively early movers in this emerging space. Both products listen intently into phone calls for key words and intonations to coach humans to be better and to collect critical analytics for evaluation at the team level. The tech stack for companies in this rapidly expanding niche is very similar to that of traditional voice companies with a lack of emphasis on speech synthesis (see Lyrebird).

A number of founders recently joined us at a dinner BSV hosted on the future of voice in AI. Despite many attendees running competing companies, a majority agreed that passive and vertical is the way to go, at least where uttering a query to an ASR isn’t already a natural action. Roxy, a digital hotel concierge, is a good illustrative example of that exception. We’re used to picking up the phone to request things for our hotel room, so agency isn’t an outlandish concept.

This isn’t to say that there’s no room for products like Alexa. But in the future it seems even consumer-focused tech giants are gearing up for a pivot to passive. If Google’s Duplex circus showed us anything, it’s that passive voice products are inevitable once the technical and ethical kinks are worked out.

For many, this may feel a bit too Big Brother for comfort — and that’s ok. Over time we will iron out the right consumer best practices for such products. But in the meantime, there are plentiful opportunities in dozens of enterprise verticals, and an exciting generation of companies working to deliver that.