We’re on the Brink of a Revolution in Crazy-Smart Digital Assistants

Francesco Muzzi

Here’s a quick story you’ve probably heard before, followed by one you probably haven’t. In 1979 a young Steve Jobs paid a visit to Xerox PARC, the legendary R&D lab in Palo Alto, California, and witnessed a demonstration of something now called the graphical user interface. An engineer from PARC used a prototype mouse to navigate a computer screen studded with icons, drop-down menus, and “windows” that overlapped each other like sheets of paper on a desktop. It was unlike anything Jobs had seen before, and he was beside himself. “Within 10 minutes,” he would later say, “it was so obvious that every computer would work this way someday.”

As legend has it, Jobs raced back to Apple and commanded a team to set about replicating and improving on what he had just seen at PARC. And with that, personal computing sprinted off in the direction it has been traveling for the past 40 years, from the first Macintosh all the way up to the iPhone. This visual mode of computing ended the tyranny of the command line—the demanding, text-heavy interface that was dominant at the time—and brought us into a world where vastly more people could use computers. They could just point, click, and drag.

In the not-so-distant future, though, we may look back at this as the wrong PARC-related creation myth to get excited about. At the time of Jobs’ visit, a separate team at PARC was working on a completely different model of human-computer interaction, today called the conversational user interface. These scientists envisioned a world, probably decades away, in which computers would be so powerful that requiring users to memorize a special set of commands or workflows for each action and device would be impractical. They imagined that we would instead work collaboratively with our computers, engaging in a running back-and-forth dialog to get things done. The interface would be ordinary human language.

Pipe Down, Jarvis

For decades, the talking tech in movies has eclipsed anything we’ve been able to build in the real world. That’s finally starting to change.

HAL 9000 from 2001: A Space Odyssey | HAL, the psychotic AI with an FM-DJ voice, is able to control every last detail of a mission to Jupiter.

KITT from Knight Rider | Michael Knight’s in-dash AI partner is sarcastic, indestructible, and always ready to get Knight out of a jam.

Jarvis from Iron Man | You never see Jarvis, but his diagnostics, worried nagging, and instant calculations are crucial to Iron Man’s superheroness.

Samantha from Her | She starts by reading his email—and eventually becomes much more than a helpful assistant in Theodore Twombly’s ear.

One of the scientists in that group was a guy named Ron Kaplan, who today is a stout, soft-spoken man with a gray goatee and thinning hair. Kaplan is equal parts linguist, psychologist, and computer scientist—a guy as likely to invoke Chomsky’s theories about the construction of language as he is Moore’s law. He says that his team got pretty far in sketching out one crucial component of a working conversational user interface back in the ’70s; they rigged up a system that allowed you to book flights by exchanging typed messages with a computer in normal, unencumbered English. But the technology just wasn’t there to make the system work on a large scale. “It would’ve cost, I don’t know, a million dollars a user,” he says. They needed faster, more distributed processing and smarter, more efficient computers. Kaplan thought it would take about 15 years.

“Forty years later,” Kaplan says, “we’re ready.” And so is the rest of the world, it turns out.

Today, Kaplan is a vice president and distinguished scientist at Nuance Communications, which has become probably the biggest player in the voice interface business: It powers Ford’s in-car Sync system, was critical in Siri’s development, and has partnerships across nearly every industry. But Nuance finds itself in a crowded marketplace these days. Nearly every major tech company—from Amazon to Intel to Microsoft to Google—is chasing the sort of conversational user interface that Kaplan and his colleagues at PARC imagined decades ago. Dozens of startups are in the game too. All are scrambling to come out on top in the midst of a powerful shift under way in our relationship with technology. One day soon, these companies believe, you will talk to your gadgets the way you talk to your friends. And your gadgets will talk back. They will be able to hear what you say and figure out what you mean.

If you’re already steeped in today’s technology, these new tools will extend the reach of your digital life into places and situations where the graphical user interface cannot safely, pleasantly, or politely go. And the increasingly conversational nature of your back-and-forth with your devices will make your relationship to technology even more intimate, more loyal, more personal.

But the biggest effect of this shift will be felt well outside Silicon Valley’s core audience. What Steve Jobs saw in the graphical user interface back in 1979 was a way to expand the popular market for computers. But even the GUI still left huge numbers of people outside the light of the electronic campfire. As elegant and efficient as it is, the GUI still requires humans to learn a computer’s language. Now computers are finally learning how to speak ours. In the bargain, hundreds of millions more people could gain newfound access to tech.

Voice interfaces have been around for years, but let’s face it: Thus far, they’ve been pretty dumb. We need not dwell on the indignities of automated phone trees (“If you’re calling to make a payment, say ‘payment’”). Even our more sophisticated voice interfaces have relied on speech but somehow missed the power of language. Ask Google Now for the population of New York City and it obliges. Ask for the location of the Empire State Building: good to go. But go one logical step further and ask for the population of the city that contains the Empire State Building and it falters. Push Siri too hard and the assistant just refers you to a Google search. Anyone reared on scenes of Captain Kirk talking to the Enterprise’s computer or of Tony Stark bantering with Jarvis can’t help but be perpetually disappointed.

Ask around Silicon Valley these days, though, and you hear the same refrain over and over: It’s different now.

One hot day in early June, Keyvan Mohajer, CEO of SoundHound, shows me a prototype of a new app that his company has been working on in secret for almost 10 years. You may recognize SoundHound as the name of a popular music-recognition app—the one that can identify a tune for you if you hum it into your phone. It turns out that app was largely just a way of fueling Mohajer’s real dream: to create the best voice-based artificial-intelligence assistant in the world.

The prototype is called Hound, and it’s pretty incredible. Holding a black Nexus 5 smartphone, Mohajer taps a blue and white microphone icon and begins asking questions. He starts simply, asking for the time in Berlin and the population of Japan. Basic search-result stuff—followed by a twist: “What is the distance between them?” The app understands the context and fires back, “About 5,536 miles.”

Mohajer rattles off a barrage of questions, and the app answers every one. Correctly.

Then Mohajer gets rolling, smiling as he rattles off a barrage of questions that keep escalating in complexity. He asks Hound to calculate the monthly mortgage payments on a million-dollar home, and the app immediately asks him for the interest rate and the term of the loan before dishing out its answer: $4,270.84.

“What is the population of the capital of the country in which the Space Needle is located?” he asks. Hound figures out that Mohajer is fishing for the population of Washington, DC, faster than I do and spits out the correct answer in its rapid-fire robotic voice. “What is the population and capital for Japan and China, and their areas in square miles and square kilometers? And also tell me how many people live in India, and what is the area code for Germany, France, and Italy?” Mohajer would keep on adding questions, but he runs out of breath. I’ll spare you the minute-long response, but Hound answers every question. Correctly.

Hound, which is now in beta, is probably the fastest and most versatile voice recognition system unveiled thus far. It has an edge for now because it can do speech recognition and natural language processing simultaneously. But really, it’s only a matter of time before other systems catch up.

After all, the underlying ingredients—what Kaplan calls the “gating technologies” necessary for a strong conversational interface—are all pretty much available now to whoever’s buying. It’s a classic story of technological convergence: Advances in processing power, speech recognition, mobile connectivity, cloud computing, and neural networks have all surged to a critical mass at roughly the same time. These tools are finally good enough, cheap enough, and accessible enough to make the conversational interface real—and ubiquitous.

But it’s not just that conversational technology is finally possible to build. There’s also a growing need for it. As more devices come online, particularly those without screens—your light fixtures, your smoke alarm—we need a way to interact with them that doesn’t require buttons, menus, and icons.

Note: Area of pie charts does not include “speech bubble” tail. But it’s cool that you were wondering about that. Source: Google Mobile Voice Survey 2014

At the same time, the world that Jobs built with the GUI is reaching its natural limits. Our immensely powerful onscreen interfaces require every imaginable feature to be hand-coded, to have an icon or menu option. Think about Photoshop or Excel: Both are so massively capable that using them properly requires bushwhacking through a dense jungle of keyboard shortcuts, menu trees, and impossible-to-find toolbars. Good luck just sitting down and cropping a photo. “The GUI has topped out,” Kaplan says. “It’s so overloaded now.”

That’s where the booming market in virtual assistants comes in: to come to your rescue when you’re lost amid the seven windows, five toolbars, and 30 tabs open on your screen, and to act as a liaison between apps and devices that don’t usually talk to each other.

You may not engage heavily with virtual assistants right now, but you probably will soon. This fall a major leap forward for the conversational interface will be announced by the ding of a push notification on your smartphone. Once you’ve upgraded to iOS 9, Android 6, or Windows 10, you will, by design, find yourself spending less time inside apps and more chatting with Siri, Google Now, or Cortana. And soon, a billion-plus Facebook users will be able to open a chat window and ask M, a new smart assistant, for almost anything (using text—for now). These are no longer just supplementary ways to do things. They’re the best way, and in some cases the only way. (In Apple’s HomeKit system for the connected house, you make sure everything’s off and locked by saying, “Hey Siri, good night.”)

At least in the beginning, the idea behind these newly enhanced virtual assistants is that they will simplify the complex, multistep things we’re all tired of doing via drop-down menus, complicated workflows, and hopscotching from app to app. Your assistant will know every corner of every app on your phone and will glide between them at your spoken command. And with time, they will also get to know something else: you.

Courtesy of Soundhound, Inc.

Let’s quickly clear something up: Conversational tech isn’t going to kill the touchscreen or even the mouse and keyboard. If you’re a power user of your desktop computer, you’ll probably stay that way. (Although you might avail yourself more often of the ability to ask a virtual assistant things like “Where’s the crop tool, again?”)

But for certain groups of people, the rise of the conversational interface may offer a route to technological proficiency that largely bypasses the GUI. Very young people, for instance, are already skipping their keyboards and entering text through microphones. “They just don’t type,” says Thomas Gayno, cofounder and CEO of voice messaging app Cord. And elsewhere on the age spectrum, there are an enormous number of people for whom the graphical user interface never really worked in the first place. For the visually impaired, the elderly, and the otherwise technologically challenged, it has always been a little laughable to hear anyone describe a modern computer interface as “intuitive.”

Chris Maury learned this the hard way. In the summer of 2010, the then-24-year-old entrepreneur was crashing on a friend’s air mattress in Palo Alto and interning at a startup called ImageShack, having just dropped out of a PhD program to chase the Silicon Valley dream. And in the midst of his long commutes and fiendishly late nights, he realized his prescription eyeglasses weren’t cutting it anymore. An ordinary optometrist appointment led to a diagnosis of Stargardt’s disease, a degenerative condition that doctors told him would eventually leave him legally blind.

We develop relationships with our digital assistants: Even when Cortana was unhelpful, people got attached to it.

Maury, who had every intention of staying in tech, was immediately forced to consider how he might use a computer without his vision. But for the 20-some million people in the US who can’t see, there’s only one real option for staying connected to computers: a 30-year-old technology called a screen reader.

To use one of these devices, you move a cursor around your screen using a keyboard, and the machine renders into speech whatever’s being selected—a long URL, a drop-down menu—at a mind-numbing robotic clip. Screen reader systems can cost thousands of dollars and require dozens of hours of training. “It takes sometimes two sessions before you can do a Google search,” Maury tells me. And as digital environments have gotten more and more complex, screen readers have only gotten harder to use. “They’re terrible,” Maury says.

As his vision started to go downhill, Maury immersed himself in Blind Twitter (yes, there’s Blind Twitter) and the accessibility movement. He came to realize how pissed off some visually impaired people were about the technology available to them. And at the same time, he was faintly aware that the potential ingredients for something better—an interface designed first for voice—were, at that moment, cropping up all over Silicon Valley.

So he set out to redeem technology for blind people. Maury founded a company, Conversant Labs, in the hope of building apps and services that put audio first. Conversant’s first product is an iPhone app called SayShopping, which offers a way to buy stuff from Target.com purely through speech. But Maury has much bigger designs. Conversant Labs is releasing a framework for adding conversational interaction to apps for iOS developers before the end of the year. And Maury wants to build a prototype soon of a fully voice-based computing environment, as well as an interface that will use head movements to give commands. “That’s all possible right now,” he says. “It just needs to be built.”

One day in the fall of 2014, out of nowhere, Amazon announced a new product called the Echo, a cylindrical, talking black speaker topped with a ring of blue lights that glow when the device speaks. The gadget’s persona is named Alexa. At the sound of its “wake word,” the Echo uses something called far-field voice recognition to isolate the voice that is addressing it, even in a somewhat noisy room. And then it listens. The idea is that the Echo belongs in the middle of your living room, kitchen, or bedroom—and that you will speak to it for all sorts of things.

It’s a funny thing, trying to make sense of a technology that has no built-in visual interface. There’s not much to look at, nothing to poke around inside of, nothing to scroll through, and no clear boundaries on what it can do. The technology press was roundly puzzled by this “enigmatic” new product from Amazon. (At least one scribe compared the Echo to the mysterious black monolith from the beginning of 2001: A Space Odyssey.)

When I started using Alexa late last year, I discovered it could tell me the weather, answer basic factual questions, create shopping lists that later appear in text on my smartphone, play music on command—nothing too transcendent. But Alexa quickly grew smarter and better. It got familiar with my voice, learned funnier jokes, and started being able to run multiple timers simultaneously (which is pretty handy when your cooking gets a little ambitious). In just the seven months between its initial beta launch and its public release in 2015, Alexa went from cute but infuriating to genuinely, consistently useful. I got to know it, and it got to know me.

This gets at a deeper truth about conversational tech: You only discover its capabilities in the course of a personal relationship with it. The big players in the industry all realize this and are trying to give their assistants the right balance of personality, charm, and respectful distance—to make them, in short, likable. In developing Cortana, for instance, Microsoft brought in the videogame studio behind Halo—which inspired the name Cortana in the first place—to turn a disembodied voice into a kind of character. “That wittiness and that toughness come through,” says Mike Calcagno, director of Cortana’s engineering team. And they seem to have had the desired effect: Even in its early days, when Cortana was unreliable, unhelpful, and dumb, people got attached to it.

There’s a strategic reason for this charm offensive. In their research, Microsoft, Nuance, and others have all come to the same conclusion: A great conversational agent is only fully useful when it’s everywhere, when it can get to know you in multiple contexts—learning your habits, your likes and dislikes, your routine and schedule. The way to get there is to have your AI colonize as many apps and devices as possible.

To that end, Amazon, Google, Microsoft, Nuance, and SoundHound are all offering their conversational platform technology to developers everywhere. The companies know that you are liable to stick with the conversational agent that knows you best. So get ready to meet some new disembodied voices. Once you pick one, you might never break up.