Share This Story!

Google, Microsoft take voice-search to new levels

Your voice — coupled with a machine's ability to recognize and process speech — is assuming a bigger role in the era of "Internet of Things." Ed Baig visits Google's voice lab and talks to researchers at Microsoft and Nuance.

MOUNTAIN VIEW, Calif. — In a hidden anechoic chamber at Google's headquarters, I'm recording sample phrases into a microphone: "Where is the nearest Mexican restaurant?" "What is the temperature in Tokyo?" "What do apples, oranges and bananas have in common?"

The chamber is suspended off the ground, with foam on the walls and ceiling and a bouncy floor that reminds you of a trampoline. The sprinkler and vent systems are isolated from the rest of the building.

This elaborate setup ensures that no external noises spoil the recording of my voice, except any background sounds that Google chooses to overlay after the fact — audio, for example, from a car, loud party or café.

Google uses this acoustic lab to help test the performance of apps and devices that depend on voice.

Google's anechoic chamber lab.(Photo: Edward C. Baig, USA TODAY)

Your voice — coupled with a machine's ability to recognize and process speech — assumes a bigger role as we usher in the era of wearable devices, connected cars, home automation and "Internet of Things" appliances. And so will the voices inside such devices and things.

Most of us are already comfortable gabbing into smartphones and having the devices talk back, whether we're requesting directions or searching for a restaurant.

Such virtual assistants are not only getting smarter at recognizing your voice and understanding the intent of what you're asking, but communicating back within the proper context.

You might ask Siri, for example, "What's the weather like today?" and Siri will pull up information based on your current location. You can then say, "What about in Austin?" and Siri knows you're still talking about weather. Then ask, "What about this weekend?" and it will pull up the Saturday and Sunday forecast.

Speech recognition has come a long way. Google reports word error rates of 8%, down from about 25% just a couple of years ago. Some of the improvement can be traced to advances in computing power, some to machine learning techniques and natural language processing.

It all ties into Google search. "Things like 'how to make a Mai Tai' or 'what are shinsplints' — we couldn't answer those questions before. It's the combination of the speech getting better and the percentage of questions that we can answer getting higher," says Tamar Yehoshua, vice president of product management at Google.

These days, you can speak to the Google app in 58 languages. Voice searches have more than doubled in the past year alone.

Through smartphones, researchers are better able to input speech data from myriad languages, accents and dialects, not to mention pitch and tone.

GOOGLE'S 'VOICE HUNTER'

Still, there's complexity.

Google's "voice hunter" Linne Ha(Photo: Chris George)

Senior program manager Linne Ha, aka Google's "voice hunter," runs the Pygmalion team of linguists and is involved in a project helping develop a "universal model" of the world's spoken languages, so that the Google app can understand whatever language you are speaking.

"We're coming up with rules and exceptions to train the computer," Ha says. "Why do we say 'the president of the United States'? And why do we not say 'the president of the France'? There are all sorts of inconsistencies within our language and within every language. For humans it seems obvious and natural, but for machines it's actually quite difficult."

The goal is to make all seem natural. "We want to be able to have somebody have a conversation with Google as you would have a conversation with a friend," says Yehoshua.

Microsoft is taking a similar path with Cortana. "Our approach … includes defining a voice with an actual personality," notes Marcus Ash, group program manager for Cortana. Microsoft actually has a trained voice actress read thousands of responses to questions so that Cortana isn't "just programmed with robotic responses."

Cortana will enter the limelight later this year, when she moves beyond smartphones into Windows 10 PCs and tablets.

Voice becomes a bigger deal on devices with small screens — or perhaps no screens at all. Folks who buy the Apple Watch when it appears next month will be able to communicate by saying, "Hey, Siri."

Before long you may be speaking out loud to your garage door, crockpot, sofa and shower head: "Hot water, please."

WILD WEST

"As this sea of devices really swells, it challenges your thinking about how to design these things," says Tim Lynch, a lead designer at Nuance Communications. "We're sort of in the Wild West with all of these devices.You have to learn how to interact, and there's not necessarily a standard. There's an opportunity and a need for something natural like speech to be that unifying thread."

Tim Lynch of Nuance(Photo: Edward C. Baig)

Much of today's input conforms to what Microsoft's Ash refers to as a "single shot" speech query. That is, you say a command, Cortana responds, and that's the end of your speech session. Now he says, "There's a bunch of research being done in what we call "multi-term" speech. In the future you'll be able to pivot — say something like, 'What's the weather in Bellevue? (And then) show me some great restaurants there.'"

One of the complicating factors remains ambient noise. "The hardest thing is something called end pointing," says Yehoshua. "Knowing when you're done. If there's too much noise that it recognizes near you then it's going to keep trying to recognize (those fringe noises)." But she says the ability to distinguish background sounds from what you're saying has dramatically improved.

For the most part, smartphones have excellent directional microphones, which help filter out noises on the edge. Presumably there'll be decent microphones in wearables and Internet of Things devices that lean heavily on voice.

Ash of Microsoft points to another issue: How will a small-form factor device that depends on voice do in an environment without an Internet connection and not much processing power or storage? "With devices that can have a connected and disconnected state, how do you build a speech interface that can work in both cases, and how do you communicate that to customers? That's something we're spending some time thinking about."

Lots of progress has been achieved. But the biggest tech companies are still working diligently to make sure your voice is heard — loud and clear.