Future of voice recognition: Assistants that learn from you

Voice-activated assistants are playing an increasingly prominent role in the technology world, with Apple's introduction of Siri for the iPhone 4S and Google's (rumored) work on a Siri competitor for Android phones.

Voice-activated technology isn't new—it's just getting better because of increasingly powerful processors and cloud services, advancements in natural language processing, and improved algorithms for recognizing voice. We spoke with Nuance Communications, maker of Dragon software and one of the biggest names in voice recognition technologies, about why voice is becoming more popular and what advancements we can expect in the future.

Peter Mahoney, Nuance chief marketing officer and general manager of the Dragon desktop business, told Ars one of the most significant improvements coming in the next few years is a far more conversational voice-activated assistant that remembers everything you say. This should create better responses to casual questions.

"I think you'll see systems that are more conversational, that have the ability to ask more sophisticated followup questions and adapt to the individual," Mahoney said. "They'll be able to remember what you're talking about. Talking to one of these assistants today is like dealing with someone with no short-term memory. They don't remember what you just said. These systems over time are going to get better and better short-term and long-term memory."

Asking Dragon Go! for iPhone "Where can I eat?" results in many interesting responses

How about Thai?

As an example, Mahoney said if you ask Nuance's Dragon Go! app for iPhone to make a reservation in an Italian restaurant tomorrow night at 8pm for four people but don't like the results, you basically have to start over. "You'd like to be able to say 'how about Thai?' instead of trying to repeat the same thing over again, or 'how about next Thursday?' You should be able to follow up and the system should be able to remember your conversation. They don't do that that well today."

Siri is taking steps toward providing a natural, conversation-like experience with voice-activated assistants, as Jacqui Cheng noted in the Ars iPhone 4S review. "When given direct and clear tasks, Siri performs well, and it's nice not having to memorize a strict list of commands," Cheng wrote. "The best part about Siri is the fact that you can (or should be able to, anyway) speak to it like you would speak to a person without having to conform to a special speaking syntax—the number one turn-off for 'regular' people using voice control features."

Siri's limits

Still, the Ars review noted some shortcomings. Siri often misinterpreted casually spoken commands, making it easier in many cases to perform the tasks manually.

The limits of Siri's conversational abilities may be seen in a video by Macworld's Jason Snell. Like the example Mahoney gave, Snell asked Siri "where can I have lunch?" After receiving the results, including 12 nearby restaurants, Snell asked "how about downtown?" Siri's response: "I don't know what you mean by 'how about downtown?'"

Snell said the same request worked on a previous attempt. "Sometimes the Siri software figures out what you mean and its kind of like magic. Other times it doesn't really work that well," he said. "I think that's one of the reasons Apple called this a beta."

The maker of Dragon has many products for desktops and mobile phones, as well as industry-specific software for in-car navigation systems, health care settings, and more. There have been rumors that Nuance technology even powers Apple's Siri on the back end. Nuance told us it can't comment on "specific capabilities or devices." However, the company did confirm that "Apple licenses Nuance's voice technology for use in some of its products."

Headquartered in Burlington, Massachusetts, Nuance has more than 1,000 engineers around the world. Research and development is divided into several categories, Mahoney explained. There's acoustic modeling, for processing audio and mapping it to sounds and words. Language modeling experts, including linguists, help build systems capable of understanding the structure of language and grammar. Natural language processing experts help extract meaningful information from the data gathered by Nuance services.

Nuance's simplest products are available in more than 70 languages. But the more complex desktop applications, such as Dragon NaturallySpeaking for PC, only support about a half-dozen.

"For each of the languages you need a good understanding of what puts the language together, how it's built and how the sounds translate into words," Mahoney said.

Apple's embrace of voice with Siri has increased consumer interest in the technology. "Apple's so strong with user experience that when they embrace voice as a core differentiator, that says to a lot of people it might be good enough, because Apple wouldn't do it if it's not great," Mahoney said.

Siri and Nuance's Dragon Go! for iPhone and Android aren't that different in the technology they use on the back end. They are different implementations, Mahoney said. "What the Siri application does is it tries to interpret what you're asking for and brings you through a very structured set of potential results Apple can deliver to you," he said. "It's very neatly controlled. It tends to be a great experience."

Dragon Go! connects users to results from more than 200 Web properties covering the most likely searches to provide information on restaurants, music and entertainment, local businesses, and other topics. For restaurant queries, Dragon takes users to Yelp for reviews and OpenTable to make reservations.

Machine learning will make voice-activated assistants smarter

Nuance has built up its user experience with various iterations over the years, but it's still a manual process. "Most of the systems require a human to define the different categories of information you're going to support," Mahoney said. "Some person has to decide what are the kinds of things people will ask You can fool any one of these systems if you work hard enough because they can't answer every single kind of question."

Over the next few years, we'll see voice recognition technologies learn from their users and improve themselves without manual intervention, Mahoney said. "As more machine learning capability gets implemented and more of these systems sort of build themselves, you'll see better and better coverage. The systems will learn from use about what kinds of things they need to cover, and they'll get smarter over time."

I was hoping to hear something about the Vlingo acquisition, and whether some of the new ideas Mahoney mentioned for voice recognition improvment would find their way into that product. Vlingo is more rigid than Siri in commands it can interpret ,but it works pretty darn well, has a good in-car feature, and a verbal wake-up to begin a command (vs. having to hit a button). Been a great addition to my OG Droid for my commute...

Dragon NaturallySpeaking goes on-line August 4th, 1997. Human decisions are removed from voice recognition. Dragon NaturallySpeaking begins to learn at a geometric rate. It becomes self-aware at 2:14 a.m. Eastern time, August 29th. In a panic, they try to pull the plug.

Dragon NaturallySpeaking goes on-line August 4th, 1997. Human decisions are removed from voice recognition. Dragon NaturallySpeaking begins to learn at a geometric rate. It becomes self-aware at 2:14 a.m. Eastern time, August 29th. In a panic, they try to pull the plug.

So true. It makes googling for voice recognition next to impossible I'm sure. But, I'm sick of all speech recognition being done remotely. Why can't we have local speech recognition already? Please? If it isn't as good because of lack of horsepower/storage.. at least make it a fall back until the local tech gets there.. but I don't like having to wait on everything to go out to another server and come back, dependent on my connection.

I remember the first time I came in contact with the software made by dragon. The first thing I did was of course to ensure my favourite words were all present and accounted for. Once I had all of the four letter words I could think of I was ready to go.However much to my surprise those words had a tendancy to appear when not asked for....

I have been a Dragon NaturallySpeaking user on the PC for a long time. I believe I started with version 3.0 or perhaps even a bit earlier. I find it using Dragon NaturallySpeaking works well for the broad strokes such as dictating the body of a letter or an article. Sort of like painting a ceiling with a paint roller, it works well for the center but when it comes to the edge you need a brush. Nuance suggests using speech recognition and the keyboard together and I agree.

I found early on that having a computer with the most CPU power and RAM was vital to good speech recognition performance. In those days I made a lot of money building SCSI-based machines because those early devices did not have enough RAM capability to hold the program itself in RAM while operating and the hard drive 'swap drives' were simply too slow. Nowadays it seems that many people think you can buy a $500 PC or laptop and successfully run speech recognition. Although many people try to do so, the results are usually disappointing and have created a poor perception of speech recognition capability as a result.

I am dictating this with Dragon 10.1 speech recognition on older desktop with a 3 GHz dual-core processor, 64 bit Win7 Pro and 8 Gb of RAM and the system is just barely adequate. My more-recent slimmed-down Lenovo laptop with second generation Sandy Bridge 3.2 GHz i5 dual core, also 64-bit Win 7 Pro and 8 Gb of RAM, does a much better job.

The most recent version of Dragon NaturallySpeaking (11.x) uses considerably more sophisticated speech algorithms and I believe would run much better but only on a computer with a quad core processor. Unfortunately the "gamers" own the high-performance computer market these days and one cannot get a high performance laptop that is not optimized for gaming with attendant larger-sized screen and accompanying weight penalty.

Good voice recognition (i.e. what words did you actually say) and natural language parsing (i.e. what did you actually mean by those words) are both required for Siri and Dragon Go! to function. As noted above the "easy" aspect (voice recognition) is still struggling to meet broad expectations for usability, and the hard part (natural language parsing) doesn't work particularly well even when digitized text is processed so I expect the this technology will take a while to mature.

Dragon NaturallySpeaking goes on-line August 4th, 1997. Human decisions are removed from voice recognition. Dragon NaturallySpeaking begins to learn at a geometric rate. It becomes self-aware at 2:14 a.m. Eastern time, August 29th. In a panic, they try to pull the plug.

Dragon NaturallySpeaking goes on-line August 4th, 1997. Human decisions are removed from voice recognition. Dragon NaturallySpeaking begins to learn at a geometric rate. It becomes self-aware at 2:14 a.m. Eastern time, August 29th. In a panic, they try to pull the plug.

Dragon = Siri? <.<

Your geek card has now been revoked.

Nah, you misunderstood. I was suggesting Instant Sunrise might be wrong and that Siri is the correct Skynet, not D.N.S.