The author is a Forbes contributor. The opinions expressed are those of the writer.

Loading ...

Loading ...

This story appears in the {{article.article.magazine.pretty_date}} issue of {{article.article.magazine.pubName}}. Subscribe

When artificial-intelligence guru Andrew Ng joined Chinese Internet pioneer last May as chief scientist, he was a little cagey about what he and his team might work on at a newly opened lab in Sunnyvale, Calif. But he couldn't help revealing better speech recognition as a key area of interest in the age of the smartphone.

Today, Baidu, often called China's , unveiled the first results of what the former Google researcher, Stanford professor and Coursera cofounder had in mind. In a paper published today on Cornell University Library's arXiv.org site, Ng and 10 members of his Baidu Research team led by research scientist Awni Hannun said they've come up with a new method of more accurately recognizing speech, an increasingly important feature used in Siri and Dictation services as well as Google's voice search. Baidu's Deep Speech beat other methods such as those offered by Google and Apple on standard benchmarks that measure the error rate of speech recognition systems, according to Ng.

In particular, Deep Speech works better than the others in noisy environments, such as in a car or a crowd. That's key, of course, to making speech recognition truly useful in the real world. In noisy backgrounds, Ng said, tests showed that Deep Speech outperformed several speech systems--the Google Speech API, wit.ai, Microsoft's Bing Speech, and Apple Dictation--by over 10% in terms of word error rates.

Baidu offered supporting comments from two university professors. "This recent work by Baidu Research has the potential to disrupt how speech recognition will be performed in the future," Ian Lane, assistant research professor of engineering at Carnegie Mellon University, said in a press release. The company requested that the details not be revealed before this morning's publication of the paper, so Google, Apple, and others couldn't be contacted for comment. I'll add what they have to say if they choose to comment later.

Andrew Ng, chief scientist at Baidu

Like other speech recognition systems, Baidu's is based on a branch of AI called deep learning. The software attempts to mimic, in very primitive form, the activity in layers of neurons in the neocortex, the 80 percent of the brain where thinking occurs, so deep learning systems learn to recognize patterns in digital representations of sounds, images, and other data--ideally lots and lots of data. "The first generation of deep learning speech recognition was reaching limits," Ng said in an interview.

The Baidu team collected some 7,000 hours of speech from 9,600 people, mostly in quiet environments--though sometimes speakers wore headphones playing loud background noise so they would change their pitch or inflections in the same way they would in a noisy environment. Then, using a principle of physics called superposition, the team added about 15 types of noise, such as ambient noise in restaurants, car, and subways, to those speech samples. That essentially amplified the speech samples to 100,000 hours of data. Then it let the system learn to recognize speech even amid all that noise.

It's a much simpler method than today's speech recognition systems, Ng says. They use a series of modules that analyze phonemes and other parts of speech which often require hand-designed modules using statistical probability systems called Hidden Markov Models, which require lots of human tuning to model noise and speaker variation. Baidu's system replaces those models with deep learning algorithms that are trained on a recurrent neural network, or simulation of connected neurons, making the system much simpler, Ng says.