UPCOMING EVENTS

Google AI researchers are applying computer vision to sound wave visuals to achieve state-of-the-art speech recognition system performance without the use of a language model. Researchers say the SpecAugment method requires no additional data and can be used without adaption of underlying language models.

“An unexpected outcome of our research was that models trained with SpecAugment out-performed all prior methods even without the aid of a language model,” Google AI resident Daniel S. Park and research scientist William Chan said in a blog post today. “While our networks still benefit from adding a language model, our results are encouraging in that it suggests the possibility of training networks that can be used for practical purposes without the aid of an language model.”

SpecAugment works in part by applying visual analysis data augmentation to spectrograms, visual representations of speech. SpecAugment was applied to Listen, Attend, and Spell networks for speech recognition tasks to achieve 2.6% word error rate (WER) with LibriSpeech960h, a collection of about 1,000 hours of spoken English, and 6.8% word error rate with the Switchboard 300h collection of 260 hours of telephone conversations in English.

Automatic speech recognition (ASR) systems translate speech into text for conversational AI like Google Assistant in Home smart speakers or Android smartphones using Gboard’s dictation tool for email or text message. Reductions in word error rates can be a key factor in conversational AI adoption rates, according to a 2018 PricewaterhouseCoopers survey.

The achievement was detailed in “SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition,” a paper published on arXiv on April 18.

Continuous improvement is part of the pitch makers of assistants like Alexa frequently make, but Google and Amazon have shared a number of papers in recent months detailing methods used to accelerate change.