We present Listen, Attend and Spell (LAS), a neural network that learns to transcribe speech utterances to characters. Unlike traditional DNN-HMM models, this model learns all the components of a speech recognizer jointly. Our system has two components: a listener and a speller. The listener is a pyramidal recurrent network encoder that accepts filter bank spectra as inputs. The speller is an attention-based recurrent network decoder that emits characters as outputs. The network produces character sequences without making any independence assumptions between the characters. This is the key improvement of LAS over previous end-to-end CTC models. On the Google Voice Search task, LAS achieves a word error rate (WER) of 14.2% without a dictionary or a language model, and 11.2% with language model rescoring over the top 32 beams. In comparison, the state-of-the-art CLDNN-HMM model achieves a WER of 10.9%.

Suggested to Venues

Discussion

The authors need to compare on something that is publicly available. No one can compare to the authors methods in a follow-up paper because no one outside Google can get access to their dataset.

GD

George Dahl wrote 4 years ago

Public

It is very hard to get a state of the art baseline recipe on a new dataset, so although it would of course be nice to see results on public data, it is more important to have the strongest possible baseline. In the speech community there are slightly different norms for these things and people seem to prefer comparisons of one system against itself with some capabilities ablated than to compare to other published numbers. Since there are so many pieces in the modern speech pipeline, comparisons to published numbers are less meaningful since so much of the system will differ. That is why open source toolkits like Kaldi are so useful for research.