Publication Year

Authors

BibTeX

Abstract

We describe a large vocabulary speech recognition system that is accurate, has low
latency, and yet has a small enough memory and computational footprint to run
faster than real-time on a Nexus 5 Android smartphone. We employ a quantized Long
Short-Term Memory (LSTM) acoustic model trained with connectionist temporal
classification (CTC) to directly predict phoneme targets, and further reduce its
memory footprint using an SVD-based compression scheme. Additionally, we minimize
our memory footprint by using a single language model for both dictation and voice
command domains, constructed using Bayesian interpolation. Finally, in order to
properly handle device-specific information, such as proper names and other
context-dependent information, we inject vocabulary items into the decoder graph
and bias the language model on-the-fly. Our system achieves 13.5% word error rate
on an open-ended dictation task, running with a median speed that is seven times
faster than real-time.