ASA 130th Meeting - St. Louis, MO - 1995 Nov 27 .. Dec 01

5aSC25. Recent progress in the INRS speech recognition system.

For large-vocabulary continuous-speech recognition, a two-pass search
allows inexpensive first-pass models, with pruned search spaces represented by
word graphs. Powerful language models and detailed acoustic-phonetic models
follow. A first-pass Viterbi lexicon search is avoided via tables of estimates
of phone scores and durations, from backward-Viterbi searches of much smaller
graphs, which impose diphone rather than full lexical constraints on phonetic
transcriptions. These estimates of phone scores and durations are used to
calculate approximate acoustic matches for arbitrary phonetic transcriptions
(one floating-point operation/phone). The speaker-independent system uses WSJ0
data (5000-word vocabulary), with separate male and female models: 3-state full
right-context models, and code books of 14 static and 15 dynamic cepstral
parameters. The first pass uses VQ models with one covariance matrix and 256
means. The word inclusion rate is about 97%. For the second pass, trigram
language models with perplexity 104 and continuous-HMM acoustic models achieved
about 90% word-recognition accuracy on the development set. To achieve good
trade-offs between acoustic models' complexity and trainability, a
shared-distribution approach for clustering has distortion measures based only
on the weights of Gaussian mixtures rather than all parameters. Word accuracy
increased by 6% for the ATIS corpus. [Work supported by NSERC-Canada.]