While the title mentions optimizations for mobile speech applications, it seemed absent from the talk. It was mostly an advertisement style talk, similar to other talks from Google I have seen. However, in addition to the standard Google advertisement, there was some interesting information.

Michiel claimed this was the “golden age of speech recognition”. Of course people have been saying this for many years, but he did try to provide some evidence. For instance, he showed a clip of Saturday Night Live mocking standard dialogue systems from 2007. It highlighted not only how poor the systems performed, but also how the general population recognized it. Contrast that to today where voice is now a commonly accepted way to interact with our phones and other devices. It has also become profitable for companies to develop devices where speech is the primary modality for interaction.

Google has also recently been working on removing the GMM from speech recognition. Until now, a DNN required an initial GMM to perform the initial labeling of the data along with the state-tying. They can now get similar performance by training the DNN from a flat start without requiring an initial GMM. This is a more complicated process than you may realize because of the priors. Since the DNN models the reverse of what the GMM models, a state prior is required. A poor prior can lead to a poor DNN model due to errors in alignment. They find it is important to frequently update the prior model during training.

Finally, he discussed some more recent work with long short term memory model (LSTM). The major take away is they can achieve similar performance to a DNN with a much smaller model. I think this will be an active area of research in the future; finding alternative models that are similar to DNNs, but require fewer parameters.

The number of multilingual speakers in the world outnumber the monolingual speakers. Given this knowledge, it is surprising how little work has been done in the area of code-switching. It is a difficult task due to lack of training data and the speaker dependency. This was a perfect talk for Singapore. The mixture of languages here and the fluency with which many people switch back and forth (even at the word level) is a perfect illustration of this problem.

While I tend to only think about acoustic modeling issues within this domain, Tanja discussed difficulties in designing a lexicon (including the acoustic unit inventory), training an acoustic model, and building a language model.

In designing a lexicon, the simplest approach is to merge lexicons from multiple languages. This produces two main issues. The first is that two languages may have homographs, but the pronunciations and semantics may differ greatly. The other issue is the set of acoustic units. If you are using something like IPA, it is questionable whether identical phones from different languages are actually identical. Tanja also introduced a tool developed in her lab for dealing with these issues: Rapid Language Adaptation Toolkit (RLAT).

For acoustic models, all previous work has basically shown that a monolingual model outperforms multilingual models. She claimed that this was not true of more recent work. I know there has been some success in using multilingual bottleneck features, but I do not think the evidence for acoustic models is clear yet.

Language modeling is probably the least investigated aspect of this task, but potentially the most difficult. Since code-switching is a phenomenon of conversational speech, finding adequate amounts of text for language model training is nearly impossible. In addition, there are no rules for code switching. The variation between speakers makes using general models of code switching ineffective.

Yifan GongSelected Challenges and Solutions for DNN Acoustic Modeling

The talk was mostly a brief synopsis of multiple DNN research topics being investigated at Microsoft. I found a couple of topics of particular interest. As with Tanja’s talk, Yifan also discussed multilingual acoustic models. The standard approach with DNNs was used. A single DNN was trained with multiple languages, where the final output layer was language dependent. In this case, they had hundreds of hours of audio in alternative languages and very limited training data in the target language. I do wonder if the improvements disappear in the case where you have greater amounts of untranscribed in-language data.

He also presented work on reducing the size of DNNs while minimizing the impact on accuracy. One drawback to DNNs are their large number of parameters, especially for mobile applications. In this work, they trained a large DNN and a small DNN system jointly. After updating each network for a particular batch, they would send an additional update to the small DNN to minimize the KL-Divergence between their outputs. Although this approach does not reduce the cost of training the initial system, it allows a smaller model to be used during decoding.

Yifan ended his presentation by stressing that robustness is still a major research area for DNNs. He supported this with some experimental results showing that DNNs are not necessarily more robust to variation than GMMs (overall performance is better, but the relative effects of different types of variation is similar).