Interspeech had a great selection of speakers this year. My thoughts on the individual talks are below. Note that Interspeech had five keynotes, but I only describe three. Unfortunately, I was unable to attend. Their absence here does not indicate a lack of quality or interest, just my inability to wake up.

Anne Cutler,ISCA Medalist
Learning about Speech

As the newest ISCA medalist, Anne had the privilege of giving the first talk of the conference. She began by advertising a large number of Phd positions and postdoc positions (though they do not appear to be posted yet) thanks to a recent large grant.

Much of her work deals with language learning of infants. She played a recording from inside the womb of a mother—not sure how they get that microphone in there—and some aspects were remarkably clear. While the speech itself was not understandable, the gender of the speaker and the prosody were obvious. Infants begin speech learning in the womb, specifically during the final trimester.

When an infant is born, they already have a preference for known speakers and languages that are similar to their native tongue. Contrary to popular conception, infants actually cope with continuous speech at a very young age (10 months). They have the ability to pick out and recognize words presented in a continuous utterance. Even more interesting, the type of language used when speaking to infants (Motherese) appears to be controlled by feedback from the infant. Your child controls your speech.

She also presented some interesting results demonstrating how learning can shape perception. For instance, babies seem to only be able to discriminate between speakers when they speak the same language as their native language.

The adaptation experiments were also interesting. They were based on the distinction between /s/ and /f/. Listeners are given a few examples of words where the /s/ phone has been replaced with a phone more similar to /f/—and vice versa—for words where the distinction between /s/ and /f/ does not produce a confusable pair. After these examples, they were presented with words that were confusable and it was shown their perception of the phone had shifted. Humans can quickly adapt their speech perception to cope with new speakers, a task that is still very difficult for machines.

Her final comment contrasted human perception with machine learning. I believe her point was that human beings are perceptual animals and that we are highly motivated to learn in this setting; maybe our machine learning algorithms lack this motivation. Not sure if it was a tongue-in-cheek comment, or a serious comment. If she is implying that our objective functions are not ideal, then there is some truth in that.

Lori began by discussing multilingual models. Multilingual modeling has mostly been a string of failures until recently. Some labs are now starting to see improvements with multilingual bottleneck features and even hybrid acoustic models. The current resurgence may be thanks in part to the development of standardized corpora. The amount of available datasets in a variety of languages continues to increase. Publishing on standard datasets is always easier than trying to publish on your own private data. There is still a heavy reliance on annotated language resources, requiring a large amount of human effort.

The focus then changed to unsupervised acoustic model training. This has been a focus at Limsi for more than ten years now. She showed some examples of why pronunciation models are crucial. With incorrect pronunciations, the alignments will be incorrect. This is carried over to the acoustic model, leading to poor models for certain contexts. I can see how this can be a problem in the supervised case, so I understand why it may be even worse in the unsupervised case.

A brief overview of the IARPA Babel project followed. One point was to highlight how much worse the performance for Babel is compared to previous CTS work. More interesting were her language analysis results. One example was the breathiness at the end of French words. This is a property that appeared relatively recently and has slowly increased through the years. Their analysis of French broadcast news data confirmed this.

Li Deng
Achievements and Challenges of Deep Learning – From Speech Analysis And Recognition To Language And Multimodal Processing

This was a very dense talk. Luckily Li Deng is a very engaging speaker, so the audience tried to keep up. His first point was that too many people equate Deep Learning with Deep Neural Networks. They are not the same thing; a DNN is only one kind of deep model. Deep generative models also exist. Li referred to the large amount of work presented at this past ICML.

Another detailed slide was on the differences between generative models and neural networks, focusing on their strengths and weaknesses. One of the obvious—and frequently talked about—advantages of generative models is their interpretability. Complaints that you cannot know what a neural network is doing are common. He referred to this advantage both because it is intellectually satisfying, but because it also means you can more easily add explicit knowledge constraints to the model. Another advantage is the ability of the generative model to handle uncertainty.

A major cited advantage of neural networks is their ease of computation. This may seem counterintuitive considering how much effort it can take to train the models. He was referring to the fact that neural networks basically require performing the same operation billions of times. This is something that can be parallelized and GPUs can be utilized to greatly speed up the entire process.

It is difficult to give an overview of the talk as it was so detailed and he hit so many major points. If there was a main point, I think it was that there is more to deep learning than DNNs. Also, the combination of neural networks with generative models is an exciting and promising direction.