Toward unsupervised learning of speech representations

Speaker

Mirco Ravanelli - Mila, University of Montreal (Canada)

Abstract

The success of deep learning techniques strongly depends on the quality of the representations that are automatically discovered from data. These representations should capture intermediate concepts, features, or latent variables, and are commonly learned in a supervised way using large annotated corpora. Even though this is still the dominant paradigm, some crucial limitations arise. Collecting large amounts of annotated examples, for instance, is very costly and time-consuming. Moreover, supervised representations are likely to be biased toward the considered problem, limiting their exportability to other problems and applications. A natural way to mitigate these issues is unsupervised learning. Unsupervised learning attempts to extract knowledge from unlabeled data, and can potentially discover representations that capture the underlying structure of such data. This modality, sometimes referred to as self-supervised learning, is gaining popularity within the computer vision community, while its application on high-dimensional and long temporal sequences like speech is more challenging.

In this presentation, I will summarize my recent efforts to learn general, robust, and transferrable speech representations using unsupervised/self-supervised approaches. In particular, I will discuss a novel technique called Local Info Max (LIM), that learns speech representations using a maximum mutual information approach. I will then describe the recently-proposed problem-agnostic speech encoder (PASE) that is derived by jointly solving multiple self-supervised tasks. PASE is a first step towards a universal neural speech encoder and turned out to be useful for a large variety of applications such as speech recognition, speaker identification, and emotion recognition.

About the Speaker

Mirco Ravanelli is currently a post-doc researcher at Mila (Université de Montréal) working under the supervision of Prof. Yoshua Bengio. His main research interests are deep learning, speech recognition, far-field speech recognition, robust acoustic scene analysis, cooperative learning, speaker recognition, and unsupervised learning. He is the author or co-author of more than 40 papers on these research topics. He received his PhD (with cum laude distinction) from the University of Trento in December 2017. During his PhD, he focused on deep learning for distant speech recognition, with a particular emphasis on noise-robust deep neural architectures. He also contributed to the European DIRHA project and cooperated with some international institutions, such as Mila and the International Computer Science Institute (University of California, Berkeley). His research on cooperative neural networks has been awarded with the IBM best paper award at ICASSP 2017, while his PhD activity has been awarded with the “Best FBK Student Award 2017” and with the “2016-2017 best doctorate in ICT” at the University of Trento.