Panel Questions

At the end of the workshop we plan to have a panel with top speech, NLP, and deep learning scientists to talk about “interpretability and robustness in audio, speech, and language”. At the following link, you can write your questions on the topic:

Workshop description

Domains of natural and spoken language processing have a rich history deeply rooted in information theory, statistics, digital signal processing and machine learning. With the rapid rise of deep learning (‘‘deep learning revolution’’), many of these systematic approaches have been replaced by variants of deep neural methods, that often achieve unprecedented performance levels in many fields.1 With more and more of the spoken language processing pipeline being replaced by sophisticated neural layers, feature extraction, adaptation, noise robustness are learnt inherently within the network. More recently, end-to-end frameworks that learn a mapping from speech (audio) to target labels (words, phones, graphemes, sub-word units, etc.) are becoming increasingly popular across the board in speech processing23 in tasks ranging from speech recognition, speaker identification, language/dialect identification, multilingual speech processing, code switching, natural language processing, speech synthesis and much much more. A key aspect behind the success of deep learning lies in the discovered low and high-level representations, that can potentially capture relevant underlying structure in the training data. In the NLP domain, for instance, researchers have mapped word and sentence embeddings to semantic and syntactic similarity and argued that the models capture latent representations of meaning.45

Nevertheless, some recent works on adversarial examples have shown that it is possible to easily fool a neural network (such as a speech recognizer or a speaker verification system678) by just adding a small amount of specially constructed noise.9 Such a remarkable sensibility towards adversarial attacks highlights how superficial the discovered representations could be, rising crucial concerns on the actual robustness, security, and interpretability of modern deep neural networks.10 This weakness naturally leads researchers to ask very crucial questions on what these models are really learning, how we can interpret what they have learned, and how the representations provided by current neural networks can be revealed or explained in a fashion that modeling power can be enhanced further.11 These open questions have recently raised the interest towards interpretability of deep models, as witness by the numerous works recently published on this topic in all the major machine learning conferences. Moreover, some workshops at Neural Information Processing Systems 2016,12 2017,13 and Interspeech 201714 have promoted research and discussion around this important issue.

With our initiative, we wish to further foster some progresses on interpretability and robustness of modern deep learning techniques, with a particular focus on audio, speech and NLP technologies. The workshop will also analyze the connection between deep learning and models developed earlier for machine learning, linguistic analysis, signal processing, and speech recognition. This way we hope to encourage a discussion amongst experts and practitioners in these areas with the expectation of understanding these models better and allowing to build upon the existing collective expertise.

The workshop will feature invited talks, panel discussions, as well as oral and poster contributed presentations.
We welcome papers that specifically address one or more of the leading questions listed below:

Is there a theoretical/linguistic motivation/analysis that can explain how nets encapsulate the structure of the training data it learns from?

Does the visualization of this information (MDS, t-SNE) offer any insights to creating a better model?

How can we design more powerful networks with simpler architectures?

How can we can exploit adversarial examples to improve the system robustness?

Do alternative methods offer any complimentary modeling power to what the networks can memorize?

Can we explain the path of inference?

How do we analyze data requirements for a given model? How does multilingual data improves learning power?

Workshop/NIPS Tickets

The NIPS tickets sold out in few minutes this year. As IRASL workshop organizers, we have a pool of reserved tickets (valid for both workshops and the main conference) to assign for the accepted papers (maximum 1 reserved ticked for each accepted paper).

Call for papers

We will accept both short paper (4 pages) and long paper (8 pages) submissions.
A few papers may be selected as oral presentations, and the other accepted papers
will be presented in a poster session. Accepted contributions will be made
available in the workshop website.

Submissions are double-blind, peer-reviewed on
OpenReview and should use NIPS2018 style guide, and open to already published works.

The accepted papers will be made available on the workshop website.
Even though we would prefer original works, double submissions are accepted. Papers submitted to other conferences or workshops can be thus submitted, but the authors must notify it to the organizers (with an email to irasl@googlegroups.com).
Preprint papers already published on ArXiv, ResearchGate, Academia or similar repositories can be submitted as well. Accepted papers will be presented at the workshop with an oral or poster presentation. Even though official proceeding will not be released, the accepted papers will be made available into the workshop website.
Double-submitted papers will be published into the website only if no conflicts with other conferences arise.

Venue

Planned Activities

The workshop will consists of one single day and assumes time slots for invited talks, contributed talks, poster presentations, and a panel discussion. In particular, the workshop will feature a number of invited talks by researchers from academia and industry and representing various points of view on the problem of interpretability and robustness in modern models. The panel discussion will allow participant to exchange ideas and will encourage the debate. The organizers take the responsibility to disseminate the workshop and to manage the reviewing process for the contributed papers.

Invited Speakers

Abstract: “In machine learning often a tradeoff must be made between accuracy and intelligibility: the most accurate models (deep nets, boosted trees and random forests) usually are not very intelligible, and the most intelligible models (logistic regression, small trees and decision lists) usually are less accurate. This tradeoff limits the accuracy of models that can be safely deployed in mission-critical applications such as healthcare where being able to understand, validate, edit, and ultimately trust a learned model is important. In this talk I’ll present a case study where intelligibility is critical to uncover surprising patterns in the data that would have made deploying a black-box model risky. I’ll also show how distillation with intelligible models can be used to understand what is learned inside a black-box model such as a deep nets, and show a movie of what a deep net learns as it trains and then begins to overfit.”

Hynek Hermansky (Professor at Johns Hopkins University and Director of the Center for Language and Speech Processing)

Title: “Learning - not just for machines anymore”

Abstract: “It is often argued that in in processing of sensory signals such as speech, engineering should apply knowledge of properties of human perception - both have the same goal of getting information from the signal. We show on examples from speech technology that perceptual research can also learn from advances in technology. After all, speech evolved to be heard and properties of hearing are imprinted on speech. Subsequently, engineering optimizations of speech technology often yield human-like processing strategies. Further, fundamental difficulties that speech engineering still faces could indicate gaps in our current understanding of the human speech communication process, suggesting directions of further inquiries. “

Title: “Observations in Joint Learning of Features and Classifiers for Speech and Language”

Abstract: I”n relation to launching the Google @home product we were faced with the problem of far-field speech recognition. That setting gives rise to problems related to reverberant and noisy speech which degrades speech recognition performance. A common approach to address some of these detrimental effects is to use multi-channel processing. This processing is generally seen as an “enhancement” step prior to ASR and is developed and optimized as a separate component of the overall system. In our work, we integrated this component into the neural network that is tasked with the speech recognition classification task. This allows for a joint optimization of the enhancement and recognition components. And given that the structure of the input layer of the network is based on the “classical” structure of the enhancement component, it allows us to interpret what type of representation the network learned. We will show that in some cases this learned representation appears to mimic what was discovered by previous research and in some cases, the learned representation seems “esoteric”. The second part of this talk will focus on an end-to-end letter to sound model for Japanese. Japanese uses a complex orthography where the pronunciation of the Chinese characters, which are a part of the script, varies depending on the context. The fact that Japanese (like Chinese and Korean) does not explicitly mark word boundaries in the orthography further complicates this mapping. We show results of an end-to-end, encoder/decoder model structure to learn the letter-to-sound relationship. These systems are trained from speech data coming through our systems. This shows that such models are capable of learning the mapping (with accuracies exceeding 90% for a number of model topologies). Observing the learned representation and attention distributions for various architectures provides some insight as to what cues the model uses to learn the relationship. But it also shows that interpretation remains limited since the joint optimization of encoder and decoder components allows the model the freedom to learn implicit representations that are not directly amenable to interpretation.
“

Abstract: “For decades, the general architecture of the classical state-of-the-art statistical approach to automatic speech recognition (ASR) has not been significantly challenged. The classical statistical approach to ASR is based on Bayes decision rule, a separation of acoustic and language modeling, hidden Markov modeling (HMM), and a search organisation based on dynamic programming and hypothesis pruning methods. Even when deep neural networks started to considerably boost ASR performance, the general architecture of state-of-the-art ASR systems was not altered considerably. The hybrid DNN/HMM approach, together with recurrent LSTM neural network language modeling currently marks the state-of-the-art on many tasks covering a large range of training set sizes. However, currently more and more alternative approaches occur, moving gradually towards so-called end-to-end approaches. By and by, these novel end-to-end approaches replace explicit time alignment modeling and dedicated search space organisation by more implicit, integrated neural-network based representations, also dropping the separation between acoustic and language modeling, showing promising results, especially for large training sets. In this presentation, an overview of current approaches to ASR will be given, including variations of both HMM-based and end-to-end modeling. Approaches will be discussed w.r.t. their modeling, their performance against available training data, their search space complexity and control, as well as potential modes of comparative analysis.”

Title: “Learning from the move to neural machine translation at Google”

Abstract: “At Google we replaced over the last few years the phrase-based machine translation system by GNMT, the Google Neural Machine Translation system. This talk will describe some of the history of this transition and explain the challenges we faced. As part of the new system we developed and used many features that hadn’t been used before in production-scale translation systems: A large-scale sequence-to-sequence model with attention, sub-word units instead of a full dictionary to address out-of-vocabulary handling and improve translation accuracy, special hardware to improve inference speed, handling of many language pairs in a single model and other techniques that a) made it possible to launch the system at all and b) to significantly improve on previous production-level accuracy. Some of the techniques we used are now standard in many translation systems – we’d like to highlight some of the remaining challenges in interpretability, robustness and possible solutions to them. “

Abstract: “Neural encoder-decoder models have had significant empirical success in text generation, but there remain major unaddressed issues that make them difficult to apply to real problems. Encoder-decoders are largely (a) uninterpretable in their errors, and (b) difficult to control in areas as phrasing or content. In this talk I will argue that combining probabilistic modeling with deep learning can help address some of these issues without giving up their advantages. In particular I will present a method for learning discrete latent templates along with generation. This approach remains deep and end-to-end, achieves comparably good results, and exposes internal model decisions. I will end by discussing some related work on successes and challenges of visualization for interpreting encoder-decoder models.”

Title: “Good and bad assumptions in model design and interpretability”

Abstract: “The seduction of large neural nets is that one simply has to throw input data into a big network and magic comes out the other end. If the output is not magic enough, just add more layers. This simple approach works just well enough that it can lure us into a few bad assumptions, which we’ll discuss in this talk. One is that learning everything end-to-end is always best. We’ll look at an example where it isn’t. Another is that careful manual architecture design is useless because either one big stack of layers will work just fine, or if it doesn’t, we should just give up and use random architecture search and a bunch of computers. But perhaps we just need better tools and mental models to analyze the architectures we’re building; in this talk we’ll talk about one simple such tool. A final assumption is that as our models become large, they become inscrutable. This may turn out to be true for large models, but attempts at understanding persist, and in this talk, we’ll look at how the assumptions we put into our methods of interpretability color the results.”

Title: BiLSTM-FSTs and Neural FSTs
Abstract: How should one apply deep learning to tasks such as morphological reinflection, which stochastically edit one string to get another? Finite-state transducers (FSTs) are a well-understood formalism for scoring such edit sequences, which represent latent hard monotonic alignments. I will discuss options for combining this architecture with neural networks. The BiLSTM-FST scores each edit in its full input context, which preserves the ability to do exact inference over the aligned outputs using dynamic programming. The Neural FST scores each edit sequence using an LSTM, which requires approximate inference via methods such as beam search or particle smoothing. Finally, I will sketch how to use the language of regular expressionsto specify not only the legal edit sequences but also how to present them to the LSTMs.

Organization committee

Mirco Ravanelli:

Mirco is a post-doc researcher at the Montreal Institute for Learning Algorithms (MILA) of the Université de Montréal. His main research interests are deep learning, speech recognition, far-field speech recognition, robust acoustic scene analysis, cooperative learning (ICASSP 2017 IBM best paper award), unsupervised learning. He received his PhD (with cum laude distinction) from the University of Trento in December 2017. During his PhD thesis, he focused on deep learning for distant speech recognition, with a particular emphasis on noise-robust deep neural network architectures.

Dmitriy Serdyuk:

Dmitriy is a PhD student and researcher at Montreal Institute for Learning Algorithms (MILA) of the Université de Montréal. His research interests are deep learning, understanding of sequence model training, robust speech recognition in unseen environments. Dmitriy was working on end-to-end speech recognition with attention networks, recurrent neural networks regularization and domain adaptation for noisy speech recognition.

Ehsan Variani:

Ehsan is a Senior Research Scientist at Google Inc. His main research interests are statistical machine learning and information theory with the focus on speech and language recognition. He has contributed to the robustness and large-scale training of Google voice-search and Google Home. Furthermore, he is actively publishing in different speech recognition conferences, including ICASSP and Interspeech. He has done extensive research on comparison of traditional speech recognition modules and deep neural network based models with the focus on interpretability and robustness.

Bhuvana Ramabhadran:

Bhuvana (IEEE Fellow, 2017, ISCA Fellow 2017) currently leads a team of researchers at Google (Senior Staff Research Scientist), focusing on multilingual speech recognition and synthesis. Previously, she was a Distinguished Research Staff Member and Manager in IBM Research AI, at the IBM T. J. Watson Research Center, Yorktown Heights, NY, USA, where she led a team of researchers in the Speech Technologies Group and coordinated activities across IBM’s world wide laboratories in the areas of speech recognition, synthesis, and spoken term detection. She was the elected Chair of the IEEE SLTC (2014–2016), Area Chair for ICASSP (2011–2018) and Interspeech (2012–2016), was on the editorial board of the IEEE Transactions on Audio, Speech, and Language Processing (2011–2015), and is currently an ISCA board member. She is a Fellow of ISCA and an adjunct professor at Columbia University. She has published over 150 papers and been granted over 40 U.S. patents. Her research interests include speech recognition and synthesis algorithms, statistical modeling, signal processing, and machine learning. Some of her recent work has focused on understanding neural networks and finding alternate models that can beat or perform as well as deep networks.