Q: What was the research about?
A: In 2008, much research in Natural Language Processing (NLP) involved shallow (rather than deep) learning with models like Support Vector Machines (SVMs), with a strong focus on hand-engineered features that were input to such systems. Tasks that were trained separately and pipelined into larger systems, such as part-of-speech tagging, chunking and parsing, word-sense disambiguation, semantic-role labeling, and so on, can lead to a cascade of errors — one system causing errors in the next.

Our paper defined a contrasting approach for Natural Language Processing—a unified architecture that is achieved by training a deep neural network where all of these tasks are integrated into a single system that is trained jointly. End-to-end learning can help a system avoid cascading errors (because each module is trained to deal with the possibly noisy input from the other systems).

Overall, there were also a number of key contributions which have been influential over the years:

A: We were sharing an office together, and wanted to try something new — and we both had a long term dream that we wanted to be able to talk to our computers. So we decided to dig into Natural Language Processing (NLP) research but neither of us had experience in this specific area. Perhaps our lack of experience actually may have been a good thing, as we didn’t do the usual things NLP researchers were doing at the time.

Ronan had neural network (NN) experience and he was successfully brainwashing Jason to move from support vector machines (SVMs) to neural networks. We were also influenced by existing work, particularly Yoshua Bengio’s language modeling with neural networks work and Yann LeCun‘s work on convolutional neural networks (CNNs).

At one point there was too much noise in the lab so we went outside so we could hear each other and had our meeting there. We generated our best ideas on our walks outside around the garden.

Q: What happened when the paper was originally published, and how was it received in the community then?

A: The work was not received well at the time by everyone. An earlier paper we wrote for the Association for Computational Linguistics (ACL) event in 2007, “Fast Semantic Extraction Using a Novel Neural Network Architecture,” got very little attention. In fact, it only got 68 citations in ten years. Our (more complete) follow-up to the ICML paper “NLP almost from scratch” was written in 2009. It took two years to get published. We could have had all of these ideas and no one would ever have known about them.

Several NLP researchers were arguing with us at the end of our ACL presentation. One issue was one of the metrics we used — this was our fault as we did not have a good understanding of the right metrics/datasets in NLP but we corrected that in subsequent work. That was one reason not to like it.

But more importantly, they didn’t like it because we were explaining a completely new approach, and that meant that their approach that they had invested so much time on (feature engineering) might not be optimal. Our work was too different. We were saying you don’t have to hand-engineer features, but rather design a model. This has now become the main paradigm—designing models, not features. It took a long time, but now neural networks are accepted almost everywhere.

Not everyone was against our work of course, and there was more traction at ICML than at ACL. Some researchers were definitely interested; in particular, they were excited about the word embeddings, which were discussed during the questions after the talk.

Q: How has the work been built upon? Is there any impact of this work in products we see today?

A: At the time there was almost no neural networks research in Natural Language Processing, and few papers in machine learning conferences like ICML too. This paper really influenced the use of neural networks in NLP. There are several techniques in the paper that were influential: the use of word embeddings and how they are trained, the use of auxiliary tasks and multitasking, the use of convolutional neural nets in NLP, and even (which is less well known) a kind of attention approach. Today, neural networks are the dominant approach. Since then, many works have taken those directions further, and they are all still important directions today. For example, Facebook’s recent state-of-the-art machine translation and summarization tool Fairseq uses convolutional neural networks for language, while AllenNLP’s Elmo learns improved word embeddings via a neural net language model and applies them to a large number of NLP tasks.

Q: Were there any surprises along the way?

A: The Language Model we published was a bit of a surprise. It came out of an experiment we had almost forgotten about that we left to train on a machine for two months. Often in machine learning, if you see some experiment isn’t working after two weeks, you stop. We were almost about to abandon it. In fact, it took us 6 months to make the Language Model work properly. On modern machines this trains much faster of course.

Q: What is your current focus?

A: We continue to be excited about pursuing the dream of being able to talk to our computers. We are looking at many other missing aspects that are “the next level” of language understanding: deeper semantics, memory, reasoning, end-to-end speech recognition and generation, and real end-goal language tasks like dialogue and question answering, rather than intermediate tasks like part-of-speech tagging or parsing. Dialogue also opens up the possibility to learn more naturally than just supervised learning. For example, learning during dialogue by talking to people, asking questions, etc. Jason works on the dialogue research platform ParlAI and is currently organizing a dialogue competition, ConvAI2, at NIPS 2018. Ronan works on wav2letter, an end-to-end lightweight speech recognition system.