In the 3C group and eClub, I now engage in some fundamental research in Natural Language Understanding. I stepped back from my daily Question Answering churn (building the YodaQA system) and took a little look around and decided the right thing to focus for a while are the fundamentals of the NLP field so that our machine learning works better and makes more sense. Warning: We’ll use some scientific jargon in this one post.

So, in the first months of 2016 I focused huge chunk of my research on deep learning of natural language. That means neural networks used on unstructured text, in various forms, shapes and goals. I have set some audacious goals for myself, fell short in some aspects but still made some good progress hopefully. Here’s the deal – a lot of the current research is about processing a single sentence, maybe to classify its sentiment or translate it or generate other sentences. But I have noticed that recently, I have seen many problems that are about scoring a pair of two sentences. So I decided to look into that and try to build something that (A) works better, (B) actually has an API and we can use it anywhere for anything.

My original goal was to build awesome new neural network architectures that will turn the field on its head. But I noticed that the field is a bit of a mess – there is a lot of tasks that are about the same thing, but very little cross-talk between them. So you get a paper that improves the task of Answer Sentence Selection, but could the models do better on the Ubuntu Dialogue task then, or on Paraphrasing datasets? Who knows! Meanwhile, each dataset has its own format and a lot of time is spent only in writing the adapter code for it. Training protocols (from objectives to segmentation to embedding preinitializations) are inconsistent, and some datasets need a lot of improvement. Well, my goal turned to sorting out the field, cross-check the same models on many tasks and provide a better entry point for others than I had.

Software: Getting a few students of the 3C group together, we have created the dataset-sts platform for all tasks and models that are about comparing two sentences using deep learning. We have a pretty good coverage (of both tasks and models), and more brewing in some side branches. It’s in Python and uses the awesome Keras deep learning library.

We have a lofty goal of building an universal text comprehension model, a sort of black box that eats your sentences and produces embeddings that correspond to their meaning, which you can use for whatever task you need to do. Long way to go, but we have found that a simple neural model trained on very large data is doing pretty good in this exact setting, and even if applied to tasks and data that look very different from the original. Maybe we are on to something.

It’s hard to compare neural models because if you train a model 16 times with the same data, the result will always be somewhat different. Not a big deal with large test datasets, but a very big deal with small test datasets which are still popular in the research community. Almost all papers ignore this! If you look at evolution of performance of models in some areas like Answer Sentence Selection, we have found that most differences over the last year are deep below per-train variance we see.

Please take a look, and tell us what you think! We’ll shortly cover a follow-up paper here that we also already posted, and we plan to continue the work by improving our task and model coverage further, fixing a few issues with our training process and experimenting with some novel neural network ideas.

In some of the followup posts, we will also talk about how the abstract-sounding research connects with some very practical technology that is materializing in eClub.