This course covers a wide range of tasks in Natural Language Processing from basic to advanced: sentiment analysis, summarization, dialogue state tracking, to name a few. Upon completing, you will be able to recognize NLP tasks in your day-to-day work, propose approaches, and judge what techniques are likely to work well. The final project is devoted to one of the most hot topics in today’s NLP. You will build your own conversational chat-bot that will assist with search on StackOverflow website. The project will be based on practical assignments of the course, that will give you hands-on experience with such tasks as text classification, named entities recognition, and duplicates detection.
Throughout the lectures, we will aim at finding a balance between traditional and deep learning techniques in NLP and cover them in parallel. For example, we will discuss word alignment models in machine translation and see how similar it is to attention mechanism in encoder-decoder neural networks. Core techniques are not treated as black boxes. On the contrary, you will get in-depth understanding of what’s happening inside. To succeed in that, we expect your familiarity with the basics of linear algebra and probability theory, machine learning setup, and deep neural networks. Some materials are based on one-month-old papers and introduce you to the very state-of-the-art in NLP research.
Do you have technical problems? Write to us: coursera@hse.ru

MV

Definitely best course in the Specialization! Lecturers, projects and forum - everything is super organized. Only StarSpace was pain in the ass, but I managed :)

TL

Jul 08, 2018

Filled StarFilled StarFilled StarFilled StarFilled Star

Anna is a great instructor. She can explain the concept and mathematical formulas in a clear way. The design of assignment is both interesting and practical.

レッスンから

Dialog systems

This week we will overview so-called task-oriented dialog systems like Apple Siri or Amazon Alexa. We will look in details at main building blocks of such systems namely Natural Language Understanding (NLU) and Dialog Manager (DM). We hope this week will encourage you to build your own dialog system as a final project!

講師

Anna Potapenko

Researcher

Alexey Zobnin

Accosiate professor

Anna Kozlova

Team Lead

Sergey Yudin

Analyst-developer

Andrei Zimovnov

Senior Lecturer

字幕

Hi. In this video, we will talk about intent classifier and slot tagger in depth. Let's start with intent classifier. How we can do that. You can use any model on bag-of-words with n-grams and TF-IDF, just use classical approaches of text mining, or you can use some recurrent architecture and you can use LSTM cells, GRU cells, or any other. You can also use convolutional networks and you can use 1D convolutions that we have overviewed in week one. And the study actually shows that CNNs can perform better on datasets where the task is essentially a key phrase recognition task and it can happen in some sentiment detection datasets, for example. So, it makes sense to try RNN or CNN, or any classical approach as a baseline and choose what works best. Then, there comes a slot tagger, and this is a bit more difficult task. It can use handcrafted rules like regular expressions, so that when I say, for example, take me to Starbucks, then you know that if something happens after the phrase take me to, then that is most definitely like a two slot or any other slots of your intent. But that approach doesn't scale because the natural language has a huge variation in how we can express the same thing. So, it makes sense to do something data driven here. You can use conditional random fields, that is a rather classical approach, or you can use RNN sequence-to-sequence model, when you have encoder and decoder, and a funny fact is that you can still use convolutional networks for a sequence-to-sequence task as well, and you can add attention to any of these models, any sequence-to-sequence model. In the next slide, I want to overview convolutional sequence-to-sequence model because that is- that gains popularity because it works faster and sometimes it even beats RNN in some tasks. Okay, let's see how convolutional networks can be used to model sequences. Let's say we have an input sequence which is bedding-bedding, then start of sequence and three German watts. And what we actually want to do, let's say, where we want to solve the task of language modeling. When we see each new token, we need to predict which token comes next. And usually, we use a recurrent architectures for this. But let's see how we can use convolutions. Let's say that when we generate the next token, what we actually- we actually care only about the last three tokens in the sequence that we have seen. And if we assume that, then we can use convolution to aggregate the information about the last three tokens and this is the blue triangle here, and we actually get some filters in the output. Let's take half of those filters and add them as is, and the second half, we will pass through sigmoid activation function, and then take an element Y as multiplication of these two halves. What we actually get is we get some Gated Linear Unit, and we add non-linear part to it and it becomes non-linear. So, this is how we actually look at the context that we had before and we predict some hidden state or let's say, next token and you can use convolutions for that, and then, that triangle is actually convolutional filter and you can slide it across the sequence and use the same weights, the same learned filters, and it will work the same on every iteration on that sequence. So, it is pretty similar to RNN, but in this way, we actually don't have a hidden state that we need to change. We actually only look at the context that we had before, and some intermediate representation. But you can see that we actually look at only three last tokens and that is not very good. Maybe we need to look at it like last 10 tokens or so because RNN is like LSTM cell, can actually have a very long short-term memory. Okay. So, we know from convolutional neural networks, we know how to increase the input receptive field. And we actually stack convolutional layers. Let's stack six layers here with kernel size five, and that will actually result in an input field of 25 elements. And the experiments show that 25 elements in the receptive field might be enough to model your sequences. Let's see how CNNs work for sequences. The office provided the results on language modeling dataset which is WikiText-103, and you can see that this CNN architecture actually beats LSTM, it has lower perplexity, and it actually runs faster. We will go into that a little bit later. And another example is a machine translation dataset, or from English to French, let's say, and there they have a metric called BLEU and the higher that metric the better. And you can see that convolutional sequence-to-sequence actually beats LSTM here as well, and this is pretty surprising. What is a good thing about CNNs is, the speed benefit. If you compare it with RNN, the problem with RNN is that it has a hidden state and we change that state through iterations and we cannot do our calculations in parallel, because every step depends on the other, and we can actually overcome that with convolutional networks because during training, we can process all time steps in parallel. So, we apply the same convolutional filters but we do that at each time step, and they are independent and we can do that in parallel. During testing, let's say, in sequence-to-sequence manner, our encoder can actually do the same because there is no that dependence on the previous outputs and we use only our input tokens, and we can apply that convolutions and get our hidden states in parallel. During testing one more thing, one more good thing is that GPUs are highly optimized for convolutions and we can get a higher throughput, thanks to using convolutions instead of RNNs. You can actually see a table here, and it shows the model based on LSTM, and the model based on convolutional sequence-to-sequence, and you can see that convolutional model actually provides a better score in terms of translation quality, and it also works 10 times faster. So, that is a pretty good thing because for a real-world systems like, let's say Facebook, they need to translate to the post when you want and they need to translate it fast. So, in order to implement these machine translation in production environment, maybe CNN is a very good choice. By the way, this paper is by the folks from Facebook. So, let's look at one more thing. You know that when you do a sequence-to-sequence task, you actually want your encoder to be bi-directional, so that you look at the sequence from left to right and from right to left. And the good thing about convolutions is that actually you can make that convolutional filters symmetric, and you can look at your context at the left and at the right to the same time. So, it is very easy to make bi-directional encoder with CNNs. And it still works in parallel, there is no dependence on hidden state here, it just applies all of that multiplications in parallel. To move further, with our, let me remind you, we are actually reviewing intent classifier and slot tagger and to move further, we need some dataset so that we can use it for our overview. Let's take ATIS dataset, it's Airline Travel Information System. It was collected back in 90s, and it has roughly 5,000 context independent utterances, and that is important. That means that we actually have a one turn dialogue and we don't need like a fancy dialogue manager here. It has 17 intents and 127 slot labels, like from location to location, departure time, and so forth. The utterances are like this, show me flights from Seattle to San Diego tomorrow. The State-of-the-art for this task is the following: 1.7 intent error, and 95.9 slots F1. So, this is pretty cool. Another thing is that you can actually learn your intent classifier and slot tagger jointly. You don't need to train like two separate tasks, you can train this supertask, because it can actually learn representations that is suitable for both tasks, and this time, we provide more supervision for our training and we get the higher quality as a result. Let's see how this joint model might work. It is still a sequence-to-sequence model, but this time we use, let's say, a bi-directional encoder, and the last hidden state, we can use for decoding the slot tags, and at the same time we can use that to decode the intent. And if we train these end-to-end for the two tasks, we can get a higher quality. And notice that we have in the decoder, we have hidden states from encoder post just as is, and this is called aligned inputs, and we also have C-vectors which are attention. Let's see how attention works in decoder. Lets say that we have at time step E, and we have to output our new decoder hidden state SE. And that is actually a function of the previous hidden state which is in blue, a previous output which is in red, and hidden stated from encoder and some vector which is attention. Let's see how attention works. The vector attention Ci, is actually a weighted sum of hidden vectors from encoder. And we need to come up with weights for these vectors. And we actually train the system to learn these weights in such a way so that it makes sense to give attention to those weights, to those vectors. And the coefficient that we use to define what weight that particular vector from encoder has, is modeled as a forward network that uses our previous decoder hidden state, and all of the states from encoders, and it needs to figure out whether we need that state from encoder or not. You can also see an example of attention distribution when we predict the label for the last word, and you can see that when we predict the label like departure time, our model looks at phrases like, from city, or city name, or something like that. Okay. So, we can also see how our two losses decrease during training, and during training we use two losses and we use a sum of them, and you can see the green loss here is for intent, and the blue one is for slots. You can see that intent loss actually saturates and it doesn't change, but blue slots, blue curve continues to decrease and so, our model continues to train because that is a harder task than intent classification. Okay. Let's look at joint training results on the 80s dataset. If we had trained slot filling independently, we have slot F1 95.7, and if we train our intent detection, our classifier independently we have intent at two percent, but if we train those two tasks jointly using the architecture that we have overviewed, we actually can get a higher slot F1 and a lower intent error. And a good thing also is that this joint model works faster if you use it on mobile phone, or any other embedded system because you have only one encoder and you reuse that information for two tasks. Okay. Let's summarize what we have overviewed. We have viewed at different options for intent classifier and slot tagger, you can start from classical approaches and go all the way to deep approaches. People start to use CNNs for a sequence modeling and sometimes get better results than with RNN. This is a pretty surprising fact. You can also use joint training and it can be beneficial in terms of speed and performance for your slot tagger and intent classifier. In the next video, we will take a look at context utilization in our NLU, our intent classifier and slot tagger.