This course covers a wide range of tasks in Natural Language Processing from basic to advanced: sentiment analysis, summarization, dialogue state tracking, to name a few. Upon completing, you will be able to recognize NLP tasks in your day-to-day work, propose approaches, and judge what techniques are likely to work well. The final project is devoted to one of the most hot topics in today’s NLP. You will build your own conversational chat-bot that will assist with search on StackOverflow website. The project will be based on practical assignments of the course, that will give you hands-on experience with such tasks as text classification, named entities recognition, and duplicates detection.
Throughout the lectures, we will aim at finding a balance between traditional and deep learning techniques in NLP and cover them in parallel. For example, we will discuss word alignment models in machine translation and see how similar it is to attention mechanism in encoder-decoder neural networks. Core techniques are not treated as black boxes. On the contrary, you will get in-depth understanding of what’s happening inside. To succeed in that, we expect your familiarity with the basics of linear algebra and probability theory, machine learning setup, and deep neural networks. Some materials are based on one-month-old papers and introduce you to the very state-of-the-art in NLP research.
Do you have technical problems? Write to us: coursera@hse.ru

MV

Definitely best course in the Specialization! Lecturers, projects and forum - everything is super organized. Only StarSpace was pain in the ass, but I managed :)

TL

Jul 08, 2018

Filled StarFilled StarFilled StarFilled StarFilled Star

Anna is a great instructor. She can explain the concept and mathematical formulas in a clear way. The design of assignment is both interesting and practical.

レッスンから

Sequence to sequence tasks

Nearly any task in NLP can be formulates as a sequence to sequence task: machine translation, summarization, question answering, and many more. In this module we will learn a general encoder-decoder-attention architecture that can be used to solve them. We will cover machine translation in more details and you will see how attention technique resembles word alignment task in traditional pipeline.

講師

Anna Potapenko

Researcher

Alexey Zobnin

Accosiate professor

Anna Kozlova

Team Lead

Sergey Yudin

Analyst-developer

Andrei Zimovnov

Senior Lecturer

字幕

[SOUND] Hey everyone, we're going to discuss a very important technique in neural networks. We are going to speak about encoder-decoder architecture and about attention mechanism. We will cover them by the example of neural machine translation, just because they were mostly proposed for machine translation originally. But now they are applied to many, many other tasks. For example, you can think about summarization or simplification of the texts, or sequence to sequence chatbots and many, many others. Now let us start with the general idea of the architecture. We have some sequence as the input, and we would want to get some sequence as the output. For example, this could be two sequences for different languages, right? We have our encoder and the task of the encoder is to build some hidden representation over the input sentence in some hidden way. So we get this green hidden vector that tries to encode the whole meaning of the input sentence. Sometimes this vector is also called thought vector, because it encodes the thought of the sentence. The encoder task is to decode this thought vector or context vector into some output representation. For example, the sequence of words from the other language. Now what types of encoders could we have here? Well, one most obvious type would be her current neural networks, but actually this is not the only option. So be aware that we have also convolutional neural networks that can be very fast and nice, and they can also encode the meaning of the sentence. We could also have some hierarchical structures. For example, recursive neural networks try to use syntax of the language and build the representation hierarchically from from bottom to the top, and understand the sentence that way. Okay, now what is the first example of sequence to sequence architecture? This is the model that was proposed in 2014 and it is rather simple. So it says, we have some LCM module or RNN module that encodes our input sentence, and then we have end of sentence token at some point. At this point, we understand that our state is our thought vector or context vector, and we need to decode starting from this moment. The decoding is conditional language modelling. So you're already familiar with language modelling with neural networks, but now it is conditioned on this context vector, the green vector. Okay, as any other language model, you usually fit the output of the previous state as the input to the next state, and generate the next words just one by one. Now, let us go deeper and stack several layers of our LSTM model. You can do this straightforwardly like this. So let us move forward, and speak about a little bit different variant of the same architecture. One problem with the previous architectures is that the green context letter can be forgotten. So if you only feed it as the inputs to the first state of the decoder, then you are likely to forget about it when you come to the end of your output sentence. So it would be better to feed it at every moment. And this architecture does exactly that, it says that every stage of the decoder should have three kind of errors that go to it. First, the error from the previous state, then the error from this context vector, and then the current input which is the output of the previous state. Okay, now let us go into more details with the formulas. So you have your sequence modeling task conditional because you need to produce the probabilities of one sequence given another sequence, and you factorize it using the chain rule. Also importantly you see that x variables are not needed anymore because you have encoded them to the v vector. V vector is obtained as the last hidden state of the encoder, and encoder is just recurrent neural network. The decoder is also the recurrent neural network. However, it has more inputs, right? So you see that now I concatenate the current input Y with the V vector. And this means that I will use all kind of information, all those three errors in my transitions. Now, how do we get predictions out of this model? Well, the easiest way is just to do soft marks, right? So when you have your decoder RNN, you have your hidden states of your RNN and they are called SJ. You can just apply some linear layer, and then softmax, to get the probability of the current word, given everything that we have, awesome. Now let us try to see whether those v vectors are somehow meaningful. One way to do this is to say, okay they are let's say three dimensional hidden vectors. Let us do some dimensional reduction, for example, by TS&E or PCA, and let us plot them just by two dimensions just to see what are the vectors. So you see that the representations of some sentences are close here and it's nice that the model can capture that active and passive voice doesn't actually matter for the meaning of the sentence. For example, you see that the sentence, I gave her a card or she was given a card are very close in this space. Okay, even though these representations are so nice, this is still a bottleneck. So you should think about how to avoid that. And to avoid that, we will go into attention mechanisms and this will be the topic of our next video. [SOUND]