Noisy channel: said in English, received in French

This course covers a wide range of tasks in Natural Language Processing from basic to advanced: sentiment analysis, summarization, dialogue state tracking, to name a few. Upon completing, you will be able to recognize NLP tasks in your day-to-day work, propose approaches, and judge what techniques are likely to work well. The final project is devoted to one of the most hot topics in today’s NLP. You will build your own conversational chat-bot that will assist with search on StackOverflow website. The project will be based on practical assignments of the course, that will give you hands-on experience with such tasks as text classification, named entities recognition, and duplicates detection.
Throughout the lectures, we will aim at finding a balance between traditional and deep learning techniques in NLP and cover them in parallel. For example, we will discuss word alignment models in machine translation and see how similar it is to attention mechanism in encoder-decoder neural networks. Core techniques are not treated as black boxes. On the contrary, you will get in-depth understanding of what’s happening inside. To succeed in that, we expect your familiarity with the basics of linear algebra and probability theory, machine learning setup, and deep neural networks. Some materials are based on one-month-old papers and introduce you to the very state-of-the-art in NLP research.
Do you have technical problems? Write to us: coursera@hse.ru

MV

Definitely best course in the Specialization! Lecturers, projects and forum - everything is super organized. Only StarSpace was pain in the ass, but I managed :)

TL

Jul 08, 2018

Filled StarFilled StarFilled StarFilled StarFilled Star

Anna is a great instructor. She can explain the concept and mathematical formulas in a clear way. The design of assignment is both interesting and practical.

从本节课中

Sequence to sequence tasks

Nearly any task in NLP can be formulates as a sequence to sequence task: machine translation, summarization, question answering, and many more. In this module we will learn a general encoder-decoder-attention architecture that can be used to solve them. We will cover machine translation in more details and you will see how attention technique resembles word alignment task in traditional pipeline.

教学方

Anna Potapenko

Researcher

Alexey Zobnin

Accosiate professor

Anna Kozlova

Team Lead

Sergey Yudin

Analyst-developer

Andrei Zimovnov

Senior Lecturer

脚本

Today, we will cover one main idea of statistical machine translations. Imagine you have a sentence, let's say, in French or in some other foreign language and then, you want to have its translation to English. How do you do this? Well, you can try to compute the probability of the English sentence given your French sentence. And then, you want to maximize this probability and take the sentence that gives you this maximum probability, right? Sounds very intuitively. Now, let us apply base rule here. So let us say that instead of computing the probabilities of E given F, we would better compute probabilities of F given E. And multiply it by some probability of the English sentence. And also, normalize it by some denominator. Now, do you have any idea? Can we further simplify this formula? Well, actually, we can. So, the denominator doesn't depend on the English sentence, which means that we can just get rid of it, okay. Now, we have this formula and now, the question is, why is that easier? Why we like it more than the original formula? This slide is going to explain why. So, we have two models now. We have decoupled our complicated problem to two more simple problems. One problem is language modeling. And actually, you know a lot about it. So, this is how to produce some meaningful probability of the sentence of words. Now, the other problem is translation model. And this model doesn't think about some coherent sentences. It just thinks about some good translation of E to F, so that you do not end up with something that is not related to your source sentence. So, you have two models about language and about adequacy of the translation. And then you have argmax to perform the search in your space and find the sentence in English that gives you the best probability. Now, I have one more interpretation for you. The Noisy Channel is a super popular idea, so you definitely need to know about it. And it is actually super simple. So, you have your source sentence and you have some probability of this source sentence. And then, it goes through the noisy channel. The noisy channel is represented by the conditional probability of what you get as the output given your input for the channel. So, as the output, you obtain your French sentence. So, let's say that your source sentence was spoilt with the channel and now you obtained it in French. Now, the rest of the video is about how to model these two probabilities, the probability of the sentence and the probability of the translation given some sentence. Okay. First, about the language model. You know a lot about it so we covered this in the week two. So, I will have just one slide to have a recap for you. So, we need to compute the probability of a sentence of words. We apply chain rule and then we know that we can factorize it into the probabilities of the next word given some previous history. You can use Markov assumption and then end up with n-gram language models. Or you can use some neural language models such as LSTM to produce the next word, you will need previous words. Now, translation model. Well, it is not so easy. So, imagine you have a sequence of words in one language and you need to produce the probability of a sequence or words in some other language. For example, this is foreign language, like Russian and English language, and these two sentences. How do you produce these probabilities? Well, it is not obvious for me. So, let us start with words level. We can understand something for the level of separate words in these sentences. Okay. What can we do? We can have a translation table. So, here, I have the probabilities of Russian words given some English words. And they are normalized, right. So, each row in this matrix is normalized into one. And this are just translations that I learn or that I look up in the dictionary or built somehow. Okay, it's doable. Now, how do I build the probability of the whole sentence given these separate probabilities? We need some word alignments. So, the problem is that we can have some reorderings in the language like here, or even worse, we can have some one to many or many to one correspondence. For example, the word appetit here corresponds to the appetite. And the word with here corresponds to two Russian words [FOREIGN] It means that we need some model to build those alignments. Now, another example would be words that can appear or disappear. For example, some articles or some auxiliary words can happen in one language and then, they can't just vanish in some other language. This is a very unique word alignment models and this is the topic will fall when next video.