This course covers a wide range of tasks in Natural Language Processing from basic to advanced: sentiment analysis, summarization, dialogue state tracking, to name a few. Upon completing, you will be able to recognize NLP tasks in your day-to-day work, propose approaches, and judge what techniques are likely to work well. The final project is devoted to one of the most hot topics in today’s NLP. You will build your own conversational chat-bot that will assist with search on StackOverflow website. The project will be based on practical assignments of the course, that will give you hands-on experience with such tasks as text classification, named entities recognition, and duplicates detection.
Throughout the lectures, we will aim at finding a balance between traditional and deep learning techniques in NLP and cover them in parallel. For example, we will discuss word alignment models in machine translation and see how similar it is to attention mechanism in encoder-decoder neural networks. Core techniques are not treated as black boxes. On the contrary, you will get in-depth understanding of what’s happening inside. To succeed in that, we expect your familiarity with the basics of linear algebra and probability theory, machine learning setup, and deep neural networks. Some materials are based on one-month-old papers and introduce you to the very state-of-the-art in NLP research.
Do you have technical problems? Write to us: coursera@hse.ru

MV

Definitely best course in the Specialization! Lecturers, projects and forum - everything is super organized. Only StarSpace was pain in the ass, but I managed :)

TL

Jul 08, 2018

Filled StarFilled StarFilled StarFilled StarFilled Star

Anna is a great instructor. She can explain the concept and mathematical formulas in a clear way. The design of assignment is both interesting and practical.

レッスンから

Language modeling and sequence tagging

In this module we will treat texts as sequences of words. You will learn how to predict next words given some previous words. This task is called language modeling and it is used for suggests in search, machine translation, chat-bots, etc. Also you will learn how to predict a sequence of tags for a sequence of words. It could be used to determine part-of-speech tags, named entities or any other tags, e.g. ORIG and DEST in "flights from Moscow to Zurich" query. We will cover methods based on probabilistic graphical models and deep learning.

講師

Anna Potapenko

Researcher

Alexey Zobnin

Accosiate professor

Anna Kozlova

Team Lead

Sergey Yudin

Analyst-developer

Andrei Zimovnov

Senior Lecturer

字幕

Hey, and welcome back. This is what you have already seen in the end of our previous video. So just to remind, we have some sequences and we are going to predict the probabilities of these sequences. So we learnt that with bigger language model, you can factorize your probability into some terms. So these are the probabilities of the next word, given the previous words. Now, take a moment to see whether everything is okay with the indices on this slide. Well, you can notice that i can be equal to 0 or to k plus 1, and it goes out of range of our sequence. But that's okay because if you remember our previous video, we discussed that we should have some fake tokens in the beginning of the sequence and in the end of the sequence. So this iequal to 0 and to k plus 1 will be exactly these fake tokens. So everything good here. Let us move forward. This is just a generalization. This is n-gram language model. So the only difference here is that the history gets longer. So we condition not only on the previous words but on the whole sequence of n minus 1 previous words. So just take a note to these denotions here. This is just a brief way to show that we have a sequence of n minus one words. Great. We have some intuition how to estimate these probabilities. So you remember that we can just count some n-grams and normalize these counts. But, now, I want to give you some intuition, not only just intuition but mathematical justification. Well, we have some probabilistic model, and we have some data. And we want to learn the parameters of this model. What do we do in this case? So what you do is likelihood maximization by W train, and you note here my train data. So this is just a concatenation off all the training sequences that I have, giving a total of big M tokens. Now, I take the logarithm of this probability because it is easier to optimize the sum of logarithms, rather than the product of probabilities. And I just write down the probability of my data according to my model. Okay? So if I'm not too lazy, I would take the derivatives of this likelihood, and I would also think about constraints, such as normalization and non-negativity of my parameters. And I will derive to exactly these formulas that you will see in the bottom of this slide. So these counts and normalization of these counts have mathematical justification, which is likelihood maximization. So this is just the likelihood maximization estimates. Awesome. We now can train our language model. Now, can we show some example how it works? This is a model trained on Shakespeare corpus. So you can see that unigram model and bigram model give something meaningful, and 3-gram model and 4-gram model are probably even better. So you can see that the model actually generates some text, which resembles Shakespeare. Now, I have a question for you. How would you choose the best n here? Do you have any intuition or maybe the procedure to find the best n for your model? Well, for this case, I would say that 5-gram models usually are the best for language modeling, but it is really, really dependent on your data and on your certain task. So the general question is how do we decide which model is better? How do we evaluate and compare our models? So one way to go is to do extrinsic evaluation. So, for example, we can have some machine translation system or speech recognition system, any final application, and we can measure the quality of this application. This is a good way, but sometimes we do not have time or resources to build the whole application. Okay? So we want also to have some intrinsic evaluation, which means just to evaluate the language model itself. And one way that people use all the time is called perplexity. It is called holdout perplexity. Why? Because we have some data, and usually, we held out some data to compute perplexity later. So this is holdout data. This is just other words to say that we need some transplit and test split. So what is perplexity? Well, you know what is the likelihood. So, here, I just write down the likelihood for my test data, and perplexity is super similar. So perplexity has just likelihood in the denominator. You can be curious why exactly this formula. Well it is really related to entropy, but we are not going into details right now. So the thing that we need to know is that the lower perplexity is, the better. Why? Because the greater likelihood is, the better. So the likelihood shows whether our model is surprised with our text or not, whether our model predicts exactly the same test data that we have in real life. So perplexity has also this intuition. And, remember, the lower perplexity, the better. Let us try to compute perplexity for some small toy data. So this is some toy train corpus and toy test corpus. What is the perplexity here? Well, we shall start with computing probabilities of our model. So I compute some probability, and I get zero. It means that the probability of the whole test data is also zero, and the perplexity is infinite. And that's definitely not what we like. How can I fix that? What can we do with it? Well, there is actually a very simple way to fix that. So let us say that we have some vocabulary. Actually, that we build some vocabulary in beforehand, just by some frequencies, or we just take it from somewhere. And after that, we substitute all out of vocabulary tokens for train and for test sets for a special <UNK> token. Okay. So then we compute our probabilities as usual for all vocabulary tokens and for the <UNK> token because we also see this <UNK> token in the training data. Right? And this is what we can use because now, when we see our test data, we see they're only vocabulary tokens and <UNK> token, and we compute probabilities for all of them, and that's okay. Now, imagine we have no out of vocabulary words. We could fix that. Let's try to compute perplexity again. So this is the toy data. What is the perplexity? The probability of some tokens is still zero because we do not see this bigram in our train data, which means the whole probability is zero. The perplexity is infinite, and this is again not what we like. So for this case, we need to use some smoothing techniques. And this is exactly what our next video is about.