Probabilistic graphical models (PGMs) are a rich framework for encoding probability distributions over complex domains: joint (multivariate) distributions over large numbers of random variables that interact with each other. These representations sit at the intersection of statistics and computer science, relying on concepts from probability theory, graph algorithms, machine learning, and more. They are the basis for the state-of-the-art methods in a wide variety of applications, such as medical diagnosis, image understanding, speech recognition, natural language processing, and many, many more. They are also a foundational tool in formulating many machine learning problems.
This course is the third in a sequence of three. Following the first course, which focused on representation, and the second, which focused on inference, this course addresses the question of learning: how a PGM can be learned from a data set of examples. The course discusses the key problems of parameter estimation in both directed and undirected models, as well as the structure learning task for directed models. The (highly recommended) honors track contains two hands-on programming assignments, in which key routines of two commonly used learning algorithms are implemented and applied to a real-world problem.

This module contains some basic concepts from the general framework of machine learning, taken from Professor Andrew Ng's Stanford class offered on Coursera. Many of these concepts are highly relevant to the problems we'll tackle in this course.

教学方

Daphne Koller

Professor

脚本

By now you've seen a couple different learning algorithms, linear regression and logistic regression. They work well for many problems, but when you apply them to certain machine learning applications they can run into a problem called over fitting that can cause them to perform very poorly. What I'd like to do in this video is explain to you what is this over fitting problem, and in the next few videos after this, we'll talk about a technique called regularization, that will allow us to or to ameliorate, or to reduce this over fitting problem and get these learning algorithms to work much better. So, what is over fitting? Let's keep using our running example of predicting housing prices with linear regression where we want to predict the price as a function of the size of the house. One thing we could do is fit a linear function to this data. And if we do that, maybe we get that sort of straight line fit to the data. But this isn't a very good model. Looking at the data, it seems pretty clear that, as the size of the house increases. The, housing prices plateau, or kind of flattens out as we move to the right. And so, this, algorithm doesn't fit the trading set very well, and we call this problem underfitting, and another term for this is that this algorithm has high bias. both of these roughly mean that it's just not even fitting the trading data very well. The term bias is kind of a historical, or technical one, but the idea is that if fitting a straight line to the data, then it's as if the algorithm has a very strong preconception, or a very strong bias that housing prices are going to vary linearly with their size. And despite the data to the contrary, despite the evidence to the contrary, as preconceptions still bias, still causes it to fit a straight line and this ends up being a poor fit to the data. Now, in the middle, we could fit a quadratic function to the data, and with this data set, we fit a quadratic function, maybe we get that kind of curve, and that works pretty well. And at the other extreme would be if we were to fit say, a fourth of the polynomials of the data. So here, we have five parameters, theta zero through theta four. And with that, we can actually fill the curve that process through all five of our training examples. We might get a curve that looks like this. That, on the one hand, seems to do a very good job fitting the trading set. It is processed through all of my data at least, but this is sort of a very wiggly curve. It's like going up and down going all over the place, and we don't actually think that's such a good model for predicting housing prices. So, this problem we call overfitting, and another term for this is that this algorithm has high variance. The term high variance, is another, sort of historical, or technical one, but the intuition is that, if we're fitting such a high older polynomial, then the hypothesis can fit, you know, is almost as if it can fit almost any function, and the space of the possible hypothesis, is just to large, or is to variable. And we don't have enough data to constrain it, to give us a good hypothesis. So that's how overfitting, and in the middle there isn't really a name, but so I'm just going to write, you know, just write where a second degree polynomial, a quadratic function, seems to be just right for fitting this data. To recap a bit, the problem of overfitting comes when, if we have too many features, then the learn hypothesis may fit the trading set very well. So, your cost function may actually be very close to zero, or maybe even zero exactly. But you may then end up with a curve like this, that you know, tries too hard to fit the training set, so that it even fails to generalize to new examples. And it fails to predict prices on new examples well. And here the term generalize refers to how well a hypothesis applies even to new examples. That is to data, to houses that it hasn't seen in the training set. On this slide, we looked at overfitting for the case of linear regression. A similar thing can apply to logistic regression as well. Here's the other logistic regression example, with two features, x1 and x2. One thing we could do is fit logistic regression with just a simple hypothesis like this, where, as usual, g is my sigmoid function. And if you do that, you end up with a hypothesis, trying to use maybe just a straight line to separate the positive and the negative examples. And this doesn't look at a, like a very good fit to the hypothesis. And so, once again, this is an example of underfitting, or of a hypothesis having high bias. In contrast if you were to add to your features these quadratic terms, then you could get a decision boundary that might look more like this. And, you know, that's a pretty good fit to the data, probably what probably about as good as we could get on this training set. And, finally at the other extreme, if you were to fit a very high order polynomial. If you were to generate lots of high-order polynomial terms as features, then the logistic progression may contort itself. We tried really hard to find a decision boundary that fits your training data, or go to great lengths to contort a cell to fit every single training example well. And, you know, if the features X1 and X2 are for predicting maybe the cancer tumor, you know cancer is a malignant benign breast tumors. This doesn't, this really doesn't look like a very good hypothesis for making predictions. And so once again, this is an instance of overfitting, and of hypothesis having high variance and not really, and being unlikely to generalize well to new examples. Later in this course, when we talk about debugging and diagnosing things that can go wrong with learning algorithms. We'll give you specific tools to recognize when over fitting and also when under fitting may be occurring. But for now, let's talk about the problem of, if we think overfitting is occurring, what can we do to address it? In the previous examples, we had one or two dimensional data, so we could just plot the hypothesis and see what was going on and select the appropriate degree polynomial. So earlier, for the housing prices example, we could just plot the hypothesis and you know maybe see that it was fitting this other very wiggly function that goes all over the place predicting housing prices and we could then use figure like these to select an appropriate degree polynomial. So plotting hypothesis could be one way to try to decide what degree polynomial to use, but that doesn't always work, and in fact more often, we may have learning problems that, where we just have a lot of features. And there is not just a matter of selecting what degree polynomial, and in fact, it, when we have so many features, it also becomes much harder to plot the data, and becomes much harder to visualize it, to decide what features to complete or not. So concretely, if we're trying to predict housing prices, sometimes we can just have a lot of different features, and all of the features seem, you know, maybe they seem kind of useful, but if we have a lot of features and very little training data, then overfitting can become a problem. In order to address overfitting, there are two main options for things that we can do. The first option is to try to reduce the number of features. One thing that we could do is manually look at the list of features and use that to try to decide which are the more important features, and therefore which are the features we should keep and which of the features we should throw out. Later in this class we'll also talk about model selection algorithms which are algorithms. But automatically deciding which features to keep and which features to throw out. This idea of reducing the number of features can work well and can reduce over fitting and when we talk about model selection we'll go into this in much greater depth. But a disadvantage is that by throwing away some of the features, it's also throwing away some of the information you have about the problem. For example, maybe all of those features are actually useful for predicting the price of a house, so maybe we don't actually want to throw some of our information or throw some of our features away. The second option, which we'll talk in, which we'll talk about in the next few videos, is regularization. Here, we're going to keep all the features that we're going to reduce the magnitude, or the values of the parameters, theta j. And this method works well, we'll see, when we have a lot of features. Each of which contributes a little bit to predicting the value of y. Like we had, like we saw in the housing price prediction example, where we could have a lot of features, each of which are, you know, somewhat useful, so maybe we don't want to throw them away. So this describes the idea of regularization at a very high level, and, I realize that all of these details probably don't make sense to you yet. But in the next video we'll start to formulate exactly how, to apply regularization and exactly what regularization means. And, then we'll start to figure out how to use this to make our learning algorithms work well and avoid overfitting.