Machine learning is the science of getting computers to act without being explicitly programmed. In the past decade, machine learning has given us self-driving cars, practical speech recognition, effective web search, and a vastly improved understanding of the human genome. Machine learning is so pervasive today that you probably use it dozens of times a day without knowing it. Many researchers also think it is the best way to make progress towards human-level AI. In this class, you will learn about the most effective machine learning techniques, and gain practice implementing them and getting them to work for yourself. More importantly, you'll learn about not only the theoretical underpinnings of learning, but also gain the practical know-how needed to quickly and powerfully apply these techniques to new problems. Finally, you'll learn about some of Silicon Valley's best practices in innovation as it pertains to machine learning and AI.
This course provides a broad introduction to machine learning, datamining, and statistical pattern recognition. Topics include: (i) Supervised learning (parametric/non-parametric algorithms, support vector machines, kernels, neural networks). (ii) Unsupervised learning (clustering, dimensionality reduction, recommender systems, deep learning). (iii) Best practices in machine learning (bias/variance theory; innovation process in machine learning and AI). The course will also draw from numerous case studies and applications, so that you'll also learn how to apply learning algorithms to building smart robots (perception, control), text understanding (web search, anti-spam), computer vision, medical informatics, audio, database mining, and other areas.

Enseigné par

Andrew Ng

Transcription

In this video I'd like to tell you about the principle components analysis algorithm. And by the end of this video you know to implement PCA for yourself. And use it reduce the dimension of your data. Before applying PCA, there is a data pre-processing step which you should always do. Given the trading sets of the examples is important to always perform mean normalization, and then depending on your data, maybe perform feature scaling as well. this is very similar to the mean normalization and feature scaling process that we have for supervised learning. In fact it's exactly the same procedure except that we're doing it now to our unlabeled data, X1 through Xm. So for mean normalization we first compute the mean of each feature and then we replace each feature, X, with X minus its mean, and so this makes each feature now have exactly zero mean The different features have very different scales. So for example, if x1 is the size of a house, and x2 is the number of bedrooms, to use our earlier example, we then also scale each feature to have a comparable range of values. And so, similar to what we had with supervised learning, we would take x, i substitute j, that's the j feature and so we would subtract of the mean, now that's what we have on top, and then divide by sj. Here, sj is some measure of the beta values of feature j. So, it could be the max minus min value, or more commonly, it is the standard deviation of feature j. Having done this sort of data pre-processing, here's what the PCA algorithm does. We saw from the previous video that what PCA does is, it tries to find a lower dimensional sub-space onto which to project the data, so as to minimize the squared projection errors, sum of the squared projection errors, as the square of the length of those blue lines that and so what we wanted to do specifically is find a vector, u1, which specifies that direction or in the 2D case we want to find two vectors, u1 and u2, to define this surface onto which to project the data. So, just as a quick reminder of what reducing the dimension of the data means, for this example on the left we were given the examples xI, which are in r2. And what we like to do is find a set of numbers zI in r push to represent our data. So that's what from reduction from 2D to 1D means. So specifically by projecting data onto this red line there. We need only one number to specify the position of the points on the line. So i'm going to call that number z or z1. Z here [xx] real number, so that's like a one dimensional vector. So z1 just refers to the first component of this, you know, one by one matrix, or this one dimensional vector. And so we need only one number to specify the position of a point. So if this example here was my example X1, then maybe that gets mapped here. And if this example was X2 maybe that example gets mapped And so this point here will be Z1 and this point here will be Z2, and similarly we would have those other points for These, maybe X3, X4, X5 get mapped to Z1, Z2, Z3. So What PCA has to do is we need to come up with a way to compute two things. One is to compute these vectors, u1, and in this case u1 and u2. And the other is how do we compute these numbers, Z. So on the example on the left we're reducing the data from 2D to 1D. In the example on the right, we would be reducing data from 3 dimensional as in r3, to zi, which is now two dimensional. So these z vectors would now be two dimensional. So it would be z1 z2 like so, and so we need to give away to compute these new representations, the z1 and z2 of the data as well. So how do you compute all of these quantities? It turns out that a mathematical derivation, also the mathematical proof, for what is the right value U1, U2, Z1, Z2, and so on. That mathematical proof is very complicated and beyond the scope of the course. But once you've done [xx] it turns out that the procedure to actually find the value of u1 that you want is not that hard, even though so that the mathematical proof that this value is the correct value is someone more involved and more than i want to get into. But let me just describe the specific procedure that you have to implement in order to compute all of these things, the vectors, u1, u2, the vector z. Here's the procedure. Let's say we want to reduce the data to n dimensions to k dimension What we're going to do is first compute something called the covariance matrix, and the covariance matrix is commonly denoted by this Greek alphabet which is the capital Greek alphabet sigma. It's a bit unfortunate that the Greek alphabet sigma looks exactly like the summation symbols. So this is the Greek alphabet Sigma is used to denote a matrix and this here is a summation symbol. So hopefully in these slides there won't be ambiguity about which is Sigma Matrix, the matrix, which is a summation symbol, and hopefully it will be clear from context when I'm using each one. How do you compute this matrix let's say we want to store it in an octave variable called sigma. What we need to do is compute something called the eigenvectors of the matrix sigma. And an octave, the way you do that is you use this command, u s v equals s v d of sigma. SVD, by the way, stands for singular value decomposition. This is a Much more advanced single value composition. It is much more advanced linear algebra than you actually need to know but now It turns out that when sigma is equal to matrix there is a few ways to compute these are high in vectors and If you are an expert in linear algebra and if you've heard of high in vectors before you may know that there is another octet function called I, which can also be used to compute the same thing. and It turns out that the SVD function and the I function it will give you the same vectors, although SVD is a little more numerically stable. So I tend to use SVD, although I have a few friends that use the I function to do this as wellbut when you apply this to a covariance matrix sigma it gives you the same thing. This is because the covariance matrix always satisfies a mathematical Property called symmetric positive definite You really don't need to know what that means, but the SVD and I-functions are different functions but when they are applied to a covariance matrix which can be proved to always satisfy this mathematical property; they'll always give you the same thing. Okay, that was probably much more linear algebra than you needed to know. In case none of that made sense, don't worry about it. All you need to know is that this system command you should implement in Octave. And if you're implementing this in a different language than Octave or MATLAB, what you should do is find the numerical linear algebra library that can compute the SVD or singular value decomposition, and there are many such libraries for probably all of the major programming languages. People can use that to compute the matrices u, s, and d of the covariance matrix sigma. So just to fill in some more details, this covariance matrix sigma will be an n by n matrix. And one way to see that is if you look at the definition this is an n by 1 vector and this here I transpose is 1 by N so the product of these two things is going to be an N by N matrix. 1xN transfers, 1xN, so there's an NxN matrix and when we add up all of these you still have an NxN matrix. And what the SVD outputs three matrices, u, s, and v. The thing you really need out of the SVD is the u matrix. The u matrix will also be a NxN matrix. And if we look at the columns of the U matrix it turns out that the columns of the U matrix will be exactly those vectors, u1, u2 and so on. So u, will be matrix. And if we want to reduce the data from n dimensions down to k dimensions, then what we need to do is take the first k vectors. that gives us u1 up to uK which gives us the K direction onto which we want to project the data. the rest of the procedure from this SVD numerical linear algebra routine we get this matrix u. We'll call these columns u1-uN. So, just to wrap up the description of the rest of the procedure, from the SVD numerical linear algebra routine we get these matrices u, s, and d. we're going to use the first K columns of this matrix to get u1-uK. Now the other thing we need to is take my original data set, X which is an RN And find a lower dimensional representation Z, which is a R K for this data. So the way we're going to do that is take the first K Columns of the U matrix. Construct this matrix. Stack up U1, U2 and so on up to U K in columns. It's really basically taking, you know, this part of the matrix, the first K columns of this matrix. And so this is going to be an N by K matrix. I'm going to give this matrix a name. I'm going to call this matrix U, subscript "reduce," sort of a reduced version of the U matrix maybe. I'm going to use it to reduce the dimension of my data. And the way I'm going to compute Z is going to let Z be equal to this U reduce matrix transpose times X. Or alternatively, you know, to write down what this transpose means. When I take this transpose of this U matrix, what I'm going to end up with is these vectors now in rows. I have U1 transpose down to UK transpose. Then take that times X, and that's how I get my vector Z. Just to make sure that these dimensions make sense, this matrix here is going to be k by n and x here is going to be n by 1 and so the product here will be k by 1. And so z is k dimensional, is a k dimensional vector, which is exactly what we wanted. And of course these x's here right, can be Examples in our training set can be examples in our cross validation set, can be examples in our test set, and for example if you know, I wanted to take training example i, I can write this as xi XI and that's what will give me ZI over there. So, to summarize, here's the PCA algorithm on one slide. After mean normalization, to ensure that every feature is zero mean and optional feature scaling whichYou really should do feature scaling if your features take on very different ranges of values. After this pre-processing we compute the carrier matrix Sigma like so by the way if your data is given as a matrix like hits if you have your data Given in rows like this. If you have a matrix X which is your time trading sets written in rows where x1 transpose down to x1 transpose, this covariance matrix sigma actually has a nice vectorizing implementation. You can implement in octave, you can even run sigma equals 1 over m, times x, which is this matrix up here, transpose times x and this simple expression, that's the vectorize implementation of how to compute the matrix sigma. I'm not going to prove that today. This is the correct vectorization whether you want, you can either numerically test this on yourself by trying out an octave and making sure that both this and this implementations give the same answers or you Can try to prove it yourself mathematically. Either way but this is the correct vectorizing implementation, without compusingnext we can apply the SVD routine to get u, s, and d. And then we grab the first k columns of the u matrix you reduce and finally this defines how we go from a feature vector x to this reduce dimension representation z. And similar to k Means if you're apply PCA, they way you'd apply this is with vectors X and RN. So, this is not done with X-0 1. So that was the PCA algorithm. One thing I didn't do is give a mathematical proof that this There it actually give the projection of the data onto the K dimensional subspace onto the K dimensional surface that actually minimizes the square projection error Proof of that is beyond the scope of this course. Fortunately the PCA algorithm can be implemented in not too many lines of code. and if you implement this in octave or algorithm, you actually get a very effective dimensionality reduction algorithm. So, that was the PCA algorithm. One thing I didn't do was give a mathematical proof that the U1 and U2 and so on and the Z and so on you get out of this procedure is really the choices that would minimize these squared projection error. Right, remember we said What PCA tries to do is try to find a surface or line onto which to project the data so as to minimize to square projection error. So I didn't prove that this that, and the mathematical proof of that is beyond the scope of this course. But fortunately the PCA algorithm can be implemented in not too many lines of octave code. And if you implement this, this is actually what will work, or this will work well, and if you implement this algorithm, you get a very effective dimensionality reduction algorithm. That does do the right thing of minimizing this square projection error.