Friday, February 26, 2010

Principal Components Analysis

Introduction

Real-world data sets usually exhibit relationships among their variables. These relationships are often linear, or at least approximately so, making them amenable to common analysis techniques. One such technique is principal component analysis ("PCA"), which rotates the original data to new coordinates, making the data as "flat" as possible.

Given a table of two or more variables, PCA generates a new table with the same number of variables, called the principal components. Each principal component is a linear transformation of the entire original data set. The coefficients of the principal components are calculated so that the first principal component contains the maximum variance (which we may tentatively think of as the "maximum information"). The second principal component is calculated to have the second most variance, and, importantly, is uncorrelated (in a linear sense) with the first principal component. Further principal components, if there are any, exhibit decreasing variance and are uncorrelated with all other principal components.

PCA is completely reversible (the original data may be recovered exactly from the principal components), making it a versatile tool, useful for data reduction, noise rejection, visualization and data compression among other things. This article walks through the specific mechanics of calculating the principal components of a data set in MATLAB, using either the MATLAB Statistics Toolbox, or just the base MATLAB product.

Performing Principal Components Analysis

Performing PCA will be illustrated using the following data set, which consists of 3 measurements taken of a particular subject over time:

To summarize the data, we calculate the sample mean vector and the sample standard deviation vector:

>> AMean = mean(A)

AMean =

269.9733 38.9067 50.4800

>> AStd = std(A)

AStd =

1.7854 0.3751 0.3144

Most often, the first step in PCA is to standardize the data. Here, "standardization" means subtracting the sample mean from each observation, then dividing by the sample standard deviation. This centers and scales the data. Sometimes there are good reasons for modifying or not performing this step, but I will recommend that you standardize unless you have a good reason not to. This is easy to perform, as follows:

1. The order of the principal components from princomp is opposite of that from eig(cov(B)). princomp orders the principal components so that the first one appears in column 1, whereas eig(cov(B)) stores it in the last column.

2. Some of the coefficients from each method have the opposite sign. This is fine: There is no "natural" orientation for principal components, so you can expect different software to produce different mixes of signs.

This completes the round trip from the original data to the principal components and back to the original data. In some applications, the principal components are modified before the return trip.

Let's consider what we've gained by making the trip to the principal component coordinate system. First, more variance has indeed been squeezed in the first principal component, which we can see by taking the sample variance of principal components:

>> var(SCORE)

ans =

2.8125 0.1809 0.0066

The cumulative variance contained in the first so many principal components can be easily calculated thus:

>> cumsum(var(SCORE)) / sum(var(SCORE))

ans =

0.9375 0.9978 1.0000

Interestingly in this case, the first principal component contains nearly 94% of the variance of the original table. A lossy data compression scheme which discarded the second and third principal components would compress 3 variables into 1, while losing only 6% of the variance.

The other important thing to note about the principal components is that they are completely uncorrelated (as measured by the usual Pearson correlation), which we can test by calculating their correlation matrix:

>> corrcoef(SCORE)

ans =

1.0000 -0.0000 0.0000 -0.0000 1.0000 -0.0000 0.0000 -0.0000 1.0000

Discussion

PCA "squeezes" as much information (as measured by variance) as possible into the first principal components. In some cases the number of principal components needed to store the vast majority of variance is shockingly small: a tremendous feat of data manipulation. This transformation can be performed quickly on contemporary hardware and is invertible, permitting any number of useful applications.

For the most part, PCA really is as wonderful as it seems. There are a few caveats, however:

1. PCA doesn't always work well, in terms of compressing the variance. Sometimes variables just aren't related in a way which is easily exploited by PCA. This means that all or nearly all of the principal components will be needed to capture the multivariate variance in the data, making the use of PCA moot.

2. Variance may not be what we want condensed into a few variables. For example, if we are using PCA to reduce data for predictive model construction, then it is not necessarily the case that the first principal components yield a better model than the last principal components (though it often works out more or less that way).

3. PCA is built from components, such as the sample covariance, which are not statistically robust. This means that PCA may be thrown off by outliers and other data pathologies. How seriously this affects the result is specific to the data and application.

4. Though PCA can cram much of the variance in a data set into fewer variables, it still requires all of the variables to generate the principal components of future observations. Note that this is true, regardless of how many principal components are retained for the application. PCA is not a subset selection procedure, and this may have important logistical implications.

I greatly appreciate the practical and clear explanation. Other things available are beyond my current background, but this explanation allows me to start empirically playing with the method - in my opinion a good precursor to really understanding it formally (if I ever get there! - and if not I still learn something).

The first time I did PCA I used the following document which has pictures. I agree they help a lot. It is nice to see the MatLab code on your blog. I think it would have been better if you did an example where you actually reduced the dimensionality by selecting a subset of feature vectors. Also it's very easy to do this without the statiscs toolbox.

I do something like this in MatLab to select my feature vectorsValues=diag(Values);[Vsort, Vindices] = sort(-1*Values);Values = Values(Vindices);PC = Vectors(:,Vindices(1:number_factors));

@previous poster This process can be applied to images for compression.

What do you do if you wanted to know the points that are, say, 1 standard deviations from the mean of the first principal component? In other words, I want to know what is being altered as I move two standard deviations from the first component.

COEFF is V, but with the opposite order, so that when it is multiplied by the normalized data, B * COEFF, the result, which is generated by princomp() as SCORE, has the first principal component in the first column. LATENT contains the variances of the principal components.

this articel very helpful for meI am a student from IndonesiaI am now more complete thesis on data mining with the PCA as a dimension reduction but I want to use the Jacobi iteration to find the eigen vectorCan you help me

hi, glad to find this blog! i have a huge set of data: 17689 approximate coefficient which extracted from feature extraction of MRI brain image. how can i use PCA to reduce the data so that i can use a minimum data for SVM classification purpose. really need your advice, tq-amalina_azman_80@yahoo.com.my

"To plot the PC1 vs PC2 plot do I plot the scores first column Vs scores second column of values?"

That depends on how the principal components were calculated. If the princomp function in the Statistics Toolbox was used then, yes. If the "manual" method I describe here was used, then the order of the principal components column is reversed, so the last 2 columns are the first 2 principal components.

Will, excellent post. Thanks. Quick question, at the end of the process my new variables will be the matrix SCORE, right? If I want to add trend line I should use the first, second, …columns of SCORE matrix?

Hello, I'd like to thank you very much for your tutorial. Very helpful and useful.But I do have a question. I read in many works that PCA is used as a "preprocessing" method, prior to classification (LDA or other). Therefor, I'd like to ask you, How can I use the principals coefficients to perform a classification ?

I mean, okay, I know the assigned classes to the training matrice T, but when I do : Coeff=princomp(A),I found out that I know nothing about Coeff, I don't know its classes. So how can I perform a classification ?

Thank you very much for your blog. Please, excuse me if my question is misplaced, please indicate me the appropriate forum to ask it.

PCA itself is not a classification method. PCA merely rearranges the data to exploit linear structure. As I note in this posting, PCA may or may not help classification, which is a separate process (performed by some classification algorithm: discriminant analysis, neural networks, etc.).

Hello! This is one of the best posts I've seen on PCA - thank you! I've used this post as starting point. However for better interpretation of which variables comprise my factors, I would like to go one step further and rotate my factors using the varimax method of rotation.Once I've obtained my COEFF, SCORE, and latent, what is the next step? Thank you~

Hello! This is one of the best posts I've seen on PCA - thank you! I've used this post as starting point. However for a better interpretation of which variables comprise my factors, I would like to rotate my factors using the varimax method of rotation.My question is, once I've obtained my COEFF, SCORE, and latent, what is the next step to rotation? Thank you~

Thank you very much for your good explanation. I am wondering about one thing though. Once we get the principal components by using the princomp function of matlab, can we say that the first principal component is related to the first column of the original data matrix? Or is it possible that the first column of the original data matrix does not have much variance as the second column; therefore, the first column of the principal components corresponds to the second column of the original data matrix? How can we know which column of the principal components is related to which column of the original data matrix? Thanks in advance.

Hi! Lately I've been reading a lot about PCA and I've found this post very useful, kudos to you!I've learned that a useful method for validating a PCA model (choosing how many PCs to retain) is by cross validation.Taking your example in Matlab and applying the 'Leave One Out' method (I really need to implement this in my work) how would you do it?I understand the concept behind CV but I'm a bit confused as how to apply it here (or in any other example), because of my lack of experience.Any help would be MUCH appreciated!Thanks in advance,Nuno B.

why there is also a sign problem beside the first-to-last order problem comparing V and COEFF if you compare the last column of V and the first column of COEFF. Which one I should use for the vector? Thanks.

The PC's are orthogonal (V*V.'), but I may say that the transformed data is not always uncorrelated. In case, in the original data, there is one variable that is a linear combination of other variables, this dependence is kept.However, PCA can help us to see these dependencies and eliminate redundant variables.

Thank you very much for this explanation. I found your page because I've been struggling for a while to understand why the coefficients given by matlab are systematically smaller than those given by Statistica. Today, I've realized that the norms of the factors (columns of COEFF)which should be equal to the factors' variances are instead normalized to one. Do you know why, and more importantly how I can prevent this normalization in order to one to stick to Statistica's results ?Thanks in advance !

Nice post; the examples make it all very clear. I recently wrote an article about what PCA actually means. This might be helpful for some, to get a more intuitive understanding: http://www.visiondummy.com/2014/05/feature-extraction-using-pca/

If you want to get the original data from B , you do not need COEFF at all. COEFF*COEFF'=identity matrix or 1. I think you should point out that V or COEFF are the eigenvectors of Cov(B). It took me a while to figure out. In this case cov(B) is same as corr(B), as they are zscores. However, B*B' is 14*cov(B). why the 14? it is puzzling me. I was thinking that will give the covariance matrix.

Lets assume you've 3 features of an image instead of 3 measurements taken of a particular subject. I have 10 images. My training dataset will be 10 x 3; If I use matlab buildin function princomp and get COEFF SCORE LATENT? which one should I use; score also gives me 3 col. Do I need to use first col. only. How to use this number for better interpretation of my results? how to give input to the classifier

Hi Will, Nice post to explain PCA. I wonder if you can help my simple problem. I wish to do a GPR with input from PCA of my data, and I learned that the right way to do the CV is by doing PCA on the training set, then use the training regression coefficients to map the test set to their PCs. The following is my attempt in matlab:

After applying PCA , if I want to take only first component and throw out other 2 components and want to reproduce the original data , then my reproduced data still having three variable. Can you please tell me how this procedure affects my data.

Or if I want to predict some Y variable based on given data , then how can we use PCA for that ?

For a multi variables data sets is it possible to tell which variable have most influence on 1st Principal component. Another question is that what is the significance of the negative sign in the V matrix component?

I have been going through number of websites and textbooks to learn how to utilize PCA and somewhat confused because of so many technical terms and formulas flying around. Your explanation just settled everything in place in my brain. Thank you!

Hey thank u for an example.I tried it with and without using princomp function and my solution without using princomp matches with your answer but i am getting different answer when i use princomp as follows:coeff =

About Me

I am a data miner with more years of experience than I care to remember. I've worked in a variety of fields and used a wide array of tools, but MATLAB is my tool of choice.
Find me at:
http://www.linkedin.com/in/predictor