PCA demystified

In my scientific field (Neuroscience), Principal Component Analysis (PCA) is very trendy. Surprisingly, even if it is widely used, I have the impression that many people are scared of this analysis. I understand that. I mean : Principal Component Analysis does look like a scary thing to do. It certainly does look like advanced analysis.

Well, surprisingly again, PCA is ONLY two lines of code in Matlab. Yes, only 2 and only using good old Matlab functions without any toolbox.

These 2 lines of code are a little dense conceptually but nothing too fancy, so let’s embark on this adventure to demystify PCA!

First, as usual, we need a good example. Instead of picking something from Neuroscience, I decided to take something that would speak to a larger audience and will also be interesting : Polls.

There is a presidential election very soon in France, so everybody is talking about these Polls and I thought we could see what PCA can tell us about it.

And I collected the results of all the polls since the beginning of the year.
I came out with this graph :

This graph only shows the percentage value of each candidate in the last 55 polls. You can see all the candidates go up or down with time. For instance, the red curve is Jean-Luc Melenchon, a left wing candidate that have been rising steeply in the last 15 polls.

So I took this data and organized it into one single Matrix called PollData.
In this matrix, each column is one candidate and each line is the distribution of percentage throughout all candidates for one Poll.

So here : Number of lines = 55 and Number of columns = 9.This is how you should organize your data to do PCA, i.e. variables along the columns and repetitions along the lines.

Now, in PCA, the first thing to do is to get the covariance matrix. Hold on, that is an easy one. The covariance matrix is just an extension of the variance. On the diagonal, it calculates the variance of each variable (here the variance of the polls for one candidate). The other elements are the covariance of, for example, candidate 1 and candidate 2. If the value is high, they covary. If the value is negative, they anti-covary. If zero, they are not correlated.

In Matlab, getting the covariance matrix is easy, just do :

CovMat=cov(PollData);

This is line number 1 of the PCA.

You can actually plot this matrix on an image. It is sort of interesting. Here I get this :

Click on the figure to get a bigger version. This matrix shows covariation. So here it is clear that Nicolas Sarkozy (right wing) is anti-correlated with Francois Hollande (left wing). That’s logic. Even more interesting, Le Pen (extreme right) is very highly anti-correlated with Sarkozy (right). Logic again, they fight for the same people.

Ok, that’s nice. If you are french, I am sure you are deeply enjoying this mathematic over politic analysis. But let’s suppose we ask the following :
What is really important in these polls? What are the most important variations in the data.

This is when PCA comes handy.

PCA is a way to redistribute the variance along their maximal direction. To do so, it just creates a new coordinate system that takes into account these variances.

But let’s just do it and you will see what I am talking about.

We are going to take our covariance matrix, and we are going to look for the eigenvectors and the eigenvalues of this matrix, like this :

[V,D]=eigs(CovMat,4);

That’s it! This is line 2. We have done PCA. Let’s make sense out of it.
So, what eigs does here is to look for the first 4 eigenvectors. That means it is going to first look at the covariance matrix and try to find the highest covariance between all 9 candidates. It will construct a combination of all these candidates to create a new candidate that varies the most. This is principal component 1. Then it iterates and try to find a new combination that is orthogonal to the previous one. That means that if candidate 3 is very strong on the first component, then the algorithm can’t pick it anymore and its weigh will be weak on all the following components.

Now the distribution of these coefficients is in V.

D is a diagonal matrix that gives you the variance of each of these new components (actually the inverse of the variance).

If we now plot V, we get the following image :

Now I am going to let you revise your french politic. But this is quite interesting.
This graph tells you, on the first column, that the Principal Component 1 is very very positive on Sarkozy and Jean-Luc Melenchon and very negative on Marine Le Pen.
In other words, the most important thing in all these Polls is that both Sarkozy and Melenchon are rising and that Le Pen is going down. This is Component 1.

Mathematic is telling here that these 3 persons are the most important changing variables in the Polls. Hollande, even if he is so far the most likely winner of the election, is not part of this dynamic.

The covariance calculation actually substracts the mean from the data.
It does not divide by the standard deviation. Yes.
You can do Z-score. I suppose in many cases it will be informative.
I don’t think this is part of PCA per se as stated here : http://en.wikipedia.org/wiki/Principal_component_analysis

Hello! thank you for your article! It is very much understandable for a newbie like me and it is much appreciated.

I have one questions to follow up on the demysticfication of PCA.

I’ve seen on this website: http://matlabdatamining.blogspot.com/2010/02/principal-components-analysis.html that the matrix V has its coefficient by column from the lastest principal component (PC) to the first one (whereas princomp gives the opposite ie PC1 PC2 … PCX). Do you confirm it? It would have been very helpful if you would have give us the code line to plot the matrix!!! To figure out exactly what’s happening.

This was a very clear tutorial on PCA—thank you. In the future, it would be great if you could list some examples of how different types of studies and data are analyzed with PCA. It’s clear why the poll data works (there’s a clear set of potentially correlated variables with many separate data sets), but I also have heard that PCA is used in more complicated things like image segmentation. It would be awesome if you could give us an idea of what would go in the columns and rows of the data matrix in cases like these.

Hey, these were the steps I followed for object identification from a set of training and testing images:

1. convert image from RGB to Grayscale
2. calculate the mean of all the 700 images in the database.
3. subtract the mean obtained above from each image to normalize them
4. compute covariance of each image using cov ()
5. use eigs() on each covariance matrix above and extract first 6 eigen vectors for each image

now I shall store them in a mat file and use a distance metric to compare the test images.
Correct?

Hi, I read your innovative PCA instruction and explanation. It is awesome. Thanks.
So, just out of curiosity, who finally won in this election? Did the election outcome reflect the PCA analysis?
Thanks.

Thanks. Hollande won in the end. He kept being high in the polls all the way till the end. PCA is looking at variance not the absolute mean value. So if one guy is constantly high, it will not come first. So I guess this approach is interesting to understand the dynamic between candidates, not the absolute future outcome.

Nice approach to explaining PCA in a friendly matter! I especially like your visualization of the covariance matrix as a co-occurrence grid! I recently wrote an article that takes a different approach in explaining PCA in an intuitive manner, by comparing it with well known decorrelation and regression methods: http://www.visiondummy.com/2014/05/feature-extraction-using-pca/