Normally, to do a PCA I would calculate the covariance matrix and then find its eigenvectors and corresponding eigenvalues. I understand very well how to interpret both of these, and find it a useful way to get to grips with a data set initially.

However, I've read that with such a large data set it's better (faster and more accurate) to do the principal components analysis by doing singular value decomposition (SVD) on the data matrix instead.

I have done this using SciPy's svd function. I don't really understand SVD, so I might not have done it right (see below), but assuming I have, what I end up with is (1) a matrix U, which is of size $3000\times 3000$; a vector s of length $3000$, and a matrix V of size $3000\times 100079$. (I used the full_matrices=False option, otherwise it would have been $100079\times 100079$, which is just silly.)

My questions are as follows:

It seems plausible that the singular values in the s vector might be the same as the eigenvalues of the correlation matrix. Is this correct?

If so, how do I find the eigenvectors of the correlation matrix? Are they the rows of U, or its columns, or something else?

It seems plausible that the columns of V might be the data transformed into the basis defined by the principal components. Is this correct? If not, how can I get that?

To do the analysis, I simply took my data in a big $3000 \times 100079$ numpy array and passed it to the svd function. (I'm aware that one should normally center the data first, but my intuition says I probably don't want to do this for my data, at least initially.) Is this the right way to do it? Or should I do something special to my data before passing it to this function?

$\begingroup$@amoeba thanks, that's a good answer - you should post it as one. (Will correct correlation to covariance, it was a thinko.)$\endgroup$
– NathanielSep 7 '14 at 1:21

$\begingroup$@Nathaniel: Thank you, I posted my answer. I am wondering if it settles the question for you? Let me know if you would like anything to still be clarified.$\endgroup$
– amoebaSep 9 '14 at 8:55

$\begingroup$@amoeba I think it pretty much does. I'm pretty busy at the moment so I haven't had much of a chance to think much about this stuff or go back to looking at my data, but your answer is definitely very helpful.$\endgroup$
– NathanielSep 9 '14 at 9:08

3 Answers
3

I think the first thing to remember is that given a matrix $A$ is $A = U \Sigma V^T$ (singular value decomposition) that decomposition is the same as $A = S \Lambda S^{-1}$ (eigenvalue decomposition) if $A$ is a positive (semi) definite symmetric matrix, ie. $ A = Q \Lambda Q^T$. Having said that and going back to your first question: Yeap, it is plausible that the singular values are numerically the same as the eigenvalues. Generally speaking, as shown below and noted by @amoeba, the singular values are the square roots of the non-zero eigenvalues of $A^TA$.

Coming to your second question: Assuming $A_{m \times n} = U \Sigma V^T$, the eigenvector you are looking for are in $V$ where as $U$ and $V$ are unitary matrices: $V^T V = I_n$ and $U^T U = I_m$. I think this point also answers your third question. To make this more clear: $A = U \Sigma V^T \rightarrow A^T A = V \Sigma^T U^T U\Sigma V^T \rightarrow V \Sigma^2 V^T$ because $\Sigma^T = \Sigma$ and $U^T U = I$. So $\Sigma^2$ = $\Lambda$. (Be carefully you most probably need to use a normalizing factor $\frac{1}{n-1}$ to achieve this equality.)

Regarding your final point: I usually work on the $m > n$ domain so the eigen-decomposition of the covariance function is more efficient; so that takes care of the centring immediately. Having said that: Yes, your intuition is correct; no, if you are looking to use $SVD$ to calculate principal components you do not need to centre your data first. There is a nice discussion of this topic in the following thread: When should you center your data & when should you standardize?

My first references regarding $SVD$ and its connection to eigen-decomposition are G.Strang's Introduction to Linear Algebra, Chapt. 6 Sect. 7 and I.T. Jolliffe's Principal Component Analysis, Chapt. 3 Sect. 5. Both are usually easily available as worn library copies and should serve as a good introduction if you wish to visit more advanced texts latter on.

$\begingroup$The singular values of the data matrix are not the same as the eigenvalues of the covariance matrix (as the last sentence in your first paragraph seems to imply); instead, they are given by the square roots of the latter.$\endgroup$
– amoebaSep 6 '14 at 15:27

$\begingroup$@amoeba: Apologies for the confusion; I never meant to imply that they are the same but rather than they can be equal (so for example if A is a strictly positive diagonal matrix). I guess I used the term "same" to match the OP's question. I should be more careful in my wording; I will fix it. I clarify the matter ($\Sigma^2 = \Lambda$) in the following paragraph.$\endgroup$
– usεr11852Sep 6 '14 at 20:22

No, this is incorrect: singular values of the data matrix (your $s$) are equal to the square roots of the eigenvalues of the covariance matrix, up to a scaling factor $\sqrt{N-1}$ where $N$ is the number of data points.

Eigenvectors of the covariance (NB: covariance! not correlation) matrix are given by the columns of $U$.

Almost correct: columns of $V$ are principal components, i.e. projections on the principle axes, but scaled to unit norm! Principal components themselves are given by columns of $V$, each multiplied by the respective singular value.

The two functions linked below compute the PCA using either np.linalg.eig or np.linalg.svd. It should help you get there for going between the two. There's a larger PCA class in that module that you might be interested in. I'd like to hear some feedback on the PCA class if you do end up using it. I'm still adding features before we merge that in.

You can see the PR here. It won't let me post a deep link for some reason, so look for def _pca_svd and def _pca_eig.