Decompositions

With a large matrix, we usually cannot afford to compute the entire
SVD. Instead, we compute the top 200 singular values and vectors using
the svds function, from the rARPACK package.

dtm_svd <- svds(dtm, 200)

For principal components analysis, we would usually “center” the matrix before
computing the singular value decomposition. That is, we would subtract the
column means from each row of the matrix. Then, the right singular vectors
would be the eigenvectors of the sample covariance matrix. The problem with
doing so is that centering the matrix makes it dense. So, in practice, most
people avoid the centering step. That means that instead of talking about
variance below, I talk about “dispersion” below, I mean dispersion around
zero (variance is dispersion around the mean).

To see how much dispersion is explained by each component, look at the squares
of the singular values:

d <- dtm_svd$d
plot(d^2)

The total dispersion is equal to the sum of squares of all singular values,
which is equal to the sum of squares of all elements of the data matrix.

(disp_tot <- sum(dtm^2))

[1] 3769.374

The following graph shows the cumulative dispersion explained by the leading components of the singular value decomposition:

plot(cumsum(d^2) / disp_tot)

We can see that with 200 components, we are explaining over 30% of the
variability in the data. The original data matrix has dimensions