From the results I've seen, manifold learning methods seem to generally outperform PCA for complicated, very high-dimensional datasets like images or videos. This makes sense to me, since nonlinear dimensionality reduction (like manifold learning) would allow for more complex understanding of the high-dimensional data than linear methods like PCA. So, if manifold learning is more powerful and successful than PCA, why do people still use linear methods like PCA at all?

$\begingroup$I believe that it has to do with simplicity, speed and scalability. If you have a problem where linear decomposition is sufficient (even though it is a greater approximation of the truth), then you may decide to use it. If you have a case where dimension reduction is absolutely necessary, but linear methods are completely insufficient, then you turn to more complex methods.$\endgroup$
– John YetterJul 21 '16 at 16:37

$\begingroup$I nearly hadn't read that question because of the non-informative title. Would you mind to append something like "... value of linear ... in the presence of nonlinear models" to hint that second focus of your question?$\endgroup$
– Gottfried HelmsJul 24 '16 at 6:20

1 Answer
1

I'll expand on some of the points mentioned in the comments, and add a few more.

Computational complexity. PCA is more efficient in terms of both time and memory than more complicated nonlinear dimensionality reduction (NLDR) techniques. This is an important issue when working with large datasets, for which NLDR techniques may not be feasible. Even simple implementations of PCA can work with large data sets, and tricks are available for scaling up massively. Scaling tricks are available for some NLDR techniques (e.g. landmark isomap and online training for autoencoders), but this isn't always the case.

Sometimes linearity is appropriate. Sometimes the data really do lie near a low dimensional linear manifold. In these cases, linear techniques like PCA are most appropriate. Even when the manifold isn't perfectly linear, PCA may give a good enough approximation that more complicated techniques aren't warranted.

Ease of use. PCA is straightforward to use. Given a particular implementation, there aren't any choices to make besides the number of dimensions. NLDR techniques typically require selecting at least one hyperparameter, and in some cases many hyperparameters. Running search procedures for hyperparameter tuning increases the already large computational cost of these methods. It's also necessary to choose one out of dozens of possible NLDR techniques to use in the first place, and the choice isn't always obvious. Different NLDR methods work well in different circumstances, and you may not know a priori which one is most appropriate.

Forward mapping. PCA gives a mapping from the high dimensional to the low dimensional space. This makes it possible to apply the same transformation to out-of-sample data that wasn't part of the training set. This is necessary for cross-validation, and also useful when the same procedure must be extended to new data. Some NLDR techniques (e.g. autoencoders) also provide such a mapping natively, but most don't. Out-of-sample extension procedures have been devised for other NLDR methods, but they add to the complexity of the procedure by requiring additional runtime, learning, and/or hyperparameter tuning for the mapping itself.

Nonlinear downstream algorithms. Dimensionality reduction is often used as a pre-processing step for downstream learning algorithms (e.g. supervised learning). It may not be necessary to learn nonlinear structure during pre-processing, because this can be done by downstream algorithms. If nonlinear structure is present, it may just be necessary to use more principal components than the true/intrinsic dimensionality of the data (e.g. the surface of a hemisphere is intrinsically two dimensional, but can be perfectly preserved using three dimensions). This is not to say that NLDR pre-processing can't help; in some cases it can.

Overfitting. NLDR techniques have a greater capacity to overfit than PCA as a consequence of their increased model complexity, so care must be taken.

Interpretability. In some cases, we may want to use dimensionality reduction to help understand the process that generated the data. PCA weights make it easier to say something in terms of the original dimensions of the data, but this isn't the case for many NLDR methods.

Anthropological issues. PCA is an old, trusted, and widely known standard, which makes it a technique that people often reach for. Paper audiences, clients, and supervisors are more likely to be familiar with it. Awareness of NLDR algorithms is simply not as great, and implementations aren't as widely available.

All of that said, NLDR is an exciting field, and there are clearly cases where NLDR obliterates PCA. I only focus on the virtues of PCA because that's what the question is about. It's all a matter of context; whether PCA or NLDR is more appropriate depends on the situation.

$\begingroup$Thanks, great answers like these are what make me love stackexchange. I understand what you're getting at in your point about forward mapping. However I think it's worth mentioning for future readers that techniques have been devised for doing PCA- or Autoencoder-like forward mapping in NLDR methods. See papers here and here.$\endgroup$
– KFoxJul 22 '16 at 12:56

$\begingroup$Yes, that's a good point. Was trying to allude to that at the end of that paragraph. Good references. There are some other papers floating around too that use a general regression approach to learn the mapping, and work for any NLDR method.$\endgroup$
– user20160Jul 22 '16 at 13:50

1

$\begingroup$+6. This is an excellent answer, well-structured and informative.$\endgroup$
– amoebaJul 23 '16 at 20:45