Month: October 2009

Principal Component Analysis (PCA) is an exploratory tool designed by Karl Pearson in 1901 to identify unknown trends in a multidimensional data set. It involves a mathematical procedure that transforms a number of possibly correlated variables into a smaller number of uncorrelated variables called principal components.

Foreword

Before you read this article, please keep in mind that it was written before the Accord.NET Framework was created and became popular. As such, if you would like to do Principal Component Analysis in your projects, download the accord-net framework from NuGet and either follow the starting guide or download the PCA sample application from the sample gallery in order to get up and running quickly with the framework.

Introduction

PCA essentially rotates the set of points around their mean in order to align with the first few principal components. This moves as much of the variance as possible (using a linear transformation) into the first few dimensions. The values in the remaining dimensions, therefore, tend to be highly correlated and may be dropped with minimal loss of information. Please note that the signs of the columns of the rotation matrix are arbitrary, and so may differ between different programs for PCA.

Accord.NET Framework

This new library, which I called Accord.NET, was initially intended to extend the AForge.NET Framework through the addition of new features such as Principal Component Analysis, numerical decompositions, and a few other mathematical transformations and tools. However, the library I created grew larger than the original framework I was trying to extend. In a few months, both libraries will merge under Accord.NET. (Update April 2015)

Design decisions

As people who want to use PCA in their projects usually already have their own Matrix classes definitions, I decided to avoid using custom Matrix and Vector classes in order to make the code more flexible. I also tried to avoid dependencies on other methods whenever possible, to make the code very independent. I think this also made the code simpler to understand.

The code is divided into two projects:

Accord.Math, which provides mathematical tools, decompositions and transformations, and

Accord.Statistics, which provides the statistical analysis, statistical tools and visualizations.

Both of them depends on the AForge.NET core. Also, their internal structure and organization tries to mimic AForge’s wherever possible.

The given source code doesn’t include the full source of the Accord Framework, which remains as a test bed for new features I’d like to see in AForge.NET. Rather, it includes only limited portions of the code to support PCA. It also contains code for Kernel Principal Component Analysis, as both share the same framework. Please be sure to look for the correct project when testing.

Using the code

To perform a simple analysis, you can simple instantiate a new PrincipalComponentAnalysis object passing your data and call its Compute method to compute the model. Then you can simply call the Transform method to project the data into the principal component space.

Example application

To demonstrate the use of PCA, I created a simple Windows Forms Application which performs simple statistical analysis and PCA transformations.

The application can open Excel workbooks. Here we are loading some random Gaussian data, some random Poisson data, and a linear multiplication of the first variable (thus also being Gaussian).

Simple descriptive analysis of the source data, with a histogram plot of the first variable. We can see it fits a Gaussian distribution with 1.0 mean and 1.0 standard deviation.

Here we perform PCA by using the Correlation method. Actually, the transformation uses SVD on the standardized data rather than on the correlation matrix, the effect being the same. As the third variable is a linear multiplication of the first, the analysis detected it as irrelevant, thus having a zero importance level.

Now we can make a projection of the source data using only the first two components.

Note: The principal components are not unique because the Singular Value Decomposition is not unique. Also the signs of the columns of the rotation matrix are arbitrary, and so may differ between different programs for PCA.

Together with the demo application comes an Excel spreadsheet containing several data examples. The first example is the same used by Lindsay on his Tutorial on Principal Component Analysis. The others include Gaussian data, uncorrelated data and linear combinations of Gaussian data to further exemplify the analysis.

I hope this code and example can be useful! If you have any comments about the code or the article, please let me know.

This is the non-linear extension of Principal Component Analysis. While linear PCA is restricted to rotating or scaling the data, kernel PCA can do arbitrary transformations (such as folding and twisting the data and the space that contains the data).