Principal Component Analysis in C#

Principal Component Analysis (PCA) is an exploratory tool designed by Karl Pearson in 1901 to identify unknown trends in a multidimensional data set. It involves a mathematical procedure that transforms a number of possibly correlated variables into a smaller number of uncorrelated variables called principal components.

Foreword

Before you read this article, please keep in mind that it was written before the Accord.NET Framework was created and became popular. As such, if you would like to do Principal Component Analysis in your projects, download the accord-net framework from NuGet and either follow the starting guide or download the PCA sample application from the sample gallery in order to get up and running quickly with the framework.

Introduction

PCA essentially rotates the set of points around their mean in order to align with the first few principal components. This moves as much of the variance as possible (using a linear transformation) into the first few dimensions. The values in the remaining dimensions, therefore, tend to be highly correlated and may be dropped with minimal loss of information. Please note that the signs of the columns of the rotation matrix are arbitrary, and so may differ between different programs for PCA.

Accord.NET Framework

This new library, which I called Accord.NET, was initially intended to extend the AForge.NET Framework through the addition of new features such as Principal Component Analysis, numerical decompositions, and a few other mathematical transformations and tools. However, the library I created grew larger than the original framework I was trying to extend. In a few months, both libraries will merge under Accord.NET. (Update April 2015)

Design decisions

As people who want to use PCA in their projects usually already have their own Matrix classes definitions, I decided to avoid using custom Matrix and Vector classes in order to make the code more flexible. I also tried to avoid dependencies on other methods whenever possible, to make the code very independent. I think this also made the code simpler to understand.

The code is divided into two projects:

Accord.Math, which provides mathematical tools, decompositions and transformations, and

Accord.Statistics, which provides the statistical analysis, statistical tools and visualizations.

Both of them depends on the AForge.NET core. Also, their internal structure and organization tries to mimic AForge’s wherever possible.

The given source code doesn’t include the full source of the Accord Framework, which remains as a test bed for new features I’d like to see in AForge.NET. Rather, it includes only limited portions of the code to support PCA. It also contains code for Kernel Principal Component Analysis, as both share the same framework. Please be sure to look for the correct project when testing.

Using the code

To perform a simple analysis, you can simple instantiate a new PrincipalComponentAnalysis object passing your data and call its Compute method to compute the model. Then you can simply call the Transform method to project the data into the principal component space.

Example application

To demonstrate the use of PCA, I created a simple Windows Forms Application which performs simple statistical analysis and PCA transformations.

The application can open Excel workbooks. Here we are loading some random Gaussian data, some random Poisson data, and a linear multiplication of the first variable (thus also being Gaussian).

Simple descriptive analysis of the source data, with a histogram plot of the first variable. We can see it fits a Gaussian distribution with 1.0 mean and 1.0 standard deviation.

Here we perform PCA by using the Correlation method. Actually, the transformation uses SVD on the standardized data rather than on the correlation matrix, the effect being the same. As the third variable is a linear multiplication of the first, the analysis detected it as irrelevant, thus having a zero importance level.

Now we can make a projection of the source data using only the first two components.

Note: The principal components are not unique because the Singular Value Decomposition is not unique. Also the signs of the columns of the rotation matrix are arbitrary, and so may differ between different programs for PCA.

Together with the demo application comes an Excel spreadsheet containing several data examples. The first example is the same used by Lindsay on his Tutorial on Principal Component Analysis. The others include Gaussian data, uncorrelated data and linear combinations of Gaussian data to further exemplify the analysis.

I hope this code and example can be useful! If you have any comments about the code or the article, please let me know.

This is the non-linear extension of Principal Component Analysis. While linear PCA is restricted to rotating or scaling the data, kernel PCA can do arbitrary transformations (such as folding and twisting the data and the space that contains the data).

your project is really very good but i have problems,when i open the program, it has difficulties in identifing the “statistics” and “samples” characterizing them as unavailable. Thus some properties in solution cannot be read.

In SVD, singular values are equal to the square root of the eigenvalues. So the singular values are being squared (thus becoming large) to give the eigenvalues. Those eigenvalues, however, may not be the actual eigenvalues that would be obtained using a Eigendecomposition because the SVD implementation used automatically normalizes its singular values.

As the eigenvalues are used only to compute the amount of variance explained by each component, the important thing to note is that their ratio is preserved.

thanks alot for ur efforts, it’s really greati have a question, if i wanna get the principal components for an image i mean my target is to classify faces and non faces, so i need to get the pca to the face imageshow can i pass my data which is image in my case to the pca object?thanks in advance

You can always transform your image into a single vector and then pass it to PCA as you would pass any other input vector.

For example, if you have a 320×240 image, you can create a vector of 320*240=76800 positions and then copy the image pixel by pixel to this vector, in any order you wish, as long as you are consistent using the same ordering with all your images.

Dear César,Thank you very much for making this valuable code available to the public.I have a problem using your code: My data is too large: about 500,000 rows and 118 columns. When I give this matrix to the DescriptiveAnalysis(sourceMatrix, sourceColumns) method, the exception “System.OutOfMemoryException” is thrown. Is there any way that instead giving a matrix, I can pass a comma-seperated file containing the info to the methods?

Have you tried running the method on a 64 bit system? Perhaps it should work. Besides, the error happens in the DescriptiveAnalysis class, perhaps it may work if you comment this portion of the code and use only the PrincipalComponentAnalysis classes.

Hello! This code is really helpful but my problem is that I dont know how to use this. My project is about face recognition and base on my researches, PCA is paired with neural network on most of the face recognition systems. How will I connect your code with my neural network? What is really the output of PCA that will be the input of the neural network? Are eigenvectors values? I have read that the output of PCA are eigenvectors? I really cant understand PCA. I wish you could help me. Thank you in advance!

PCA can be seen as a linear transformation. Being a transformation, what it does is project your data into another space. The PCA output you are looking for is the projection of your original data into this space, which in the case of PCA, will be a space where your variables are (hopefully) uncorrelated.

The eigenvectors found by the analysis will form the basis for this new space, and the eigenvalues can be used to measure the importance of each of the vectors. If you discard the less important eigenvectors before performing the projection, then you can also perform dimensionality reduction in the process.

By the way, I have used PCA as a preprocessing step for ANNs too. If you wish, please take a look on the images on this poster, they might help to understand how PCA can be used in this scenario.

I reduced the number of samples to 50, but still the number of features (118) seems to be too much for your code. The maximum number of features the code can handle seems to be 46 for my data, otherwise it takes forever to finish pca.Compute(). Do you have any suggestions?

Well, can I have a look in your data? If Compute is taking forever (and is not throwing any exceptions) then this may be a bug. If you could provide an excerpt of your data (perhaps the 50 samples with 118 columns you mentioned) it would be great!

Hello, I think there is s small bug in PCA implementation. When you use matrix where column number is higher then row number, then it does not work correctly (transform method returns all zeros).I think solution is in method PCA.Compute where turning all params to “true” value helps.

Thanks, you are correct about that. The latest version of PCA in the development branch of Accord.NET does indeed uses those parameters when creating the SingularValueDecomposition, but I forgot to update the code available here.

If you have installed the framework using the executable installer, the source code will be available in the installation folder. However, if you can wait a little, I will try to release a new version of the framework this week.

I’m all of a sudden stuck with a problem concerning the dimensions of input data and pca / kda and I’d like to ask, if you have encountered similar behaviour.

Is it only possible to set the principal components count maximum to rows count of input data? I always receive an exception “Index was outside the bounds of the array” during the first call of “.Transform(matrix,pcacomponents)” after “.Compute()” with Analysis method “Center”.

In PCA this shouldn’t be a problem. I believe I had corrected this in recent versions of the Accord.NET Framework. However, for KPCA, the limit is indeed the number of rows in your data. KPCA works by performing PCA over the Kernel matrix. The kernel matrix have the same dimensions as the number of rows in your data, so I guess it is not possible to generate more components than rows.

the SVD computation never ends with my data. For example if i pass data 51×400 it works, if i pass 52×400 stops working 😡 i did some debugging, and found out that the “p” variable in SingularValueDecomposition class is not decreasing its value in switch (kase) statement (only when kase==4), what is never reached 😡

Well, sort of. When we apply PCA to categorical data, we could indeed obtain a lower-dimensional representation of the data. Please see page 339 from the book “Principal Component Analysis”, by I.T. Jolliffe. This particular page is available in Google Books. As it can be seen, the author states that “For data in which all variables are binary, Gower (1966) points out that using PCA does provide a plausible low-dimensional representation.”

I discovered a bug in the PCA adjust function. If any of the standard deviations are zero this will return NaN and propagate through SVD causing the process to fail:

matrix[i, j] = (m[i, j] – columnMeans[j]) / columnStdDev[j];

I’ve updated it to do an addition check now for 0 standard deviations and divide by epsilon if found.

matrix[i, j] = (m[i, j] – columnMeans[j]) / Double.Epsilon;

I’m also a little confused by the number of eigenvalue generated when using data with more dimensions than samples.

I have a dataset consisting of 5 samples (rows) with 480 dimensions (columns). The SVD algorithm returns 480 eigenvectors, however, it only returns 6 eigenvalues. I was under the impression that there should be one eigenvalue for every eigenvector. I figured the number of Principal Components would correspond to the dimensions such that it would be possible to analyze the dimensions in their entirety. I’m still new to PCA and was wondering if you could explain.

Thanks for reporting the issue. This has already been fixed a while ago in the main Accord.NET sources. If a variable has zero standard deviation then it should be removed from the the data set, since it will have no impact in the analysis.

About your second question, I have written a tutorial about using PCA through SVD. However, it is still somewhat unfinished, so if you wish I can send you a partial version by email.

this is ananth. i downloaded the sample from your website and could not able to run the code in visual studio 2010. its saying that the accord.statistics.dll and accord.statistics.controls.dll.?? what could be the solution

This had to do on how the matrices were handled by the classifiers in the AForge.NET Framework (which I wanted to keep compatible). The internal matrix processing routines in Accord.NET (such as the matrix decompositions) can work faster on multidimensional matrices than on jagged ones. This is possible because they make heavy use of unsafe pointer operations.

Hi Cesar, I’m working on image denoising Using PCA-LPG approach, LPG=> Local Pixel Grouping. The Problem is how to group the pixels that are output from your PCA. Or better still if u can provide an insight into how I can apply ur code to denoising. ur will really be appreciated thanks.

Thanks for letting me know about the issue! Perhaps it got lost away in some refactoring. If you wish to determine how many components do you need in order to achieve a given percentage of information, you can use the GetNumberOfComponents method of the PrincipalComponentAnalysis class. Then you can use this number as the second argument of the Transform function.

I run the PCA and it works fine. However, I cannot find a way to save the principal component space in order to reuse it later.
In fact I need to put my soft in production and I do not want to provide the complete set of data. I just want to project the new sample into the principal component space.
Is there a way to do that ?