Thoughts on Science and Nature

Principal components analysis (PCA) is one of the oldest and most important transformations of multivariate data analysis. The central idea is to generate linear combinations of the input data variables that are uncorrelated and have maximum variance. This reduces the dimensionality of the data while enhancing the features of interest.

In remote sensing this technique can be advantageously used to reduce the number of bands that are necessary for a certain analysis (i.e. classification) and so reduce computing costs while keeping as much as possible of the variability present in the data. Most GIS and remote sensing software packages in use today have implemented this function in some or another way. In practice, it is enough for an analyst to just press a virtual button to calculate the principal components of an image. This is comfortable but boring. It robs us of the fun of understanding the basic principles and see how this transformation works behind the scenes. Let’s have a look!

We here follow the explanation given by Canty (2007), although the method is well explained in many other textbooks (Schowengerdt, 2006, has a nice explanation too). The n bands of our image are the n dimensions of data. We project these bands into n new orthogonal bands, such that each of them is uncorrelated and has maximum variance. We then recast the problem as an eigenvalue problem and find the eigenvalues and eigenvectors. We can then create new bands by applying the linear transformations to our data.

All procedures are done combining different open source tools: written in Perl, using the Perl Data Language, Generic Mapping Tools, ImageMagick, GDAL and R. The code used for computing PCA can be found in the fighsare repository. Note that this is a very eclectic approach using tools I know. I make no claim to write neat code nor pretend my code to be best practice. I am sure others can do better!

We will use as an example image a subset of a Landsat 7 ETM+ scene path/row 193/018, acquired 2002-08-04, and depicting the city of Uppsala, Sweden, and its surroundings:

We see that there is a significant correlation between these bands, particularly those that are spectrally close:

Scatterplot of band data, showing the high correlation of spectrally close bands.

Using PCA we will find a new set of bands where this correlation is eliminated. We begin by finding the covariance matrix for each band pair (i,j), considering each band as a random vector with n elements:

Note that we use normalized bands, where the band’s mean value has been subtracted from each value. In our case we obtain:

Once we have figured out the relative order of the n eigenvalues, we calculate the ith component by adding up the product of each of the n bands with each of the n rows of the corresponding eigenvector:

And our result looks like this:

Six principal components, derived from six spectral bands.

And we can see how these principal components correlate with each other:

Scatterplot of the six principal components.

As a rule the first principal components contain the largest part of the variability. The last principal component, for example, is mostly (but not only) noise. If we combine the first three principal components in one image, we get:

First three principal components of the image as RGB.

This technique is useful for reducing the number of bands needed to some processes, as it keeps the variability mostly untouched, which is what we actually need. This is handy when using multispectral data, and crucial when using hyperspectral data.

Just stumbled across your article after having toyed around with PCA on satellite images myself (in Python). Very nice piece, thanks for sharing it (and the many other useful articles on your site – wish I had discovered it earlier).
Best, Harald