Syntax

Description

coeff = pca(X) returns
the principal component coefficients, also known as loadings, for
the n-by-p data matrix X.
Rows of X correspond to observations and columns
correspond to variables. The coefficient matrix is p-by-p.
Each column of coeff contains coefficients for
one principal component, and the columns are in descending order of
component variance. By default, pca centers the
data and uses the singular value decomposition (SVD) algorithm.

coeff = pca(X,Name,Value) returns
any of the output arguments in the previous syntaxes using additional
options for computation and handling of special data types, specified
by one or more Name,Value pair arguments.

For example, you can specify the number of principal components pca returns
or an algorithm other than SVD to use.

[coeff,score,latent,tsquared,explained,mu]
= pca(___) also returns explained,
the percentage of the total variance explained by each principal component
and mu, the estimated mean of each variable in X.

The rows of coeff contain the coefficients for the four ingredient variables, and its columns correspond to four principal components.

PCA in the Presence of Missing Data

Find the principal component coefficients when
there are missing values in a data set.

Load the sample data set.

load imports-85

Data matrix X has 13 continuous variables
in columns 3 to 15: wheel-base, length, width, height, curb-weight,
engine-size, bore, stroke, compression-ratio, horsepower, peak-rpm,
city-mpg, and highway-mpg. The variables bore and stroke are missing
four values in rows 56 to 59, and the variables horsepower and peak-rpm
are missing two values in rows 131 and 132.

Perform principal component analysis.

coeff = pca(X(:,3:15));

By default, pca performs the action specified
by the 'Rows','complete' name-value pair argument.
This option removes the observations with NaN values
before calculation. Rows of NaNs are reinserted
into score and tsquared at the
corresponding locations, namely rows 56 to 59, 131, and 132.

Use 'pairwise' to perform the principal
component analysis.

coeff = pca(X(:,3:15),'Rows','pairwise');

In this case, pca computes the (i,j)
element of the covariance matrix using the rows with no NaN values
in the columns i or j of X.
Note that the resulting covariance matrix might not be positive definite.
This option applies when the algorithm pca uses
is eigenvalue decomposition. When you don’t specify the algorithm,
as in this example, pca sets it to 'eig'.
If you require 'svd' as the algorithm, with the 'pairwise' option,
then pca returns a warning message, sets the algorithm
to 'eig' and continues.

If you use the 'Rows','all' name-value
pair argument, pca terminates because this option
assumes there are no missing values in the data set.

Another way to compare the results is to find the angle between the two spaces spanned by the coefficient vectors. Find the angle between the coefficients found for complete data and data with missing values using ALS.

subspace(coeff,coeff1)

ans = 8.2686e-16

This is a small value. It indicates that the results if you use pca with 'Rows','complete' name-value pair argument when there is no missing data and if you use pca with 'algorithm','als' name-value pair argument when there is missing data are close to each other.

In this case, pca removes the rows with missing values, and y has only four rows with no missing values. pca returns only three principal components. You cannot use the 'Rows','pairwise' option because the covariance matrix is not positive semidefinite and pca returns an error message.

Find the angle between the coefficients found for complete data and data with missing values using listwise deletion (when 'Rows','complete').

subspace(coeff(:,1:3),coeff2)

ans = 0.3576

The angle between the two spaces is substantially larger. This indicates that these two results are different.

All four variables are represented in this biplot by a vector, and the direction and length of the vector indicate how each variable contributes to the two principal components in the plot. For example, the first principal component, which is on the horizontal axis, has positive coefficients for the third and fourth variables. Therefore, vectors and are directed into the right half of the plot. The largest coefficient in the first principal component is the fourth, corresponding to the variable .

The second principal component, which is on the vertical axis, has negative coefficients for the variables , , and , and a positive coefficient for the variable .

This 2-D biplot also includes a point for each of the 13 observations, with coordinates indicating the score of each observation for the two principal components in the plot. For example, points near the left edge of the plot have the lowest scores for the first principal component. The points are scaled with respect to the maximum score value and maximum coefficient length, so only their relative locations can be determined from the plot.

The data shows the largest variability along the first principal component axis. This is the largest possible variance among all possible choices of the first axis. The variability along the second principal component axis is the largest among all possible remaining choices of the second axis. The third principal component axis has the third largest variability, which is significantly smaller than the variability along the second principal component axis. The fourth through thirteenth principal component axes are not worth inspecting, because they explain only 0.05% of all variability in the data.

To skip any of the outputs, you can use ~ instead in the corresponding element. For example, if you don’t want to get the T-squared values, specify

Input Arguments

X — Input datamatrix

Input data for which to compute the principal components, specified
as an n-by-p matrix. Rows of X correspond
to observations and columns to variables.

Data Types: single | double

Name-Value Pair Arguments

Specify optional
comma-separated pairs of Name,Value arguments. Name is
the argument name and Value is the corresponding value.
Name must appear inside quotes. You can specify several name and value
pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example: 'Algorithm','eig','Centered',false,'Rows','all','NumComponents',3 specifies
that pca uses eigenvalue decomposition algorithm,
not center the data, use all of the observations, and return only
the first three principal components.

Principal component algorithm that pca uses
to perform the principal component analysis, specified as the comma-separated
pair consisting of 'Algorithm' and one of the following.

Value

Description

'svd'

Default. Singular value decomposition (SVD) of X.

'eig'

Eigenvalue decomposition (EIG) of the covariance matrix. The
EIG algorithm is faster than SVD when the number of observations, n,
exceeds the number of variables, p, but is less
accurate because the condition number of the covariance is the square
of the condition number of X.

'als'

Alternating least squares (ALS) algorithm. This
algorithm finds the best rank-k
approximation by factoring X into
a n-by-k left
factor matrix, L, and a
p-by-k right
factor matrix, R, where k is the
number of principal components. The factorization
uses an iterative method starting with random
initial values.

ALS is designed to better handle missing values.
It is preferable to pairwise deletion
('Rows','pairwise') and deals
with missing values without listwise deletion
('Rows','complete'). It can
work well for data sets with a small percentage of
missing data at random, but might not perform well
on sparse data sets.

Example: 'Algorithm','eig'

'Centered' — Indicator for centering columnstrue (default) | false

Indicator for centering the columns, specified as the comma-separated
pair consisting of 'Centered' and one of these
logical expressions.

Value

Description

true

Default. pca centers X by
subtracting column means before computing singular value decomposition
or eigenvalue decomposition. If X contains NaN missing
values, nanmean is used to find the mean with any
available data. You can reconstruct the centered data using score*coeff'.

false

In this case pca does not center the
data. You can reconstruct the original data using score*coeff'.

Example: 'Centered',false

Data Types: logical

'Economy' — Indicator for economy size outputtrue (default) | false

Indicator for the economy size output when the degrees of freedom, d,
is smaller than the number of variables, p, specified
as the comma-separated pair consisting of 'Economy' and
one of these logical expressions.

Value

Description

true

Default. pca returns only the first d elements
of latent and the corresponding columns of coeff and score.

This
option can be significantly faster when the number of variables p is
much larger than d.

false

pca returns all elements of latent.
The columns of coeff and score corresponding
to zero elements in latent are zeros.

Note that when d < p, score(:,d+1:p) and latent(d+1:p) are
necessarily zero, and the columns of coeff(:,d+1:p) define
directions that are orthogonal to X.

Number of components requested, specified as the comma-separated
pair consisting of 'NumComponents' and a scalar
integer k satisfying 0 < k ≤ p,
where p is the number of original variables in X.
When specified, pca returns the first k columns
of coeff and score.

Action to take for NaN values in the data
matrix X, specified as the comma-separated pair
consisting of 'Rows' and one of the following.

Value

Description

'complete'

Default. Observations with NaN values
are removed before calculation. Rows of NaNs are
reinserted into score and tsquared at
the corresponding locations.

'pairwise'

This option only applies when the algorithm is 'eig'.
If you don’t specify the algorithm along with 'pairwise',
then pca sets it to 'eig'. If
you specify 'svd' as the algorithm, along with
the option 'Rows','pairwise', then pca returns
a warning message, sets the algorithm to 'eig' and
continues.

When you specify the 'Rows','pairwise' option, pca computes
the (i,j) element of the covariance
matrix using the rows with no NaN values in the
columns i or j of X.

Note
that the resulting covariance matrix might not be positive definite.
In that case, pca terminates with an error message.

'all'

X is expected to have no missing values. pca uses
all of the data and terminates if any NaN value
is found.

Example: 'Rows','pairwise'

'Weights' — Observation weightsones (default) | row vector

Observation weights, specified as the comma-separated pair
consisting of 'Weights' and a vector of length n containing
all positive elements.

Data Types: single | double

'VariableWeights' — Variable weightsrow vector | 'variance'

Variable weights,
specified as the comma-separated pair consisting of 'VariableWeights' and
one of the following.

Value

Description

row vector

Vector of length p containing all
positive elements.

'variance'

The variable weights are the inverse of sample variance.
If you also assign weights to observations using 'Weights',
then the variable weights become the inverse of weighted sample variance.

If 'Centered' is
set to true at the same time, the data matrix X is
centered and standardized. In this case, pca returns
the principal components based on the correlation matrix.

Initial value for the coefficient matrix coeff,
specified as the comma-separated pair consisting of 'Coeff0' and
a p-by-k matrix, where p is
the number of variables, and k is the number of
principal components requested.

Note

Initial value for scores matrix score,
specified as a comma-separated pair consisting of 'Score0' and
an n-by-k matrix, where n is
the number of observations and k is the number
of principal components requested.

Note

You can use this name-value pair only when 'algorithm' is 'als'.

Data Types: single | double

'Options' — Options for iterationsstructure

Options for the iterations, specified as a comma-separated pair
consisting of 'Options' and a structure created
by the statset function. pca uses
the following fields in the options structure.

Field Name

Description

'Display'

Level of display output. Choices are 'off', 'final',
and 'iter'.

'MaxIter'

Maximum number steps allowed. The default is 1000. Unlike in
optimization settings, reaching the MaxIter value
is regarded as convergence.

'TolFun'

Positive number giving the termination tolerance for the cost
function. The default is 1e-6.

'TolX'

Positive number giving the convergence threshold for the relative
change in the elements of the left and right factor matrices, L and
R, in the ALS algorithm. The default is 1e-6.

Note

You can use this name-value pair only when 'algorithm' is 'als'.

You can change the values of these fields and specify the new
structure in pca using the 'Options' name-value
pair argument.

Output Arguments

coeff — Principal component coefficientsmatrix

Principal component coefficients, returned as a p-by-p matrix.
Each column of coeff contains coefficients for
one principal component. The columns are in the order of descending
component variance, latent.

score — Principal component scoresmatrix

Principal component scores, returned as a matrix. Rows of score correspond
to observations, and columns to components.

latent — Principal component variancescolumn vector

Principal component variances, that is the eigenvalues of the
covariance matrix of X, returned as a column
vector.

More About

Hotelling’s T-Squared Statistic

Hotelling’s T-squared statistic is a
statistical measure of the multivariate distance of each observation
from the center of the data set.

Even when you request fewer components than the number of variables, pca uses
all principal components to compute the T-squared statistic (computes
it in the full space). If you want the T-squared statistic in the
reduced or the discarded space, do one of the following:

For the T-squared statistic in the reduced space,
use mahal(score,score).

For the T-squared statistic in the discarded space,
first compute the T-squared statistic using [coeff,score,latent,tsquared]
= pca(X,'NumComponents',k,...), compute the T-squared statistic
in the reduced space using tsqreduced = mahal(score,score),
and then take the difference: tsquared - tsqreduced.

Degrees of Freedom

The degrees of freedom, d,
is equal to n – 1, if data is centered and n otherwise,
where:

n is the number of rows without
any NaNs if you use 'Rows','complete'.

n is the number of rows without
any NaNs in the column pair that has the maximum
number of rows without NaNs if you use 'Rows','pairwise'.

Variable Weights

Note that when variable weights are used, the
coefficient matrix is not orthonormal. Suppose the variable weights
vector you used is called varwei, and the principal
component coefficients vector pca returned is wcoeff.
You can then calculate the orthonormal coefficients using the transformation diag(sqrt(varwei))*wcoeff.