CRAN Task View: Multivariate Statistics

Base R contains most of the functionality for classical multivariate analysis,
somewhere. There are a large number of packages on CRAN which extend this methodology,
a brief overview is given below. Application-specific uses of multivariate statistics
are described in relevant task views, for example whilst principal components are listed here,
ordination is covered in the
Environmetrics
task view. Further information on supervised classification can be found in the
MachineLearning
task view, and unsupervised classification in the
Cluster
task view.

The packages in this view can be roughly structured into the following topics.
If you think that some package is missing from the list, please let me know.

Visualising multivariate data

Graphical Procedures:
A range of base graphics (e.g.
pairs()
and
coplot()) and
lattice
functions (e.g.
xyplot()
and
splom()) are useful for
visualising pairwise arrays of 2-dimensional scatterplots, clouds and 3-dimensional densities.
scatterplot.matrix
in the
car
provides usefully enhanced pairwise scatterplots.
Beyond this,
scatterplot3d
provides 3 dimensional scatterplots,
aplpack
provides bagplots and
spin3R(),
a function for rotating 3d clouds.
misc3d, dependent upon
rgl,
provides animated functions within R useful for visualising densities.
YaleToolkit
provides a range of useful visualisation techniques for multivariate data.
More specialised multivariate plots include the following:
faces()
in
aplpack
provides Chernoff's faces;
parcoord()
from
MASS
provides parallel
coordinate plots;
stars()
in graphics provides a choice of star, radar
and cobweb plots respectively.
mstree()
in
ade4
and
spantree()
in
vegan
provide minimum spanning tree functionality.
calibrate
supports biplot and scatterplot
axis labelling.
geometry,
which provides an interface to the qhull library,
gives indices to the relevant points via
convexhulln().
ellipse
draws ellipses for two parameters, and provides
plotcorr(),
visual display of a correlation matrix.
denpro
provides level set trees
for multivariate visualisation. Mosaic plots are available via
mosaicplot()
in graphics and
mosaic()
in
vcd
that also contains other visualization techniques for multivariate
categorical data.
gclus
provides a number of
cluster specific graphical enhancements for scatterplots and parallel coordinate plots
See the links for a reference to GGobi.
rggobi
interfaces with GGobi.
xgobi
interfaces to the XGobi
and XGvis programs which allow linked, dynamic multivariate plots as well as
projection pursuit. Finally,
iplots
allows particularly powerful dynamic interactive
graphics, of which interactive parallel co-ordinate plots and mosaic plots may be of great interest.
Seriation methods are provided by
seriation
which can reorder matrices and dendrograms.

Data Preprocessing:
summarize()
and
summary.formula()
in
Hmisc
assist with descriptive functions; from the same package
varclus()
offers variable
clustering while
dataRep()
and
find.matches()
assist in exploring a given
dataset in terms of representativeness and finding matches.
Whilst
dist()
in base and
daisy()
in
cluster
provide a wide range of distance measures,
proxy
provides a framework for more distance measures, including measures between matrices.
simba
provides functions for dealing with presence / absence data including similarity matrices and reshaping.

Hypothesis testing

ICSNP
provides Hotellings T2 test as well as a range of non-parametric tests including location tests based on marginal ranks, spatial median and spatial signs computation, estimates of shape. Non-parametric two sample tests are also available from
cramer
and spatial sign and rank tests to investigate location, sphericity and independence are available in
SpatialNP.

Multivariate distributions

Descriptive measures:
cov()
and
cor()
in stats
will provide estimates of the covariance
and correlation matrices respectively.
ICSNP
offers several descriptive measures such as
spatial.median()
which provides an estimate of the spatial median and further functions which provide estimates of scatter. Further robust methods are provided such as
cov.rob()
in MASS which
provides robust estimates of the variance-covariance matrix by minimum volume
ellipsoid, minimum covariance determinant or classical product-moment.
covRobust
provides robust covariance estimation via nearest neighbor variance estimation.
robustbase
provides robust covariance estimation via fast minimum covariance determinant with
covMCD()
and the Orthogonalized pairwise estimate of Gnanadesikan-Kettenring via
covOGK(). Scalable robust methods are provided within
rrcov
also using fast minimum covariance determinant with
covMcd()
as well as M-estimators with
covMest().
corpcor
provides shrinkage estimation of large scale covariance
and (partial) correlation matrices.

Densities (estimation and simulation):
mvnorm()
in MASS simulates from the multivariate normal
distribution.
mvtnorm
also provides simulation as well as probability and
quantile functions for both the multivariate t distribution and multivariate normal
distributions as well as density functions for the multivariate normal distribution.
mnormt
provides multivariate normal and multivariate t density and distribution
functions as well as random number simulation.
sn
provides density, distribution and random number generation for the multivariate skew normal and skew t distribution.
delt
provides a range of functions for estimating multivariate densities by
CART and greedy methods.
Comprehensive information on mixtures is given in the
Cluster
view,
some density estimates and random numbers are provided by
rmvnorm.mixt()
and
dmvnorm.mixt()
in
ks, mixture fitting
is also provided within
bayesm. Functions to simulate from the
Wishart distribution are provided in a number of places, such
as
rwishart()
in
bayesm
and
rwish()
in
MCMCpack
(the latter also has a density
function
dwish()).
bkde2D()
from
KernSmooth
and
kde2d()
from MASS provide binned and non-binned 2-dimensional kernel density
estimation,
ks
also provides multivariate kernel smoothing as does
ash
and
GenKern.
prim
provides patient rule induction methods to attempt to find regions of high density in high dimensional multivariate data,
feature
also provides methods for determining feature significance in multivariate data (such as in relation to local modes).

Assessing normality:
mvnormtest
provides a multivariate extension
to the Shapiro-Wilks test,
mvoutlier
provides multivariate outlier detection based
on robust methods.
ICS
provides tests for multi-normality.
mvnorm.etest()
in
energy
provides an assessment
of normality based on E statistics (energy); in the same package
k.sample()
assesses a number of samples for equal distributions. Tests for Wishart-distributed covariance matrices
are given by
mauchly.test()
in stats.

Principal components:
these can be fitted with
prcomp()
(based on
svd(),
preferred) as well as
princomp()
(based on
eigen()
for compatibility
with S-PLUS) from stats.
pc1()
in
Hmisc
provides the first principal component and gives coefficients for unscaled
data. Additional support for an assessment of the scree plot can be found in
nFactors, whereas
paran
provides routines for Horn's evaluation of the number of dimensions to retain.
For wide matrices,
gmodels
provides
fast.prcomp()
and
fast.svd().
kernlab
uses kernel methods to provide a form of non-linear principal components with
kpca().
pcaPP
provides robust principal components by means
of projection pursuit.
amap
provides
further robust and parallelised methods such as a form of generalised
and robust principal component analysis via
acpgen()
and
acprob()
respectively. Further options for principal components
in an ecological setting are available within
ade4
and in a sensory setting in
SensoMineR.
psy
provides a
variety of routines useful in psychometry, in this context these include
sphpca()
which
maps onto a sphere and
fpca()
where some variables may be considered as
dependent as well as
scree.plot()
which has the option of adding simulation results to help assess the observed data.
PTAk
provides principal tensor analysis analagous to both PCA and correspondence analysis.
smatr
provides standardised major axis estimation with specific application to allometry.

Redundancy Analysis:
calibrate
provides
rda()
for
redundancy analysis as well as further options for canonical correlation.
fso
provides fuzzy set ordination, which extends ordination beyond methods available from linear algebra.

Independent Components:
fastICA
provides fastICA
algorithms to perform independent
component analysis (ICA) and Projection Pursuit, and
PearsonICA
uses score functions.
ICS
provides either an invariant co-ordinate system or independent components.
JADE
adds an interface to the JADE algorithm, as well as providing some diagnostics for ICA.

Procrustes analysis:
procrustes()
in
vegan
provides procrustes analysis, this package also provides functions
for ordination and further information on that area is given in the
Environmetrics
task view. Generalised procrustes analysis via
GPA()
is available from
FactoMineR.

Cluster analysis:
A comprehensive overview of clustering
methods available within R is provided by the
Cluster
task view. Standard techniques include hierarchical clustering by
hclust()
and k-means clustering by
kmeans()
in stats.
A range of established
clustering and visualisation techniques are also available in
cluster, some cluster validation routines are available in
clv
and the Rand index can be computed from
classAgreement()
in
e1071. Trimmed cluster analysis is available from
trimcluster, cluster ensembles are available from
clue, methods to assist with choice of routines are available in
clusterSim
and hybrid methodology is provided by
hybridHclust. Distance
measures (
edist()) and hierarchical clustering (
hclust.energy()) based on E-statistics are available in
energy. Mahalanobis distance based clustering (for fixed points as well as clusterwise regression) are available from
fpc.
clustvarsel
provides variable selection within model-based clustering.
Fuzzy clustering is available within
cluster
as well as via the
hopach
(Hierarchical Ordered Partitioning and
Collapsing Hybrid) algorithm.
kohonen
provides supervised and unsupervised SOMs
for high dimensional spectra or patterns.
clusterGeneration
helps simulate clusters. The
Environmetrics
task view also gives a topic-related overview of some clustering techniques. Model based clustering is available in
mclust.

Tree methods:
Full details on tree methods are given in the
MachineLearning
task view.
Suffice to say here that classification trees are sometimes considered within
multivariate methods;
rpart
is most used for this purpose.
party
provides recursive partitioning. Classification and regression training is provided by
caret.
kknn
provides k-nearest neighbour methods which can be used for regression as well as classification.

Supervised classification and discriminant analysis

lda()
and
qda()
within MASS provide linear
and quadratic discrimination respectively.
mda
provides mixture and
flexible discriminant analysis with
mda()
and
fda()
as well as
multivariate adaptive regression splines with
mars()
and adaptive spline
backfitting with the
bruto()
function. Multivariate adaptive regression splines can also be found in
earth.
rda
provides classification
for high dimensional data by means of shrunken centroids regularized discriminant analysis.
Package
class
provides k-nearest
neighbours by
knn(),
knncat
provides k-nearest neighbours for
categorical variables.
SensoMineR
provides
FDA()
for factorial discriminant analysis. A number of packages provide for
dimension reduction with the classification.
klaR
includes variable
selection and robustness against multicollinearity as well as a number of
visualisation routines.
superpc
provides principal components for
supervised classification, whereas
gpls
provides classification using
generalised partial least squares.
hddplot
provides cross-validated linear discriminant calculations to determine the optimum number of features.
ROCR
provides a range of methods for assessing classifier performance.
Further information on supervised classification can be found in
the
MachineLearning
task view.

Correspondence analysis

corresp()
and
mca()
in MASS provide simple and
multiple correspondence analysis respectively.
ca
also provides single, multiple and joint correspondence analysis.
ca()
and
mca()
in
ade4
provide correspondence and multiple correspondence analysis
respectively, as well as adding homogeneous table analysis with
hta().
Further functionality is also available within
vegan
co-correspondence
is available from
cocorresp.
FactoMineR
provides
CA()
and
MCA()
which also enable simple and multiple correspondence analysis as well as associated graphical routines.
homals
provides homogeneity analysis.

As a vector- and matrix-based language, base R ships with many powerful tools for
doing matrix manipulations, which are complemented by the packages
Matrix
and
SparseM.
matrixcalc
adds functions for matrix differential calculus. Some further sparse matrix functionality is also available from
spam.

Miscellaneous utilities

abind
generalises
cbind()
and
rbind()
for arrays,
mApply()
in
Hmisc
generalises
apply()
for matrices and passes multiple functions. In addition to functions listed earlier,
sn
provides operations such as marginalisation, affine transformations and graphics for the multivariate skew normal and skew t distribution.
mAr
provides for vector auto-regression.
rm.boot()
from
Hmisc
bootstraps repeated measures models.
psy
also provides a range of statistics based on Cohen's kappa including weighted measures and agreement among more than 2 raters.
cwhmisc
contains a number of interesting support functions which are of interest, such as
ellipse(),
normalise()
and various rotation functions.
desirability
provides functions for multivariate optimisation.
geozoo
provides plotting of geometric objects in GGobi.