This paper is a short summary of the main classes defined in the ade4 package for one table
analysis methods (e.g., principal component analysis). Other papers will detail the
classes defined in ade4 for two-tables coupling methods (such as canonical correspondence
analysis, redundancy analysis, and co-inertia analysis), for methods dealing with K-tables
analysis (i.e., three-ways tables), and for graphical methods.

This package is a complete rewrite of the ADE4
software ([Thioulouse et al., 1997], http://pbil.univ-lyon1.fr/ADE-4/) for the
R environment. It contains Data Analysis
functions to analyse Ecological and Environmental data in the framework of Euclidean
Exploratory methods, hence the name ade4
(i.e., 4 is not a version number but means that there are four E
in the acronym).
The ade4 package is available in CRAN, but it can also be used directly online, thanks
to the Rweb system (http://pbil.univ-lyon1.fr/Rweb/). This possibility is being used to
provide multivariate analysis services in the field of bioinformatics, particularly for
sequence and genome structure analysis at the PBIL (http://pbil.univ-lyon1.fr/).
An example of these services is the automated analysis of the codon usage of a set of
DNA sequences by correspondence analysis ([Perrière et al., 2003] http://pbil.univ-lyon1.fr/mva/coa.php).

The basic tool in ade4 is the duality diagram [Escoufier, 1987]. A
duality diagram is simply a list that contains a triplet (X, Q, D):
- X is a table with
n rows and p columns, considered as p points in Rn (column vectors) or n
points in Rp (row vectors).
- Q is a p ×p diagonal
matrix containing the weights of the p columns of
X, and used as a scalar product in Rp
(Q is stored under the form of a vector of length p).
- D is a n ×n diagonal
matrix containing the weights of the n rows of X, and used as a scalar product in Rn
(D is stored under the form of a vector of length n).
For example, if X is a table containing normalized quantitative variables, if Q is
the identity matrix Ip and if D is equal to [1/n]In,
the triplet corresponds to a principal component analysis on correlation matrix (normed PCA).
Each basic method corresponds to a particular triplet (see table 1), but more complex methods can
also be represented by their duality diagram.

The singular value decomposition of a triplet gives principal axes, principal components,
and row and column coordinates, which are added to the triplet for later use.

We can use for example a well-known dataset from the base package :

data(USArrests)
pca1 <- dudi.pca(USArrests, scannf = FALSE, nf = 3)

scannf = FALSE means that the number of principal components that will be used to compute
row and column coordinates should not be asked interactively to the user, but taken as the
value of argument nf (by default, nf = 2). Other parameters allow to choose between centered,
normed or raw PCA
(default is centered and normed), and to set arbitrary row and column weights. The pca1 object
is a duality diagram, i.e., a list made of several vectors and dataframes:

pca1$lw and pca1$cw are the row and column weights that define the duality diagram,
together with the data table (pca1$tab). pca1$eig contains the eigenvalues.
The row and column coordinates are stored in pca1$li and pca1$co. The variance of
these coordinates is equal to the corresponding eigenvalue, and unit variance coordinates are stored
in pca1$l1 and pca1$c1 (this is usefull to draw biplots).

The general optimization theorems of data analysis take particular meanings for each type of
analysis, and graphical functions are proposed to draw the canonical graphs,
i.e., the graphical expression corresponding to the mathematical property of the object.
For example, the normed PCA of a quantitative variable table gives a score that maximizes
the sum of squared correlations with variables. The PCA canonical graph is therefore a graph
showing how the sum of squared correlations is maximized for the variables of the data set.
On the USArrests example, we obtain the following graphs:

score(pca1)

s.corcircle(pca1$co)

Figure 1:
One dimensional canonical graph for a normed PCA. Variables are displayed as a function
of row scores, to get a picture of the maximization of the sum of squared correlations.

Figure 2:
Two dimensional canonical graph for a normed PCA (correlation circle):
the direction and length of arrows show the quality of the correlation between variables and
between variables and principal components.

The scatter function draws the biplot of the PCA (i.e., a graph with both rows and columns
superimposed):

scatter(pca1)

Figure 3:
The PCA biplot. Variables are symbolized by arrows and they are superimposed to the
individuals display. The scale of the graph is given by a grid, which size is given in the
upper right corner. Here, the length of the side of grid squares is equal to one.
The eigenvalues bar chart is drawn in the upper left corner, with the two black bars
corresponding to the two axes used to draw the biplot. Grey bars correspond to axes that
were kept in the analysis, but not used to draw the graph.

Separate factor maps can be drawn with the s.corcircle (see figure 2) and s.label functions:

A duality diagram can also come from a distance matrix, if this matrix is Euclidean
(i.e., if the distances in the matrix are the distances between some points in a
Euclidean space). The ade4 package contains functions to compute
dissimilarity matrices (dist.binary for binary data, and dist.prop
for frequency data), test whether they are Euclidean [Gower and Legendre, 1986],
and make them Euclidean (quasieuclid, lingoes, [Lingoes, 1971],
cailliez, [Cailliez, 1983]). These functions are useful to ecologists who use the
works of [Legendre and Anderson, 1999] and [Legendre and Legendre, 1998].

The Yanomama data set ([Manly, 1991]) contains three distance matrices between 19 villages
of Yanomama Indians. The dudi.pco function can be used to compute a principal coordinates
analysis (PCO, [Gower, 1966]), that gives a Euclidean representation of the 19 villages.
This Euclidean representation allows to compare the geographical, genetic and anthropometric
distances.

In sites x species tables, rows correspond to sites, columns correspond to species,
and the values are the number of individuals of species j found at site i.
These tables can have many columns and cannot be used in a discriminant analysis.
In this case, between-class analyses (between function) are a better alternative,
and they can be used with any duality diagram. The between-class analysis of triplet (X, Q, D)
for a given factor f is the analysis of the triplet
(G, Q, Dw), where G is the table of the means of table
X for the groups defined by f, and Dw is the diagonal matrix
of group weights. For example, a between-class correspondence analysis (BCA) is very simply obtained
after a correspondence analysis (CA):

The meaudret$fau dataframe is an ecological table with 24 rows corresponding to six sampling
sites along a small French stream (the Meaudret). These six sampling sites were sampled four times
(spring, summer, winter and autumn), hence the 24 rows. The 13 columns correspond to 13
ephemerotera species. The CA of this data table is done with the dudi.coa function, giving the
coa1 duality diagram. The corresponding bewteen-class analysis is done with the between function,
considering the sites as classes (meaudret$plan$sta is a factor defining the classes).
Therefore, this is a between-sites analysis,
which aim is to discriminate the sites, given the distribution of ephemeroptera species.
This gives the bet1 duality diagram, and Figure 6 shows the graph obtained by plotting this object.

Figure 6:
BCA plot. This is a composed plot, made of :
1- the species canonical weights (top left),
2- the species scores (middle left),
3- the eigenvalues bar chart (bottom left),
4- the plot of plain CA axes projected into BCA (bottom center),
5- the gravity centers of classes (bottom right),
6- the projection of the rows with ellipses and gravity center of classes (main graph).

Permutation tests (also called Monte-Carlo tests, or randomization tests) can be used to assess the statistical
significance of between-class analyses. Many permutation tests are available in the ade4 package, for example
mantel.randtest, procuste.randtest,
randtest.between, randtest.coinertia, RV.rtest,
randtest.discrimin, and several of
these tests are available both in R (mantel.rtest) and in C (mantel.randtest) programming langage.
The R version allows to see how computations are performed, and to write easily other tests, while the C version
is needed for performance reasons.

The statistical significance of the BCA can be evaluated with the randtest.between function.
By default, 999 permutations are simulated, and the resulting object (test1) can be plotted (Figure 7).
The p-value is highly significant, which confirms the existence of differences between sampling sites. The plot
shows that the observed value is very far to the right of the histogram of simulated values.

We have described only the most basic functions of the ade4 package, considering only the simplest
one-table data analysis methods. Many other dudi methods are available in ade4, for example
multiple correspondence analysis (dudi.acm), fuzzy correspondence analysis (dudi.fca), analysis of
a mixture of numeric variables and factors (dudi.mix), non symmetric correspondence analysis (dudi.nsc),
decentered correspondence analysis (dudi.dec).

We are preparing a second paper, dealing with two-tables coupling methods, among which canonical
correspondence analysis and redundancy analysis are the most frequently used in ecology ([Legendre and Legendre, 1998]).
The ade4 package proposes an alternative to these methods, based on the co-inertia criterion ([Dray et al., 2003]).

The third category of data analysis methods available in ade4 are K-tables analysis methods,
that try to extract the stable part in a series of tables. These methods come from the STATIS
strategy, [Lavit et al., 1994] (statis and pta functions) or from the multiple coinertia strategy
(mcoa function). The mfa and foucart functions perform two variants of
K-tables analysis, and the STATICO method (function ktab.match2ktabs, [Thioulouse et al., 2004]) allows
to extract the stable part of species-environment relationships, in time or in space.