For software to efficiently identify GGM networks from data visit
the GeneNet page.

A simple method for inferring the network of (linear) dependencies among a set of variables
is to compute all pairwise correlations and subsequently to draw the corresponding
graph (for some specified threshold). While popular and often used on many types of genomic
data (e.g. gene expression, metabolite concentrations etc.) the naive correlation approach does not
allow to infer the dependency network.
Instead, graphical Gaussians models (GGMs) should be used. These allow to correctly identify direct
influences, have close connections with causal graphical models, are straightforward to interpret, and yet
are essentially as easy to
compute as naive correlation models. This page lists pointers to learning GGMs from data, including
procedures suitable for "small n, large p" data sets (category iii).

Introduction:

Graphical Gaussian Models (GGMs), also known as "covariance
selection" or "
concentration graph" models, have recently become a popular
tool to study gene association networks. The
key idea behind GGMs is to use partial correlations as
a measure of independence of any two genes. This makes it
straightforward to distinguish direct from indirect
interactions. Note that partial correlations are related to the
inverse of the correlation matrix. Also note that in
GGMs missing edges indicate conditional
independence.

A related but completely different concept
are the so-called gene relevance networks
which are based on the "covariance
graph" model. In the latter interactions are defined
through standard correlation coefficients so that missing edges
denote marginal independence only.

There is a simple reason why GGMs should be preferred over
relevance networks for identification of gene networks:
the correlation coefficient is weak criterion for
measuring dependence, as marginally, i.e.
directly and indirectly, more or less all genes will be
correlated. This implies that zero correlation is in fact a
strong indicator for independence,
i.e. the case of no edge in a network - but this is of course
not what one usually wants to find out by building a relevance
network... On the other hand, partial correlation
coefficients do provide a strong measure of
dependence and, correspondingly, offer only a
weak criterion of independence (as
most partial correlations coefficients usually vanish).

Application of GGMs to genomic data is quite challenging, as
the number of genes (p) is usually much larger than the number
of available samples (n), and classical GGM theory is not valid
in a small sample setting. With this page I'd like to provide a
commented list of some recent work dealing with GGM gene
expression analysis (there are only very few so far). In my
understanding, all of these paper fit in one of three
categories:

analysis with classic GGM theory,

using limited order partial correlations, and

application of regularized GGMs.

For small n, large p data it seems that methods from section
iii. are most suited (see below for references and
software).

I. Classic GGM Analysis:

The following papers simply apply classical GGM theory (i.e.
with not further modification) to analyze gene expression data.
It turns out that such an analysis is necessarily restricted to
very small numbers of genes or gene clusters as to satisfy n
> p.

One way to circumvent the problem of computing full partial
correlation coefficients when the sample size is small compared
to the number of genes is to use partial correlation
coefficients of limited order. This results in something
inbetween a full GGM model (with correlation conditioned on all
p-2 remaining genes) and a relevance network model (with
unconditioned correlation). This is the strategy employed in
the following papers:

Another possibility (and in my opinion the statistically
most sound way) to marry GGMs with small sample modeling is to
introduce regularization and moderation. This essentially boils
down to finding suitable estimates for the covariance matrix
and its inverse when n < p. This can either be done in a
full Bayesian manner, or in an empirical Bayes way via variance
reduction, shrinkage estimates etc. Once regularized estimates
of partial correlation are available then heuristic searches
can subsequently to be employed to find an optimal graphical
model (or set of models).

Outside a genomic context using regularized GGMs was first
proposed by F. Wong, C.K. Carter, and R. Kohn. (2003. Efficient estimation
of covariance selection models. Biometrika 90:809-830). For
gene expression data this strategy is pursued in the following
papers:

In these papers a regularized estimate of the
correlation matrix is obtained, either by Stein-type shrinkage
(3) or by bootstrap variance reduction (2). This estimate
is subsequently
employed for computing partial correlation. Network
selection is based on false discovery
rate multiple testing. This
method is implemented in GeneNet.