CNIO-stats-FAQ: Frequently Asqued Statistical Questions at CNIO

This document contains answers to some questions commonly asked about
statistical analyses at CNIO. The most interesing part is the last one,
where we collect miscellaneous questions and answers about commons
statistical issues at CNIO. It is much appreciated if you can check
whether your question has already been answered.
This page is eternally under construction.

3.1 Software from the Bioinformatics Unit

3.2 Other software for microarray data analysis

Lots; a search in Google will be overwhelming. Our favourite is, first and
foremost, the Bioconductor set of
packages. Bioconductor runs under R
and most of these packages are command-drive; thus, mouse-clicking will not
take you very far. Therefore, Bioconductor tends to be used by people who
think the learning effort is worth it (we do think it is worth it!).

Another useful program, that runs under Windows and might be a little bit
more user friendly are the
BRB Array Tools.

3.3 General purpose statistical software

Lots too. Of course, you have the usual commercial options (SPSS, SAS,
Statistica, etc). But you might want
to give a try to free software. Our favourite is, without doubt,
R; we've used it for quite some time now,
use it for virtually all of our statistical analyses and most of our
programming, and have taught some courses. Learning R
takes some tyme. There are some GUIs available; we don't use them much,
but from our limited experience with R GUIs the one we like best is
Rcmdr,
from John Fox; Rcmdr will
run with a very similar "look-and-feel" in GNU/Linux and Windows;
we can help you get it up and running.

Another free statistical system with a GUI is
Arc, which is built
upon
Xlisp-Stat.
Arc is particularly nice for regression diagnostics. Also runs under both
GNU/Linux and Windows.

3.4 Help with statistical software

As explained in the statistical
consulting at CNIO usage rules we are unable to provide help with any
software, except that developed by us; note that we ocassionally teach
courses about the usage of GEPAS and related programs. We may, ocassionally, help with R
(installation, and some basic usage). Please, come to our courses if you
think you will want to use R or Bioconductor.

I suspect I have several groups, but I don't know which nor which are
the relevant genes

Some comments about p-values

Pooling RNA

Dye-swaps or always controls with same dye?

Other lists with prejudices

On consulting a statistician before or after the experiment

4.1 Multiple testing: do I need to worry about it?

If you are asking, then you most likely do need to worry. There are lots of
good introductions available. Check
Pomelo's
help page
for links and papers.

4.2 t-test then cluster illusion

Some people do something like:

Run a t-test on all the genes of the array

Cluster subjects using as variables only those genes that have an (unadjusted) p-value
less than a given threshold (e.g., 0.05)

Bingo, now you see two perfectly separated groups in your
cluster

The "great clusters" are an illusion. It is trivial to get great results using completely random
data; R code to show this is provided in the example files for the
R course taught at
CNIO. And the statistical explanation is also trivial. In summary, this
procedure shows nothing of relevance (except how simple it is to capitalize
on chance and obtain aparantly great results).

4.3 cluster then t-test illusion

Its like the specular image of the above, but it is still an illusion. The
idea is something like, using a set of samples and genes, you cluster the
samples, divide your sample into two
groups using the dendrogram as guidance, and then find the genes (among the
genes originally used for the clustering) that are "significant", using a
t-test.

These t-tests, and their p-values, are meaningless. Please, don't claim
they are "statistically significantly different" between the two groups,
because the very two groups are defined using the very genes you are then
testing for differences…

Sure, you have a question here about which genes are important for the
clustering, but this is not the way to approach the problem. There are
some references on variable selection for clustering.
In the OVW
page you can find a paper with references to the literature and software.
You might
also want to consider biclustering (see also this question).

4.4 I suspect I have several groups, but I don't know which nor which are
the relevant genes

4.5 Some comments about p-values

P-values are often misintrepreted. This comes from the help of
Pomelo:

(Recall that a p-value is the probability,
under the null hypothesis [in our case, the "natural" null would
be that there are no differences between the two classes in the level
of gene expression] of obtaining a value of the test statistic
as extreme as, or more extreme than, the one observed in the sample.
Small p-values provide evidence against the null hypothesis,
and in the "Fisherian tradition" of p-values
as strength of evidence against the null
a p-value between 0.05 and 0.01 is considered
some evidence against the null, a value between 0.01 and 0.001 is usually
considered strong evidence against the null, and a value less than 0.001 is usually
considered very strong evidence against the null. Note, however, that p-values
can sometimes be a tricky business, and there is quite a bit of misunderstanding
about what they really mean (e.g., misinterpreting the Fisherian
approach to p-values as evidence as a frequentist statement, or the importance
of tail behavior, or trying to give a bayesian-like "probability of the null"
interpretation); a nice review of some of these issues can be
found in a
paper by J. Berger.).

And it also gets complicated, because our null often includes
more than we would like it to include. The following two links are from
a recent exchange at the Bioconductor email list:

4.6 Pooling RNA

Some references include
Peng et al., 2003
and Kendziorski, et al., 2003, Biostatistics , 4: 465-477, 2003. A quick
summary of some issues can be found in these two email messages from
the Bioconductor email list:

4.7 Dye-swaps or always controls with same dye?

In many experiments we have to decide whether to use dye-swap (i.e., each treatment
gets Cy3 and Cy5, in different slides, of course) or to always have the control
with one dye (e.g., Cy3) and the experimentals with the other.
Some papers on experimental design for arrays deal with these issues.