SignS is web-based tool for gene selection and signature finding in
problems where the dependent variable is patient survival or, more
generally, a right-censored variable. Four approaches have been
implemented: the threshold gradient descent method (TGD) of
Gui & Lui (2005), a method that uses a combination of
gene filtering, clustering and survival model building (FCMS), very
similar to the one used in Dave et al. (2004), a method
that uses random forests with conditional inference tress by Hothorn
et al. (2006a), and a method that uses boosting with component-wise
univariate Cox models (Hothorn et al., 2006a) .

There are other approaches that will analyze gene expression (in
general, high-dimensional) data when the response is a
right-censored variable. However, few other methods explicitly
attempt to recover subsets of the original genes, preserving
relevant genes even when these are highly correlated. Other
approaches (e.g., PLS) will return predictive models, but offer
little guidance as to selection of genes (i.e., prediction will be
based on the complete set of original variables ---genes), or be
intrinsically limited as to the maximum number of genes that can be
used for fitting (such as the lasso).

The "Filter, cluster, and stepwise model selection" method uses a
series of steps very similar to those in Dave et al.

Genes are filtered, using a univariate (gene-by-gene) Cox
model. Only genes with a p-value (not adjusted for multiple
comparisons) smaller than Minimum gene-wise Cox p-value are
retained for further analyses.

Genes that fulfill the previous requirement are divided into two
groups, those with a positive coefficient in the Cox model (i.e.,
those genes for which an increase in their value is associated
with a decrease in survival) and those with negative coefficients
(increase in the gene associated to increased survival). (Yes,
this is not a typo: the Cox model is a model of the hazard rate or
hazard function ---the "instantaneous risk of death"; positive
coefficients indicate a (relative) increase in the hazard, and
thus a decrease in survival).

Each of these two groups of genes is clustered separately. We
have used complete-linkage hierarchical clustering with (1 -
correlation) as the distance measure. This is a reasonable
clustering method for genes (e.g., Simon et al.,
2003) and eases the implementation of a correlation-based
criterion for cluster selection (see next).

A potential signature is a roup of genes that show a minimal
correlation among themselves of Min. correlation of genes in a
cluster, and that have between Min. number of genes in a
cluster and Max. number of genes in a cluster.

Note that the default settings for Min. and Max. numbers of
genes in a cluster (10 and 100) are different from those used in
Dave et al. (25 and 50). Other than interpretability and prior
assumptions about size of "paths", I see no reason to use any
particular set of values.

The numeric values of a component (signature in the sense of
Dave et al) are obtained by taking the average of the expression levels
of all the genes in a given compnent.

All possible pairs of signatures are then jointly fitted with a
Cox model (i.e., a Cox model with two variables). The best
two-signature model is retained (we choose as best the model with
the largest likelihood) for the next step.

We carry out stepwise variable selection using as starting point
(and allowing both addition and deletion of terms) the best
two-signature model from the step above. From the descriptions in
the supplementary material of Dave et al., it seems they use
p-values to retain or eliminate terms. We have used, instead, the
AIC, a criterion which might be more suited for model selection. The
components retained in the final models are used as the final
selected components.

This method, developed by Gui and Li, uses penalized likelihood, a
general approach (with other examples such as the lasso and ridge
regression) for obtaining estimates of the regression coefficients
in high-dimensional spaces such as these (few subjects, thousands of
features). This approach has close connections to the Lasso, but
should overcome some of its limitations; for instance, the largest
number of genes that can be selected by the Lasso can be no larger
than the number of cases, and it tends to return only one of a set
of genes that show high correlations.

Two parameters have to be set for this method: a threshold parameter
τ (tau) and an infinitesimal increment ∆ν (Delta
nu). The threshold parameter (where 0 ≤ τ ≤ 1) is
related to the amount of penalization, and larger values lead to
larger numbers of coefficients being set to 0. Delta nu is an
infinitesimal increment that affects the rate of change of the
coefficients at each iteration of the algorithm.

In their
paper, Gui and Li suggest using cross-validation to choose
τ and ν. Actually, an operational approach is to decide
in advance on a set of τ to examine (in our case, 0, 0.2,
0.4, 0.6, 0.8, 1), and use cross-validation to select the ν
for each of the pre-specified τ. For each τ, we will
chose the ν that minimizes the cross-validated partial
likelihood (CVPL), a function of the difference in the log Cox's partial
likelihoods for the models with and without the the ith group of
subjects. Once we find the ν for a given τ, it is
inmediate to obtain the fitted coefficients for the genes and the
scores for subjects not in the sample.

From the original paper,
it seems that the authors used the CVPL to find the ν for a
given τ, but it is not clear that this is the criterion
ultimately used to select among τ. Rather, in the example
discussed in their paper, the authors choose the final model based
on the area under the ROC curve, and they use a set of data that
includes both training and testing data subsets. We have (so far)
only implemented selection based on the CVPL. Using roc curves has
not yet been implemented, because of the additional computational
demands, but we are working on the implementation of this (and
another) approach for selecting the final τ, which can be
used when we have no external testing data set.

Note that we scale (subtract the mean and divide by the standard
deviation) each gene to prevent numerical problems that arise with
some data sets when exponentiating the score function. This is in
contrast to the R
code from the authors, which performed no scaling or
centering. In the future, this scaling of the data will be optional,
possibly based on the existence of numerical problems. Note that
when using scaled data the interpretation of the effects of
coefficients does not depend on the variance of a gene, and all
coefficients refer to the effect of a change in 1 s.d. unit of the
gene.

NEW We have re-enabled this method. As this method is slow,
following a suggestion from Jiang Gui, we now do not examine the CVPL over
a range of threshold parameters τ (tau), but instead use a fixed
one. By default, it is set at 1. You can vary it, but you are likely to
want large values (between 0.9 and 1.). The main reason is that this
method can add genes but not remove them, so small values of the threshold
will result in hundreds or thousands of genes selected in the very first
steps of the method, that are then carried along all the way.

The approach of Hothorn et al. for using random forests utilizes
conditional inference trees as base learners. A conditional inference tree
is built by recursively splitting the data into two groups according to
the value of the covariables. First, test the global null hypothesis of
independence between any of the variables and the response. If this
hypothesis is rejected, select the variable with strongest association
with the response, and implement a binary split in the selected input
variable. Continue recursively (in each branch of the tree) until the
global null is no longer rejected. The tests are carried out in a
conditional inference framework (for details see Hothorn et al.,
2006b). A forest is built by fitting a conditional inference tree
to each of the many (in our case, 500) bootstrap samples of the data. The
type of response returned by these forests are Kaplan-Meir curves. In
contrast to the previous two methods, there is no variable (gene)
selection performed here although, implicitly, the construction of each
tree involves choosing the best variable for a split.

Following van Wieringen et al. (2007), we fit the forests
to the best genes, as determined from gene-wise Cox models. The actual
number of genes used is the parameter "Number of genes", that can be chosen by the user of the
application. The default is 200, as in the reference above.

Boosting is a general iterative optimization method that, at each
iteration, fits the best predictor variable reducing the appropriate
criterion (Bühlmann and Hothorn, 2008) and updating the
estimates; the algorithm used is equivalent to refitting
residuals multiple times.

For survival data, Hothorn et al. (2006), and
Bühlmann and Hothorn (2008) use component-wise univariate Cox
models. Boosting requires choosing the number of boosting iterations, but
as shown in those references, we can use cross-validation to select this
parameter. This method performs variable selection, similar to the Lasso
and other L1-penalty methods. We select for the final results the
variables (genes) with non-zero coefficients at the optimal number of
iterations as chosen via cross-validation.

Assessing predictive performance with censored data is more complex
than with, say, classification data, and several alternative
approaches have been suggested. Here, we provide a simple graphical
approach, popular in the gene expression literature: for each case,
we obtaine its out-of-bag cross-validated prediction (i.e., the
prediction from a model in which that subject was not used at all in
any of the steps of the model building). Using the out-of-bag
predictions we divide all subject into two groups, those with high
scores and those with low scores (i.e. we split them into two
equally sized groups). Then, we plot the survival curvesand compare
the survival of the two groups with a log-rank test.

(Note that with random forests, we use the expected survival time, not
the linear predictor scores).

It must be emphasized that these are out-of-bag predictions, so these
are predictions for each subject obtained from a complete
model building process where that subject was not involved. For
comparison, we also provide the "overfitt" plots where the scores
used to split the subjects are "resubstitution" scores.

Be very careful when using the above plots!!
In fact, that way of measuring or assessing predictive ability is
extremely limited, which is a soft way of saying that it is wrong, and
those p-values mean nothing, and a small p-value does not mean you
really can predict survival well, etc, etc, etc. I do not have time to
write about this in any detail right now, but do yourself a favor and
look, for instance, at rules 36 to 38 of the paper "Critical Review of
Published Microarray Studies for Cancer Outcome and Guidelines on
Statistical Analysis and Reporting" by Alain Dupuy, Richard M. Simon
published at the JNCI (January 17, 2007).

How should we measure predictive ability then? There are a range of
approaches. I like (and have used in published papers ---see, e.g.,
Montes et al., 2011, "miRNA expression in diffuse large B-cell lymphoma
treated with chemoimmunotherapy.", Blood, 118: 1034-40, and go to the
supplementary material) what are basically extensions to the survival
data case of things such as Brier's score and the concordance index (of
course, from cross-validated/bootstrapped runs). These
indices can really address the question of the predictive quality
of your model. The above plots don't.

Why aren't Brier's and the concordance index in SignS/will they ever be
in SignS? They aren't because I started working in SignS when I was a
lot younger and, thus, a lot un-wiser. They might make their way to
SignS when/if I get the time to work on SignS again.

Why don't you remove these plots then? Ah, great question. I think they
give you a (bad) hint of what is going on. And taking them out would
require me to rewrite a lot of the help and change the examples, etc,
and I just don't have the time now. However, I've warned you not to use
them for anything serious. Oh, and I use this in class as an example of
how NOT to do things.

Be very careful when looking at the legend!!
So this I just noticed (April 2015) when a friend of mine at CNIO
(Gonzalo Gomez) asked me an innocent question. What does this "high/low"
score in the legend really means? Get ready, because this is a confusing
mess I should some day fix. If you are using something like the FCMS
(which I recommend you don't use, but whatever) or the TGD method, or
boosting with component-wise Cox models we get the linear scores from
the model (remember, the higher the linear score, the higher the hazard)
and we split them into two groups; the blue ones are those from the
individuals with low linear scores (yes, low), and thus those from
individuals that should die later (those that have smaller hazards). In
contrast, if you use something like random forest, the scores are from
the predicted survival time, so the individuals in the blue group are
those in the group of smaller predicted survival times. Sorry, I am
aware this is very confusing.

Stability is assessed showing the parameters chosen in each
crossvalidation run (for TGD) and the signatures, their
coefficients, and the composing genes (for FCMS), or the selected
genes (random forests and boosting). For all we show the frequency
with which each selected gene appears in the output, and the
percentage of coincidence between the different runs. (See output section, below).

Data should conform to the "genes in rows, patients (or arrays) in columns".
In other words, each row of the data file is supposed to represent a different gene
or variable.

Use tab (\t) as the field separator within rows.

Use newline or carriage return (\n) between rows. It is also convenient to finish a
file with one carriage return (\n).

Array names: if you want to name your arrays (useful for the output of
the analyses) do as follows:

Place a line that starts with "#";

After the "#" put "Name" or "NAME" or "name" (don't say we
don't give you choices);

Write the array names (separated by tabs).

The first column is assumed to contain the ID information for genes, marker,
or whatever. This will be used to label the output (but it also means that whatever is
in the first column is not used in the analyses). Please note that
shorter names work better when labeling dendrograms.

You can have an arbitrary number of rows with comments. These rows must always start
with an "#".

Gene names and array names MUST be unique. If
they are not, the program will let you know. If you do
not want to provide array names, that is OK, and we will
name them with sequential integers starting at 1. If you
do not want to provide gene names, then put some
arbitrary labels on the first column (e.g., fill it with
a sequence of integers).
Why do we need gene and array names to be unique?
Because in many steps, we need to provide either where
we classify a given array (and what should we do if two
or more arrays are named "A"?), or the genes used in the
classifier (and what should we do if two or more genes
are named "gene B"?).

Missing values are NOT allowed. You can use the
preprocessor and do
several things with your data before sending it
to Signs2. We would probably recommend you do imputation after
eliminating genes with too many (more than, say, 20%?) missing. Anyway,
how best to deal with missing values is not a trivial issue and is outside
the scope of this help file.

A tab-separated text file with the survival or duration time (e.g.,
time to death, time to relapse). These data can be
right-censored. Separate values by tab (\t), and finish the file with a
carriage return or newline. No missing values are allowed here.

Please note that we do not allow any data set with 10 or fewer
cases. In terms of statistical analysis, the "true sample size" of
right-censored data is often much smaller than the number of cases,
and even 20 cases are way too few.

It is your responsability to make sure the assumptions for
these data hold. The models work for right-censored data. See usual
caveats/discussions on other types of censoring, informative
censoring, etc.

A tab-separated text file indicating if the event (e.g., death, relapse)
was actually observed (this is indicated with a 1) or not (a 0); cases for
which the event was not observed are the right-censored cases.

Same format as above: Separate values by tab (\t), and finish the file with a
carriage return or newline. No missing values are allowed.

If you have
validation data, you can enter them here. It is rarely a good idea to
keep a fraction of the data for validation; statistics has shown
repeatedly that cross-validation and the bootstrap are much better
procedures (specially in terms of variance of estimates) than setting
appart a chunck of the data. (A quick, and not quite fully correct,
explanation: Setting appart a chunk of the data is, if you think of
it, like a "poor man's cross-validation", because you do once what
10-fold CV does 10 times; and means of 10 samples have smaller
variances than one sample). Thus, unless you have a very good reason,
do not use the validation data option.

But there are situations when you might have validation
data. Suppose you have a sample of 200 from a study, and another
sample of, say, 100 from a similar study but with samples from a
different hospital or different country. It is probably a bad idea
to put together all 300 cases as if they were a homogenous
sample. You can, instead, train your model with the 200 samples and
validate it on the 100. This provides also some indication about
whether what you find on the first hospital/country is useful for
the second hospital/country.

However, DO
NOT (I repeat DO NOT) keep playing around with the parameters until things
look nice with the validation data. That would be a serious case of
overfitting. Why? Well, obviusly because you would be using
your validation sample to choose the "best parameters", and
thus the validation sample would not be validating anything at all.

Of coruse, you can play, to your hearts content, with the
parameters using the training data (ONLY the training data). You can
do this as many times as you wish (or you have patience for). Then,
once you are satisfied with the parameters found with your training
data, run SignS, using training and validation data, with the very
same parameters you just found.

Please remember that the validation data are just that, validation
data, and should be used to validate a model (not to find good
settings of the parameters). Of course, we cannot prevent you from
doing this, and the problem of this tool is that it is open to
abuse. But if you do abuse it, you must be fully aware that this is
a clear case of overfitting and cannot be called validation.

If you use any of the currently standard identifiers for your gene IDs
for either human, mouse, or rat genomes, you can obtain additional
information by clicking on the names in the output figures. This
information is returned from IDClight based on that provided
by our IDConverter
tool.

The plots depend on the method used. For all methods we show the
survival plots as explained above.

For FCMS we also show the dendrograms
from the clustering of the positive and negative coefficient genes
(after the p-value filtering), indicating with colors the groups
that meet the restrictions of size and minimum correlation. Again,
these are the clusters that survive the filtering for p-value. And
we use color to show those that suvive the filtering in terms of
maximum and minimum size of clusters. All, some, or none of these
might be finally selected by the Cox model.

Additionally, no particular meaning is to be assigned to the
cluster names. We simply name all clusters for genes with positive
coefficients starting with a "P", and those for genes with
negative coefficients starting with a "N". But the numbers
themselves mean nothing in particular (they have to do with depth
and layout in the plot). Please do not try to interpret or assign
any meaning to them; they are just names.

Few "automatically generated" and static dendrograms are to
everybody's satisfaction; how well things look depends on the
dendrogram, whether or not we annotate with labels, the size of the
labels, etc. NEW We now provide dendrograms with all nodes
labeled or only the selected clusters labeled. In addition, you can
get additional information about each node by clicking on it, if you
provided information about the organism and the type of id (and those
identifiers are among the ones used
by IDConverter). Moreover,
you can choose among three different sizes for the dendrograms. The names of the
cluster/signature in the labels of the figures correspond to the names
in the textual output.

For TGD we show a plot of the CVPL (cross-validate partial
likelihood) versus ν for different values of τ. You would
want to see six U shaped lines, with the bottom of the U far from the
maximum you set for Maximum iterations; note that this maximum
might not be shown in the figure. You might see that different lines
finish at differetn ν: one algorithm we have implemented (and
which might be running when you use the program) allows
for "early stopping" when it is obvious we are getting away from the
best ν, and this is likely to happen at different ν for
different τ. For TGD we also showm, following the original paper of
Gui and Li, the values of the coefficients for different settings of
τ, to allow to obtain a quick idea of the amount of
regularization and sets of genes likely to be important.

Single-gene statistics and p-values

NEW We now provide several tables (ordered using different
criteria) that show the p-values and coefficients for the single-gene
(or gene-by-gene) Cox models that are the first part of the FCMS
method.

For random forests the statistics shown are those corresponding to
single-gene Cox models. No FDR information is shown, to emphasize that
we choose, by decree, and regardless of p-value, the best K genes (K,
by default, is 200).

Likewise, for boosting, only the coefficients have any meaningful value.

For each sample, its out-of-bag score. This is just the matrix
product of the coefficients and the value of the covariates. It is
unlikely to provide any insight by itself, but allows to order cases,
see which are very far away from others, etc. Again, we emphasize that
these are out-of-bag predictions, so these
are predictions for each subject obtained from a complete
model building process where that subject was not involved. In other
words, as in Tnasas and GeneSrF we do what is
sometimes called "double or full cross-valdation".

Model results and parameters

Each method has somewhat slightly different output,
that shows the results of the models fitted, parameters used,
etc. The output should be self-explanatory. (If it is not, please
let me know, and I'll do my best to fix it).

Some explanatory notes for FCMS:

The names of clusters in different runs are not
comparable. For instance, a cluster called N.2 in the original run
and a cluster called N.2 in the first CV run might, or might not,
be somewhat similar, but in general they will be very different
things (the only common feature is that they both group genes with
negative coefficients).

The coefficients shown always refer to the Cox-model
coefficient of each specific cluster and, again, are not
comparable among different runs. For instance, do not try to
compare the coefficient for component P.4 in CVrun1 and in CV
run3. That would not make any sense.

The number of clusters obtained (before any Cox-model
building) that survive the filtering steps need not be the same in
different CV runs. In fact, this information is unlikely to be of
much use, and thus it is not provided.

The number of genes per cluster can be completely different in
different runs (CV and all data). Of course, all of the finally
selected clusters in any run will satisfy the restrictions you set
when running the application in terms of maximum and minum number
of genes per cluster, and minimal correlation between any two
genes in a cluster. But it is possible that, say, in one run you
get two clusters, all of possitive coefficients, each with 8
genes, in another run you get one cluster of nine genes with
negative coefficients, and in another run you get one cluster with
six genes with negative coefficietns and one cluster of 15 genes
with positive coefficients.

All these differences between runs and lack of comparability
might seem confussing: they are here because it is perfectly
possible to get many great results, but which share very little
(in terms of genes and clusters) among them. Part of the point of
showing the results from the CV runs is to alert you to this
fact. With some data sets results tend to be very stable from run
to run; with some others they don't.

For TGD, the complete list of all the genes selected for the six
different values of the threshold (tau) are listed in the file
"genes.all.out", which you can download with the rest of the output
(see below).

Stability assessments

Genes selected in each of the cross-validation runs.

Number of shared genes. The number of genes selected in
common between any two runs (e.g., between the run with the
complete sample and the 2nd cross-validated run, or between
the 4th and 7th cv run). Of course, this is a symmetric matrix.

Proportion of shared genes (relative to row total). This
is the above table divided by the total number of gene
selected by the procedure of the given row. So suppose the
third row, the 2nd cv run, selected 20 genes. Then the third
row of this table is the same as the third row of the table
above divided by 20. Thus, this is not a symmetric
matrix.

Gene freqs. in cross-validated runs of genes selected in
model with all data. Take each of the genes selected from the
complete data run, and count how many times it shows up in
the selections from the cross-validation runs.

Gene frequencies in cross-validated runs. Count how many
times any gene (that shows up at least once in the
cross-validation runs) appeared among the selected ones in
the cross-validation runs.

Validation data

We provide the linear scores; these are just the matrix product of
the coefficients times the value of the covariates (i.e., the
selected clusters). It is unlikely to provide any insight by
itself, but allows to order cases, see which are very far away
from others, etc.

For random forests, the values provided are not linear scores but
expected survival time.

You can download a single
compressed file with all figures, in both the png format showed in the web
results as well as in pdf format ---which gives better quality for
printing---, and results (a single text file). The format is tar.gz,
understood by GNU's tar, standard in GNU/Linux distributions, and also
understood by the common uncompressors/unzipers/etc in other operating
systems.

This program was developped by Ramón
Díaz-Uriarte. This tool uses
Python for the CGI and R for the computations. The
code for the FCMS approach was written from scracth by Ramón
Díaz-Uriarte, with collaboration from David Casado de Lucas. For
TGD, I started with the R code provided by Li
and Gui, and rewrote large parts of it to increase its speed and
parallelize it (using Rmpi and Snow, see below), and to add
additional functionality. For random forests I use the party
package by Torsten Hothorn, Kurt Hornik and Achim Zeileis. For
boosting I use mboost
by Torsten Hothorn and Peter Buhlmann with contributions by Thomas
Kneib and Matthias Schmid. I have also used the following R packages: CGIwithR by David
Firth, snow,
by Luke Tierney, A. J. Rossini, Na Li and H. Sevcikova, Rmpi, by Hao Yu,
and rsprng,
by Na (Michael) Li, class
by W. Venables and B. Ripley,
survival,
S original by Terry Therneau, ported by Thomas Lumley,
GDD by Simon Urbanek,
imagemap by
Barry Rowlingson,
R2HTML
by Eric Lecoutre,
papply,
by D. Currie, and
combinat,
by S. Chasalow. We have also
used the JavaScript aqtree
code by Stuart Langridge, for the expandable lists.

This application is running on a cluster of machines using Debian GNU/Linux as operating system and Apache as web server.

We want to thank the authors and contributors of these great (and open
source) tools that they have made available for all to use. If you find this
useful, and since R and Bioconductor are developed by a team of volunteers, we
suggest you consider making a donation to the R foundation for
statistical computing.

Funding partially provided by Project TIC2003-09331-C02-02 of the
Spanish Ministry of Education and Science. This application is running on a
cluster of machines purchased with funds from MINECO project BIO2009-12458.

You acknowledge that this Software is experimental in nature and
is supplied "AS IS", without obligation by the authors or the IIB,
to provide accompanying services or support. The entire risk as to
the quality and performance of the Software is with you. The IIB
and the authors expressly disclaim any and all warranties
regarding the software, whether express or implied, including but
not limited to warranties pertaining to merchantability or fitness
for a particular purpose.

If you use SignS for any publication, we would appreciate if you
could let us know and if you cite our program (you know, "credit where
credit is due"). The reference is "Diaz-Uriarte, R. 2008. SignS: a
parallelized, open-source, freely available, web-based tool for gene
selection and molecular signatures for survival and censored data. BMC
Bioinformatics, 2008, 9:30."
In addition to the paper, you might want to provide the full URL: http://signs2.iib.uam.es).

We appreciate if you give us feedback concerning bugs, errors or misconfigurations.
Complaints or suggestions are welcome.

Uploaded data set are saved in temporary directories in the server and are
accessible through the web until they are erased after some time. Anybody can
access those directories, nevertheless the name of the directories are not
trivial, thus it is not easy for a third person to access your data.

In any case, you should keep in mind that communications between the client
(your computer) and the server are not encripted at all, thus it is also
possible for somebody else to look at your data while you are uploading or
dowloading them.

This software is experimental in nature and is supplied "AS IS", without
obligation by the authors or the IIB to provide accompanying services or
support. The entire risk as to the quality and performance of the software is
with you. The authors expressly disclaim any and all warranties regarding the
software, whether express or implied, including but not limited to warranties
pertaining to merchantability or fitness for a particular purpose.