MOTIVATION

At the moment, in many studies, various "omics" platforms are used
in parallel to study cell transcriptome, genome, proteome, epigenome.
The outcome of the most techniques is commonly reduced to a list of genes
(differentially expressed, hypo/ hyper methylated, mutated, amplified/deleted and so on).
Although evidently dependent, in many cases, genes reported by different
"omics" techniques are not overlapped.
Finding functional associations between identified lists pose a bioinformatics challenge.
Here we propose to use a network based framework
incorporating robust statistical principles
to estimate significance of inferred models.

OUR MODEL

In our model we have two input gene lists
(referred to as gene list a and gene list b)
that have been selected (based on some experimental results)
from some general sets of genes
(denoted as list A and list A)
representing either all genes from a genome
or genes that have been profiled by "omics" platforms
(like all genes from the chip).
We are using a reference gene network
(external knowledge) which is supposed
to model spreading the signal in the cell
from one gene to another.
At the moment our tool employs 2 different reference gene networks: Reactome pathway database and Intact database of protein interactions.

To start computations you need to provide:

gene list a

gene list b

reference gene set for list b (Optional)

Ideally you need to provide also a list B
(a reference set of genes used to select list b).
This is optional.
If list B is not provided we assume that list B is all known genes.

METHODS

Gene vs. Gene List
The core of our approach is statistical model to relate a gene
(gene "a") to a gene list (list b)
given a reference gene network and a list B.
To implement this we use a schema presented in figure 1:

First, we compute the distance from gene "a"
to all genes from list b using reference gene network.
Distance is defined as a minimal number of steps
required to get from one gene to another using edges
of the reference network.

Second, we define the connectivity score Sab (between gene "a" and list b)
based on the number of genes from b having distance 1,2,3,,n to gene "a".

Third, to find whether the connectivity score is significant
we implement Monte Carlo procedure.
We sample randomly from list B the gene list "r"
equivalent in size to the list b.
We deduce the connectivity score Sr
(between gene "a" and list "r").
We repeat the procedure N times (up to N= 10 000 if required)
to find out the distribution Srj (j=1, 2,, N)
of the connectivity score between gene "a"
and a random gene list
(equivalent to the input list b).
The significance (p-value) of the score
Sab is computed as p = k/N,
where k is a number of times the score
Sab was less or equal to the scores from Srj distribution.

Gene List vs. Gene List

The "Gene vs. Gene List" procedure is repeated for each gene "a"
from the input list A. In this case we test a number of hypotheses
(equals to the number of genes in list A)
and need to apply standard FDR procedure to adjust for multiple testing.

OUTPUT

As output, the genes from the list a are ranked by significance score
of the inferred model in relation to the list b.
The p-value of the model for a gene "a" can be interpreted
as a probability to get the same (or better) connectivity score Sab
for a random gene list equivalent in size to list b.
For each gene "a" from the list a with significant p-value
the visualization of the network model is provided.
Example is presented in figure 2.