Ch_15

DISEASE GENE PRIORITIZATION
Presented by Qian Huang
What is a disease
• Disease is a condition of the living animal or plant or one
•
•
•
•
•
of its parts that impairs normal function and is typically
manifested by distinguishing signs and symptoms
Definition also describes the malfunction of individual cells
or cell groups
Many diseases should be defined on a cellular level.
Sickle cell disease was first documented in 1904
sickle cell disease became the first disease to be
characterized on a molecular level in 1949
The first genetic diseases was discovered
Genetic diseases
• A genetic disease is any disease that is caused by an
•
•
•
•
abnormality in an individual's genome
It is rarely that one gene is responsible for one function.
An assembly of genes constitutes a functional module or
a molecular pathway.
a molecular pathway leads to some specific end point in
cellular functionality via a series of interactions between
molecules in the cell.
Any changes in the normally molecular interactions and
pathways may lead to disease
The specifics of a change determine the severity and the
type of the resulting disease
Genetic diseases
• inherited from the parents or caused by mutations
• a number of different types of genetic diseases
• Single gene disorder
- Mendelian or monogenetic inheritance
- caused by changes or mutations that occur in the DNA
sequence of a single gene
- over 4000 human diseases caused by single gene disorder
-occur in about 1 out of every 200 births
- dominant: Only one mutated copy of the gene will be
necessary for a person to be affected. one affected parent, 50%
chance
- recessive: Two copies of the gene must be mutated for a
person to be affected. Two unaffected people each carry one
copy of the mutated gene, 25% chance the child affected
Genetic diseases
• Chromosome abnormalities
- distinct structures made up of DNA and protein
- caused by abnormalities in chromosome number or
structure
- due to a problem with cell division
• Multifactorial gene disorder
- caused by a combination of environmental factors and
mutations in multiple genes
- heart disease, high blood pressure, cancer, diabetes,
obesity
Genetic diseases
• Identifying the relationship between human genetic
•
•
•
•
diseases and their causal genes is important in human
medical improvement
Revealing the genetic basis of human disease is a
fundamental aim of the human genetic studies
The Human Genomic Project started in 1990
The genomic studies rapid accumulate large amount of
genomic data
a lot of computational methods were proposed to prioritize
candidate casual genes by considering the relationship of
candidate genes of a given phenotype and existing known
disease genes
What is Gene Prioritization
• Gene prioritization is the process of assigning likelihood of
gene involvement in generating a disease phenotype.
• narrows down, and arranges the set of genes to be tested
experimentally.
• based on various correlative evidence that associate each
gene with the given disease and suggest possible causal
links
• Evidence comes from high-throughput experimentation,
including gene expression and function, pathway
involvement, and mutation effects
Why using Gene Prioritization
• Proving a causal link between a gene and a disease
experimentally is expensive and time-consuming
• Using computational prioritization of candidate genes prior
to experimental testing can drastically reduce the
associated costs and improve the outcomes of targeted
experimental studies
• High-throughput experimental techniques has contributed
significantly to the identification of disease-associated
genes and mutations and reported a large number of data
• Gene prioritization is a computational method to deal with
the quantity of data, effectively translate the experimental
data into legible disease-gene associations
Identification of disease-genes
• Disease results from the changes of normal function
• Four reasons of pathway function changes
(1) changes in gene expression
(2) changes in structure of the gene-product
(3) introduction of new pathway members
(4) environmental disruptions
• Defining molecular pathways whose disrupted
functionality is necessary and sufficient to cause the
disease
• All members of the affected pathways can be construed
as disease genes
• Identification of disease-genes is difficult
How to identify disease-genes
• Disease genes are most often identified using:
(1) genome wide association or linkage analysis studies
(2) similarity or linkage to co-expression with known
disease genes
(3) participation in known disease-associated pathways or
compartments.
• Methods represented by direct and indirect evidence
o Direct : evidence coming from own experimental work
and from literature
o Indirect: genes that are in any way related to already
established disease-associated genes
Indirect evidence
Very broadly, gene-disease associations are inferred from
evidence of five aspects
(1) Functional Evidence
The suspect gene is a member of the same molecular
pathways as other disease-genes
(2) Cross-species Evidence
The suspect gene has homologues implicated in
generating similar phenotypes in other organisms
Indirect evidence
(3) Same-compartment Evidence
The suspect gene is active in disease-associated pathways
(e.g. ion channels), cellular compartments (e.g. cell
membrane), and tissues (e.g. Liver)
(4) Mutation Evidence
The suspect genes are affected by functionally deleterious
mutations in genomes
(5) Text Evidence
There is ample co-occurrence of gene and disease terms in
scientific texts
Overview of gene prioritization data flow
Molecular Interactions
• Many gene prioritization tools used gene-gene (protein-
•
•
•
•
•
•
protein) interaction and pathway information to prioritize
candidate genes.
genes responsible for similar diseases often participate in the
same interaction networks
MC4R is a receptor and known to be associated with severe
obesity
The interactors of MC4R may be predicted to be linked to
obesity.
AgRP and POMC directly bind MC4R for varied purposes of
the MC4R pathway.
mutations that negatively affect normal POMC production or
processing have been shown to be obesity- associated
AgRP have been linked to food intake abnormalities
MC4R-centered protein-protein interaction
network
Regulatory and genetic linkage
• Co-regulation of genes has traditionally been thought to
point to same molecular pathways and similar disease
• Co-expressed genes often cluster together in different
species lead to genetic linkage
• Genes co-expressed with or genetically linked to other
disease genes are also likely to be disease-associated
• They also pose a problem
Problem with Regulatory and genetic
linkage
o A given disease-associated gene may be co-regulated
with or linked to another disease-associated gene
o The two diseases are not identical
o It is difficult to distinguish the actual causes of disease
and co-occurring with the disease-mutations due to
genetic linkage.
Similar sequence/structure/ function
• Prioritization tools often use functional similarity as an
input feature
• Predictors relying on functional similarity to determine
disease association will link two genes sharing a same
function
• Functionally similar genes are likely to produce similar
disease phenotypes, sequence/structure similarities are
indicators of similar disease involvement
• Disease genes are often associated with specific gene
and protein features
o higher exon number
o longer gene length
Cross-species Evidence
• Cross-species comparisons of orthologues and their
•
•
•
•
associated phenotype
Finding related phenotypes across species suggests
orthologous human candidate genes
MC4R is known to be associated with severe obesity
Polar bears have a V95I mutation on MC4R for their need
to increase body fat to adapt to their environment
may have a similar (increased body fat) effect in humans
Cross-species Evidence
• A correlation of gene co-expression across species is also
•
•
•
•
•
useful for gene prioritization
Genes that are part of the same functional module are
generally co-expressed
functionally unrelated genes also could be co-expressed
Comparing genes co-expressed in human and other
organisms can be used to infer disease-genes
A cluster of functionally unrelated genes co-expressed in
human and mouse contained a disease-gene KCNIP4
The initial list of 1,762 genes mapped to 850 OMIM
(Online Mendelian Inheritance in Man)phenotypes narrow
to twenty times fewer possible disease-causing genes.
Compartment Evidence
• Changes in gene expression in disease-affected
compartments and tissues are associated with many
complex diseases
• Predict suspect gene in the disease-associated pathway,
compartments and tissues.
• multiple storage diseases all are caused by the
impairment of the degradation pathways of the
intracellular transport.
Mutant Evidence
• Every genetic disease is associated with some sort of mutation
that alters normal functionality
• Selection of candidate genes for further analysis is often based
on mutations in diseased individuals
• not all observed mutations are associated with deleterious
effects:
(1) no effect at all - silent mutations
(2) some is deleterious with respect to normal function
(3) weakly beneficial
• strongly deleterious mutation are relatively rare because they
are rapidly removed by selection
• A candidate gene carrying a deleterious mutation is more likely
to be disease-associated than gene with other mutation or no
mutation at all
Mutant Evidence
• Structural variation (SV)
insertions and deletions, inversions, translocations..
• Nucleotide polymorphisms
SNPs (single nucleotide polymorphisms)
MNPs (multi-nucleotide polymorphisms
• 90% of human variation exists in the form of Nucleotide
polymorphisms
Structural variation
• Structural variation (SV) is the least studied of all types of
mutations
• less than 10% of human genetic variation is in the form of
genome structural variants
• Each of the structural variants is large, the total number of
base pairs affected by SVs may actually be comparable to
the number of base pairs affected by the much more
common SNPs
• Gross changes to genome sequence are very likely to be
disease associated
Structural variation
• High throughput detection of structural variants is difficult
• there are only 180 thousand structural variants reported in
one of the most complete mutation collections-DGV
• Do not currently know what proportion of genetic disease
is caused by SVs
• Disease is caused by change of a sequence, all of the
genes found in these regions of the genome are, by
default, associated with the disease, but none of them can
be considered primarily causal
• If diseases that are associated with SVs, the prioritization
of disease-causing genes is only finding those that are
directly affected by the mutation
Nucleotide polymorphisms
• A single human genome is expected to contain roughly
10–15 million SNPs per person
• 93% of all human genes contain at least one SNP
• MNPs are rare as compared to SNPs
• nearly 43 million validated human SNPs
- coding SNPs
- non-coding SNPs
coding SNPs and non-coding SNPs
(1) coding SNPs
17.5 million have been experimentally mapped to
functionally distinct regions of the genome, only 0.4M are in
coding region while the rest are in non-coding region
Coding SNPs are over-represented in disease associations
e.g. OMIM contains 2430 non-coding SNPs (0.0001% of
all) and 5327 coding ones (0.01% of all)
- synonymous (no effect on protein sequence)
- non-synonymous (single amino acid substitution)
(2) non-coding SNPs
more prevalent than coding SNPs because the majority
of the genome is non-coding
non-synonymous SNPs
• more studied
• two types:
- Missense : change results in a different amino acid
- Nonsense : produce a premature stop codon
• nonsense mutants result in early termination of the
protein, very often associated with disease
• Missense SNPs alter the protein sequence without
destroying it, may or may not be disease associated
• most methods estimate that only 25–30% of the nsSNPs
negatively affect protein function
Nucleotide polymorphisms
• Identifying and annotating functional effects of SNPs and
•
•
•
•
MNPs is important in the gene prioritization
Genes selected for further disease-association studies
are more likely to contain a deleterious mutation
A number of methods were created for identifying
mutations as functionally deleterious in regulatory regions
Coding synonymous SNPs have recently been shown to
have the same chance of being involved in a disease as
non-coding SNPs due to reasons such as codon usage
bias
Few computational methods are able make predictions
with their functional effects
Text Evidence
• Huge amounts of data could potentially improve the
performance of any gene prioritization method
• Specialized tools can be used to prioritize diseases
associated genes
• Researchers make their data computationally available
from databases
• Depositing knowledge obtained through reading and
manual curation into databases
Text Evidence
• Hidden in plain site in natural language text of scientific
publications
• A casual search in PubMed for the term breast cancer
generates over two hundred thousand matches. Limiting
the field to genetics of breast cancer reduces to fifty
thousand.
• Scientific text mining tools allow for intelligent
identification of possible gene-gene and disease-gene
correlations
The Inputs and Outputs
• Functionality of prioritization methods is defined by previously
•
•
•
•
•
known information about the disease and candidate search
space
Disease information: disease-associated genes, affected
tissues and pathways and relevant keywords
candidate search space :
- automatically selected by the tool (the entire genome)
- submitted by the user (suspect genes)
providing a list of very broad keywords may reduce the
performance specificity, while incorrect candidate search space
automatically decreases sensitivity
Output is produced based on input, produce ranked/ordered
lists of genes
The prioritization accuracy depends on the accuracy and
specificity of the inputs
The Processing
• Gene prioritization methods use different algorithms to
product all the data they extract
o mathematical/statistical models/methods
o fuzzy logic
o artificial learning devices
o Some methods use combinations of the above.
• No one methodology is better than the others for all data
inputs
• refer to relevant tool publications and method-specific
literature to get details on computational methods used in
the various approaches
Figure 5. Predicting gene-disease involvement using artificial neural networks (ANNs).
Bromberg Y (2013) Chapter 15: Disease Gene Prioritization. PLoS Comput Biol 9(4): e1002902. doi:10.1371/journal.pcbi.1002902
http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1002902
Gene prioritization tools
• Many different Gene prioritization tools have been
developed
• Endeavour:
- a web resource for the prioritization of candidate gene
- inferring several models (based on various genomic data
sources)
- applying each model to the candidate genes to rank
those candidates against the profile of the known genes
and
- merging the several rankings into a global ranking of the
candidate genes
Gene prioritization tools
• G2D (genes to diseases):
- a web resource for prioritizing candidates genes
- uses three algorithms based on different prioritization
strategies
- input:
(1) the genomic region where the user is looking for the
disease-causing mutation,
(2) an additional piece of information depending on the
algorithm used
- output in every case is an ordered list of candidate genes
in the region of interest
Summary
• Gene prioritization methods are developed to link genes
to diseases by extracting and combining the various
information
• Rely on experimental work such as disease gene linkage
analysis and genome wide studies to establish the search
space of candidate genes
• Use mathematical and computational models of disease
to filter the original set of genes based on gene and
protein sequence, structure, function, interaction and
expression information
• Translate the experimental data into legible disease-gene
associations effectively