Abstract

Following the publication of the complete human genomic sequence, the post-genomic
era is driven by the need to extract useful information from genomic data. Genomics,
transcriptomics, proteomics, metabolomics, epidemiological data and microbial data
provide different angles to our understanding of gene-environment interactions and
the determinants of disease and health. Our goal and our challenge are to integrate
these very different types of data and perspectives of disease into a global model
suitable for dissecting the mechanisms of disease and for predicting novel therapeutic
strategies. This review aims to highlight the need for and problems with complex data
integration, and proposes a framework for data integration. While there are many obstacles
to overcome, biological models based upon multiple datasets will probably become the
basis that drives future biomedical research.

Genetic analysis in the post-genomic era

In 1990, the human genome project was established to sequence the human genome [1], with the aim of applying the acquired genomic data to improve disease diagnosis
and determine genetic susceptibility [2]. The publication of the first draft sequence of the human genome in 2001 [3] was thus followed by a rapid growth of different approaches to extract useful information
from the genomic sequence. These approaches included, but were not limited to, the
analysis of genetic variation (genomics), gene expression (transcriptomics), and gene
products (proteomics) and their metabolic effects (metabolomics).

Each of these post-genomic approaches has already contributed to our understanding
of specific aspects of the disease process and the development of diagnostic/prognostic
clinical applications. Cardiovascular disease [4,5], obesity [6-8], diabetes [9-11], autoimmune disease [12,13] and neurodegenerative disorders [14,15] are some of the disease areas that have benefited from these types of data. Taking
the metabolic syndrome as an example, our knowledge on all aspects of the disease
has grown. The metabolic syndrome is the result of a complex bioenergetic problem
characterized by disturbances in lipid, carbohydrate and energy metabolism and blood
pressure. In combination, these metabolic factors contribute to an increased susceptibility
to cardiovascular disease, morbidity and mortality [16]. Genome-wide association (GWA) studies have identified possible genes involved in
each aspect of the syndrome: namely type 2 diabetes [11], obesity [17] and hyperlipidaemia [18]. The findings have confirmed the role of certain candidate genes as well as the polygenetic
nature of the syndrome. Not surprisingly, replicate GWA studies of type 2 diabetes
revealed that the genes associated with disease, among others, are involved in beta-cell
function and adipocyte biology [11,17,19]. In contrast, genes found to be associated with obesity appear to be those that are
predominantly involved in central appetite regulation [20-22] as key contributors to positive energy balance.

Genetic association studies in epidemiology have highlighted a number of issues. Firstly,
many common disease states are related to either many genetic polymorphisms of small
effect or, in selected cases, to a few of large effect. The involvement of multiple
genes with unequal contributions to disease hints of complex gene-gene and gene-environment
interactions. The understanding of such interactions becomes a daunting task when
other modulating factors remain unknown. Secondly, some common diseases such as type
2 diabetes [12] appear to be relatively less genetically determined compared to diseases such as
rheumatoid arthritis [12] and obesity [23]. In these situations, our understanding of pathophysiology requires additional data
outside of genomic information. Thirdly, the initial failures to find robust replicable
associations between most of the identified genetic variants and common complex diseases
suggest that genomic analysis alone will not account for all of the heritability and
phenotypic variation [9,24]. For this reason, there is a growing need to incorporate information derived from
environmental studies and post-genomic data into genetic analysis.

Advantages of combining multiple types of data

It is clear that the genetic approach captures only one layer of the complexity inherent
within human biology. There is thus a need to integrate multiple 'omics' datasets
when aiming to unravel the molecular networks underlying common human disease traits
[25]. Attempts have been made to combine two datasets in relation to the clinical phenotype,
and this is reflected in the combination of terms found in the literature, for example
metagenomics, pharmacogenomics and epigenetics. Many of the post-genomic approaches
linking the genetic association data with other 'omics' layers focus on the use of
'omics'-derived phenotypic data as quantitative traits. The utility of such approaches
has been previously applied, by combining genetics and metabolomics, in plant functional
genomics [26]. More recently, such approaches have also been applied to human datasets. For example,
Papassotiropoulos and colleagues [15] identified clusters of cholesterol-associated susceptibility genes for Alzheimer's
disease by combining genetics with sterol profiling, while Gieger and colleagues [27] used ratios of metabolites to identify the function of putative genes. In another
study, proteomics was linked to quantitative trait loci (QTL) in an attempt to identify
changes in function rather than quantity of the protein [28].

By combining multiple types of techniques, including genetics, transcriptomics, proteomics
and metabolomics, we are expecting a shift toward 'environmentome' research, where
all available information from periconception to disease onset, using both longitudinal
and cross-sectional experimental designs, can be obtained [9]. The measurement of traits that are modulated but not encoded by the DNA sequence,
commonly referred to as intermediate phenotypes, is of particular interest. These
intermediate phenotypes include not only biochemical (metabolites) and genomic (gene
expression) traits, but also an individual's microbial (gut microflora) [29,30] and social traits. It is conceivable that by comprehensively examining an individual's
'environmentome', we would be able not only to understand both the genetic and environmental
determinants of disease, but also to develop 'feasible' personalized medicine, that
is, tailor specific personalized interventions to the individual's own environmental
profile. As a pioneering example of this kind, Oreši Land colleagues [10] investigated metabolic profiles of children between birth and type 1 diabetes onset
in a large birth cohort, and established that specific metabolic phenotypes, not dependent
on human leukocyte antigen (HLA)-associated genetic risk, precede the first autoimmune
response. The excitement of this research is the expectation that these early metabolic
phenotypes may be validated as specific diagnostic and prognostic markers of disease,
with therapeutic implications.

Establishing disease causality as a framework for data integration

The goal of inferring disease causality and disease mechanisms from integrated data
is complicated by the fact that measuring more variables may provide a better characterization
of the process but still does not contribute directly to our understanding of cause
and effect. In fact, given the progressively increasing number of variables that we
can measure, the odds of finding spurious associations that do not reflect true causality
are much higher. Confounding and reverse causality are among the main sources of bias
for failures to replicate apparently robust associations between risk factors and
diseases [31]. Confounding specifically refers to a spurious causal effect inferred from the association
between a risk factor and a disease due to the existence of some common causes, that
is, confounding factors to both of them. This type of spurious causal effect can be
removed if we have enough knowledge about the most likely confounding factor candidates.
However, the truth is that for most epidemiological studies confounding factors are
unknown and difficult to measure, especially in case-control studies. Reverse causality,
the second source of bias, refers to an alternative explanation for the observed association
between a risk factor and disease, which states that the 'risk factor' is a result
of the disease, rather than vice versa. The problem of reverse causality is particularly
prevalent in retrospective case-control studies.

One example of a potential confounding association is the established epidemiological
evidence of a strong link between obesity and insulin resistance. This association
has recently been brought into question from the identification of specific clinical
settings where fat mass dissociates from insulin resistance [32,33]. This implies that adipose tissue expansion typically associated with obesity per
se may not be the cause of metabolic complications. A potential alternative explanation
may be related to an individual's ability to optimally store fat. In the presence
of caloric excess, a person is likely to remain metabolically healthy despite obesity,
provided their adipose tissue can continue to expand and safely store fat [34]. Therefore, while the epidemiological evidence associates the risk of metabolic complication
with increased body weight, this relationship may not be direct and may not necessarily
reflect a truly biologically relevant process.

A randomized control trial (RCT) is the golden standard for excluding the spurious
association that arises from confounding and reverse causality. A RCT involves random
allocation of risk factors to subjects, such that distribution of known and unknown
confounders in the different groups is roughly equal, that is, the risk factors become
disassociated from any confounders due to the randomization. Furthermore, since the
initial randomization is done preceding the disease response, this renders reverse
causality highly unlikely. However, the use of RCTs to determine causality is often
not possible due to enormous ethical, financial or technical difficulties.

An alternative to RCTs could be Mendelian randomization, which has been proposed as
a practical strategy to overcome the problem of experimental bias while significantly
reducing the difficulties inherent to RCTs [35,36]. The experimental design of Mendelian randomization aims at providing a potential
way to discern true causality from spurious associations, provided that several basic
assumptions are valid (Figure 1). The idea of Mendelian randomization originated from Katan's letter to The Lancet
[37], where the main objective was testing the hypothesis that low serum cholesterol increases
the risk of cancer versus the alternative one that the cancer induces a lowering of
cholesterol, that is, a hypothesis testing against reverse causality. Using a language
of graphical models [38], Mendelian randomization could be formulated in a triangulation representation as
shown in Figure 1. The essence of Mendelian randomization is the use of a genetic variant as a proxy
for the random assignment of a risk factor to subjects, given that the inheritance
of the genetic variant in a population is also random according to Mendel's second
law. Mendelian randomization may provide a rational approximation to RCTs that can
be used to identify real causal factors contributing to diseases.

Figure 1.A causal model based upon Mendelian randomization. The model demonstrates the core assumptions for making a valid causal inference
between a phenotype and disease. The three assumptions are: (1) genotype is independent
of the confounder; (2) genotype is associated with phenotype; (3) genotype is independent
of disease conditioning on phenotype and confounder. If these assumptions are valid,
then an observed association between genotype and disease would imply the causality
from phenotype to disease.

Data integration based upon Mendelian randomization

We envisage that the potential of combining different post-genome approaches for discovering
disease causality and mechanisms could be integrated within the framework of Mendelian
randomization. In order to apply this idea to distinguish between association and
causation, we need to first justify the three core assumptions that underlie the applicability
of Mendelian randomization (Figure 1). Two of the three assumptions (1 and 3) depend on unobserved confounding factors
and, therefore, cannot be formally tested from observable data. Therefore, the three
associations that are needed in the Mendelian randomization model, that is, the genotype-phenotype
association, the phenotype-disease association, and the genotype-disease association,
require a certain degree of initial characterization. Clearly, these initial models
will need to be continually refined as new data challenge the validity of the assumptions.
The downstream impact of these assumptions is not trivial, as a failure to detect
robust associations could invalidate the power of Mendelian randomization. While this
may imply that Mendelian randomization requires our complete understanding of the
biological system, in practice some apparent violations may not actually negate its
biological implications [36,39]. Applied carefully, Mendelian randomization can become a useful framework for data
integration.

In determining truly positive associations in the presence of a large number of variables
and relatively few samples, one needs to resort to novel statistical techniques that
can handle such complexity. Bayesian statistical methods can be seen as an alternative
to conventional hypothesis testing and appear better able to deal with large post-genomics
datasets. In contrast to conventional P-value-centered statistics, a Bayesian approach provides a measure of the probability
of a hypothesis being true by taking all evidence in an explicit way. This is clearly
a desirable feature as it allows different forms of data to be combined into a unified
hypothetical model. Competing models are then entered into a selection framework such
that the hypotheses that are most supported by data are favored. For example, using
the language of a causal Bayesian network [40,41], Mendelian randomization can be explicitly represented in the graphical model as
shown in Figure 1; in which the directions of the arrows (or edges) between the nodes indicate non-reversible
causal relationships and reflect the three core assumptions made. The plausibility
of the graphical model can then be tested through Bayesian rules, with the evidence
provided by all available 'omics' data from different studies. A pioneering example
of using a Bayesian network to infer disease causality can be found in reference [42], where three possible model networks that characterize the relationships between
QTLs, RNA levels and disease traits were evaluated. However, it should be noted that
most of the current applications of Bayesian networks consider phenotypes and disease
traits as discrete rather then continuous variables; this is due to the computational
difficulties of model selection from an extremely large model space.

Major methodological challenges with complex data integration

While the use of heterogeneous high-dimensional post-genomic data carries many potential
benefits, several challenges exist in the areas of biological interpretation, computing
and informatics, which will need to be addressed to take full advantage of the wealth
of post-genomic data. See Box 1 for the key issues.

Conclusion

Over the last few years, biomolecular research has progressed from the completion
of the human genome project to functional genomics and the application of this knowledge
to advance our understanding of health and disease. It is clear that genomic information
alone, although crucial, is not sufficient to completely explain disease states, which
involve the interaction between genome and environment. Post-genomic approaches attempt
to contribute to our understanding of this interaction, with each approach capturing
a different angle of the global picture. Intuitively, the next step forward is to
integrate these datasets, an approach that, if successful, could be much more informative
and predictive than working exclusively on a single platform.

Associating and correlating variables between datasets as a means of integrating the
large datasets is wrought with issues such as extracting biological meaning (biology
is not always linear and is often context dependent) and determining causality and
spurious associations. We propose that data integration should be built upon a model,
such as a Bayesian model, that takes into account the non-linearity and context-dependent
nature of human biology. We further propose that a putative biological relationship
between individual data points, identified through association studies, can be efficiently
tested (and validated) using strategies, such as Mendelian randomization, that approximate
the design strengths of a RCT. While there are clearly obstacles that need to be overcome,
biological models based upon multiple datasets are likely to become the basis that
drives future research.

Abbreviations

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

All authors contributed equally to this work.

Authors' information

JT is a postdoctoral researcher in MO's group, focusing on developing applications
of Bayesian statistics to integration of heterogeneous genomic and post-genomic data.
MO is research professor of systems biology and bioinformatics. His main research
areas are metabolomics applications in biomedical research and integrative bioinformatics.
CYT is a clinical research fellow in AVP's group, focusing on a systems-biology approach
to studying obesity-related metabolic complications. AVP is a reader in metabolic
medicine at Cambridge University.

Acknowledgements

This project was supported by the ATHEROREMO project (FP7-HEALTH-2007-A contract number
201668) funding to MO, HEPADIP project (EU FP6 Contract LSHM-CT-2005-018734) funding
to AVP and MO, and MRC-CORD funding to AVP.