Research Goal

Common diseases such as cardiovascular disease, cancer, obesity, diabetes and psychiatric illnesses are caused by a combination of multiple genetic and environmental factors. Understanding how the genetic factors interact with each other and with the environment would allow better prevention, diagnosis and treatment of these diseases, and thus allow individualized treatment of these diseases based on the genetic make-up of the patients. Almost all existing approaches for studying the genetic causes of disease are localized to studying effects of a very small set of genes, and thus are not capable of capturing subtle effects of many genes. At MSR Cambridge, we are collaborating with researchers at the Wellcome Trust Sanger Institute to perform genome-level analysis that integrates genetic and functional genomic data to study effects of multiple genes jointly.

Our goal of the joint project with Sanger institute is to model multiple sources of genomic data within a common statistical framework using large-scale machine learning tools. By combining the expertise of the Sanger Institute and MSR Cambridge, we aim to gain new insights into genetic networks and the pathogenesis, diagnosis and treatment of human disease, whilst also driving the development of machine learning tools usable by the wider scientific community. The project involves using two primary data sources - genetic variation (haplotype) sequences from the international HapMap project, and high throughput gene expression data from the Sanger Institute’s Population and Comparative Genomics group, in the first instance obtained by microarray analysis of cell lines collected from the HapMap project. The gene expression data is useful because it acts as an intermediary between SNP measurements and disease susceptibility. It is note-worthy to mention that almost all previous approaches have studied genetic basis for diseases either using only genetic variation data or by using only gene expression data and that our approach takes into account these two complementary genomic data sets to understand genetic variations. In fact, a preliminary study from the Sanger Institute analysed co-variation between pairs of SNP and gene expression probes and discovered significant relationships between them.

Approach

The figure above illustrates our approach for jointly modelling the haplotype data and the gene expression data. The haplotype model is a statistical model of correlations within the haplotype data. SNPs variants that are closely linked in the genome are in non-random association with neighbouring SNPs, leading to a block-like structure in the haplotypes, where blocks are conserved between different individuals. We aim to use a haplotype model that will capture correlated structures in the data to compactly represent the entire observed haplotype sequences. SNPs can affect the functionality of genes in two main ways; SNPs in the coding regions of a gene affect the protein that the gene codes for whilst SNPs in the regulatory region affect how much gene is expressed. Our framework separates out these two pathways so that direct and indirect effects on gene expression can be modelled in different ways. Genes are often co-regulated or co-expressed, leading to strong correlations in expression levels between different genes. These have been examined in previous gene expression models. As we have information from coding SNPs, we plan to extend such models to a richer interaction model that also models the intra-cellular relationship between levels of protein activity and levels of gene expression. The final stage of our analysis will be to identify correlations between our haplotype and interaction models, and the phenotypes or diseases of the individuals that the samples come from. This will identify relationships between genetic variation and gene expression, and hence lead to improved understanding of the genetic causes of human disease.