Epistasis Blog

From the Computational Genetics Laboratory at the University of Pennsylvania (www.epistasis.org)

Friday, October 16, 2009

A computational evolution system for open-ended automated learning of complex genetic relationships

I will be giving a talk on the following topic at IGES on Tuesday morning. I will also be presenting this as a poster at ASHG.

A computational evolution system for open-ended automated learning of complex genetic relationships.

Jason H. Moore, Doug Hill, Casey S. Greene

The failure of genome-wide association studies to reveal the genetic architecture of common diseases suggests that it is time that we embrace, rather than ignore, the complexity of the genotype-to-phenotype mapping relationship that is characterized by epistasis, plastic reaction norms, heterogeneity and other phenomena such as epigenetics. The extreme complexity of the problem suggests that simple linear models and other approaches that assume simplicity are unlikely to capture the full spectrum of genetic effects. To this end, we have developed an open-ended computational evolution system (CES) that makes no assumptions about the underlying genetic model and can learn through evolution by natural selection how to solve a particular genetic modeling problem. This is accomplished by providing the basic mathematical building blocks (e.g. +, -, *, /, LOG, , =, AND, OR, NOT etc.) for models that can take any shape or form and the basic building blocks for algorithmic functions (e.g. ADD, DELETE, COPY, etc.) that can manipulate genetic models in a manner that is dependent on expert statistical and biological knowledge or prior modeling experience. We have previously demonstrated that our CES approach has excellent power to detect epistatic relationships in genome-wide data across a wide-range of heritabilities and sample sizes (Moore et al. 2008, 2009). We have also previously shown that this system can learn to utilize one of many sources of expert knowledge thus providing an important clue as to how the system solves the problem (Greene et al. 2009). Here, we introduce an additional layer to our CES approach that introduces noise into the training data (5%, 10%, 15% and 20%) to drive the discovery process toward models that are more likely to generalize. We show using simulated epistatic relationships in genome-wide data that the CES leads to significantly smaller models (P<0.001) thus reducing false-positives and overfitting while maintaining a power of 97% to 100%. These results are important because they show how introduced noise in the data can yield more parsimonious models and reduce overfitting without the need for computationally expensive cross-validation. This study is an important step towards a paradigm of genetic analysis that makes few assumptions about a genetic architecture that is very complex.

With technological advances in genetic mapping studies more of the genes and polymorphisms that underlie Quantitative Trait Loci (QTL) are now being identified. As the identities of these genes become known there is a growing need for an analysis framework that incorporates the molecular interactions affected by natural polymorphisms. As a step towards such a framework we present a molecular model of genetic variation in sporulation efficiency between natural isolates of the yeast, Saccharomyces cerevisiae. The model is based on the structure of the regulatory pathway that controls sporulation. The model captures the phenotypic variation between strains carrying different combinations of alleles at known QTL. Compared to a standard linear model the molecular model requires fewer free parameters, and has the advantage of generating quantitative hypotheses about the affinity of specific molecular interactions in different genetic backgrounds. Our analyses provide a concrete example of how the thermodynamic properties of protein-=protein and protein-DNA interactions naturally give rise to epistasis, the non-linear relationship between genotype and phenotype. As more causative genes and polymorphisms underlying QTL are identified, thermodynamic analyses of quantitative traits may provide a useful framework for unraveling the complex relationship between genotype and phenotype.

Traditionally, we understand that individual phenotypes result primarily from inherited genetic variants together with environmental exposures. However, many studies showed that a remarkable variety of factors including environmental agents, parental behaviors, maternal physiology, xenobiotics, nutritional supplements and others lead to epigenetic changes that can be transmitted to subsequent generations without continued exposure. Recent discoveries show transgenerational epistasis and transgenerational genetic effects where genetic factors in one generation affect phenotypes in subsequent generation without inheritance of the genetic variant in the parents. Together these discoveries implicate a key signaling pathway, chromatin remodeling, methylation, RNA editing and microRNA biology. This exceptional mode of inheritance complicates the search for disease genes and represents perhaps an adaptation to transmit useful gene expression profiles from one generation to the next. In this review, I present evidence for these transgenerational genetic effects, identify their common features, propose a heuristic model to guide the search for mechanisms, discuss the implications, and pose questions whose answers will begin to reveal the underlying mechanisms.

Genome-wide association studies (GWAS) have been fruitful in identifying disease susceptibility loci for common and complex diseases. A remaining question is whether we can quantify individual disease risk based on genotype data, in order to facilitate personalized prevention and treatment for complex diseases. Previous studies have typically failed to achieve satisfactory performance, primarily due to the use of only a limited number of confirmed susceptibility loci. Here we propose that sophisticated machine-learning approaches with a large ensemble of markers may improve the performance of disease risk assessment. We applied a Support Vector Machine (SVM) algorithm on a GWAS dataset generated on the Affymetrix genotyping platform for type 1 diabetes (T1D) and optimized a risk assessment model with hundreds of markers. We subsequently tested this model on an independent Illumina-genotyped dataset with imputed genotypes (1,008 cases and 1,000 controls), as well as a separate Affymetrix-genotyped dataset (1,529 cases and 1,458 controls), resulting in area under ROC curve (AUC) of ~0.84 in both datasets. In contrast, poor performance was achieved when limited to dozens of known susceptibility loci in the SVM model or logistic regression model. Our study suggests that improved disease risk assessment can be achieved by using algorithms that take into account interactions between a large ensemble of markers. We are optimistic that genotype-based disease risk assessment may be feasible for diseases where a notable proportion of the risk has already been captured by SNP arrays.

Annotating the function of all human genes is a critical, yet formidable, challenge. Current gene annotation efforts focus on centralized curation resources, but it is increasingly clear that this approach does not scale with the rapid growth of the biomedical literature. The Gene Wiki utilizes an alternative and complementary model based on the principle of community intelligence. Directly integrated within the online encyclopedia, Wikipedia, the goal of this effort is to build a gene-specific review article for every gene in the human genome, where each article is collaboratively written, continuously updated and community reviewed. Previously, we described the creation of Gene Wiki 'stubs' for approximately 9000 human genes. Here, we describe ongoing systematic improvements to these articles to increase their utility. Moreover, we retrospectively examine the community usage and improvement of the Gene Wiki, providing evidence of a critical mass of users and editors. Gene Wiki articles are freely accessible within the Wikipedia web site, and additional links and information are available at http://en.wikipedia.org/wiki/Portal:Gene_Wiki.

About Me

Edward Rose Professor of Informatics,
Director of the Institute for Biomedical Informatics, Director of the Division of Informatics in the Department of Biostatistics and Epidemiology,
Senior Associate Dean for Informatics,
The Perelman School of Medicine,
University of Pennsylvania