CORE 1: THEORETICAL AND COMPUTATIONAL STUDIES Project 1. Atomic-level molecular-interaction models: B.Honig, H.Bussemaker, D.Murray. Background: One of the long range goals of MAGNet is the integration of structural information in all aspects of Systems Biology research. Our strategy is to develop a computational infrastructure that will facilitate the integration process and to apply our computational tools across the entire spectrum of MAGNet activities. Our efforts will involve the direct use of three-dimensional structural information, where available, and the extensive application of modeling techniques. The Honig and Murray labs have demonstrated the power of modeling in numerous collaborative studies (see (8, 9)). Our focus in this proposal is twofold. First, our recent discovery of the role of minor groove shape in protein-DNA recognition (7) opens up a broad range of questions to be addressed (see also DBP 1). Second, we will develop databases and models that allow us to predict the structure of proteins involved in different pathways, with focus on the computational prediction of protein-protein and protein-membrane interactions. These tools will impact projects throughout MAGNet, with specific emphasis on modeling cancer-related pathways. Project 2. Regulatory Molecular-Interaction Model: A.Califano, D.Anastassiou, D.Pe'er, D.Vitkup Background: Regulatory networks are becoming increasingly valuable in the elucidation of cellular function and of its dysregulation in disease (5, 6, 22-24). Additionally, integrative genetical-genomics models that use genetics to inform causality in regulatory models have been successfully used to elucidate determinants of mammalian traits, which have been experimentally validated (25). Yet, this area of investigation is just in its infancy and significant non-incremental improvements are necessary before these tools and methodologies may be routinely used by biologists for the elucidation of physiological and pathological mechanisms. Specifically, regulatory network models of higher eukaryotes are largely incomplete, lack context specificity, and, with few exceptions (21), address only one molecular-interaction layer: generally either the transcriptional (22) or the protein-protein interaction layer (26). Indeed, the vast majority of pathway models used in the literature is assembled from the literature or from ex vivo data, such as yeast-2-hybrids, and is thus both biased and not specific to the cellular context of interest. Not surprisingly, there are very few examples where unbiased computational interrogation of these models has led to the elucidation of novel biological mechanism. Rather, these are used as conceptual tools to explore broad association between disease and network connectivity or to explore regulatory interactions surrounding genes of specific interest. Similarly, paracrine and endocrine regulatory processes spanning multiple cell types, such as those driven by stroma-tumor (27), gut-bone (28), and glia-motor neuron (29) interactions, are virtually unmapped at a genome-wide level. Finally, several informative data modalities are poorly integrated in efforts to dissect molecular-interactions. For instance, data on structure-based specificity of protein-DNA and protein-protein interactions has not been systematically integrated with functional data to reverse-engineer regulatory networks. Project 3. Genetic Variability Models: C.Wiggins, D.Pe'er, I.Pe'er, R.Rabadan Background: The data-driven revolution currently transforming population genetics is the focus of the third theme in Core 1. Abundant sequence data challenge our decades-old understanding of population genetics - particularly dynamics - and support new investigations to learn the mapping from microscopic genetic variation to macroscopic phenotypic (and disease) response. This theme spans from the fast evolutionary dynamics of small genomes (viruses) to population data of large human genomes, tied together via machine learning methods, which constrain and guide our understanding of population genetics and population dynamics. In the same way that microarray data spawned an entire new field of quantitative inquiry into transcriptional regulatory networks a decade ago, current advances in both technology (sequencing methods, in particular) and computation (algorithmic advances in regression, in particular) now allow learning the structure and even dynamics of the genotype-phenotype relationship from data. Project 4. Software Development: A.FIoratos, A.Califano, G.Kaiser Background: A key objective of the NCBC program in general and the MAGNet Center in particular is to facilitate the dissemination of advanced computational tools and data resources to the national and international biomedical research communities. Any software platform used for that purpose must address a complex set of challenges. Integration and user-friendliness: the fundamentally integrative nature of modern biomedical research necessitates the combination of data from multiple genomic/biomedical databases and the use of an array of advanced analysis techniques (39). Making tools and resources accessible in an integrated and interoperable manner is a prerequisite for lowering the adoption barrier by biologists that are not computationally trained. Knowledge sharing and collaboration: Moving beyond the traditional model of providing expert-driven support to bioinformatics tool users (through mailing lists, forums, knowledge bases, etc.) new approaches have emerged that seek to create communities of practice through activity awareness (40-42). Integrative tools can be daunting to use, and they stand to gain tremendously from the ability to automatically build "community memory" and enable knowledge sharing through the addition of transparent event (activity) collection, aggregation and mining facilities Seamless access to computational Infrastructure: Due to their sheer size and dimensionality, analysis of genomic data sets can be computationally very demanding. It is unlikely that every biomedical researcher that would like to utilize such analyses will have access to local/institutional hardware resources capable of supporting their execution. It is then extremely important to facilitate sharing of public infrastructure, such as grid computing (43, 44). Integration into the national biomedical computing environment: It is becoming increasingly evident that to maximize the impact of analytical and data resources in biomedical research it is desirable to expose them programmatically in a semantically aware manner (45, 46). The combination of programmatic accessibility and semantic clarity not only provides a level of self-documentation that increases usability and quality control but also encourages their creative incorporation into shareable workflows and innovative analysis and visualization tools (47-51).