The development of effective methods for the prediction of ontological annotations is an important goal in computational biology, with protein function prediction and disease gene prioritization gaining wide recognition. While various algorithms have been proposed for these tasks, evaluating their performance is difficult due to problems caused both by the structure of biomedical ontologies and biased or incomplete experimental annotations of genes and gene products. In this work, we propose an information-theoretic framework to evaluate the performance of computational protein function prediction. We use a Bayesian network, structured according to the underlying ontology, to model the prior probability of a protein's function. We then define two concepts, misinformation and remaining uncertainty, that can be seen as information-theoretic analogs of precision and recall. Finally, we propose a single statistic, referred to as semantic distance, that can be used to rank or train classification models. We evaluate our approach by analyzing the performance of three protein function predictors of Gene Ontology terms and provide evidence that we address several weaknesses of currently used metrics. We believe this framework provides useful insights into the performance of protein function prediction tools.

Motivation: Combinatorial interactions of transcription factors with cis-regulatory elements control the dynamic progression through successive cellular states and thus underpin all metazoan development. The construction of network models of cis-regulatory elements therefore has the potential to generate fundamental insights into cellular fate and differentiation. Haematopoiesis has long served as a model system to study mammalian differentiation, yet modelling based on experimentally informed cis-regulatory interactions has so far been restricted to pairs of interacting factors. Here we have generated a Boolean network model based on detailed cis-regulatory functional data connecting 11 haematopoietic stem/progenitor cell (HSPC) regulator genes. Results: Despite its apparent simplicity, the model exhibits surprisingly complex behaviour that we charted using strongly connected components and shortest-path analysis in its Boolean state space. This analysis of our model predicts that HSPCs display heterogeneous expression patterns and possess many intermediate states that can act as ‘stepping stones’ for the HSPC to achieve a final differentiated state. Importantly, an external perturbation or ‘trigger’ is required to exit the stem cell state, with distinct triggers characterising maturation into the various different lineages. By focussing on intermediate states occurring during erythrocyte differentiation, from our model we predicted a novel negative regulation of Fli1 by Gata1 which we confirmed experimentally thus validating our model. In conclusion, we demonstrate that an advanced mammalian regulatory network model based on experimentally validated cis-regulatory interactions has allowed us to make novel, experimentally testable hypotheses about transcriptional mechanisms that control differentiation of mammalian stem cells.

Motivation: Development and progression of solid tumors can be attributed to a process of mutations, which typically includes changes in the number of copies of genes or genomic regions. Although comparisons of cells within single tumors show extensive heterogeneity, recurring features of their evolutionary process may be discerned by comparing multiple regions or cells of a tumor. A particularly useful source of data for studying likely progression of individual tumors is fluorescence in situ hybridization (FISH), which allows one to count copy numbers of several genes in hundreds of single cells. Novel algorithms for interpreting such data phylogenetically are needed, however, to reconstruct likely evolutionary trajectories from states of single cells and facilitate analysis of their evolutionary trajectories. Results: In this paper, we develop phylogenetic methods to infer likely models of tumor progression using FISH copy number data and apply them to a study of FISH data from two cancer types. Statistical analyses of topological characteristics of the tree-based model provide insights into likely tumor progression pathways consistent with the prior literature. Furthermore, tree statistics from the resulting phylogenies can be used as features for prediction methods. This results in improved accuracy, relative to unstructured gene copy number data, at predicting tumor state and future metastasis. Availability: A package of source code for FISH tree building (FISHtrees) and the data on cervical cancer and breast cancer examined here are publicly available at the site ftp://ftp.ncbi.nlm.nih.gov/pub/FISHtrees.

Short Abstract: The rapid and accurate identification of pathogens in human tissue samples is a necessity as disease-causing pathogens increasingly develop resistance to broad spectrum antibiotics and remain one of the greatest public health burdens worldwide. With the increased affordability of high-throughput sequencing, it is now possible to investigate the microbiome of a given sample with high sensitivity. However, clinical samples contain a mixture of genomic sequences from various sources, which complicates the identification of pathogens. Here we present Clinical Pathoscope, a pipeline to rapidly and accurately remove host contamination, isolate viral reads, and deliver a diagnosis. To optimize the Clinical Pathoscope pipeline, data was simulated from human, bacterial, and viral genomes to create biologically realistic clinical samples which represented a diverse variety of host-pathogen landscapes. These data were then used to evaluate the accuracy, usability, and speed of multiple alignment algorithms and filtration methods. The optimal alignment algorithm and filtration method were implemented in the Clinical Pathoscope pipeline to isolate viral reads. These reads were then mapped against a robust viral database and assigned to their appropriate genomes of origin. We demonstrate our approach using sequenced nasopharyngeal aspirate samples from children with upper respiratory tract infections. Unique to other methods, Clinical Pathoscope can rapidly identify multiple pathogens from mixed samples and distinguish between very closely related species with very little coverage of the genome and without the need for genome assembly.

F1000 Poster Awards

Poster - N054

Sequence assembly and variation calling using multiple-dimension de Bruijn graphs

Sergey Lamzin, The Genome Analysis Centre, United Kingdom

Short Abstract: Recent developments in sequencing technologies have brought a renewed impetus to the development of bioinformatics tools for sequence processing and analysis. Most of the current algorithms for de novo genome assembly are based on de Brujingraphs which provide an effective framework for aggregating next generation sequencing (NGS) data into a convenient structure. De Bruijn graphs, however, introduce an artificial parameter that can impact greatly on the results: the dimension k giving rise to k-mer building blocks. We report on the development of a novel assembly algorithm with a new data structure designed to overcome some of the limitations of a single fixed k-mer size de Brujin graph approach and enable higher quality NGS data processing. Our approach structurally combines de Brujin graphs for all possible dimensions k in one supergraph, leading to a flexible graph dimension . The algorithm called StarK is designed in such a way that it allows the assembler to dynamically adjust the de Brujin graph dimension at any given nucleotide position. In addition to flexible k-mer lengths the structure allows for simultaneous assembly of a consensus sequence and mutations/haplotypes directly from reads. The StarK graph uses localised coverage differences to guide the generation of connected subgraphs. This allows higher resolution of genomic differences and helps differentiate errors from potential variants within the sequencing sample.

Poster - A074

A web server for the functional characterization of drugs from gene expression following treatment

Short Abstract: Many drugs exert their therapeutic activities through the modulation of multiple targets. Moreover, this polypharmacology is often associated with both beneficial and adverse off-target effects. For most drugs these targets are largely unknown and identification among the thousands of gene products remains difficult. Yet a better knowledge about such drug-protein interactions, along with the molecular pathways involved and the associated diseases, could be of substantial value to drug development, in particular to predict side effects and explore potential drug repositioning.

DNA microarray technology enables us to observe the effect of drug treatment on the activity of all genes simultaneously and thus forms the perfect starting point for drug mode of action prediction. Hence we have developed an easy-to-use analysis suite for functional characterization of drugs based on gene expression changes following treatment. Our software provides all necessary tools for gaining new insights into the biological effects of a drug by integrating (1) preprocessing of gene expression data obtained from different Affymetrix array types; (2) quality assessment and exploratory analysis of these data; (3) genome-wide drug target prioritization; (4) prediction of pathways involved in the drug’s mode of effect; (5) identification of associated diseases enabling side effect prediction and drug repurposing; and (6) result visualization and reporting. Drug target prioritization is performed by means of an in-house developed algorithm for network neighborhood analysis, integrating the expression data with functional protein association information. All of the above functionalities are demonstrated on gene expression data for treatment with well-characterized drugs.

Poster - I13

Relating the metatranscriptome and metagenome of the human gut

Eric Franzosa, Harvard School of Public Health, United States

Xochitl Morgan (Harvard School of Public Health, Biostatistics Department United States); Nicola Segata (Harvard School of Public Health, Biostatistics Department United States); Levi Waldron (Harvard School of Public Health, Biostatistics Department United States); Joshua Reyes (Harvard School of Public Health, Biostatistics Department United States); Curtis Huttenhower (Harvard School of Public Health, Biostatistics Department United States); Ashlee Earl (The Broad Institute, Genome Sequencing & Analysis Program United States); Georgia Giannoukos (The Broad Institute, Genome Sequencing & Analysis Program United States); Dawn Ciulla (The Broad Institute, Genome Sequencing & Analysis Program United States); Wendy Garrett (Harvard School of Public Health, Department of Immunology and Infectious Diseases United States); Andrew Chan (Massachusetts General Hospital, Gastrointestinal Unit United States); Jacques Izard (The Forsyth Institute, Department of Microbiology United States); Matthew Boylan (Massachusetts General Hospital, Gastrointestinal Unit United States);

Short Abstract: Typical microbial residents and ecologies of the human microbiome have now been well-studied. However, the microbiota's>8 million genes and their transcriptional regulation remain largely uncharacterized. We conducted one of the first human microbiome studies in a well-phenotyped prospective cohort incorporating taxonomic, metagenomic, and metatranscriptomic profiling at multiple body sites. The results establish the feasibility of metatranscriptomic investigations in subject-collected samples from the Health Professionals Follow-up Study. Replicate stool and saliva samples were collected from 8 subjects, and three different RNA preservation methods were assessed (frozen, ethanol, and RNAlater). Within-subject microbial species, gene, and transcript abundances were highly concordant across sampling methods, with only transcripts and only a small fraction (<5%) displaying significant between-method variation. Their functions were consistent with reprogramming in response to storage media environment (carbon source and osmolarity). Next, we investigated relationships between the oral and gut microbial communities, identifying a subset of abundant oral microbes that routinely survive transit to the gut. Comparison of the gut metagenome and metatranscriptome revealed three distinct functional clusters: (i) the ~50% of microbial genes whose RNA and DNA levels are strongly correlated; (ii) genes detected only at the DNA level, including inactive biosynthesis and stress-response factors; and (iii) genes detected only at the RNA level, including functions specific to the gut’s archaeal inhabitants, e.g. methanogenesis. Globally, we observe that RNA-level functional profiles are significantly more individualized than DNA-level profiles across subjects but less variable than microbial composition, indicative of subject-specific whole-community regulation occurring at the transcriptional level.

Short Abstract: Genotype-phenotype association methods for bacterial genomes are not yet well-established. Bacteria do not have sexual reproduction, which invalidates some of the assumptions made in many of the current methods used for association studies in other organisms. Bacteria have a huge influence on human health and we need better methods to learn about how changes in genetics give clinically relevant phenotypes.

Our bug of interest is Mycobacterium tuberculosis (Mtb), a bacterial pathogen that causes pulmonary tuberculosis (TB), which kills over a million people each year. Unfortunately, Mtb is difficult to diagnose and resistance to antibiotics is becoming rampant. The current diagnostics for drug resistance take six to eight weeks. The technology for rapid molecular diagnostics exists, but requires knowledge about resistance marker mutations, which is missing.

To address the lack of knowledge about marker mutations, we present a machine learning based strategy that uses support vector machines to predict genotype-phenotype associations by integrating genome sequence and clinical meta data. An ensemble feature selection method enables the discovery of antibiotic resistance markers in Mtb. The impact is two-fold: (i) the features selection procedure gives us a ranking of mutations that are associated with drug resistance and (ii) the classification model can be used together with a molecular diagnostic to predict treatment options for patients.

In this poster we discuss the methods and illustrate their capabilities on a panel of bacterial drug-resistance projects with a particular focus on Mycobacterium tuberculosis.

Short Abstract: According to American Cancer Society, breast cancer is the second most common cause of cancer death among women. Generally, the reason of fatality is the metastasis in another organ, not the primer tumor in the breast. A better understanding of the molecular mechanism of the metastatic process may help to improve the clinical methods. For this purpose, we have used protein structure and protein networks together at the system level to explain genotype-phenotype relationship, and applied it to breast cancer metastasis.

We have built a comprehensive human PPI network, by combining the available protein-protein interactions data from various databases. Then, we have ranked all the interactions of human PPI network according to their relevance to genes known to be mediating breast cancer to brain and lung metastasis. We have formed two distinct metastasis PPI networks from high ranked interactions.

We have preformed functional analyses on brain/lung metastasis PPI networks and observed that the proteins of the lung metastasis network are also enriched in “Cancer”, “Infectious Diseases” and “Immune System” KEGG pathways. This finding may be pointing to a cause and effect relationship between immune system-infectious diseases and lung metastasis progression.

We have enriched the metastasis PPI networks with structural information both with available data in Protein Databank and with our protein interface predictions. In the interface prediction step, the most common protein-protein interface templates in lung metastasis are observed to be coming from bacterial proteins. This finding reinforced our claim about the relationship between lung metastasis and infectious diseases.

Short Abstract: Macromolecular complexes play a key role in many biological processes. In metabolic pathways, for example, assemblies of proteins bear several advantages: reactions are performed more efficiently, oversupply of intermediate products is reduced or avoided by regulating the activity of the involved enzymes via feedback loops, and toxic or highly reactive compounds are kept from being released into the cytoplasm. However, atom-level structural determination of such complexes, for example with X-ray crystallography, often fails due to the size of the complex, different binding affinities of the involved proteins, or the complex falling apart during crystallization.

We present a novel combinatorial greedy algorithm that iteratively assembles such complexes solely based on the knowledge of the approximate interface locations of any two interacting proteins in the complex, and the stoichiometry of each monomer. Prior assumptions about symmetries in the complex are not required; rather, the symmetry is detected during complex assembly. Complexes are assembled stepwise from pairwise docking poses obtained with RosettaDock and scored using a geometric compatibility constraint deduced from these docking poses. Clash detection and clustering guarantee a reasonable and diverse solution space in each iteration.

In a diverse and representative benchmark set of 304 complexes from the Protein Data Bank with more than five subunits, 199 (65%) could be reconstructed with an average RMSD of 14 reference points for any two contacting subunits in the reference complex not greater than 3.0Å from the reference complex. Of these, the best prediction lies within the top ten in 91% of the cases.

The Best Artwork Award - $200 Cash Prize

Cosmopolitan Chicken Research Project

Jan Aerts - University of Leuven, Belgium

The Cosmopolitan Chicken Project by Belgian artist Koen Vanmechelen aims to explore the phenotypic and genetic evolution of chicken breeds as a proxy for human evolution and diversity (http://www.koenvanmechelen.be/cosmopolitan-chicken-project). This project has created several generations of hybrids based on purebred domestic chickens.

In this data-driven sculpture, we visualize the genetic heterozygosity of one particular hybrid ("Mechelse Ancona") which descends from 12 different purebreds, including Mechelse Koekoek, Poulet de Bresse and Ancona. After genotyping, the number of homozygous and heterozygous loci were counted at their chromosomal positions, and translated as peaks in 3-dimensional space. Each chromosome is laid out in a circle, connecting at the starting position. Peaks pointing towards the centre represent homozygous genotypes; peaks pointing outward show heterozygous regions. As a result, inbred chickens generate smoother outlines whereas crossbred ones result in a form with many outward-pointing spikes. The 3D models were developed in Processing, an open source programming language and integrated development environment based on Java.