Determining the clonal heterogeneity in tumors using exome sequencing data

Integrative molecular characterization of glioblastoma

Integrative data analysis of pancreatic tumor genomes

Mutational signature patterns of breast cancer tumors based on stage, grade and subtype

Breast cancers exhibit highly heterogeneous molecular profiles. Although gene expression profiles have been used to predict the risks and prognostic outcomes of breast cancers, the high variability of gene expression limits its clinical application. In contrast, genetic mutation profiles would be more advantageous than gene expression profiles because genetic mutations can be stably detected and the mutational heterogeneity widely exists in breast cancer genomes.

We analyzed 98 breast cancer whole exome samples that were sorted into three subtypes, two grades and two stages. The sum deleterious effect of all mutations in each gene was scored to identify differentially mutated genes (DMGs) for this case-control study. DMGs were corroborated using extensive published knowledge. Functional consequences of deleterious SNVs on protein structure and function were also investigated. Mutational profiling at gene- and SNV-level revealed differential patterns within each breast cancer comparison group, and the gene signatures correlate with expected prognostic characteristics of breast cancer classes. Some of the genes and SNVs identified in this study show high promise and are worthy of further investigation by experimental studies.

Click on image for larger version.

The differentially mutated genes between breast cancer subtypes

The deleterious mutation scores for the differentially mutated genes across the compared samples

High-throughput sequencing, especially exome sequencing, is fast becoming a popular diagnostic tool in the clinical setting, but it has become more and more difficult to determine which tools are the best at analyzing this sequencing data. Previously, researchers have had to use simulated data sets or gene chips as a form of validation to determine the best analysis pipeline, but both have drawbacks and biases. In this study we use the NIST Genome in a Bottle results as a novel method for exome analysis pipeline validation, which do not contain these inherent drawbacks.

Using the NIST Genome in a Bottle variant list as a golden standard, we were able to use six different aligners - Bowtie2, BWA mem, BWA sampe, CUSHAW3, MOSAIK, and Novoalign - and five different variant callers - FreeBayes, GATK HaplotypeCaller, GATK UnifiedGenotyper, SAMtools mpileup, and SNPSVM - to determine which pipeline performs the best on a standard human exome. SNVs were compared in all feasible pipelines by calculating sensitivities at different depths to determine which pipelines ultimately performed the best. We found that among the 30 different pipelines tested, Novoalign in conjunction with GATK UnifiedGenotyper exhibited the highest sensitivity while maintaining a low number of false positives for SNVs. However, it is readily apparent that indels are still difficult for any pipeline to handle with none of the tools achieving an average sensitivity higher than 33% or a Positive Predictive Value (PPV) higher than 53%. Additionally, this work highlights the fact that standard analysis pipelines are missing a large majority of the variants - both SNVs and indels - present in a normal human exome. Lastly, as expected, it was found that aligners can play as vital a role in variant detection as variant callers themselves.

Click on image for larger version.

Schematic of the data analysis pipeline used

The intersection of the SNVs identified by the top five pipelines

Published articles related to this project:

Cornish A, Guda C. A comparison of variant calling pipelines using Genome-in-a-Bottle as a reference. BioMed Research International (2015). [Hindawi]

﻿﻿ECemble: An enzyme classification method to study the role of gut microbiome in human metabolism

Enzymes encoded by the human gut microbiome play an essential role in the human metabolism. To annotate the full enzyme complements of species in the genomic and metagenomic projects, we developed a method called ECemble, to identify enzymes and enzyme classes and study the human gut metabolic pathways. ECemble method uses an ensemble of machine-learning methods to accurately model and predict enzymes from protein sequences and also identifies the enzyme classes and subclasses at the finest resolution. We applied ECemble to predict the entire complements of enzymes from ten sequenced proteomes including the human proteome. We also applied this method to predict enzymes encoded by the human gut microbiome from gut metagenomic samples, and to study the role played by the microbe-derived enzymes in the human metabolism. After mapping the known and predicted enzymes to canonical human pathways, we identified 48 pathways that have at least one bacteria-encoded enzyme, which demonstrates the complementary role of gut microbiome in human gut metabolism. These pathways are primarily involved in metabolizing dietary nutrients such as carbohydrates, amino acids, lipids, cofactors and vitamins. The ECemble method is able to hierarchically assign high quality enzyme annotations to genomic and metagenomic data. This study demonstrated the real application of ECemble to understand the indispensable role played by microbe-encoded enzymes in the healthy functioning of human metabolic systems.

Click on image for larger version.

Schematic representation of the ECemble method and its application

Fractions of known and ECemble predicted enzymes in the proteomes of 10 model organisms from UniProt

﻿﻿MetaID: Taxonomic profiling of metagenomic samples down to the strain level

Several computational methods are available for taxonomic profiling at the genus- and species-level, but none of these methods are effective at the strain-level identification due to the increasing difficulty in detecting variation at that level. Here, we present MetaID, an alignment-free n-gram based approach that can accurately identify microorganisms at the strain level and estimate the abundance of each organism in a sample, given a metagenomic sequencing dataset.

MetaID is an n-gram based method that calculates the profile of unique and common n-grams from the dataset of 2,031 prokaryotic genomes and assigns weights to each n-gram using a scoring function. This scoring function assigns higher weightage to the n-grams that appear in fewer genomes and vice versa; thus, allows for effective use of both unique and common n-grams for species identification. The proposed scoring function and approach is able to accurately identify and estimate the entire taxa in any metagenomic community. The weights assigned to the common n-grams by our scoring function are precisely calibrated to match the reads up to the strain level. The generic approach employed in this method can be applied for accurate identification of a wide variety of microbial species (viruses, prokaryotes and eukaryotes) present in any environmental sample.

Protein-protein interaction (PPI) networks carry vital information about proteins’ functions. In this project, we developed graph comparison and graph mining algorithms to detect the subgraphs (that represent a distinct biological functional module in the cell) that are common to all cancers and those that are distinct to each cancer type. This work was conducted in two phases. In the first phase, we constructed nine cancer PPI networks using differentially expressed genes from the Oncomine dataset. From these networks we discovered frequent patterns that occur in all networks and at different size levels. By using effective canonical labeling and adopting weighted adjacency matrices, we are able to perform graph isomorphism test in polynomial running time. Validation of the frequent common patterns using GO semantic similarity showed that the discovered subgraphs scored consistently higher than the randomly generated subgraphs at each size level.

The molecular profiles exhibited in different cancer types are very different; hence, discovering distinct functional modules associated with specific cancer types is very important to understand the distinct functions associated with them. We developed a new graph theory based method to identify distinct functional modules from nine different cancer protein-protein interaction networks. The method is composed of three major steps: (i) extracting modules from protein-protein interaction networks using network clustering algorithms; (ii) identifying distinct subgraphs from the derived modules; and (iii) identifying distinct subgraph patterns from distinct subgraphs. The subgraph patterns were evaluated using experimentally determined cancer specific protein-protein interaction data from the Ingenuity knowledgebase, to identify distinct functional modules that are specific to each cancer type.

Click on image for larger version.

Power-law distribution of PPI networks from nine different cancers

Distribution of distinct subgraphs in PPI networks across the IPA cancer-specific networks

This is a long-standing project in our laboratory for over a decade. We have published several computational methods for the prediction of proteins targeted to different subcellular locations using amino acid composition and location-specific domains (MitoPred and pTarget), or based on n-gram based Bayesian classification (ngLOC). A web-server was also developed to make online predictions on the proteins from prokaryotic (gram –ve and gram +ve), fungal, plant and animal kingdoms athttp://genome.unmc.edu/ngLOC. We also released an open-source standalone software package to run proteome-scale predictions on local servers. Our ngLOC method was applied to estimate the subcellular proteomes of a number of eukaryotic species. Using the n-gram based classification approach, we developed a new method to identify class-specific subcellular localization signals, some of which are potential novel targeting signals. We have experimentally validated the protein localization predictions from our ngLoc method using GFP-fusion proteins followed by confocal microscopy.

Click on image for larger version.

The n-gram model for representing proteins in ngLOC

Experimental validation of predicted localization of human proteins

An overview of the frequency distribution of amino acids in the signal set for each of the eight subcellular organelles