Short Abstract: RNA interference (RNAi) has been established as an efficient tool for systematic, high-throughput investigation of loss-of-function phenotypes, thus representing a rich source of functional gene annotation. RNAi phenotypes cover a wide range of biology and include, for example, transcriptional readouts, morphological observations, and viability studies - based on experiments in cell culture, specific tissues or whole organisms.

The GenomeRNAi database aims to collect and make available RNAi data obtained from the literature or by direct submission. All data undergo a process of manual curation, following structured annotation guidelines to ensure consistent presentation and comparability of the data. As of version 10, the database contains phenotype data from 137 cell-based experiments in Homo sapiens, as well as 173 screens in Drosophila melanogaster, 53 of which were performed in vivo. Additionally, the database holds RNAi reagent information, which includes calculations assessing their specificity and efficiency.

GenomeRNAi is publically available at www.genomernai.org allowing searches by gene, reagent, or phenotype, browsing through the RNAi screen table, or downloading data in tab-delimited format. GenomeRNAi provides many links to external resources; mutual links have been established with FlyBase, UniProt and GeneCards. The website features a “Frequent hitters” page, highlighting genes that frequently show a phenotype, also available for download. A DAS server for GenomeRNAi phenotypes and reagents has been implemented, which also serves a dynamic genome browser on the GenomeRNAi website. Furthermore, GenomeRNAi data have been integrated into the FlyMine query tool. The implementation of an interface for direct submissions by data producers is in progress.

Poster - F03

ToxWorkshop: an extendable workflow software for data processing of massive and diverse data

Short Abstract: While the recent advent of new technologies such as DNA microarray and next-generation sequencer has brought us a growing influx of data, it also poses a serious challenge: how to analyze such massive and diverse data effectively. Especially in toxicology, these new technologies have given birth to an emerging field called toxicogenomics in which Toxicogenomics Project (TGP) in Japan developed TG-GATEs, a system with database storing microarray and other toxicological data of over 150 drugs in vivo (rat) and in vitro (rat and human).

To tackle this, we developed ToxWorkshop, a .NET framework-based Windows application. ToxWorkshop is an extendable workflow environment where a user can arrange any number of analyzers (add-in packages) as a workflow. Typically, users sequentially connect analyzers for database access, data conversion, statistical analysis and visualization to finally obtain an output.

The advantage of ToxWorkshop over the other softwares lies on its extendability and maintainability. Users can easily and quickly build their own custom analyzers. They only have to focus on coding analytical logic as ToxWorkshop provides basic infrastructure for user interface, error handling, and asynchronous process. Once created, custom analyzers can be shared with other researchers and used repeatedly.

We used ToxWorkshop to evaluate accuracy of our newly identified predictive gene markers of liver weight gain in rats, utilizing TG-GATEs database. ToxWorkshop can dramatically reduce the time and effort needed for such an ununiform analysis where manual calculation is virtually impossible. (ToxWorkshop is still under development and not released for public use.)

Poster - F04

Gene mapping based on whole genome sequencing data

Jürgen Claesen, Hasselt University, Belgium

Tomasz Burzykowski (Hasselt University, CenStat Belgium);

Short Abstract: The analysis of polygenic, phenotypic characteristics such as quantitative traits or inheritable diseases remains an important challenge. It requires reliable scoring of many genetic markers covering the entire genome. The advent of high-throughput sequencing technologies provides a new way to evaluate large numbers of single nucleotide polymorphisms as genetic markers. Combining the technologies with pooling of segregants, as performed in bulked segregant analysis should, in principle, allow the simultaneous mapping of multiple genetic loci present throughout the genome.We propose two methods to analyze the marker data obtained by next generation sequencing: a scatterplot smoother and a hidden markov model. The latter includes several states, each associated with a different probability of observing the same/different nucleotide in an offspring as compared to the parent. The transitions between the SNPs implies transitions between the states of the model. After estimating the between-state transition probabilities and state-related probabilities of nucleotide (dis)-similarity, the most probable state for each SNP can be selected. The most probable states can then be used to indicate regions in the genome with a high probability of nucleotide (dis)-similarity, i.e., which may be likely to contain trait-related genes.

We also present a semi-parametric approach that uses marker information from a pool of segregants and provides a smoother based testing procedurefor discovering genomic regions that contain potential gene loci contributing to the phenotypictrait of interest.

Short Abstract: Nonsense-mediated mRNA decay (NMD) is an RNA surveillance system that degrades transcripts containing a premature termination codon. Alternative splicing coupled with NMD is a mechanism of post-transcriptional gene regulation that we found to affect mRNA levels of thousands of human genes. The prevailing model of defining a premature termination codon in mammals is the 50nt rule: a termination codon > 50 nucleotides upstream of an exon-exon junction is recognized as premature and triggers NMD. There is evidence that this rule works in Arabidopsis but not in other eukaryotes such as Drosophila. There are reports that a longer 3' UTR triggers NMD in plants, flies, and mammals. We have performed RNA-Seq analysis on cells where NMD has been inhibited via knockdown of UPF1, a key effector of NMD. By comparing the isoform abundance in inhibited NMD cells to that in normal cells, we discovered that hundreds to thousands of transcripts are degraded by NMD in human, zebrafish, and fly. We found that the 50nt rule is a strong predictor of NMD in human cells and seems to act in zebrafish and, surprisingly, in fly. In contrast, we found very little correlation between the likelihood of degradation by NMD and 3' UTR length in the absence of a 50nt rule premature termination codon in any species.

Short Abstract: Peroxisomes are small, ubiquitous eukaryotic cell organelles that mediate a wide range of metabolic functions such as photorespiration, fatty acid beta-oxidation and response to biotic and abiotic stress. Recent advances have begun to reveal the unexpectedly large plant peroxisomal proteome to increase our understanding of metabolic pathways in peroxisomes. Large-scale plant genome sequencing will soon allow detailed comparative computational analyses of many different peroxisomal proteomes. The results should be instrumental in defining the functional and metabolic inventory of plant peroxisomes and developing molecular strategies for improvement of food and biofuel production.

We here present the first approach to functional and metabolic characterization of plant peroxisomal proteomes from different phylogenetic clades using bioinformatics and machine-learning methods. Our pipeline involves the prediction of peroxisomal proteins from diverse complete plant genomes followed by a homology search-based identification of clade-specific conserved gene families. By mapping the resulting peroxisomal proteomes to Gene Ontology terms, Pfam domain families and KEGG metabolic pathways we obtain functional and metabolic profiles of different algae, mosses, monocotyledons and dicotyledons. Our computational analyses of metabolic profiles from peroxisomal proteomes and complete genomes reveal significantly enriched peroxisomal pathways that have previously been unknown. Furthermore, we apply machine learning techniques to functional profiles from different clades to identify known and novel discriminative peroxisomal functions and pathways in algae and seed plants. Future work will comprise experimental verification of newly identified proteins and pathways and the extension of our method to other phylogenetic branches and other organelles.

Robin Smith (University of California San Francisco, Department of Bioengineering and Therapeutic Sciences/Institute for Human Genetics United States); Mee Kim (University of California San Francisco, Department of Bioengineering and Therapeutic Sciences/Institute for Human Genetics United States); Nadav Ahituv (University of California San Francisco, Department of Bioengineering and Therapeutic Sciences/Institute for Human Genetics United States); Ivan Ovcharenko (National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Computational Biology Branch United States);

Short Abstract: Tissue-specific expression is controlled by proximal promoter and distant enhancer sequences. Our results suggest that both promoters and enhancers cooperate very closely to determine tissue specificity. Moreover, the identification of particular sequence signatures is sufficient to identify enhancers de novo.We compiled sets of tissue-specific promoters based on gene expression profiles of 79 human tissues and cell types. Putative transcription factor binding sites within each set of sequences were used to train a Support Vector Machine (SVM) classifier capable of distinguishing tissue-specific promoters from control sequences. We obtained reliable classifiers for 92% (73/79) of the tissues under consideration, with an area under the receiver operating characteristic curve (AUC) between 60% (for subthalamic nucleus promoters) and 98% (for heart promoters), providing evidence for the abundance of tissue-specific regulatory signatures in promoters. We next used these classifiers to identify tissue-specific enhancers, scanning noncoding sequences in the loci of the 200 most highly and lowly expressed genes in each of the 73 tissues with reliable classifiers, while excluding promoter regions. Thirty-percent of reliable promoter-based classifiers produced consistent predictions of enhancers, with significantly higher densities in the loci of the most highly expressed compared to lowly expressed genes (e.g., over 5-fold enrichment in the case of liver). The accuracy of the enhancer predictions generated by the promoter-based models was assessed in vivo using the hydrodynamic tail vein injection assay. Fifty-eight percent(7/12) of liver-enhancer predictions yielded robust enhancer activity in the mouse liver, versus zero for the controls (0/5).

Short Abstract: Alternative splicing is central for cellular processes and substantially increases transcriptome and proteome diversity. The emergence of next generation RNA sequencing provides an exciting new technology to analyze alternative splicing on a large scale. We present a new method and software to predict genes that are differentially spliced between two different conditions using RNA-seq data. Our method employs geometric angles between the high dimensional vectors of exon read counts. With this, differential splicing can be detected even if the splicing events comprise of higher complexity and involve previously unknown splicing patterns. We applied our approach to two case studies including neuroblastoma tumour data with favorable and unfavorable clinical courses and show the validity of our predictions as well as the applicability of our method in the context of patient clustering, simulated experiments, and association with specific regulatory splicing factor motifs in the regulated gene sequences.

Short Abstract: MicroRNAs are short RNA molecules that are involved in the regulation of gene expression by binding to mRNAs, usually resulting in translational repression or mRNA degradation. There are very few experimentally validated miRNA-mRNA interactions compared to the total expected number. However, in the last few years there has been an intense proliferation of predictive algorithms to determine the targets of these non-coding RNAs. These algorithms take into account the sequence complementarity, their structure and the thermodynamics of the binding process, and assign a score to every predicted interaction. Using the existing predictive algorithms, we have measured the confidence for each of the interactions by estimating the precision of the prediction when compared to the experimental validated information. We have also created a new predictive combined database that merges the existing predictive algorithms, giving every interaction a new combined score and its statistical confidence. This global score allows us to combine several databases without the effect of low-performing algorithms dragging down good-performing ones.The combined database uses miRNA targets from four sources containing experimental validated interactions: Tarbase, miRTarBase, miRWalk and miRecords. Predicted interactions were retrieved from: EIMMo, DIANA-microT, Microcosm, Microrna.org, TargetScan, Mirtarget, PITA, miRWalk-predictive and TargetSpy. A few data analysis methods have been included to exploit the collected information. This includes an enrichment analysis to determine the proportion of targets of every miRNA that are significantly down regulated in transcriptomic experiments. This is possible by providing gene expression information or searching automatically in GEO database.

Short Abstract: The study of the plant cell wall is still limited by the paucity of our knowledge of cellulose biosynthesis related genes despite the amount of high-throughput gene expression data available. Recently, considerable effort was undertaken to derive information which also captures the spatiotemporal dynamics of gene expression in complete cellular systems. A particular example is a gene expression compendium generated for different developmental stages as well as root cell types of Arabidopsis thaliana. Other studies extended our view beyond the transcriptome by monitoring cell specific abundances of genes and gene-products on the level of the translatome and proteome.Clearly, such a ‘multi-omics’ data set poses challenges for data integration and so far, there have been no clear results pertaining to the degree of similarity of patterns obtained for genes and gene-products on different system levels. As such, it has been repeatedly reported that particularly on the level of the transcriptome and proteome only weak linear relationships are present considering patterns of protein abundances and corresponding gene-expression levels. By including the intermediate system level of the translatome, an analysis of the cell wall related biology is presented. Particularly, classical computational approaches relying on pairwise similarities (i.e. relevance networks utilizing Pearson’s correlation coefficient) are extended by projection-based approaches, such as canonical correlation analysis (CCA). Briefly, we employ CCA to obtain linear combinations of gene co-expression patterns which are maximally correlated with their respective protein abundances. Finally, our approach allows identifying and confirming key genes associated with production of cell wall polymers.

Poster - F11

De novo prediction of the genomic components and capabilities for microbial plant biomass degradation from (meta-)genomes

Short Abstract: Understanding the biological mechanisms used by microorganisms for plant biomass degradation is of considerable biotechnological interest. Despite of the growing number of sequenced (meta)genomes of plant biomass degraders, there is currently no technique for the systematic determination of the genomic components of this process from these data.

We describe a computational method for the discovery of the Pfam domains and CAZy families involved in microbial plant biomass degradation. Our method furthermore accurately predicts this phenotype from microbial genome sequences. Application to a manually curated data set of microbial degraders and non-degraders identified gene families of enzymes well-known to be implicated in cellulose degradation, such as GH5 and GH6. Additionally, genes of enzymes that degrade other plant polysaccharides, were found, as well as protein families which have not previously been related to the process. For draft genomes reconstructed from a cow rumen metagenome our method predicted Bacteroidetes-affiliated species and a relative to a known plant biomass degrader to be plant biomass degraders. This was supported by enzymatically active glycoside hydrolases encoded in these genomes.

Our results show the potential of the method for generating novel insights into microbial plant biomass degradation from (meta-)genome data, where there is an increasing production of genome assemblages for uncultured microbes.

Short Abstract: Adipose tissue is made up of numerous cell types including adipocytes that store energy in the form of fat. The two types of adipose tissue found in humans and other mammals are white adipose tissue (WAT) and brown adipose tissue (BAT). BAT is the prominent form of fat in newborns and brown adipocytes contain multiple small lipid droplets and high mitochondrial numbers. On the contrary, white adipocytes contain a large lipid droplet, and WAT is the prominent adipose tissue of adult humans. BAT is specialized in energy dissipation and the generation of heat by oxidation of glucose and fatty acids, whereas WAT is wired for energy storage.

The project addresses the postnatal "transformation" of the innate brown fat to white adipose tissue in sheep (Ovis aries). From earlier studies it is known that this transformation takes place in about two weeks after the birth. The idea is, as sheep is a large mammal, that the transformation in sheep mimics the postnatal brown-to-white adipose conversion occurring in newborn human babies.

To find out the underlying mechanism of this transformation, adipose tissue was collected at multiple time points from sheep. It includes time points before and shortly after birth. Tag-RNA sequencing was performed on the samples to assess the differential gene expression for different time points. Time series analysis along with linear regression was employed to find significantly changing genes. Gene ontology and pathway enrichment were done in the significantly changing gene clusters to uncover the underlying biological mechanism in the transformation.

Poster - F13

Computational estimation of microRNA targets and binding site frequencies

Short Abstract: Recent estimates suggest that almost 30% of animal genes are targets of microRNAs and they are among the most abundant non-coding RNAs in humans and plants. These endogenous, ~22-nucleotide RNAs represent a novel post-transcriptional level of regulation of gene expression. It is known that sequence complementarity with an hexamer at the miRNA 5' end (the “seed”) is sufficient for the repression of most animal mRNAs and G:U (“wobble”) pairs are unfavorable to it. However, these rules do not apply in many cases and nowadays it is not possible yet to successfully predict the targets of groups of miRNAs. To analyze how each miRNA affects global gene-expression programs we compare gene expression with miRNAs targets. Because direct targets are predicted to have lower expression in the presence of miRNAs, we expect them to be enriched among the downregulated genes. Here we compare two different methods to calculate the binding site enrichment of miRNAs in their downregulated putative targets. First, we determine the proportions of downregulated genes in the whole genome and compare them with the downregulated proportions of miRNA targets. If the proportions are significantly different (by a Wilson approximation to the hypothesis test of equality of two proportions following a binomial distribution), it indicates a potential miRNA regulation mechanism. As a second method we determine if downregulated mRNAs are enriched for putative miRNAs target sites using the average number of seed sequences found per 1kb of their 3’ UTR sequences and compare this frequency to the whole transcriptome.

Poster - F14

An integrative approach based on random forest for the improved prediction of miRNA targets

Short Abstract: MicroRNAs (miRNAs) are small noncoding RNAs widely found in animals and plants. They are typically 20 to 24 nucleotides long and regulate gene expression post-transcriptionally by base-pairing interactions with an mRNA target thereby (i) inhibiting mRNA translation or (ii) destabilizing the targeted mRNA.We present a method for the improved prediction of miRNA targets that is based on the dissection of experimentally validated targets into (i) a direct readout component that can be represented as a position weight matrix (PWM), (ii) a sequence-dependent structural component composed of a variety of miRNA target specific structural characteristics and (iii), a component that takes into account the positional dependencies of nucleotides (NPDs). We make use of the random forest algorithm to flexibly exploit both types of information.We show that the predictive values of each of these components are complementary and have developed an integrative approach that flexibly combines two or three different approaches to establish the best possible prediction of microRNA targets. Results obtained with 34 miRNAs show that our method significantly improves classification accuracy for all 34 miRNAs compared with miRanda, MiRTif and MicroTar. Models developed in this study contain information about structural features specific for the miRNA-mRNA interaction and can be of great use for gaining insight into the mechanisms of miRNA binding.

Poster - F15

Decoy Features in Ensemble and Frequency-Based Feature Selection for Biomarker Discovery with Microarrays

Short Abstract: Background: Feature selection is a crucial step for biomarker discovery with microarrays. However, it has been shown that for typical microarray studies suffering from the “n << p problem” the selected features are highly dependent on the choice of training samples and the selected biomarker panels are often unstable and not reproducible. Therefore, ensemble and frequency-based feature selection approaches have been proposed to receive stable biomarker panels. Unfortunately, no statistically interpretable measure has been proposed where to cut-off the feature rankings obtained from these selection methods.Results: Therefore, we have added as many randomly drawn “decoy features” as there were original features to several different microarray data sets. After applying ensemble and frequency-based feature selection algorithms to each extended data set, these decoy features have been used to estimate false discovery rates (FDRs) for all positions in the resulting rankings. Subsequently, these FDRs have been used to select final feature panels by setting FDR-based cut-offs (e.g., FDR < 0.05). Furthermore, the stability and the classification performance of these feature panels as well as panels resulting from alternative cut-offs have been assessed.Conclusions: We propose a decoy feature-based approach to estimate FDRs for all list positions of ensemble and frequency-based feature selection results. This approach enables a statistically interpretable and reliable final selection of biomarker candidate panels for microarray studies. Moreover, the results of the exemplary data are expected to show improved stability and classification performance of the FDR-based feature panels in comparison to the alternative panels.

Short Abstract: Background: Prevention and personalized medicine are key issues of contemporary medical research. Multi-OMICS approaches aim on measuring the dynamics of the most important bio-molecules (i.e. genes, mRNAs, proteins and metabolites) in order to gain better understanding of the complex regulation of a cell. In the medical context, such efforts are promising for the discovery of novel biomarkers and the development of new drug targets. However, processing and interpretation of multi-OMICS data is usually challenging and requires a structured workflow. Results: Here, we present such a scheme for the processing of Proteomics and Transcriptomics (mRNA and miRNA) data. The workflow comprises several steps of data conversion, quality control, data comparison, text mining and statistical analyses. Additionally, a software named CrossPlatformCommander is sketched, which facilitates several steps of the proposed workflow in a semi-automatic manner. The performance of the workflow is shown using a hepatocellular carcinoma data set, obtained from the multi-OMICS project named PROFILE (http://www.profile-project.de/). Conclusion: A workflow / software solution is proposed that handles and integrates both Proteomics and Transcriptomics data. Utilization of this approach was shown for the detection of novel biomarkers. The final result is a list of ranked and annotated biomarker candidates that can be further validated using an independent data set or an independent method.

Short Abstract: Background: The rewarding effect and the physiological benefits of music-listening on human health are well acknowledged, but the underlying molecular mechanisms and biological pathways triggered by music-listening remain largely unknown. Here, using Illumina Human HT-12 v4, we analyzed the gene expression profiles in the peripheral whole blood of 41 subjects before and after music-listening to understand its effect on bodily functions.

Conclusion: These findings provide the primary evidence for the effect of music-listening on human gene expression and immune responses. A balanced immune homeostasis after music-listening substantiates the benefits of music-listening on human well-being.

Short Abstract: The Expression Atlas (http://www-test.ebi.ac.uk/gxa) is a semantically enriched database of meta-analysis based summary statistics over a curated subset of the ArrayExpress, servicing queries for condition-specific gene expression patterns as well as broader exploratory searches for biologically interesting genes/samples. It is a successor to its current incarnation, http://www.ebi.ac.uk/gxa, and support both the analysis of baseline, as well as of (contrast-based) differential expression patterns.

Baseline analysis aims primarily at representing information about which gene transcript is present (and at what abundance) under normal conditions (e.g. tissue, cell line, cellular component), and currently includes of a number of large, well-curated, comparative RNA-seq experiments (e.g. Illumina Body Map, ENCODE cell lines). The interface allows for expression analysis across various expression level cut-offs, and for focusing on expressions specific to conditions of interest.

Differential analysis has been performed for MicroArray and RNA-seq experiments (studying mRNA and microRNA). All experiments have been well-curated and analyzed to define a set of contrasts (e.g. disease vs normal, or time-course) that best capture the research intent behind each experiment. A differential expression is then calculated and presented for each experiment-gene-contrast (and array design in the case of MicroArray experiments). Finally, users are able to search for genes, keywords as well as contrasts within and across experiments.

Gene set enrichment analysis, and visualisation of gene/transcript expression data alongside other ‘omics data (e.g. proteomics data - for experiments in which both types of analysis were performed) are to be included in the the Expression Atlas interface in the near future.

Short Abstract: Cell-type-specific gene expression is generally regulated by combinatorial interactions among transcription factors (TFs) binding to the DNA. Information about TFs’ binding affinity to distal and proximal regulatory sequences can help determine which combinations of factors work together to regulate their target genes in cell-type-specific manner.In this study, we provide detection of co-regulating TF pairs in 34 healthy human cell types which is based on statistical analysis of estimated ranked lists of TFs’ target genes. Specifically, we first scanned all cell-type-specific DNase hypersensitive sites (DHS) for single TF-DNA binding affinities using motifs for 160 TFs and ranked the DHS by their predicted binding affinity separately for each TF. We then studied the similarity of pairs of the ranked lists stratified by cell type by applying a statistical test for multiway contingency tables. Our significant TF pairs defined by the test in each cell type were validated by known protein-protein interactions (PPIs) and by co-binding of TFs in ChIP-seq data. We found that the known PPIs are significantly enriched (up to 12 fold) in the groups of our predicted co-regulating TFs and that we can recover a majority (56%) of predicted co-binding TF pairs from the ChIP-seq analysis. Furthermore, the predicted co-regulating TFs are supported in literature to be active regulators in the corresponding cell types. Our findings show that the cell-type-specific gene expression is regulated by a large number of combinatorial TF interactions with dominating central regulators. However, the TF interaction networks substantially differ even for related cell lines.

Short Abstract: The Type III secretion system is a key mechanism for the transport of effector proteins of pathogenic and endosymbiotic Gram-negative bacteria into the cytoplasm of eukaryotic cells. Despite the importance of this virulence system for public health and agriculture, tools for the identification of effector proteins in un-annotated bacterial genomes are lacking. We developed a new two-step machine learning approach for the prediction of Type III effector proteins. First, we build evolutionary profiles of the new bacterial proteins using comparisons to existing protein databases. This information is processed by a specifically-trained SVM (Support Vector Machine) that reaches 96% accuracy/precision in identifying over half of the bacterial effectors. Second, we use a set of sequence-based features, including sequence length, amino acid composition, secondary structure information, the presence of localization signals and unstructured regions, to identify additional poorly conserved effector proteins. Overall, our method achieves high levels of 86% accuracy/precision and 80% coverage/recall when evaluated on non-redundant test data.

We used our method to make predictions of secreted effectors in the proteomes of more than 2000 fully sequenced bacteria. We identify the majority of known effector proteins in annotated organisms and suggest novel candidates for further experimental validation in newly sequenced ones. Our fast and accurate approach to whole-genome screenings for Type III effector proteins provides significant insight into the nature of this system.

Poster - F21

An Ensemble framework for Small World Clustering Coefficient for protein-protein interaction networks in plasmodium falciparum.

Short Abstract: Protein-Protein Interaction networks are believed to be important sources of information related to biological processes and complex metabolic functions of the cell. The presence of biologically relevant functional modules in these networks has been theorized by many researchers. However, Clustering algorithms are the most commonly used computational method for analyzing microarray gene expression data. This has been used by LeRoch et al and Bozdech et al to classified genes into functional modules, namely, metabolisms and metabolic pathways. The results obtained have left us with many putative functional genes. Also, the experimental results in the Hagai database provide limited information about this. Recent work like Gangman et al. and Young et al. introduce the use of gene ontology but the results are also very limited in their application to plasmodium falciparum.The clustering coefficient has been used successfully to summarise important features of weighted, directed graph across a wide range of application. Small world clustering coefficient is the efficient communication networks that have two characteristics: a short path length and high clustering coefficient. We implemented small world clustering coefficient introduced by Watts D, J. and Strongatz, H. in C/C++ and tested it using high throughput data of plasmodium falciparum. We applied the algorithm to the data obtained from plasmodium falciparum and compared our derived functional modules with pathways and metabolisms classification in plasmoDB.The biological roles and relevance can be depicted from the functional modules from plasmodium falciparum which Biologists may found useful for experimental tests to ascertain the validity of these genes.

Poster - F22

Systematic characterization of novel factors involved in Th17 cell differentiation and their dynamics in mouse and human

Short Abstract: Lymphocytes are responsible for immune responses. In the presence of certain cytokine signals, naive CD4+ T helper (Th) cells can differentiate into Th17 cells, which provide immunity against extracellular bacterial and fungal pathogens but are also connected to autoimmune diseases. Immune response is commonly studied using animal models, especially mice. We have generated time-course RNA-seq gene expression data from early time points of activated T cells and Th17 cells, where activation and differentiation from naïve Th cells in mouse and human are induced under the same culture conditions.Statistical analysis of the RNA-seq data show that hundreds of orthologous genes are differentially expressed by the developing Th17 cells compared to activated cells.

Approaches based on Gaussian process regression can easily handle biological replicates, unevenly distributed observations, and does not make strong parametric assumptions about the function that describes the whole time series, which is why they are appealing method e.g. for calling differentially expressed genes. We apply and extent a LIGAP method [1] to compare gene expression dynamic between mouse and human. We have two models to compare: an orthologous gene has similar Th17 differentiation profiles in mouse and human, or the profiles differ. Our genome-wide results will provide a solid basis for identification of novel factors and signalling pathways crucial for the Th17 cell differentiation both in human and mouse.

Short Abstract: In mammals, most vital processes are subject to daily variations (CircadianRhythm) which are reflected in precisely timed gene expression.A major goal of our study is to understand how circadian genes are regulated in different phases during the circadian cycle. Mapping DNAse I hypersensitive (DHS)sites on chromatin is a powerful and well‐established method to identifymany different types of regulatory elements. Here, we report a high‐throughput analysis of DHS sites in mouse liver, that uses massive parallel signature sequencing of tags from a DNAse library generated from C57 black 6 mouse liver at4 h time resolution over a 24h daily cycle (ZT conditions). Using Pol II , H3K4me3 and H3K27ac ChIP-Seq data sets around the clock, we characterize the DHS and study the dynamics of each marks. Finally we use the DNA sequence motif content of each DHS and a linear regression model to explain their temporal behavior interms of phase specificity.

Poster - F24

Expression divergence between Escherichia coli and Salmonella enterica serovar Typhimurium and the relation to pathogenicity

Short Abstract: Escherichia coli K12 is a commensal bacteria and one of the best-studied model organisms. Salmonella enterica serovar Typhimurium, on the other hand, is a facultative intracellular pathogen. These two prokaryotic species can be considered related phylogenetically and they share a large amount of their genetic material, which is commonly termed the 'core genome'. Despite their shared core genome, both species display very different life styles and it is unclear to what extent the core genome, apart from the species-specific genes, plays a role in this lifestyle divergence. In this study, we focus on the differences in expression domains for the orthologous genes in E. coli and S. Typhimurium. The iterative comparison of coexpression methodology was used on large expression compendia of both species to uncover the conservation and divergence of gene expression. We found that gene expression conservation occurs mostly independent from amino acid similarity. According to our estimates, at least more than one quarter of the orthologous genes has a different expression domain in E. coli than in S. Typhimurium. Genes involved with key cellular processes are most likely to have conserved their expression domains whereas genes showing diverged expression are associated with metabolic processes that, although present in both species, are regulated differently. The expression domains of the shared 'core' genome of E. coli and S. Typhimurium, consisting of highly conserved orthologs, have been tuned to help accommodate the differences in lifestyle and the pathogenic potential of Salmonella.

Poster - F26

Deciphering the role of long non-coding RNAs in human lymphocytes by integrating genome-mapping and de-novo strategies

Short Abstract: Recent genome-wide studies have shown that long non-coding RNAs (lncRNAs) are pervasively transcribed in the genome and are emerging as new powerful players involved in transcriptional, posttranscriptional and epigenetic mechanisms of gene regulation. Since the mechanisms that control the regulation of human lymphocytes by lncRNAs are poorly understood as their expression in these cells, we purified 13 different human lymphocytes subsets (from T-CD4+, T-CD8+ and B lymphocyte populations) from peripheral blood to perform a comprehensive transcriptome analysis by RNA-seq using Illumina platform. We collected over than two billions RNA-seq reads across a panel of 63 purified lymphocyte samples to identify specific or new lncRNAs for each subset using both reference-based and de novo assembly approaches. For the identification of new lncRNAs specifically expressed in our cell we adopted two mapping-first approaches (TopHat and Star as mappers and Cufflinks for the identification of new transcripts) and an assembly-first de novo method based on Trinity. The new transcripts are then processed to satisfy a set of requirements that discriminate new potentially lncRNAs among all mRNAs identified (sequence length>200 bases, at least two exons, does not match any known protein domains from Pfam and must have a low predicted coding potential score). We found that different lncRNAs are preferentially expressed in specific lymphocyte subsets and that their expression patterns change in very specific manner during T cell differentiation. Not only we identified lymphocyte-subset-specific lincRNA signatures, but loss/gain of function experiments suggest that lncRNAs are key players in maintaining lymphocytes' cell identity.

Short Abstract: COLOMBOS is an publically available access portal to comprehensive organism-specific cross-platform expression compendia for bacterial organisms. It provides a suite of tools for exploring, analyzing, and visualizing the data within these compendia. The expression compendia themselves are built based on a propriety methodology that is unique in directly combining the data from different technological platforms. COLOMBOS also incorporate extensive annotations for both genes and experimental conditions; these heterogeneous data are functionally integrated in the analysis tools to interactively browse and query the compendia not only for specific genes or experiments, but also metabolic pathways, transcriptional regulation mechanisms, experimental conditions, biological processes, etc.Several improvements have been made. Content wise, we have invested in the development of a compendia creation and management system that has enabled us to greatly expand existing compendia (Escherichia coli, Bacillus subtilis, and Salmonella Typhimurium) as well as add compendia for other species. Additionally, the current version supports the inclusion of RNAseq data. Functionally, we have revamped the interface with new interactive visualization and analysis tools, a bicluster tree algortihm for discovering complex coexpression patterns around a set of query genes, and inclusion of noise models for measurement errors, enabling analysis of differential expression with measures of statistical significance.This work is relevant to a large community of microbiologists by facilitating the use of publicly available genome-wide expression data to support their research, as well as providing a useful resource for top-down systems biology applications.

Short Abstract: Metagenome sequencing is revealing thousands of novel gene families from microbial communities with biotechnologically or medically relevant capabilities, such as to survive at high temperature, digest cellulosic materials or to consume the greenhouse gas methane. Identification of the genes encoding the pathways or protein complexes that define such a phenotype, beyond homologs of genes with known function, remains a challenge. We have previously described inference of functional modules with a Bayesian probabilistic topic model from the co-occurrence patterns of gene families across microbial genomes. We here describe a new probabilistic topic model-based method named Metaphor which identified functional modules and gene clusters distinctive for a phenotype of interest. Using 2884 genomes and 18 metagenome samples, we studied the phenotype of microbial plant biomass degradation. The inferred functional modules revealed previously unknown phenotype-defining gene families and predicted the phenotype and relevant gene clusters of lignocellulose degraders. We identified functional modules of microbial plant biomass degraders linked to lignocellulose, xyloglucan, xylan, pectin and oligosaccharide degradation. The protein family content and distribution of these functional modules and their gene clusters across genomes indicates that microbial lignocellulose degradation according to the known free enzyme and cellulosome-associated degradation ‘paradigms’ can be decomposed into different combinations of these modules. Metaphor can also be applied to study other phenotypes of interest.

Short Abstract: Long noncoding RNAs (lncRNAs) are a large and heterogeneous class of RNAs not translated into proteins, that vary in size from 200 bp to >100 kb, are generally transcribed by RNA polymerase II, and are often spliced and polyadenylated. While their functional characterization is still in its infancy, several evidences highlight their involvement in many cellular processes and pathological conditions, generally by the formation of complexes with proteins. Nevertheless, the protein binding capabilities of lncRNAs are not yet easily discernable from their sequence and/or structure, creating a conspicuous gap in the knowledge of the interaction network in all cellular machineries involving these molecules. We aimed at obtaining a detailed picture of the interaction networks that link lncRNAs and their protein partners, by developing procedures for the inference of the rules governing protein-RNA interactions, training and testing them on experimentally determined protein-RNA pairs, and modeling the interaction at all possible levels (from primary to quaternary structure). By the analysis of RNA folding, emerging patterns of conserved local substructures, dissection of x-ray protein-RNA complexes, RNA 3D modeling and docking, and the integration with experimentally determined binding between RNA and proteins by means of CLIP-seq, RIP-seq and their variants, our goal is to determine the rules responsible for the interaction and to obtain a detailed (from 1D to 4D) picture of the determinants of protein-lncRNA interactions.

Poster - F30

The characteristics of the DNA structural profile of prokaryotic promoter regions

Short Abstract: The genomic promoter regions, which control the expression levels of their downstream genes, exhibit different characteristics in their local DNA molecular structure than the remainder of the genome. These characteristics of the DNA structure contain valuable insight into transcription processes and are indeed already frequently used in promoter prediction tools. In this study, the structural patterns present in the molecular DNA structure were compared across a variety of prokaryotic organisms based on accurate transcription start site information that is publically available. Promoter regions were found to be on average less stable, more rigid and more curved than the genomic DNA across all studied prokaryotes. Further sets of promoters could be grouped based on similar structural properties, with each set displaying a unique structural profile. These sets could then be related to regulation by specific sigma factors or to certain expression behaviors of the downstream genes. However comparison between different organisms revealed large differences in the found structural profiles, with larger evolutionary distances resulting in greater differences. It could be concluded that there is great variety in the structural DNA properties of promoter regions, which is likely related to the functionality of the promoter.

Short Abstract: Most complex diseases, such as susceptibility to mastitis, have a complex inheritance and may result from variants in many genes, each contributing only a small effect to the trait. Genome-wide association studies have successfully identified numerous loci at which common variants influence complex diseases. However, the variants identified as being statistically significant have generally explained only a small fraction of the heritable component of the trait. Insufficient modelling of the underlying genetic architecture may in part explain this missing heritability.Evidence collected across GWAS for complex diseases reveals patterns that provide insight into the genetic architecture of complex traits. Although many genetic variants with small or moderate effects contribute to the overall genetic variation, it appears that multiple independently associated variants are located in the same genes and that these variants are enriched for genes that are connected in biological pathways or for likely functional effects on genes. These biological findings provide valuable insight for developing better genomic models. These are statistical models for predicting complex trait phenotypes on the basis of SNP-data and trait phenotypes and can account for a much larger fraction of the heritable component. A disadvantage is that this “black-box” modelling approach conceals the biological mechanisms underlying the trait. We propose to open the “black-box” by building SNP-set genomic models that evaluate the collective action of multiple SNPs in genes, biological pathways or other external findings on the trait phenotype. As proof of concept we have tested the modelling framework on several traits in dairy cattle.

Poster - F32

Using D2P2 to explore the association between exon architecture and protein domains and disorder

Short Abstract: D2P2 (http://d2p2.pro) is a database of protein disorder using 9 different predictors, to which we added exon architectures for a selection of 88 eukaryotic genomes, from animals, plants, algae and protists. Using this resource, we carried out a comparison between exon architecture and features at the protein level, including both compact domains and disordered protein structure.

This study has highlighted different modes of exonic evolution for certain protein families and biological functions. We show that exon length distributions for complete genomes are sufficient to relate and distinguish similar species. The previous reported phenomena of a correlation between globular domains and exon boundaries in metazoa is found to hold true for all species in this study, including plants and protists. These findings suggest that it may be possible to provide annotation and quality control on next-generation sequencing reads based only on open-reading-frame prediction and knowledge of related genomes.

Short Abstract: The Clusters of Orthologous Groups of genes (COGs) database is a comprehensive collection of gene families. This is used as a gene-family definition database to the extract maximum amount of information from fast growing genome collections. In recent years, however, non-orthologous groups have been discovered within same cluster. We present here a novel approach to the analysis and curation of the COGs Database.

We worked with the current COGs database containing 4873 clusters of a total 138,458 proteins from NCBI. We used a Nonparametric Bayesian approach for validation of consistency of the COGs database. In the framework of a Nonparametric Bayesian approach, model parameter distributions of multiple discrete “support” points were observed, up to one per subject in the population. Each support point is a set of point estimates of each model parameter value, plus the probability of that set.

We found that there were outliers in unimodal distribution and evidence of merger between two or more clusters, resulting in multimodal distribution of gene lengths. We also noted that in at least 35% clusters from the COG database, the distribution of gene length cannot be approximated using a single distribution, but rather should be considered as a mixture of two or more distributions.

COGs were examined for multi-modality and analyzed the clusters for presence of outliers. We have developed a publicly accessible database and a web-portal illustrating our findings which can be found at www.glacombio.net/cog

Poster - F34

tRanslatome: an R/Bioconductor package to portray translational control

Short Abstract: High-throughput technologies have led to an explosion of genomic data available for automated analysis, enabling to simultaneously sample multiple layers of variations along the gene expression flow. This effort requires computational techniques suitable to integrate raw information coming from different ‘-omics’ layers. It has been recently demonstrated that translational control is a widespread phenomenon, with profound and still underestimated regulation capabilities. While detecting changes in total mRNA levels (the transcriptome) and in the polysomal loading of mRNAs (the translatome) is experimentally feasible in a high-throughput way, the integration of these levels is still far from being robustly approached. Here we introduce tRanslatome, a new computational suite implemented as an R/Bioconductor package, representing a complete platform for the analysis of data coming from high-throughput assays conducted simultaneously at the transcriptome and the translatome levels. The package includes most of the available statistical methods developed for next generation sequencing and microarray data, allowing the parallel comparison of differentially expressed genes and the corresponding differentially enriched biological themes. Notably, it also enables the prediction of translational regulatory elements on mRNA sequences. The utility of this tool is demonstrated with a case study.

Short Abstract: Alternative splicing is one of the main processes in eukaryotic cells which enables the genes to produce different forms of proteins. Although the diversity in protein isoforms is crucial for the living of organisms, abnormal variations can cause development of diseases, such as cancer. Therefore, a good understanding of alternative splicing is very important. Recently, Glaus et al. (2012) have developed BitSeq, a Bayesian approach for estimation of alternatively spliced transcript expression and differential expression from RNA-seq data. Kalaitzis & Lawrence (2011) have applied Gaussian process (GP) regression to model the temporal behavior of gene expression levels and suggested a Bayes factor approximation to rank the genes which show temporal changes in their expression levels.

We extend the Kalaitzis & Lawrence model by incorporating technical and biological variance estimates from BitSeq into the GP models. We fit time-independent and time-dependent GP models to log-transformed relative transcript abundances and rank the transcripts by their Bayes factors to identify the ones with temporarily varying relative abundances.

Evaluation with synthetic data shows that incorporating the technical and biological variance increases both the sensitivity and specificity of the method. We also apply our method to a 10 point RNA-seq time series from MCF7 breast cancer cell line treated with estradiol, and identify a number of genes with significant changes in their transcript ratios.

Short Abstract: In cancer biology, tumor suppressor genes play a pivotal role. They encode proteins that normally inhibit tumor formation caused by abnormal cellular proliferation. These proteins can participate in a variety of processes such as negative regulation of the cell cycle, positive regulation of apoptosis, regulation of DNA damage response, or other mechanisms. Determining functional status of these genes is important aspect of understanding of underlying tumor biology.

We have built a comprehensive computational framework for assessing the functional status of tumor suppressor genes. Recently Broad Institute published a public resource, “Cancer Cell Line Encyclopaedia (CCLE), which contains mRNA expression, Affymetrix SNP 6.0 profiles, OncoMap mutation screening and exome sequencing data from nearly 1000 cancer cell lines. Using all 4 data types available from this encyclopaedia, our framework can comprehensively determine tumor suppressor status.

Here we present the database of systematically and comprehensively derived status of 69 known or putative tumor-suppressors across CCLE. Browsable online interface of the database is available at www.glacombio.net/tsgs .

Short Abstract: Alternative splicing and its implications in diseases such as cancer are an important and highly active field of research. Exon arrays provide an established technology for the detection of alternative splicing events (ASE). Since their introduction to the field a variety of computational methods have been developed with the aim of predicting AS. However, no evaluation and comparison of this methods at a broad and comprehensive scale has been performed yet. To this end, we gathered and implemented different published and new algorithms.

The specific aim of our study was to assess the different methods based on certain data properties. This set of parameters potentially influencing the performance of the method is divided in known and unknown ones. Parameters that are known to the researcher are for instance the number of exons a gene contains or the number of samples in the study. Unknown parameters are the number of exons alternatively spliced per gene or the percentage of samples containing an ASE in one group. By evaluating a great variety of parameter settings based on artificial data we provide a basis of decision for the adequate method in individual research scenarios.

Poster - F38

A new miRNA promoter recognition method uncovers the complex regulation of intronic miRNAs

Short Abstract: The regulation of intragenic miRNAs by their own intronic promoters, as well as the contribution of intronic versus host gene promoters in different tissues or disease states, is one of the most interesting open problems in the study of miRNA biogenesis. Although a few methods has been developed in the past few years for miRNA promoter recognition, they are unable to detect intronic miRNA promoters at high sensitivity. The difficulty of experimentally detecting and consequently annotating miRNA promoters has limited our ability to identify the regulatory circuits that control miRNA expression, and has therefore prevented a comprehensive analysis of intronic promoter characteristics and usage. We propose a new approach for miRNA promoter annotation based on a semi-supervised statistical model trained on deepCAGE data and sequence features called PROmiRNA.Compared to previous methods PROmiRNA increases the detection rate of intronic promoters by 30%, thereby allowing a large-scale analysis of their genomic features, and elucidating their contribution to tissue-specific regulation.We validate the promoter identified by model with a significant number of existing annotated miRNA TSSs and we demonstrate the reliability of the method in detecting new intronic miRNA promoters by confirming them with PolII occupancy data. In addition, we experimentally validate the novel intronic promoters of miR-130a by means of a promoter reporter assay. In this study we are able for the first time to uncover the different regulatory elements that distinguish intergenic and host gene miRNA promoters from intronic promoters. This provides insights into the mechanisms of miRNA regulation by means of intronic promoters.

Poster - F39

MIRTIL: towards the construction of a database for miRNAs expressed in Nile tilapia (Oreochromis niloticus)

Short Abstract: Nile tilapia (Oreochromis niloticus) is a native cichlid fish of Egypt that has been globally recognized as a commercially valuable fish due to its ease of breeding and growing in a variety of aquaculture systems. Given this importance, it would be interesting to investigate the molecular mechanisms underlying the tissue development and homeostasis of Nile tilapia to discover, for example, how to increase its production for human consumption. As it is widely known that miRNAs--short ~22-nucleotide RNA sequences that regulate gene expression through binding to complementary sequences in the 3’UTR of mRNAs--play key roles in tissue development, the exogenous control of miRNA expression would be a promising tool to produce the phenotype of interest. However, this type of manipulation is only possible if an entire list of miRNAs expressed in different tissues at different stages of development along with their regulators and targets is compiled. Moreover, such data should be organized in a way that the information regarding each miRNA can be easily retrieved. To this end, we are developing MIRTIL, a database containing tilapia miRNAs detected by next generation sequencing (RNA-seq) in which users will be able to retrieve the following information for each entry: genomic location, raw and normalized expression levels, tissue of origin, developmental stage, sex, regulating transcription factors and known and predicted targets. So far, we have collected and organized miRNAs expressed in red and white muscles from male and female individuals. In the near future, we will add miRNAs expressed in other tissues.

Poster - F40

Global meta-analysis of transcriptomics studies

Susana Vinga, INESC-ID, Portugal

José Caldas (INESC-ID, KDBIO Portugal);

Short Abstract: Meta-analysis of transcriptomics data aims at recycling existing data to derive novel biological hypotheses, and is motivated by the public availability of a large number of independent studies. Current methods are based on breaking down studies into multiple comparisons between phenotypes (e.g. disease vs. healthy), based on the studies’ experimental designs, followed by computing the overlap between the resulting differential expression signatures. While useful, in this methodology each study yields multiple phenotype comparisons, and connections are established not between studies, but rather between subsets of the studies corresponding to phenotype comparisons.We propose a rank-based statistical meta-analysis framework that establishes global connections between transcriptomics studies without breaking down studies into sets of phenotype comparisons. By coupling the rank product method with a Gamma distribution significance approach, our framework extracts global features from each study, corresponding to genes that are consistently among the most expressed or differentially expressed genes in that study. Those features are then statistically modeled via a term-frequency inverse-document frequency model (TF-IDF), which is then used for connecting studies. Our framework is fast and parameter-free; when applied to large collections of Homo sapiens and Streptococcus pneumoniae transcriptomics studies, it performs better than nonparametric correlation-based approaches in retrieving related studies, using a Medical Subject Headings (MeSH) gold standard. Finally, we highlight via case studies how the framework can be used to derive novel biological hypotheses regarding related studies and the genes that drive those connections.

Poster - F41

Cyanobacteria pose new problems to old methods: Challenges in microarray time series analysis

Short Abstract: Cyanobacteria are phototrophic microorganisms of global importance and have recently attracted significant attention due to their capability to synthesize organic compounds, including biofuels, directly from sunlight and carbon dioxide. Unlike most heterotrophic bacteria, cyanobacteria exhibit a diurnal lifestyle, driven by changes in theavailability of sunlight. Correspondingly, the transcriptomes of several cyanobacterial strainshave recently been shown to exhibit diurnal oscillations, reflecting the phototrophic lifestyle of the organisms. The analysis of such genome-wide transcriptional oscillations is often facilitated by theuse of clustering algorithms in conjunction with a number of pre-processing steps. Biological interpretation is usually focused on the time and phase of expression of the resulting groups of genes.However, the use of microarray technology requires the normalization and pre-processing of data, with unclear impact on the qualitative and quantitative features of the derived information on the number of oscillating transcripts and their respective phases.Here, we present a microarray based evaluation of diurnal expression in the cyanobacterium Synechocystis sp. PCC 6803. We demonstrate that many commonly employed normalization methods introduce a systematic bias in the observed expression phases and therefore compromise biological interpretation. We suggest strategies how such detrimental effects can be avoided.Integration of our clustering results with external biological knowledge yields insights into the regulation of cellular and metabolic processes on the transcriptional level.

Poster - F42

Discovering how Non-Gene-Encoding Elements Contribute to Stress Regulation in Yeast.

Short Abstract: This poster is based on Proceedings Submission for ISMB/ECCB 2013.The last few years have seen significant progress on the understanding of the non-gene-encoding segments of DNA, showing a picture of a wide range of elements that often have important regulatory roles in the cell. In this work, we investigate how non-gene-encoding elements may contribute to the regulation of stress response on yeast.Our experiments use both micro-array data and more recent high-throughput RNA-Seq data. In both cases we obtained expression data on yeast both on its wild type, to be used as control, and after being submitted to several forms of stress, including NaCl osmotic shock, Heat stress, and Ethanol stress. We investigate the main statistics of the non-coding genes in both cases and compare them to the statistics for the coding genes.The problem of detecting differential expression in micro-arrays and more recently on RNA-Seq data has been intensively studied. We apply state-of-the-art tools such as limma for micro-arrays and DESeq, edgeR or EBArrays for the RNA-Seq data and search for the more differentially expressed coding-genes and non-coding-genes. We validate our results by matching coding genes with its respective Gene Ontology annotations, and by matching non-coding-genes with overlapping and/or nearby coding ORFs.Ultimately, our goal is to reconstruct how non-coding elements may interact with the main stress pathways. We propose a logical statistical approach towards this goal, based on the ProbLog system.

Short Abstract: Allele specific expression (ASE) is a marker for cis-regulatory variation influencing transcription. We present two inter-related projects on this topic.

First, we investigated if the magnitude of ASE varies under different biological conditions. Detection of such condition-dependent ASE can provide a functional association of genetic variants by relating them to the transition of a cellular phenotype from one state to another.

We studied condition-dependent ASE by deep transcriptome-wide sequencing of primary white blood cells from eight human individuals before and after the controlled induction of an inflammatory response by treatment with lipopolysaccharide (LPS). We developed methods for condition-dependent ASE analysis, including an ASE false discovery rate estimation procedure.

We detected condition-dependent ASE related to the inflammatory response at ten to fifty-five genetic loci per individual. These loci may represent genetic variants associated with the cellular response and transition to an inflammatory phenotype. Condition-dependent ASE was validated by real-time PCR in nine out of ten selected variants.

This is the first systematic study of condition-dependent ASE. The real-time PCR validation confirms the presence of condition-dependent ASE and suggests that our methodological approaches are useful to detect such ASE.

Second, we present a software suite, named SimulASE, to simulate allele-specific RNA-seq data. To our knowledge, no RNA-seq software available today is capable of simulating allele-specific expression.

Short Abstract: We present a novel method to score binding sites obtained from NGS experiments, and generate a highly reliable regulatory network using a combination of three scoring approaches. The combination of these scoring methods can rapidly filter out spurious binding sites and results in higher accuracy of the regulatory network. We have tested our approach with actual genome-wide NFKB/RelA peaks derived from TNF treated A549 cells. We show using Informativity analysis of enriched GO terms that the regulatory network derived using our approach is much better than that is generated using naive approaches.

Short Abstract: RNA-Seq provides an exciting new approach for semi-quantitative expression profiling. On one hand, expression estimates do not suffer non-linear effects like dye-based platforms such as typical microarrays. On the other hand, expression estimates for less strongly expressed genes are very noisy due to the sampling effects seen in count-based methods. We here compare the response characteristics of RNA-Seq and a modern custom microarray using External RNA Control Consortium (ERCC) spike-ins. ERCC spike-ins were added to mRNA samples in known ratios and abundances. Platform specific response characteristics can then be studied in an analysis of the signal response in a comparison to the expected values. We can show that at sufﬁciently high expression levels, the expected ratios are accurately and precisely recovered for RNA-Seq. For microarrays, non-processed signals show non-linear saturation effects. Application of modern signal models, however, allow a correction for these technical effects, yielding results matching RNA-Seq in accuracy and precision. It is noteworthy that the compared platforms behave very differently for lower expression levels. As expected from theory, RNA-Seq suffers from strong sampling effects whereas microarrays show an attenuated signal response for weakly expressed genes.

Short Abstract: Rationale: Bronchopulmonary Dysplasia (BPD) is a major complication of premature birth. Risk factors for BPD are complex and include prenatal infection and O2 toxicity. BPD pathology is equally complex and characterized by inflammation, dysmorphic airspaces and vasculature. Due to the limited availability of clinical samples, an understanding of the molecular pathogenesis of this disease, its causal mechanisms and associated biomarkers is limited. Objectives: Apply genome-wide expression profiling to define pathways affected in BPD lungs.Methods: Lung tissue was obtained at autopsy from 11 BPD cases and 17 age-matched non-BPD controls. RNA isolated from these tissue samples was interrogated using microarrays. Standard gene selection and pathway analysis methods were applied to the data set. Abnormal expression patterns were validated by qPCR and immunohistochemistry.Measurements and Main Results: We identified 159 genes differentially expressed in BPD tissues. Pathway analysis indicated previously appreciated (e.g., DNA damage regulation of cell cycle) as well as novel (e.g., B cell development) biological functions were affected. Three of the 5 most highly induced genes were mast cell (MC)-specific markers. We confirmed an increased accumulation of connective tissue MCTC (chymase expressing) mast cells in BPD tissues. Increased expression of MCTC markers was also demonstrated in an animal model of BPD-like pathology.Summary: We present a unique genome-wide expression data set from human BPD lung tissue. Our data provide information on gene expression patterns associated with BPD, and facilitated the discovery that MCTC accumulation is a prominent feature of this disease. These observations have significant clinical and mechanistic implications.

Short Abstract: Growing evidence suggests that aggregation prone proteins are both harmful and functional for a cell. How do cellular systems balance the detrimental and beneficial effect of protein aggregation? We reveal that aggregation prone proteins are subject to differential transcriptional, translational and degradation control compared to non-aggregation prone proteins, which leads to their decreased synthesis, low abundance and high turnover. Genetic modulators that enhance the aggregation phenotype are enriched in genes that influence expression homeostasis. Moreover, genes encoding aggregation prone proteins are more likely to be harmful when over-expressed. The trends are evolutionarily conserved and suggest a strategy whereby cellular mechanisms specifically modulate the availability of aggregation prone proteins to (i) keep concentrations below the critical ones required for aggregation and (ii) shift the equilibrium between the monomeric and oligomeric/aggregate form as explained by Le Chatelier's principle. This strategy may prevent formation of undesirable aggregates and keep functional assemblies/aggregates under control.

Short Abstract: We present an approach that use information mined from scientific articles for flagging possibly toxic compounds. The mining of the scientific articles was carried out using a next-generation text mining technique for knowledge discovery based on the vector space model to relate two concepts (such as chemicals and genes) to each other. We performed a comparison of the predictive performance of toxic properties by next-generation text mining with a classic approach using profiles stored in the Comparative Toxicogenomics Database. Specifically, we created chemical response-specific gene sets based on the two approaches and tested these with three different gene set analysis methods against gene expression experiment data from different types of experimental models (human cell lines and cells, mouse models, and mouse embryonic stem cells). We conclude that gene set analysis with gene sets from next-generation text mining complemented the classic approach, in particular for compounds for which information is scarce.

Poster - F49

Reproducible mRNA and small RNA sequencing across different laboratories

Short Abstract: RNA-sequencing is an increasingly popular technology for genome-wide analysis of transcript structure and abundance. However, the sources of technical and inter-laboratory variation have not been assessed in a systematic manner. To address this, seven centers of the GEUVADIS consortium sequenced mRNAs and small RNAs of 465 HapMap lymphoblastoid cell lines (incl. large numbers of replicates). The variation between laboratories appeared to be considerably smaller than the already limited biological variation. Laboratory differences mainly manifested in differences in insert size and GC content. The randomized study design allowed nearly full correction of these laboratory effects. In small RNA sequencing, the miRNA content differed widely between samples due to competitive sequencing of rRNA fragments. This did not affect relative quantification of the miRNAs. We conclude distributed RNA-sequencing is well feasible when proper standardization and randomization procedures are used. The combined sequencing data from this project significantly extended our understanding of the genetic basis of transcriptome variation and generated an unprecedented resource of novel transcripts and eQTLs.

Poster - F50

Next Generation Sequence Analysis of the Transcriptional Response to Neonatal Hyperoxia

Short Abstract: Bronchopulmonary Dysplasia (BPD) is a major complication of preterm birth associated with significant morbidity. Due to the complexity of risk factors and limited availability of clinical samples, an understanding of this disease, potential biomarkers and causal mechanisms are limited. Rodent models involving neonatal exposure to excessive oxygen concentrations (hyperoxia) have been used to study the mechanisms contributing to BPD pathology. Transcriptomic assessment of the effects of hyperoxia in neonatal mouse lungs using RNASeq will identify novel genes and pathways associated with BPD. We examined gene expression changes in lung tissue from newborn C57Bl/6 mice exposed to 100% oxygen for 10 days and room air-exposed age matched controls. Total RNA was isolated from individual whole lung tissues and pooled in duplicates to perform RNASeq using the Illumina platform. Alignments were generated by multiple algorithms. Normalized gene expression data were filtered to remove undetected genes. SAM and CuffDiff2 werw used to identify genes with differential expression. Ingenuity Pathway Analysis (IPA) was used for pathway and network analyses. Expression patterns for selected genes were examined by quantitative polymerase chain reaction (qPCR).A total of 248 genes were identified as differentially expressed between hyperoxia and control samples by SAM and CuffDiff2, and had a fold change ≥ 2. We successfully validated 24 of 31 genes by qPCR. Interestingly most genes significantly affected in hyperoxic lungs showed opposite expression change during organogenesis, consistent with arrested lung development. 13% of the genes demonstrating a significant response to hyperoxia were also dysregulated in human BPD lung tissue.

Short Abstract: In animals, RNA binding proteins (RBPs) and microRNAs (miRNAs) post-transcriptionally regulate the expression of virtually all genes by binding to RNA. In this context, we have developed a method to profile the protein occupancy on mRNA transcripts by next-generation sequencing (Baltz et al., 2012). Our current work focuses on streamlining and extending this protein occupancy profiling. Our objectives are to identify previously unknown protein-bound transcripts and, more importantly, to assess global and local differences in protein occupancy across different biological conditions. To this end, we have implemented poppi, the first pipeline for the analysis of protein occupancy profiles. Poppi provides modules for quality control, read mapping, visualization, comparison to genomic features and assessment of replicates reproducibility. Moreover, it allows the identification of significant differences in RBP occupancy between different biological conditions. We show the relevance of our pipeline by analyzing the impact of the ribosome footprint on the overall protein occupancy in HEK293 cells. Additionally, we present our work on local protein occupancy profile differences between two cell types (HEK293 and MCF7, a breast cancer cell line), reflecting changes in RBP-binding with potential functional consequences on (m)RNA processing. Revealing these changes on a transcriptome-wide level will open up a completely new perspective on the dynamics of post-transcriptional regulation.

Short Abstract: Sorghum is a very adapted cereal to water stress. However, drought stress is still a major factor in reducing production in this crop. Development of sorghum cultivars tolerant to water stress is the major alternative to reduce losses of this crop under water limited conditions. Functional genomic tools have enabled large-scale gene expression studies to an unprecedented speed and accuracy, which may lead to the identification of genes involved in differential responses between genotypes under different stress conditions. The objective of this study was to compare gene expression of a drought tolerant sorghum genotype in the presence and absence of the stress to identify candidate gene association with this phenotype in sorghum. Total RNA was extracted from roots of tolerant sorghum plants under conditions of normal irrigation (100%) and stressed (50%) with two biological replicates per treatment. The cDNA libraries were constructed and sequenced by the company Eurofins (Alabama, USA). The differential expression analysis was performed using the GeneSifter ® Analysis Edition software (Perkin Elmer). From this analysis, it was possible to identify 662 differentially expressed genes under conditions of water stress, of which 21 were up-regulated. The increase of expression varied between 10 and 65X. The proteins encoded by these genes can be grouped into the ones that lacks functional annotation; post-translational modification, and in response to stress. Now, efforts should be used for the functional characterization and validation of candidate genes that could be used to obtain elite sorghum inbred lines tolerant to drought.

Short Abstract: Water-deficit is one of the most common environment stresses, which affect plants productivity and its geographic distribution. Plant mechanisms for water-deficit stress adaptation and response are often characterized by specifics genes activation. In this study microarray data from roots of a drought tolerant inbred maize line (Embrapa’s breeding program) under two water regimes was analyzed. The analysis revealed that 746 genes were differentially expressed, of which 455 were up- and 291 were down-regulated. Twenty six genes differentially expressed were successfully amplified and evaluated using quantitative PCR (qPCR). These genes showed a moderate positive correlation (r = 0.39) between microarray (logFC) and qPCR results (2-ΔΔCT). No genes were found to display a divergent expression pattern in microarray and qPCR experiments, whereas some genes showed expression values very similar in both techniques. Using Gene Ontology, 353 differentially expressed genes were categorized within biological processes and the most representative categories were sorted to stress and carbohydrate metabolic process. GO enrichment analysis was performed comparing the GO of the differentially expressed genes found in this study with the GOs across all genes represented in the array. In Biological Process domain, three terminal categories significantly overrepresented: “response to stress”, “response to abiotic stimulus” and “transport”. For Cellular Component domain, the category “vacuole” was detected as GO differentially represented. This network information created valuable input for selection of candidates gene related to drought tolerance in maize under water stress which will be quite important to implement into the Embrapa’s breeding program.

Short Abstract: The Photo-Activatable Ribonucleoside-enhanced CrossLinking and ImmunoPrecipitation (PAR-CLIP) is a recently developed method for global identification of RNAs interacting with proteins. A strength of PAR-CLIP is the induction of specific T to C transitions at sites of protein-RNA interaction. However, current analytical tools do not distinguish between non-experimentally and experimentally induced transitions. In addition, geometric properties at potential binding sites are not taken into account. To fill this gap, we developed a two-step algorithm consisting of a non-parametric two-component mixture model and a wavelet-based peak calling procedure. Our algorithm can reduce the number of false positives up to 24% thereby identifying high confidence interaction sites. We provide an implementation of our algorithm in the R package wavClusteR. wavClusteR was successfully employed in conjunction with a modified PAR-CLIP protocol to study the functional role of nuclear Moloney leukemia virus 10 (MOV10), a putative RNA helicase interacting with Argonaute2 and Polycomb. In this poster, we present our method and discuss the general applicability of wavClusteR to other substitution-based inference problems in genomics.

Daniela Boernigen, Harvard School of Public Health, Harvard University, United States

Daniela Boernigen (Harvard School of Public Health, Biostatistics Department United States); Yo Sup Moon (Harvard School of Public Health, Biostatistics Department United States); Levi Waldron (Harvard School of Public Health, Biostatistics Department United States); Eric Franzosa (Harvard School of Public Health, Biostatistics Department United States); Curtis Huttenhower (Harvard School of Public Health, Biostatistics Department United States);

Short Abstract: Biological databases of high-throughput experimental results provide vast and growing resources for medical, and bioinformatic research. Open questions remain in how best to maintain such resources, access them computationally, meta-analyze their contents from hundreds of experiments, and do so reproducibly while maintaining computational best practices.

We include biological examples demonstrating the utility of ARepA for integrative analyses. When focusing on human data, ARepA's metadata database allowed us to identify and standardize 12 human prostate cancer gene expression datasets from GEO, which were subsequently meta-analyzed across six different platforms. A subsequent co-expression network analysis correctly recovered the NfκB signaling pathway along with new candidate genes with roles in prostate cancer. A similar example in mouse integrates 11 gene expression datasets selected by querying ARepA for metadata indicating germ-free and intestinal tissue conditions. Finally, multiple data types from three model microbes were integrated to assess differences in peptide secretion systems.

Short Abstract: Understanding complex diseases, such as cancer, requires an explanation of the intricate regulatory networks working within a cell. A key part of this puzzle is to find genes that are direct targets of Transcription Factors (TFs).

We present the BioConductor package Rcade (R-based analysis of ChIP-seq And Differential Expression). To isolate the binding events that have a functional effect in a given context, we use a Bayesian strategy that couples DE analysis with ChIP-seq analysis. The more robust read-based analysis is used in preference to peak-calling, with potential regulatory element universes including promoter regions, ChIP-seq peak sets and ENCODE-derived enhancers. We estimate the number of true TF targets through expected FDR computation.

By applying Rcade to data for various TFs, including STAT1 and the tumour suppressor p53, we illustrate that one should be wary of ascribing function to ChIP-seq peaks alone. Subsequently, we show that coupling ChIP-seq with expression data, through the Rcade package, allows us to focus on contexually functional binding sites. This yields an improved ability to find motifs of known cofactors, and greater connectivity between genes in terms of cofunctional network links. Thus, Rcade enriches for biologically meaningful information.

Poster - F57

RNA-Seq Analysis of Eucalyptus Genotypes that Differ in Carbon Allocation

Short Abstract: The global demand for wood combined with its diversified use requires the generation of trees that sequester and accumulate more carbon, providing differentiated raw material for the production of pulp, paper, charcoal and even cellulosic ethanol. Since differences in levels of gene expression may largely explain the observed phenotypic variation, and there is great variability among species of Eucalyptus, we decided to perform a gene expression analysis of four contrasting Eucalyptus genotypes to gain insight into the mechanisms that lead to differences in carbon allocation. Leaf, xylem and root samples from each genotype were used for transcriptome sequencing using Illumina Hi-Seq technology. After control quality analysis, the reads were mapped to the Eucapyptus grandis reference genome using TopHat, and the gene expression analysis was performed using CuffDiff. Pairwise comparisons were carried out between tissues and genotypes, and statistical tests were performed to assess differential expression. The analysis was executed within Galaxy, which also permitted us to visualize the reads mapping. CummeRbund was used to generate result tables and charts. Gene Ontology terms were assigned to the genes using InterProScan, and Bingo was used to identify the enriched terms. We generated a total of 89.3Gb reads and between 70.75% and 90.33% of them were aligned to the reference genome. We then computed the FPKM values, thus allowing us to identify 26,190 differentially expressed genes out of 44,974 predicted genes. We expect to unveil candidate genes for carbon allocation that will be further investigated by transgenic over- and down-regulated expression approaches.

Poster - F58

Understanding the oncogene-driven interaction of coding and non-coding worlds of a tumor … is not easy

Short Abstract: In Ewing Sarcoma the chimeric transcription factor EWS-FLI1 is thought to be the driving force of oncogenesis. Therefore many studies concentrated on the transcriptional network regulated by EWS-FLI1. In an attempt to augment this data we integrated several different data types: mRNA expression (microarray, RNA-seq), protein expression (SILAC), miR expression (stem-loop-PCR array), miR targets (PAR-CLIP), genome-wide functional miR ablation (AGO2 knockdown) and targets of EWS-FLI1 (ChIP-seq) upon perturbation of EWS-FLI1 expression. Combination of some data types yield a straight forward interpretation: e.g. a good correlation of mRNA and protein expression changes after EWS-FLI1 knockdown, suggesting no strong overall posttranscriptional effect under this condition. However, other data, especially PAR-CLIP, AGO2 knockdown and miR expression are more challenging to fit into general models of oncogene-driven gene regulation. Results of the integrative analysis and features of data types will be discussed. Overall, we observe a complex picture of the coding and non-coding world in Ewing Sarcoma.

Short Abstract: Here we elucidate whether seed treatment prior to sowing (osmopriming) with CaCl2 can improve the drought stress tolerance, as it was previously reported for salt stress on wheat. Thus the effect of 50 mM CaCl2, previously chosen based on a pilot experiment, versus water-treated control exposed to drought stress was investigated. Caryopses of two barley genotypes (tolerant Sebastian and susceptible Georgia) were subjected to osmopriming. Three-week-old plants were exposed to drought stress by withholding the water and their leaves were taken at three selected time points related to field water capacity (pF= 3.2, 3.6 and 4.2). The transcriptome analysis was performed using GeneChip Barley Genome Array 22K (Affymetrix). Analysis of variance set up groups of differentially expressed genes (DEGs) according to the studied factors and their interactions. Functional analysis of DEGs and their GO terms categorization was performed using Blast2Go (Conesa et al. 2005). Enrichment analysis were done using Fisher’s exact test on Blast2Go and verified using GSEA (Subramanian et al. 2005). Whenever it was possible, MapMan and PageMan (Usadel et al. 2005) was used to visualize the results of concerted analysis. The analysis revealed genotype-dependent changes of expression among well known drought related group of genes, but also in groups strongly affected by CaCl2 treatment. The work was supported by the European Regional Development Fund through the Innovative Economy Program for Poland 2007-2013, project POLAPGEN-BD no. WND-POIG.01.03.01-00-101/08.

Poster - F60

Extensive modulation of circadian transcription cycle by microRNAs

Andrey Ptitsyn, University of Florida, United States

Short Abstract: MicroRNAs are important modulators of gene expression. There is anecdotal evidence that the abundance of some microRNAs varies in circadian or approximately daily rhythm. In this study, using publicly available data we attempt a systematic analysis of co-expression between microRNAs and their prospective mRNA targets. The questions we are attempting to answer are: if a microRNA is expressed in a rhythm, do the target genes of this microRNA also oscillate in the same rhythm, and, if a microRNA abundance peaks at a certain time of the day, how does it affect the expression pattern of its multiple targets?An advanced analysis of periodicity with the application of digital filters in phase continuum revealed a baseline rhythmic oscillation on over 80% of both mRNA and microRNA populations. From this data we have selected only the top 26% of confidently rhythmic transcripts. For each transcript and microRNA we have identified the phase and the amplitude. For each microRNA in the study we have identified all predicted targets. The combined list has been clustered to eliminate cases of cross-modulation. For the resulting list of microRNAs we tallied the occurrence of all phases among target mRNAs. In spite of all imperfections in the quantitative estimation of gene expression and incorporated assumptions, we can confidently claim that at least 30% of tested microRNAs modulate expression of predicted targets. This computational observation adds evidence to the theory that microRNAs play an active role in modulation of rhythmic expression as a general rule rather than special exemption.

Short Abstract: Experimental efforts are targeting genomes, cells, and populations of organisms with a widening array of high throughput experimental techniques. The resulting datasets harbor associations between biological entities, e.g. for gene expression these are gene co-expression and co-regulation. We have developed an accurate and sensitive biclustering algorithm, Massive Associative K-biclustering (MAK), for the discovery of biological data associations across multiple data types. The algorithm framework models data archetypes, such as object-by-value (e.g. gene expression), object-by-feature (e.g. phylogenetic profiles), and object-by-object (e.g. protein interactions). The objects can be for example genes, proteins, regulators, orthologous sequence families, or experiments. For each data archetype we designed statistical criteria to detect the expected association patterns. True associations in biological data are mostly unknown, therefore to evaluate biclustering methods we design a simulated dataset modeled on yeast gene expression data. Algorithm evaluations indicate that MAK compares favorably to other biclustering methods. Subsequently we apply the MAK algorithm to reconstruct a condition-specific transcriptional co-expression network of Saccharomyces cerevisiae using combinations of gene expression, experimentally determined transcription factor binding, protein interaction, and phylogenetic profile data. Using gene expression data alone we find more biclusters with higher enrichment for known transcriptional regulation and functional terms than other biclustering methods. Combining gene expression with other data types results in tradeoffs between coverage and enrichment for known regulation and functions. We highlight examples of novel biclusters identified by MAK and discuss their biological implications. Our algorithm is embedded in a flexible framework to allow rapid customization and deployment for different input datasets and search problems.

Poster - F62

Investigating the Role of Transcribed Pseudogenes in Breast Cancer

Joshua Welch, The University of North Carolina at Chapel Hill, United States

Jan Prins (The University of North Carolina at Chapel Hill, Computer Science United States);

Short Abstract: Pseudogenes are genomic sequences closely resembling genes but possessing sequence differences that prevent them from encoding functional proteins. Although the human genome contains thousands of pseudogenes, these sequences are generally disregarded in functional genomic studies and are widely viewed as non-functional. However, there is increasing evidence that some pseudogenes are actually transcribed into RNA and can contribute to cancer when dysregulated. In particular, pseudogene transcripts can sequester miRNAs that would otherwise target mRNAs. In this role pseudogenes function as competing endogenous RNA (ceRNA).

To investigate the hypothesis that transcribed pseudogenes play a role in cancer, we developed a bioinformatics method for studying pseudogene transcription using RNA-seq and applied this method to 820 breast cancer samples from The Cancer Genome Atlas project. We incorporated sample-paired gene and miRNA expression data and miRNA target prediction to assess the potential ceRNA function of transcribed pseudogenes. We also performed a clustering analysis using the pseudogene expression data, determining how variation in pseudogene expression relates to known breast cancer subtypes.

Our results indicate that many pseudogenes are transcribed in breast cancer. A subset of these exhibit significant differential expression between tumor and normal samples. The expression levels of the differentially expressed pseudogenes correlate with a number of known cancer-related genes. Furthermore, our analysis incorporating miRNA target prediction and miRNA expression data suggests that a number of transcribed pseudogenes are strong candidates for ceRNA function. Taken together, these results indicate that pseudogene transcription in cancer plays a larger role than previously appreciated.

Short Abstract: Activation and differentiation of T-helper (Th) cells is a complex process orchestrated by distinct gene activation programs engaging a number of genes. This process is so important for a robust immune response that even a slight imbalance might lead to disease states such as allergy or an autoimmune disease. Therefore, identification of genes involved in these processes is important to further understand the pathogenesis of immune mediated diseases. In this study we identified lineage specific genes of Th1 and Th2 subsets (at an early stage of differentiation), using both the traditional genome wide transcriptional profiling (microarrays) and next-generation sequencing techniques. Next-generation sequencing techniques do not have some of the shortcomings of the microarrays such as pre-selection bias. Results from the comparison of various transcriptomic platforms are useful for future experimental design. Importantly, these results enabled us to generate a high confidence gene list that is in agreement in all the platforms employed. We discovered also a panel of novel genes deduced from the next-generation sequencing data.

Accepted Posters

Attention Poster Authors:
The ideal poster size should be a maximum of 0.95 m (wide) x 1.30 m (high).
The organizers will provided double side tape for fastening the poster to the board.
View a diagram of the the poster board here.
View an image of the poster board here.

Print your poster in Berlin

Delegates wishing to have their poster printed in advance can use the ICC Berlin Business Center: