Transcription

1 : An integrated approach to analyse RNA-seq reads Additional File 3 Results on simulated RNA-seq data. Nicolas Philippe and Mikael Salson and Thérèse Commes and Eric Rivals February 13, Results on Drosophila simulated RNA-seq data All analyses were performed on sets of Human and Drosophila simulated RNA-seq data to assess the impact of the reference genome. Results on Human data are presented in the manuscript, while all pendant results on Drosophila datasets are given here. Although Drosophila and Human genomes differ in length, gene density as well as in number of short introns, the results are similar between the two species for mapping, splice junction or chimeric RNA predictions. Mapping Figure 1 compares the sensitivity and precision of mapping between, BWA / BWA-SW,,, and [1, 2, 3, 4, 5]. The version and parameters used for these programs are given in Additional File 2. As for Human data, the percentages of uncorrectly mapped reads (in red) are almost invisible except for BWA-SW on 200 nt reads, meaning that almost all output genomic locations are correct. However, the difference in sensitivity remains and shows that exhibits both high sensitivity and precision. Again, its behavior improves with longer reads. 1

3 Normal and chimeric splice junctions detection. Table 1 shows the sensitivity and precision of splice junction prediction on D. melanogaster simulated data. is compared to TopHat, MapSplice, and [6, 7, 8]. Again is highly sensitive, even if TopHat achieves between +2 to +4 points in sensitivity, but remains the most precise among all tools. For instance, TopHat yields 10 to 20 times more false positive junctions than. Table 1: Sensitivity and precision of detection of splices among different softwares. TP is the number of true positives and FP the number of false positives. 75bp 200bp Tool Sensitivity Precision TP FP Sensitivity Precision TP FP , , , , MapSplice , , TopHat ,329 1, ,123 2,354 Table 2 shows the sensitivity and precision of chimeric junction prediction on D. melanogaster simulated data. is compared to MapSplice [7], TopHat-fusion [9], and TopHat-fusion- Post (i.e., TopHat-fusion followed by a post-processing script). Here, both and TopHat-fusion achieve better sensitivity than on Human data. However, reaches much higher precision than any other tool, at the exception of TopHatfusion-Post which has 100% precision but delivers only 2 candidate chimeric junctions, that is < 1% sensitivity. Table 2: Sensitivity and precision of detection of chimera among different softwares. TP is the number of true positives and FP the number of false positives. 75bp 200bp Tool Sensitivity Precision TP FP Sensitivity Precision TP FP , , MapSplice ,784 TopHatFusion ,157 1,298 TopHatFusionPost

4 2 Additional results on Human simulated RNA-seq data 2.1 Comparison of 11 vs 42 million reads We assessed the impact on mapping results of the size of the dataset in terms of number of reads, and hence of coverage. We performed the same analysis with a subset of 11 million reads and with the whole set of 42 million reads. The read length is 75 nt. The results for each set and for all tools are displayed in Figure 2 (A) for 11 millions and (B) for 42 millions reads. The impact is negligible, except for BWA that yields more false locations (small red bar on top of the blue one in A) with the medium size set (96.28 vs 99.13%). Especially, sensitivity and precision are not impacted by the number of reads, although this number changes the support values. For comparison, as shown in the manuscript, using longer reads impacts much deeply all mapping tools (Figure 3 in the MS). 2.2 Comparison of running times and memory usages We give in Table 3 the running times and memory usages observed for mapping and splice junction prediction with various programs for processing the 42 million of 75 nt reads (Human simulated data). Times can be in days (d), hours (h) or even minutes (m), while the amount of main memory is given in Gigabytes (Gb). Although performs several prediction tasks - for point mutations, indels, splice junction and chimeric RNAs - its running time is longer than those of mapping tools and shorter than those of splice junction prediction tools. Its memory consumption is larger due to the use of a read index, the Gk arrays. This index is indispensable to query the support profile of each read on the fly. Programs BWA MapSplice TopHat Time (dhm) 7h 6h 5h 40m 9h 2d 4h 12h Memory (Gb) Table 3: Running times and memory usages observed for mapping or splice junction prediction with various programs. 3 Cases of failures For some simulated datasets, we experienced failures while running other tools in our comparisons, as mentioned in the Results of the article. For instance, TopHat-fusion did not deliver results on the 200 nt read datasets [9]. TopHat-fusion was unable to process the 200 nt simulated reads for a yet unknown reason. On that input, TopHat-fusion ran during about one month, while still filling temporary files but it stopped without any error message. We tried a few times and always obtained the same results. Finally, we contacted TopHat-fusion s contributors twice via their mailing list, but did not obtain any reply. 4

How-To: SNP and INDEL detection April 23, 2014 Lumenogix NGS SNP and INDEL detection Mutation Analysis Identifying known, and discovering novel genomic mutations, has been one of the most popular applications

New solutions for Big Data Analysis and Visualization From HPC to cloud-based solutions Barcelona, February 2013 Nacho Medina imedina@cipf.es http://bioinfo.cipf.es/imedina Head of the Computational Biology

Delivering the power of the world s most successful genomics platform NextCODE Health is bringing the full power of the world s largest and most successful genomics platform to everyday clinical care NextCODE

MBE Advance Access published August 2, 2016 PoPoolationTE2: comparative population genomics of transposable elements using Pool-Seq Robert Kofler, Daniel Gómez-Sánchez and Christian Schlötterer June 17,

Exercise 11 - Understanding the Output for a blastn Search (excerpted from a document created by Wilson Leung, Washington University) Read the following tutorial to better understand the BLAST report for

MODULE 2: TRANSCRIPTION PART I Lesson Plan: Title MARIA S. SANTISTEBAN Transcription Part I: From DNA sequence to transcription unit Objectives Describe how a primary transcript (pre-mrna) can be synthesized

Bioinformatics Resources at a Glance A Note about FASTA Format There are MANY free bioinformatics tools available online. Bioinformaticists have developed a standard format for nucleotide and protein sequences

Bioinformatics Grid - Enabled Tools For Biologists. What is Grid-Enabled Tools (GET)? As number of data from the genomics and proteomics experiment increases. Problems arise for the current sequence analysis

GeneModels&Bedformat:Whattheyrepresent. Gene models are hypotheses about the structure of transcripts produced by a gene. Like all models, they may be correct, partly correct, or entirely wrong. Typically,

Frequently Asked Questions Next Generation Sequencing Import These Frequently Asked Questions for Next Generation Sequencing are some of the more common questions our customers ask. Questions are divided

RETRIEVING SEQUENCE INFORMATION Nucleotide sequence databases Database search Sequence alignment and comparison Biological sequence databases Originally just a storage place for sequences. Currently the

The world of non-coding RNA Espen Enerly ncrna in general Different groups Small RNAs Outline mirnas and sirnas Speculations Common for all ncrna Per def.: never translated Not spurious transcripts Always/often

An example of bioinformatics application on plant breeding projects in Rijk Zwaan Xiangyu Rao 17-08-2012 Introduction of RZ Rijk Zwaan is active worldwide as a vegetable breeding company that focuses on

in Higher Plants Just Adding to Proteomic Diversity or an Additional Layer of Regulation? Alternative splicing is nearly ubiquitous in eukaryotes It has been found in plants, flies, worms, mammals, etc.

UCHIME in practice Single-region sequencing UCHIME is designed for experiments that perform community sequencing of a single region such as the 16S rrna gene or fungal ITS region. While UCHIME may prove

Hidden Markov Models in Bioinformatics with Application to Gene Finding in Human DNA 308-761 Machine Learning Project Kaleigh Smith January 17, 2002 The goal of this paper is to review the theory of Hidden

ENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE October 2013 Introduction As sequencing technologies continue to evolve and genomic data makes its way into clinical use and

Standards, Guidelines and Best Practices for RNA-Seq V1.0 (June 2011) The ENCODE Consortium I. Introduction: Sequence based assays of transcriptomes (RNA-seq) are in wide use because of their favorable