ABSTRACTBackground. The emergence of next-generation sequencing platform gives rise to a new generation of assembly algorithms. Compared with the Sanger sequencing data, the next-generation sequence data present shorter reads, higher coverage depth, and different error profiles. These features bring new challenging issues for de novo transcriptome assembly. Methodology. To explore the influence of these features on assembly algorithms, we studied the relationship between read overlap size, coverage depth, and error rate using simulated data. According to the relationship, we propose a de novo transcriptome assembly procedure, called Euler-mix, and demonstrate its performance on a real transcriptome dataset of mice. The simulation tool and evaluation tool are freely available as open source. Significance. Euler-mix is a straightforward pipeline; it focuses on dealing with the variation of coverage depth of short reads dataset. The experiment result showed that Euler-mix improves the performance of de novo transcriptome assembly.

Mentions:
Figure 5(a) presents an example of how overlap precision and overlap recall were computed: the overlap precision rate is (a1 + a2 + a34)/(C1 + C2) and the overlap recall rate is (a12 + a3 + a4)/(R1 + R2). Note that the true positive area may include overlapping regions of alignments, so we named these measures overlap. Also note that these measures may overestimate the performance because of recounting the overlaps. However, most works use them as the benchmark, for example, the “sequence coverage” used in Velvet and the “genome coverage” used in ABySS. Accordingly, we take overlap measures as the upper bound of performance.

Mentions:
Figure 5(a) presents an example of how overlap precision and overlap recall were computed: the overlap precision rate is (a1 + a2 + a34)/(C1 + C2) and the overlap recall rate is (a12 + a3 + a4)/(R1 + R2). Note that the true positive area may include overlapping regions of alignments, so we named these measures overlap. Also note that these measures may overestimate the performance because of recounting the overlaps. However, most works use them as the benchmark, for example, the “sequence coverage” used in Velvet and the “genome coverage” used in ABySS. Accordingly, we take overlap measures as the upper bound of performance.

Bottom Line:
Methodology.Significance.The experiment result showed that Euler-mix improves the performance of de novo transcriptome assembly.

ABSTRACTBackground. The emergence of next-generation sequencing platform gives rise to a new generation of assembly algorithms. Compared with the Sanger sequencing data, the next-generation sequence data present shorter reads, higher coverage depth, and different error profiles. These features bring new challenging issues for de novo transcriptome assembly. Methodology. To explore the influence of these features on assembly algorithms, we studied the relationship between read overlap size, coverage depth, and error rate using simulated data. According to the relationship, we propose a de novo transcriptome assembly procedure, called Euler-mix, and demonstrate its performance on a real transcriptome dataset of mice. The simulation tool and evaluation tool are freely available as open source. Significance. Euler-mix is a straightforward pipeline; it focuses on dealing with the variation of coverage depth of short reads dataset. The experiment result showed that Euler-mix improves the performance of de novo transcriptome assembly.