Bioinfomatics Research Group

Computer Science, The University of Hong Kong

T-IDBA is an iterative De Bruijn Graph De Novo short read assembler
for transcriptome. It is purely de novo assembler based on only RNA
sequencing reads. In this assembler, not only the reads but also the
pair-end information is used to increase the k value in the accumulated
de Bruijn graph. Because of the nature of the transcriptome, the
transcripts from different genes share only very few repeat patterns.
Hence, de Bruijn graph will be decomposed into small connected
components when k is large enough. Each component corresponds to one
gene in most cases and contains not many transcripts. A heuristic
algorithm based on pair-end reads is then used to find the isoforms in
T-IDBA.

(Please note that T-IDBA is out of maintainance now, we recommend using
IDBA-Tran instead which generally performs better.)

IDBA, IDBA-UD, IDBA-Hybrid and IDBA-Tran all in one
package Released
Oct 18, 2012

All IDBA (iterative de Bruijn graph assembler) series assemblers are
refined and included in this package. Plenty of errors are fixed and
scaffolding on multiple levels of paired-end reads are supported in
IDBA, IDBA-UD and IDBA-Hybrid.

The basic IDBA is included only for comparison.
If you are assembling genomic data without reference, please use
IDBA-UD.
If you are assembling genomic data with a similar reference genome,
please use IDBA-Hybrid.
If you are assembling transcriptome data, please use IDBA-Tran.

Abstract: RNA sequencing based on next-generation sequencing
technology is useful for analyzing transcriptomes, discovering novel genes and
studying exon/intron structures. Similar to genome assembly, de novo transcriptome
assembly does not rely on a reference genome and additional annotated
information. Most, if not all, existing de novo transcriptome assemblers
rely heavily on de novo genome assembly techniques without fully utilizing
the
properties of transcriptomes and may result in short contigs because of
the splicing
nature (shared exons) of the genes and the repeats existing in different
genes. In
this paper, we analyze the properties of the mammalian transcriptome and
propose
an algorithm to reconstruct expressed isoforms without a reference
genome. We
extend the iterative de Bruijn graph approach (IDBA) and use pair-end
information to solve the problem of long repeats in different genes and
the
problem of branching introduced by shared exons in the same gene. The
graph will
then be decomposited into small components, each of which contains a
few, if not
single, genes. The most possible isoforms which have the most support
from the
pair-end reads will then be found by depth-first search heuristically.
In practice,
our de novo transcriptome assembly software, T-IDBA, outperforms Abyss
(one
of the newest de novo transcriptome assembly tools) substantially in
terms of
sensitivity and precision for both simulated and real data. We also
provide a
theoretical analysis of T-IDBA’s performance, which shows that T-IDBA
guarantees most isoforms can be recovered as long as the coverage of the
isoforms
by reads exceeds a certain threshold and matchs with T-IDBA’s
performance.