NGS data

The transcriptome assemblies were done using Roche 454, Illumina, and ABI SOLiD. Here is a table of the raw data that went into each assembly.

Assembly

454

Illumina

SOLiD

AAA(Blythe et al)

0.58 million SE

X

507 million SE

Smed454(Abril et al)

0.58 million SE

X

X

BIMSB(Adamidi et al)

1.3 million SE

56 milion PE, 20 million SE

X

Heidelberg(Sandmann et al)

1.3 million SE

336 million PE

X

Assemblies

The assembly methods varies with the sequencing platform. De novo assembly is usually done with Newbler on 454 reads. De novo assembly can be done on Illumina reads with various short-read assemblers like SOAPdeNovo or Velvet. Reference assembly can also be done on illumina reads with BWA as a mapper and Cufflinks as an assembler. Currently, the only reliable option to assembling SOLiD reads is a reference assembly, typically done with Tophat/Bowtie and Cufflinks.

The AAA data set was assembled in our lab. I was able to get two other assembled sets of transcripts from BIMSB (Adamidi et al) and Heidelberg (Sandmann et al). Here are some statistics of these three assemblies:

AAA dataset

The AAA dataset was assembled using 454 and SOLiD reads. The 454 reads and ESTs were de novo assembled using Newbler and the SOLiD reads were reference assembled using BioScope/Cufflinks. Since this assembly was done before Tophat and Bowtie supported mapping of color-space reads, we had to use BioScope to find split-reads.

The 454 assembly was used as the backbone. The SOLiD assembly was used to determine strandness where possible and extend the 454 assembly. I believe we have saturated the transcriptome with our SOLiD reads, but because of variable distribution of SOLiD reads across transcripts, we had a hard time assembling full length transcripts.

Pros: A more complete coverage of transcribed regions due to the SOLiD sequencing, strandness can be determined with SOLiD, samples taken from a range of regenerating time-points

Cons: Redundancy in transcripts, not full length, some transcripts containing introns were likely assembled from pre-mRNA

BIMSB dataset

The BIMSB assembly had a fair amount of illumina 36bp pair-end reads which were assembled with SOAPdenovo. They also had a good amount of 454 reads assembled with Newbler. The initial assembly actually produced around 26,000 transcripts. Using BLAT, they were able to determine transcript fusion/fission events and recluster the ~26,000 transcripts into ~18,000.

Pros: Very little redundancy, transcript are more complete, proteomics data to support the transcripts

Cons: Slightly lower coverage of all genes due to the strict transcript clustering step where 5% lower quantile were discarded, unknown sample conditions

Heidelberg dataset

The most recent transcriptome assembled is the Heidelberg transcriptome. The raw data contains a good amount of 454 reads from previous studies and a large amount of Illumina 36bp PE reads. They were able to use the 454 + ESTs de novo assembly as a scaffold. Velvet + Oassis were then used to assemble the Illumina data with the 454 assembly.

Pros: Good coverage of all the genes, more complete transcript lengths

Cons: PE read assembly contains a lot of Ns, same redundancy problem as AAA, all samples were taken from head pieces might bias read composition

General issues with assemblies

Multiple sequencing platforms. NGS read assemblers are usually design for a specific platform. There is a hybrid assembler available (Mira2), however it is mainly used for genomic assemblies. Most of the planarian transcriptome studies utilizes two sequencing platforms resulting in an initial dual assemble and then a merging step.

The problem with this approach is that since each assembler attempts to deal with the short-comings of the sequencing platform by various methods, there is no standardized metric of determining the ‘goodness’ of an assembly. When we merge two assemblies from two different platforms, are we compounding the faults of both?

Sample preparation. Read composition of the biological sample could skew assembly statistics depending on the condition of the organism when the sample was taken and library preparation methods. How does the various assemblers deal with read composition?

Let’s say a transcript, ‘X’ is expressed just enough in one library to pass the threshold for amount of reads required for assembly. But in another library, it is not expressed at all. If we put together the reads of both libraries and assemble it, do we run into the risk of discarding ‘X’? Is it better to assemble all the libraries individually and then merge the individual assemblies? How much coverage would we be losing if we did do that?

SOLiD reference assembly. ABI SOLiD reads are a bitch to work with because they are in color-space. There are no reliable de novo assemblers for SOLiD currently (at least one that can handle planarian’s AT rich transcriptome). The available de novo assemblers just convert the color-space reads into nucleotide-space for a de novo assembly.

The best we can do is map the reads to the genome and reference assemble the reads. The reliances on this incomplete genome means we cannot discover anything that isn’t on the genome. Another issue is that the genome is of a sexual strain of planarians. Mapping asexual reads onto a sexual genome is obviously not ideal.

Final Thoughts

There are 4 separate transcriptomes in planarians right now. 4 indepdent transciptome studies within a year of each other. These 4 individual studies consist of a combined: over 2 million 454 reads, over 400 million illumina reads, and over 500 million SOLiD reads. I think that’s enough said.