Saturday, April 29, 2017

Mira4 assembly of 454 reads from SRA

I want to make an assembly of the Annona squamosa fruit transcriptome data from this paper (http://dx.doi.org/10.1186/s12864-015-1248-3). They give in the paper a link to a web resource (http://www.annonatranscriptome.nabi.res.in/), but the resource appears to now be defunct, so to get contigs reads, I will have to assemble the reads myself. The reads are from two different cultivars of Annona squamosa, so I'm going to assemble each cultivar separately first, and then if that works, I'll try a combined assembly.

MIRA is a nice, free, software package that can assemble 454 data. I've had success with it before, so that's what I'll use for this project too.

Data files that need to be downloaded:
The 454 data should be downloaded from NCBI in the sff format. The sff format is much larger than the fastq format, but it contains the clipping information which is vital to a good assembly. Also the sequence names are in a format that MIRA can accept. MIRA will throw an error if an assembly is tried with fastq files obtained directly from SRA.

The reads can all be found under SRP042646 (https://www.ncbi.nlm.nih.gov/sra/?term=SRP042646), notice that there are 8 runs total, 4 for each cultivar. The different runs for each cultivar represent developmental stages, but I'll assemble all developmental stages together all at once.

fastq format and clipping information can be extracted from the sff file using sff_extract from the mira third party tools.

Here is a bash script to download the data and extract the clipping information and fastq files for input into MIRA:

Notice that you have to provide the fastq and the xml files separately as datafiles. MIRA won't know to look for the xml files unless you tell it about them. The parameter -GE:not=4 tells MIRA to use 4 threads.

Follow up:
Assembly of NMK1 with just these parameters looked pretty good, there were 62,136 contigs, which is reasonable. I'd like to get a coverage table for the contigs so I can have some idea of relative expression. A nice summary table can be found in the [name]_d_info/[name]_info_contigstats.txt file.

It seemed like maybe there were some unclipped adapter sequences, because there were a few very long contigs (2 of 10 kb), and there MIRA reported 7 "strong unresolved repeats", although no megahubs. Apart from adapter sequences, the presence of rRNA may be contaminating the library

To look for adapter sequences, I merged all of the reads into one file, and created a blast database out of it to search for potential adapters:

This gives less than 10 hits for each of the adapter sequences except for Adaptor B, which has about 12,000 hits (about 0.65 % of all reads). But only a segment of it. This suggests that perhaps this is the adapter sequence that was used. But maybe there is more to it. We should extract the reads that contain this sequence and compare them to each other to see if there is a longer consensus sequence that should be removed.

The fastq files extracted from the sff file contain upper and lowercase letters. The lowercase letters are adapter sequence and low quality sequence that should not be included in the assembly. When fastq files are downloaded directly from genbank, these sequences have already been clipped off. The BLAST above was based on the SRA fastq files, and demonstrates that there are still some un-clipped adapter sequences.