Monday, March 21, 2011

This is a small tutorial on how to work with large scale RNAseq data with open source softwares. I will be discussing various options for alignment of reads to the reference sequence and little bit on assembly.

[When I started writing this post, tophat did not have colorspace support, but now it does]
A quicklist of software programs available for nextgen data analysis can be found here:

In order to start with bowtie, you need to build index of your reference sequence. This could be typically a fasta file. For me running command . "./bowtie-build /data/vbi/references/p_sojae.fasta sojaeV4"
to generate index for a 70 MB genome in a 64 bit machine with dual core and 8 GB chip - took exactly 2 minutes. This command will generate 6 .ebwt files in the working directory.
ls -l

Good thing about bowtie(version 0.12 onwards, it supports colorspace alignment)

As of version 0.12.0, bowtie can align colorspace reads against a colorspace index when -C is specified. Colorspace is the characteristic output format of Applied Biosystems' SOLiD system.Look at color coding here for more details. See ABI's Principles of Di-Base Sequencing document for details.

All input formats (FASTA -f, FASTQ -q, raw -r, tab-delimited --12, command-line -c) are compatible with colorspace (-C). When -C is specified, read sequences are treated as colors. Colors may be encoded either as numbers (0=blue, 1=green, 2=orange, 3=red) or as characters A/C/G/T (A=blue, C=green, G=orange, T=red).
Some reads include a primer base as the first character; e.g.:

These 3 softwares are also good for fast alignment of short reads into the genome.

BFAST stands for "BLAT-like Fast Accurate Search Tool", and it also supports colorspace

data. SHRIMP also supports colorspace reads and recently the major improvement is that,

it can work on small memory computers. VMATCH on the other hand is a software for

local alignments and needs licensing. I have not worked extensively on these 3 softwares,

so, can't say much about them

TOPHAT:

Is designed to predict the splice junctions after aligning short reads to the reference.

It usually works without a reference junction position(GFF) file, but can also work

if there is already a known gff file. This program chops the reads into smaller fragments

and stitches the alignments later in separating introns from exons. Now tophat has a

colorspace support.

Results:

This result represents alignment percentage of 3 different assembly version of the

same organism.

Tophat Output: [For larger datasets try splitting them into smaller files since tophat exits with some error hard to tell why it exited]

Caveats: I work with oomycetes pathogens and the introns and exons in this pathogen could be really much smaller than what tophat defines as default.

Running tophat may not work for everyone. For example with P.sojae, tophat produced awful number of junctions(just 700) with default parameters. Tweaking the parameters only produced worse results. As a work around, we tried aligning these transcripts to the predicted gene models as well. Comparing genome vs predicted gene model alignment results, we found around 10% of the alignments were not translated into predicted models(55% matched with genomes where as 45% match with predicted transcripts)

There is an option to try and get the sequences aligned with the genome to assemble into contigs. One easier method for this is to try ABySS

Running ABySS is quite straight forward. Unpack the distribution and follow installation instructions. Once installed you could try running ABySS with different k mers. One small shell script for running different k mers
is as belows:

export PATH=$PATH:/home/sutripa/samtools-0.1.11/

for i in {20..40};
do
./ABYSS -k $i /data/bowtieOutput/PS-1_F3V1.bam.sorted.bam -o tmp-k$i.fa
done
[NOTE: Here the input file for assembly is sorted bam files. The bam files are generated by running bowtie]

Then check the files for their N50 values [See N50 section of the blog on how to calculate N50 values]
In case of P.sojae here is the N50 results:

N50 Vs Number of scaffolds having values larger than N50 for P.sojae.
Here, the results are not very encouraging. So, you may try running with a bunch of different k values. I will discuss about running trans-abyss and cufflink in part - 2 of this series.