Friday, September 03, 2010

As next generation sequencing is still evolving, so also myriad of tools that are built in and around these sequences . Applied Biosystems Dibase sequencing that uses ligation based chemistry ensures system accuracy and high throughputness.What is SOLiD system?
SOLiD stands for "Sequencing by Oligonucleotide Ligation and Detection". Due to its 2 base encoding system, it ensures greater accuracy e.g; 99.94% .
The decoding process is kind of tricky, since each color represents a dinucleotide. Altogether there are 4 colors i.e; 0,1,2,3 representing 4 bases A,C,G,T

Color Decoding

So, if a csfasta file has the first base known, then the subsequent bases can be calculated using the decoding table as below:

Now how about an error? If the colorspace is represented by a number >=4 or a "." , how to decode the rest of the reads? I guess, in that case, we can designate the rest of the reads as 'N', this can be done especially because we generate abysmally large number of reads and ignoring some of them will not matter much.

Converting Quality Scores:

Now, the next step is to convert the quality scores into sanger fastq format. Sanger Fastq standard was defined by Jim Mullikin, gradually disseminated, but never formally documented. The biggest drawback with the phred quality scores is that the need to separate numbers with space which increases storage space and numbers are often 2 digits numbers, which again adds up to the space issue.

Phred value is calculated as:

Qphred = -10 X log10(Pe), where P stands for probability score.

From Phred to Sanger:

Converting phred quality scores to Sanger Quality score is quite straight forward. Phred values 0 - 93 are represented by ASCII 33 - 126. 33 was used as a offset because ASCII 32 represents a white space.
The paper describing the details of sanger fastq and colorspace can be found here :