Biology, sequencing, bioinformatics and more

Menu

There is more (length) to Ion Torrent reads than meets the eye (and is Ion Torrent hiding it?)

The sff files from the E coli Ion Torrent runs released by EdgeBio show much longer raw reads than the trimmed reads in the corresponing fastq/fasta files. The quality of those extra bases, however, is very low. This shows the potential for longer reads from the Ion Torrent platform.
The sff file released by Ion Torrent through their Dev Community site has these extra bases masked, which makes one wondering what if are trying to hide something…

Part 1: EdgeBio’s data
When EdgeBio released six runs with E coli DH10B Ion Torrent data (see http://www.edgebio.com/blog/?p=191), I decided to have a look inside the sff files they provided. I downloaded the data from data.edgebio.com, and used Roche’s sffinfo command to ‘peek inside’. The sffinfo command, accompanying the 454 Life Science software suite, will list the content of the binary sff file in text format (see the post on my other blog). Other, open source/access tools, such the ones my mention on my blog, might do this as well. Here are the ‘header’ (manifest) part and the first read of DH10B-run01:

For sake of space, I cut short a few very long lines (indicated with ‘…’). A full explanation of the output is given in my blogpost.
Of note is the line that says

# of Flows: 220

Each flow is one base, in the order TACG, so there are 55 cycles of 4 flows. Note that the first instrument from 454, GS20, ran 42 cycles (168 flows), the GS FLX ran 100 (400 flows) and the current GS FLS Titanium runs 200 cycles (800 flows). The coming upgrade (when, Roche?) is supposed to give 400 cycles, 1600 flows…

I once looked into the number of bases one gets on average per cycle (of four flows), and this is remarkably close to 2.5. I basically chopped up E coli and human chromosomes into ‘cycles’ using flows in the order ‘TACG’ and counted how many T’s, A,’s C’s and G’s were actually read each cycle (I might write about this in another blog post). For the 454 instrument, the read length distributions, slightly depending on GC content, actually peaks around the then expected number of bases:
– the GS20, at 42 cycles, peaked around 100 bases
– GS FLX, 100 cycles, had a ~250 bases read length
– GS FLX Titanium, 200 cycles, peaks nicely at 500 nt.
Note that the expected output of the upcoming upgrade, supposedly at 400 cycles, actually seems to peak at 850 nt. Why this is I might have to look into once data become available…

Now, for the IonTorrent, at 220 flows and 55 cycles, one would expect a peak around 135.7. But the edgeBio data peaks at 107.5 bases!
If we look at the sff file data for the first read, it says

# of Bases: 141
Clip Qual Left: 5
Clip Qual Right: 83

So, the clipped read is 79 bases long (from position 5 to 83, including), but the raw read is indeed much longer. For this particular read, there is an adapter detected at the end, but for the majority of the reads in this file this is not the case.
Note that the reported sequence has the clipped part in lower case, and the remainder (trimmed part) in upper case.

So, I plotted this ‘raw read length’ versus the Trimmed one for DH10B-RUN01, see the graph.

The raw length peaks at 132 nt, close to, but below the expected 137.5 nt based on 55 cycles. The trimmed reads peak at 108. Also note the steep drop in read length beyond 108 bases. This drop is quite unusual, you never see it for 454 read, for example.

So, Ion Torrent reads are actually much longer than what is being reported. The key question, though, is whether there is any useful information in these extra bases. For this, I extracted the fasta sequences and quality scores, and uploaded them to the excellent tool PrinSeq, http://edwards.sdsu.edu/prinseq_beta. I first uploaded the data as they were with the original trimpoints:

The figure shows the quality value distribution along the read length in bins of 2 bases, steadily declining towards the end of the reads.

The same analysis for the retrimmed data:

Here the bins are 9 bases. It shows that the quality reaches its minimum at the 117 bp bin. This explains why the trimming is leading to the much shorter reads: beyond say 115 nt the quality of the bases is too low to be meaningful.

I would say that the analysis of EdgeBio’s file shows that the platform indeed has a potential to grow beyond the 100 base read length. ‘All it takes’ is increase the quality of the bases beyond cycle 27.

Note that the number of flows for this run is 260, equaling 65 cycles. But note the many lower case n’s at the end of the reported sequence! And the zeros in the reported quality values!
How to explain the difference between the sff files? I can only imagine the ‘masking’ in the file from the Ion Torrent website to be done deliberately on the part of Life Technologies. If so, were they trying to hide the low quality bases beyond the trimpoints?

Code used:
Read length distribution: I use a home-made perl script, called fasta_length.pl, which just plots the length of the fasta files it gets. The short version goes like this:

$/=">"; # set the record separator to the '>' symbol
; # remove the empty first 'sequence'
while (){
chomp; # remove the trailing '>' symbol
my @lines = split(/\n/,$_); # split the entry into individual lines based on the newline character
my $header = shift @lines; # the header is the first line (now without the '>' symbol)
my $seq = join "", @lines;
print length $seq; print "\n";
}

To get the distributions, I use the script in combination with sffinfo, awk and sort:

Plotting was done in excel (which I still use for these simple plots).
Getting the raw reads was done by making use of the trimpoint resetting function of the sfffile command. The first trimpoint is always 5, the first base beyond the four-base key sequence, for the second trimpoint I took the number reported as ‘# of bases’ (I later saw that I could just have set it to ‘0’, that would also have forced the max read length and would have made it much easier).

I then manually (using the ‘nano’ program) removed read 8UBVS:1145:967, which was exceptionally long (and consisted of long stretches of TTTTTTTAAAAAACCCCCCGGGGGG), to tidy up the prinseq graph. Both fna and qual files were uploaded to prinseq after gzipping them.