Wednesday, 21 November 2012

Introduction

Cleaning FASTQ reads is the process of removing those bits of the reads that you don't deem good enough to be given to the next stage of your pipeline. At worst, this could mean removing the whole read, and if the reads were paired, this means some reads will become "orphan" single reads. The cleaning process is often called filtering, trimming, clipping, or pruning. A FASTQ read has three parts: a sequence of bases, and a quality score for each base, and a sequence ID. The cleaning process only has access to this information.

Sequence

In terms of the sequence itself, we may want to discard sequences which contain
ambiguous DNA symbols, most commonly "N" which means the base could not be called. Generally, reads with anything other than A,T,G,C are going to cause problems with most downstream analysis tools.

Preparing DNA for sequencing almost always involves ligating "adaptor" sequences to all the fragments of DNA. When things don't go perfectly, these adaptors can end up being sequenced, and be partially present at the start/end of your reads! This can wreak havoc with downstream analyses like read alignment and de novo assembly. This mainly seems to occur with Illumina reads, especially mate-pair libraries and some RNA-Seq protocols. You should always check and remove these adaptor sequences.

Quality

Usually when we think of cleaning FASTQ reads, we mean trimming on quality. Each base in a FASTQ read gets a Phred quality score which is an integer between 0 and 93 (as encoded in the Sanger FASTQ format) although you rarely see more than "Q45" or so. Low numbers are bad, high numbers are good. The numbers are on a logarithmic scale, and represent the probability of this base notbeing correct. For example, a base of Q30 is expected to be wrong about 1 in 1000 times, or we are 99.9% confident in it. This means that in the millions of Q30 bases throughout all your reads, thousands of them are expected to be wrong! When we get down to Q10, one in ten bases are dodgy. Like any analysis, the "garbage in, garbage out" rule applies, and hence removing low quality reads is often a good strategy, especially if you don't know how well the downstream tool handles errors.

In Illumina reads the quality tends to get monotonically worse at the 3' end of the reads. Simply removing bases below a quality threshold T from the end is a common form of quality trimming. Alternatively you could average quality across a small window (3 bases say) and threshold that. Taking that idea to the limit, you could simply remove whole reads with average quality < T. There are many variations on this theme, and later I will discuss the approach we take.

ID

You probably think I'm crazy suggesting we can use the sequence ID to filter reads, right? In general, of course it does not make sense, however Illumina reads do contain some information in their read IDs. The most relevant are the series of colon-separated integers which some encode the coordinates of the cluster on the flowcell lane which generated this sequence. When we got the original Illumina Genome Analyzer (version 1!) it was clearly producing bad reads that came from the edges of the flowcell, due to poorer optical performance and other physical affects. You can still see this effect today when you open a FASTQ file, whereby lots of the reads at the start of the file are full of "N"s and have low quality - these come from the first tile in the corner of the flowcell lane. I'm not advocating using this information in reality, but it is interesting to keep in mind, and may actually be useful to someone if you were given a FASTA file where the quality information had been stripped away.

What is Nesoni?

Nesoni (github) is our Swiss Army knife of NGS related tools, implemented primarily by Paul Harrison in our group. It began as a wrapper for aligning reads and calling SNPs for the 100s of bacterial genome samples we were processing, but has now evolved into an extendible, distributable pipeline system which we hope to publish soon and actually document. I'll save the full description of Nesoni for another day, and today just focus on one of the simplest but still very useful tools it has.

The Nesoni clip: tool

If you just type "nesoni clip:" you will get this descriptive help information:

Usage

The standard Illumina paired-end run gives you two files for the left and right reads:

Sample_R1.fastq.gz Sample_R2.fastq.gz
By default, Nesoni will quality clip at Q10, minimum length 24, and will remove all known Illumina adaptors from TruSeq amd Nextera:nesoni clip: clipped pairs: Sample_R1.fastq.gz Sample_R2.fastq.gz
The "clipped" is the output file prefix which is mandatory. The "pairs:" clause tells it that you are feeding it separate left and right files. It can optionally take interleaved pairs with the "interleaved:" clause. By default, the output is three files - the clean pairs (interleaved), any orphan reads, and an extensive log file which explains how many reads failed the various filters (adaptors, quality, length)

clipped_paired.fq.gzclipped_single.fq.gzclipped_log.txt

You probably want your output as separate left/right files, and uncompressed, as to be compatible with most other software. Nesoni can do this with two extra options:

Sunday, 11 November 2012

Introduction

In very simple terms, current sequencing technology begins by breaking up long pieces of DNA into lots more short pieces of DNA. The resultant set of DNA is called a "library" and the short pieces are called "fragments". Each of the fragments in the library are then sequenced individually and in parallel. There are two ways of sequencing a fragment - either just from one end, or from both ends of a fragment. If only one end is sequenced, you get a single read. If your technology can sequence both ends, you get a "pair" of reads for each fragment. These "paired-end" reads are standard practice on Illumina instruments like the GAIIx, HiSeq and MiSeq.

Now, for single-end reads, you need to make sure your read length (L) is shorter than your fragment length (F) or otherwise the sequence will run out of DNA to read! Typical Illumina fragment libraries would use F ~ 450bp but this is variable. For paired-end reads, you want to make sure that F is long enough to fit two reads. This means you need F to be at least 2L. As L=100 or 150bp these days for most people, using F~450bp is fine, there is a still a safety margin in the middle.

However, some things have changed in the Illumina ecosystem this year. Firstly, read lengths are now moving to >150bp on the HiSeq (and have already been on the GAIIx), and to >250bp on the MiSeq, with possibilities of longer ones coming soon! This means that the standard library size F~450bp has become too small, and paired end reads will overlap. Secondly, the new enyzmatic Nextera library preparation system produces a wide spread of F sizes compared to the previous TruSeq system. With Nextera, we see F ranging from 100bp to 900bp in the same library. So some reads will overlap, and others won't. It's starting to get messy.

The whole point of paired-end reads is to get the benefit of longer reads without actually being able to sequence reads that long. A paired-end read (two reads of length L) from a fragment of length F, is a bit like a single-read of length F, except a bunch of bases in the middle of it are unknown, and how many of them there are is only roughly known (as libraries are only nominally of length F, each read will vary). This gives the reads a longer context, and this particularly helps in de novo assembly and in aligning more reads unambiguously to a reference genome. However, many software tools will get confused if you give them overlapping pairs, and if we could overlap them and turn them into longer single-end reads, many tools will produce better results, and faster.

The tools

Here is a list of tools which can do the overlapping procedure. I am NOT going to review them all here. I've used one tool (FLASH) to overlap some MiSeq 2x150 PE reads, and then assembled them using Velvet, and the merged reads produced a "better" assembly than with the paired reads. But that's it. I write this post to inform people of the problem, and to collate all the tools in one place to save others effort. Enjoy!

Tuesday, 6 November 2012

The problem

The FASTQ file format is used to store short sequence reads, most commonly those from Illumina. The reads are typically between 36 and 250 bp long, and typical data sets contain between 1M to 1B reads. Each entry in a FASTQ file contains three pieces of information: the sequence ID, the DNA sequence, and a quality string. I recently read through the documentation for thekhmer package, and noticed a comment stating that digital normalisation would work better if the FASTQ reads were in order from longest to shortest.

I googled for some FASTQ sorting information. I found some posts on sorting by the sequence itself to remove duplicates (Justin's blog), and some posts on sorting on ID to help with finding displaced pairs (BioStar), but nothing on length sorting. There were some posts related to sorting FASTA files (Joe Fass) but it read the whole file into RAM and didn't use a Schwartzian transform.

This turns a 4-line FASTQ entry into a single tab separated line, adds a column with the length of each read, passes it to Unix sort, removes the length column, and converts it back into a FASTQ file. And it works faster than I thought it would. About 2 minutes for a 10M read file. The whole process is I/O bound, so the faster your disks the better. The sort usually uses /tmp so if that is on a different disk unit ("spindle" for my more mature readers) to your input and/or output files, you'll get better performance. There are various chunk parameters etc you can change on sort, but it probably doesn't make too much difference.

Unix sort breaks the input into chunks, sorts each of those, then merges all the chunks in the output (merge sort). Another possibility which avoids explicit sorting is to read through the input, and put reads into separate files based on their read length (a simplistic radix sort). We can do this because we have a finite number of read lengths. The final output simply concatenates all the bin files in order. Here is an implementation in Perl, which is about 2x faster on average:

Results

Here are some timing measurements across three files:

10 million readsBash real 1m42.407s user1m57.863s sys0m52.479sPerl real 0m51.698s user0m35.626s sys0m9.897s20 million readsBash real 3m58.520s user4m2.155s sys1m58.167sPerl real1m39.824s user1m12.765s sys0m20.409s30 million readsBash real 5m53.346s user6m16.736s sys2m51.483sPerl real2m23.200s user1m46.815s sys0m30.754s
On face value, it appears the shell version ("Bash") is behaving in an O(N.logN) fashion, whereas the Perl implementation seems to be O(N) linear, which is somewhat expected given it reads all the data, writes all the data, then writes it all again. More testing and thinking is needed.

Conclusion

Sorting huge FASTQ files is feasible. If applications like diginorm become mainstream and sorting clipped FASTQ files make it work better, than this is handy to know. Nothing groundbreaking here, and I may have repeated other people's work, but I hope it is useful to somebody.