Search This Blog

filtering paired end reads (high throughput sequencing)

NOTE: I don't recommend using this code. It is not supported and currently does not work for some sets of reads. If you use it, be prepared to fix it.

I wrote last time about a pipeline for high-throughput sequence data. In it, I mentioned that the fastx toolkit works well for filtering but does not handle paired end reads. The problem is that you can filter each end (file) of reads independently, but most aligners expect that the nth record in file 1 will be the pair of the nth record in file 2. That may not be the case if one end of the pair is completely removed while the other remains.
At the end of this post is the code for a simple python script that clips an adaptor sequences and trims low-quality bases from paired end reads. It simply calls fastx toolkit (which is assumed to be on your path). It uses fastx_clipper if an adaptor sequence is specified and then pipes the output to fastq_quality_trimmer for each file then loops through the filtered output and keeps only reads that appear in both. Usage is something like:

Where the -a (adaptor) -M (length of adaptor match) -t (min quality threshold) and -l (min length after quality chops) options are copied directly from (and sent directly to) fastx toolkit. --sanger indicates that the reads have fastq qualities in the sanger encoding. If that option is not specified, qualities are assumed to be in illumina 1.3 format where the ascii offset is 64.Output
This example will create 2 new files: en.wt.1.fastq.trim and en.wt.2.fastq.trim each with the same number of corresponding records that pass the filtering above.Adaptors
As described in my previous post, sometimes there are multiple adaptor sequences in the reads. This script can filter out any number of adaptors--specified in a comma delimited option -a--in a single run.Script
It's not too pretty, but it does the job:

Comments

Hey great post was potentially looking into a script that would trim paired-reads. Are you getting a lot of adapter sequence because you're using 75-bp and some of your insert sizes are small enough that you'll get read-through into the adapter. I noticed that FastQC only recognized one of the Illumina adapters. Do you see the other one as well?

@jeffhsu3i believe fastQC finds over-represented sequences wether they are primers or not. for example it finds poly-A stuff left in the reads.it often finds multiple primers (and potentially different primers for the _1 and _2 reads).

i suspect you could run the barcode splitter on both ends of the reads independently, then run each pair of resulting files through this program. it should remove any reads that are unpaired -- either because of the barcode or because of quality filtering.

I try the one downloaded from https://gist.github.com/588841. I found the output pairs DO NOT MATCH. Other version does not work: https://github.com/brentp/bio-playground/blob/master/reads-utils/fastq_pair_filter.py

Hi, Can you confirm if any of both scripts mentioned works properly? I tried pair_fastx_clip_trim.py and fastq_pair_filter.py. Both scripts crashes with the adaptors option. Without -a options, pair_fastx_clip_trim.py produces non sync files (one shorter than the other one and fastq_pair_filter.py produces files with a very few reads. With my 9GB start files the first script produces 7GB files and with the second one 150MB files. Thanks for your help

Hi, to any commenters having trouble, I have updated the first sentence of the post to indicate that this should not be used.I have not been maintaining this and don't intend to do so in the near future.There are multiple versions about (my fault) with various different bugs.

Hi BrentP,Thanks for this tool. I came here after checking Fastx_clipper tool box manually. To my understanding it did NOT work as expected/or praised.

For sure, it pulled out all those reads with adapter sequences. But, in addition, it also pulled out reads that are originally came from my genome of intrest. It was very obvious, when i align those reads (containing adatper as told by fastx tool kit) aginst NCBI_NT db or my genome of intrest.....

Following is the command i used to see what the fastx_clipper thinks as adapter containing reads....

Popular posts from this blog

I'm obsessed with trees lately -- of the CS variety, not the plant variety. Although we are studying poplar, so I'll be using trees to study trees. I'd tried a couple times to implement an interval tree from scratch following the Wikipedia entry.Today I more or less did that in python. It's the simplest possible form. There's no insertion (though that's trivial to add), it just takes a list of 'things' with start and stop attributes and creates a tree with a .find() method.The wikipedia entry that baffled me was about storing 2 copies of each node's intervals--one sorted by start and the other by stop. I didn't do that as I think in most cases it won't improve lookup time. I figure if you have 1 million elements and a tree of depth 16, then you have on average 15 intervals per node (actually fewer since there are the non-leaf nodes). So I just brute force each of those nodes and move to the next. I think th…

I've been playing with go in the evenings and over xmas break for about 8 weeks now. This post is about go the language and the tooling. I may write another post about a simple go package for bioinformatics that I've been writing which is under 1000 lines of code.

First, go is boring, and though it is pretty terse, I do miss things about python like list comprehensions; initializing a variable and writing a for loop is easy enough, but it's one of the things that I use all the time in python. But, I can't argue with the "less is exponentially more" mantra as I was able to pick up the language very quickly. The tooling is fantastic. My project has dependencies that are wrappers to C-libs, but I can simply do:
go get code.google.com/p/biogo.boom
and it just works. The project is only about 1K lines of code, but it compiles in about 0.1 seconds on my laptop. And, when the time comes, I can distribute binaries for common platforms!