Search This Blog

ngs / high-throughput sequencing pipeline

This is the minimal set of preprocessing steps I run on high-throughput sequencing data (mostly from the Illumina sequencers) and then how I prep and view the alignments. If there's something I should add or consider, please let me know.
I'll put it in the form of a shell script that assumes you've got this software installed.
I'll also assume your data is in the FASTQ format. If it's in illumina's qseq format, you can convert to FastQ with this script by sending a list of qseq files as the command-line arguments.
If your data is in color-space, you can just tell bowtie that's the case, but the FASTX stuff below will not apply.
This post will assume we're aligning genomic reads to a reference genome. I may cover bisulfite treated reads and RNA-Seq later, but the initial filtering and visualization will be the same. I also assume you're on *Nix or Mac.

Most of the following will run as-is for any set of reads, you'll only need to change the FASTQ and REFERENCE variables above.

Seeing the Reads

If you have a file with 2GB of reads, you likely can't just read (pun intended) it and get an idea of what's going on -- though I've tried. There are a number of tools to give you a summary of the data including stats such as quality per base, nucleotide frequency, etc. While fastx toolkit will do this for you, I've found fastqc to be the best choice.
The command:

$fastqc $FASTQ

will write a folder "a_fastqc/" containing the html report in fastqc_report.html
Here's an example of the nicely formatted and informative FastQC report before quality filtering and trimming (scroll to see the interesting stuff):

from that, we can see that (as with all illumina datasets) quality scores are lower at the 3' end of the read. For this analysis, I'll choose to chop bases with a quality score under 28. In addition to the other information, we can see that there is an Illumina Primer still present on many of the reads, so we'll want to chop that out in the filtering step below.

Filtering the Reads

The fastx toolkit is not extremely fast, but it has a simple command-line interface for most common operations I use to filter reads (though it--like similar libraries--is lacking in support for filtering paired-end reads). For this case, we want to trim nucleotides with quality less than 28 from the ends of each read and then remove reads of length 40. In addition, we want to chip Illumina adaptor sequence. This can be done by piping 2 commands together:

both fastx_clipper and fastq_quality_trimmer are provided in the fastx toolkit.
The first command clips the adaptor sequence from $FASTQ and pipes the output to the 2nd command, fastq_quality_trimmer, which chomps bases with quality less than 28 (-t) then discards and read with a remaining length less than 40 (-l) and sends the output to a .trim file.

Seeing the filtered Reads

We then want to re-run fastqc on the trimmed, clipped reads
The command:

$fastqc ${FASTQ}.trim

and the output looks like:

where we can see the quality looks much better and there are no primer sequences remaining.
This took the number of reads from 14,368,483 to 11,579,512 (and many shorter--trimmed--reads in the latter as well).

Analysis

Up to this point, the analysis will have been very similar for RNA-Seq, BS-Seq, and genomic reads, but you'll want to customize your filtering steps. The following will go through a simple alignment using bowtie that assumes you have genomic reads.

Aligning

The first step is to build the reference bowtie index:

${bowtie_dir}/bowtie-build --quiet $REFERENCE bowtie-index

That will create the bowtie index for your reference fasta in the bowtie-index/ directory. It can take a while so you may want to download the pre-built indexes from the bowtie website if your organism is available.
Then, you can run the actual alignment as:

which tells bowtie to try hard (--tryhard), assume quality scores are the latest schema from illumina (--phred64-quals), use 4 processors (-p 4), discard any reads that map to more than one location the the reference (-m 1), use SAM output format (-S) and then where to find the bowtie index and the reads. The output is sent to ${FASTQ}.trim.sam.
We've done a lot of experimenting with different values for -m, and can affect your results, but -m 1 seems a common choice in the literature. And it's clearly less sketchy than, for example, -m 100 which would report up to 100 alignments for any read that maps to less than 100 locations in the genome.

View the Alignment

From there, we want to view the alignment. Most tools can handle SAM formatted files, but will perform better with an indexed bam. To get that, we do:

to get something that looks like:
But you'll probably want to use a "real" viewer such as IGV, MochiView, or Tablet, all of which I have had some sucess with.

Note that you may want to choose another aligner, because bowtie does poorly with indels. Something like GSNAP will be better suited for reads where you have more complex variants.
I welcome any feedback on these methods.

Comments

A couple of questions1. After filtering for quality >28, the data still includes some reads with phred scores below 28. Is this because the Illumina quality score != phred score, or have I missed something?2. You used a -M 30 param for the clipper method (fastx_clipper -a $CLIP -i $FASTQ -M 30), and -t 28 on the trimmer (fastq_quality_trimmer -t 28) but I don't find this on the hannonlab commandline page. Ummm ... point me in the right direction, please?

Popular posts from this blog

I'm obsessed with trees lately -- of the CS variety, not the plant variety. Although we are studying poplar, so I'll be using trees to study trees. I'd tried a couple times to implement an interval tree from scratch following the Wikipedia entry.Today I more or less did that in python. It's the simplest possible form. There's no insertion (though that's trivial to add), it just takes a list of 'things' with start and stop attributes and creates a tree with a .find() method.The wikipedia entry that baffled me was about storing 2 copies of each node's intervals--one sorted by start and the other by stop. I didn't do that as I think in most cases it won't improve lookup time. I figure if you have 1 million elements and a tree of depth 16, then you have on average 15 intervals per node (actually fewer since there are the non-leaf nodes). So I just brute force each of those nodes and move to the next. I think th…

NOTE: I don't recommend using this code. It is not supported and currently does not work for some sets of reads. If you use it, be prepared to fix it.

I wrote last time about a pipeline for high-throughput sequence data. In it, I mentioned that the fastx toolkit works well for filtering but does not handle paired end reads. The problem is that you can filter each end (file) of reads independently, but most aligners expect that the nth record in file 1 will be the pair of the nth record in file 2. That may not be the case if one end of the pair is completely removed while the other remains.
At the end of this post is the code for a simple python script that clips an adaptor sequences and trims low-quality bases from paired end reads. It simply calls fastx toolkit (which is assumed to be on your path). It uses fastx_clipper if an adaptor sequence is specified and then pipes the output to fastq_quality_trimmer for each file then loops through the filtered output and keeps only reads …

I've been playing with go in the evenings and over xmas break for about 8 weeks now. This post is about go the language and the tooling. I may write another post about a simple go package for bioinformatics that I've been writing which is under 1000 lines of code.

First, go is boring, and though it is pretty terse, I do miss things about python like list comprehensions; initializing a variable and writing a for loop is easy enough, but it's one of the things that I use all the time in python. But, I can't argue with the "less is exponentially more" mantra as I was able to pick up the language very quickly. The tooling is fantastic. My project has dependencies that are wrappers to C-libs, but I can simply do:
go get code.google.com/p/biogo.boom
and it just works. The project is only about 1K lines of code, but it compiles in about 0.1 seconds on my laptop. And, when the time comes, I can distribute binaries for common platforms!