HPC Lab

BMI - OSU

HPC Lab

Benchmarking Short Sequence Mapping Tools

General information

The development of next-generation sequencing instruments has led to the generation of millions of
short sequences in a single run. The process of aligning these reads to a reference genome is time
consuming and demands the development of fast and accurate alignment tools. However, the current
proposed tools make different compromises between the accuracy and the speed of mapping. Moreover,
many important aspects are overlooked while comparing the performance of a newly developed
tool to the state of the art. Therefore, there is a need for an objective evaluation method that covers all
the aspects. In this work, we introduce a benchmarking suite to extensively analyze sequencing tools
with respect to various aspects and provide an objective comparison.
Information about the tools and the options we used in the experiments are shown in the following. In addition,
the code used to verify the tools is included.

Related Publications

Experimental setup

The experiments in this study can be repeated following three
major steps: getting the reference genomes, generating the synthetic
data sets, and choosing the right options for the tools. Each one of
these are described in detail below. If needed, instead of
regenerating datasets, you can also download them from here.

Getting the reference genomes:

The reference genomes are available online from different websites.
For our experiments, we downloaded the genomes from UCSC Genome Bioinformatics Center
(http://genome.ucsc.edu/).
There are also other available reference genomes.

Generating the synthetic data:

To generate the synthetic data, we used wgsim, which is part of SAMtools package
( http://samtools.sourceforge.net/ ).
You can use different options for wgsim to
emulate the base error rate, the mutation rate, and percentage of indels beside other options.
For our experiments, we used the following options with the following values:
-e 0.02 (default value), -r 0.0009, -R 0.0001
In addition, based on every experiments, we changed -N (total number of reads)
and -1 and -2 (length for the first and second read).
In addition to wgsim, we used ART (
http://www.niehs.nih.gov/research/resources/software/biostatistics/art/ )
to generate reads with a varying sequencing error rate. The default options were used.

Choosing the right options for the tools:

It is important to disable and enable the right options beside choosing the right values for them.
In the following, we show
how the different options are used to run a fair comparison.
It is important to note that in all of the experiments we used
pMap on a single node to provide the execution time for some of the tools;
not all of the tools provide the execution time.
Therefore, we first explain what pMap is and how to use it.
Then, we mention the scripts we used to run pMap with the different tools.

pMap:
pMap (
http://bmi.osu.edu/hpc/software/pmap/pmap.html) is an open-source
implementation of MPI-based tool that enable
parallelization of existing short sequence mapping tools.
Currently, it supports the following tools:
Bowtie, BWA, SOAP, GSNAP, MAQ, and RMAP.
In addition, it can be extended easily to integrate other tools.
The followings are the main commands to use pMap: pmap_index $genomefile $indexdir $indexprefix $programname pmap_dist $workdir $outdir $readsfile [-r $readfile2]pmap [-pe](paired end) -i $indexdir $indexprefix $workdir $outdir $programname $options

Bowtie

First of all, before calling the mapping program,
bowtie needs the bowtie_indexes environment variable to contain
the location of the reference genome index.BOWTIE_INDEXES=/home/dayat/out-bowtie/index/lancelet; export BOWTIE_INDEXES
The experiments specific options are as follows:

Quality threshold: -e 140 -n 2 -l 28 -S

Number of mismatches: -n 2 -l 28 -S -e (40, 60, 80, 100, 120, 140)

Seed length: -n 2 -l (20, 24, 28, 32, 36) -e 100 -S

Read length: -n 2 -l 28 -e 100 -S

Paired end: -n 2 -l 28 -e 100 -S -I 0 -X 500

Genome type: -n 2 -l 28 -e 100 -S

Performance: -n 2 -l 28 -e 100 -S -p (2, 4, 8)

Bowtie2

Bowtie2 is not supported by pMap. To run Bowtie2, the following command is used:Bowtie2 -t --ignore-quals $indexdir/indexprefix -U $readfile -S $workdir/out.txt
The experiments specific options are as follow:

FANGS is not yet supported by pMap. Therefore, to run FANGS, call the following command:fangs $indexdir/$indexprefix $readsfile -out=pslx $workdir/out.txt
FANGS is used only in one experiment (Read length) and there is no specific options needed
to use for this experiment. In addition, by default, it allows five mismatches.