Description

BWA is a fast light-weighted tool that aligns short sequences to a sequence
database, such as the human reference genome. By default, BWA finds an alignment
within edit distance 2 to the query sequence, except for disallowing gaps
close to the end of the query. It can also be tuned to find a fraction of
longer gaps at the cost of speed and of more false alignments.

BWA excels in its speed. Mapping 2 million high-quality 35bp short reads
against the human genome can be done in 20 minutes. Usually the speed is
gained at the cost of huge memory, disallowing gaps and/or the hard limits
on the maximum read length and the maximum mismatches. BWA does not. It
is still relatively light-weighted (2.3GB memory for human alignment), performs
gapped alignment, and does not set a hard limit on read length or maximum
mismatches.

Given a database file in FASTA format, BWA first builds BWT index with
the 'index' command. The alignments in suffix array (SA) coordinates are
then generated with the 'aln' command. The resulting file contains ALL the
alignments found by BWA. The 'samse/sampe' command converts SA coordinates
to chromosomal coordinates. For single-end reads, most of computing time
is spent on finding the SA coordinates (the aln command). For paired-end
reads, half of computing time may be spent on pairing (the sampe command)
given 32bp reads. Using longer reads would reduce the fraction of time spent
on pairing because each end in a pair would be mapped to fewer places.

How to Use

Module

There are multiple versions of BWA available. An easy way of selecting the version is to use modules. To see the modules available, type

module avail bwa

To select a module, type

module load bwa/[ver]

where [ver] is the version of choice. This will set your $PATH variable.

Index Files

Pre-build BWA index files are available in

/fdb/igenomes/[organism]/[source]/[build]/Sequence/BWAIndex/genome.fa

[organism] is the specific organism of interest (Gallus_gallus, Rattus_norvegicus, etc.)

If the UCSC/hg19 BWA index file were used, the bwa process will need at least 3gb of memory.

Multithreading

BWA is a multithreaded application. That is, the bwa command can distribute its work across multiple CPUs on a
single node. The number of threads BWA will use is controlled by the -t option.
The total number of threads allocated to multiple bwa processes on the same node should not exceed the total
number of CPUs on the node.

Submitting a single BWA batch job

1. Create a script file. The file will contain the lines similar to the one below.

2. Make sure you use an appropriate number of threads (-t) for bwa processes.
For example, g72 nodes have 16 CPUs, while g4 nodes have 2 CPUs. in the example below, bwa is directed to use
four threads:

The -f option is required for swarm. Because bwa is multithreaded, the -t option is
be used to direct swarm to allocate multiple cpus per bwa process. Also because full genome alignments using bwa require
substantial memory utilization, the -g option can be used to direct swarm to allocate how many gb of
memory per bwa process.

By default, swarm will execute each line on one CPU, using 1gb of memory. In the above case, bwa requires four
threads, so the swarm commandline should be:

swarm -f cmdfile -t 4 --module bwa

If a larger BWA index file were used, for example hg19, then the amount of memory per bwa process must be increased
using the -g option: