Scalable Nucleotide Alignment Program

SNAP is a new sequence aligner that is 3-20x faster and just as
accurate as existing tools like BWA-mem, Bowtie2 and Novoalign. It
runs on commodity x86 processors, and supports a rich error model that
lets it cheaply match reads with more differences from the reference
than other tools. This gives SNAP up to 2x lower error rates than
existing tools (in some cases) and lets it match larger mutations that
they may miss. SNAP also natively reads BAM, FASTQ, or gzipped FASTQ,
and natively writes SAM or BAM, with built-in sorting, duplicate
marking, and BAM indexing.

ami-46513b76:
A machine image on Amazon's EC2 in the us-west-2 (Oregon)
region that has Bowtie2, BWA 0.6.2, BWA 0.7.5, Novoalign, and
SNAPbeta5 installed with hg19 indices premade for each aligner
(20mer index for SNAP as well as a 20mer index from the GATK bundle
ucsc.hg19.fasta). This image is bundled with several EBS devices
with simulated data generated by Mason drawn from hg19 and TVSim
drawn from Venter's genome and a real dataset from
the platinum
genomes project (NA18507). We recommend running this image
using a cr1.8xlarge instance (see FAQ for more detail below).
Login instructions:

What is sequence alignment, and why is it important?

As the cost of DNA sequencing continues to drop faster than Moore's Law, there is a growing need for tools that can efficiently analyze large bodies of sequence data. By mid-2013, sequencing a human genome is expected to cost $1000, at which point this technology will enter the realm of routine clinical practice. For example, it is expected that each cancer patient will have their genome and their cancer's genome sequenced.

However, current high-throughput sequencing technologies produce large numbers of short (~100 letter) reads from random locations in the genome. Putting together these reads into a choerent whole is a significant computational challenge, with current pipelines taking thousands of CPU-hours per genome. The first and most expensive step of this process is aligning each read to a known reference genome, so that differences between the patient's genome and the reference genome can be localized.

What makes SNAP faster?

SNAP leverages a combination of three insights: increasing read lengths, which allow for fast hash-based location of reads using larger "seed" sequences; increasing server memories, which allow trading memory to save CPU time (SNAP is designed for server machines with tens of gigabytes of RAM); and a novel algorithm for set intersection, edit distance algorithm, and pruning methodology that allow SNAP to reject most locations without fully scoring them, dramatically reducing the cost of local alignment checks. Please refer to the SNAP paper for details.

What do I need to run SNAP?

SNAP runs on Windows, Linux and Mac OS X. In addition, to align against the full human genome, you will need at least 64 GB of memory. SNAP can also take full advantage of multicore processors with the -t option to set the number of threads.

What file formats does SNAP support?

SNAP supports the standard FASTQ and SAM file formats for import, as well as SAM for output. Reference genomes should be FASTA.

I get a "bad_alloc" error building an index for hg19, but I have more than 60 GB RAM

SNAP can build this index using about 50 GB of RAM, but on machines with not much more than that, Linux will
sometimes refuse to allocate memory so as not to overcommit the total memory available. Run
sudo sysctl vm.overcommit_memory=1 to disable this

How do you recommend running SNAP on EC2?

For the highest-throughput alignment against hg19, we recommend using the cc2.8xlarge or cr1.8xlarge
Amazon instance type (16 cores, 60 GB RAM or 16 cores, 240 GB RAM), with Ubuntu 12.04.
Get the latest Ubuntu AMI from the Ubuntu website.
Also, once you start the machine, run
sudo sysctl vm.overcommit_memory=1 to allow SNAP to use all the memory.