Variant Detection in Massively Parallel Sequencing

There is at last some evidence that I do real scientific work, and real scientific writing outside of Massgenomics. Online today at Bioinformatics is our publication of variant detection in massively parallel sequencing of individual and pooled samples. In it we present VarScan, the culmination of my work over the past two years to develop algorithms for calling SNPs and indels in deep resequencing data from the 454 and Illumina/Solexa platforms. It’s one of the many cancer genomics tools that we have developed and made available to the research community, and the first to be published.

The Challenge

With next-generation sequencing, the problem of accurately calling variants faces two key challenges. First, to distinguish between true variants and false positives in shorter, more error-prone sequencing reads. Second, to handle the sheer volume of data, which might be millions of reads from a single run. The 454 and Illumina platforms present their own particularly difficulties as well. On the 454, it’s homopolymers – stretches of A’s, C’s, G’s, or T’s four or more bases long. The exact number of bases in a homopolymeric stretch becomes difficult to resolve by pyrosequencing, and as a result, 454 sequencing tends to overcall or undercall bases in these regions. This presents alignment difficulties and numerous artifact insertions/deletions relative to the reference sequence. On the Illumina platform, reads are shorter, and the base qualities drop sharply after 28 or so bases. Until recent improvements that have yielded 50, 75, and 100 bp read lengths, alignment of Illumina reads was particularly difficult given this base quality paradigm.

Targeted Resequencing, Pooled Samples

For whole genome sequencing of individual samples, we’ve used Maq with some success to map, assemble, and call variants from Illumina/Solexa reads. However, for various projects we’re interested in sequencing only selected regions (targeted by PCR or capture), often in many samples. Deep sequencing of pooled samples is one area, unfortunately, where Maq and RunMapper/Newbler continue to struggle. Here, the individual read counts supporting reference and variant alleles become important in distinguishing true variants and assessing their frequencies in a pool. Also, due to the high-priority of say, validating somatic variants in a tumor sample with 454 sequencing, we wanted something that could leverage faster aligners – parallelized BLAT for 454 and Bowtie for Illumina.

VarScan: SNP and Indel Calling in BLAT, Bowtie, Novoalign…

To address these challenges we developed VarScan, a program for calling SNPs and indels in next-gen sequencing data. VarScan accepts read alignments in several native formats – BLAT and Newbler for 454 data, Bowtie, cross_match, and Novoalign for Illumina/Solexa data. Given a file of read alignments, VarScan detects sequence variants, combines them by position and type, and then computes the read counts, average base quality, and number of strands supporting each allele. As a final step, one can use VarScan to limit variant calls to a set of targeted positions, which removes variants arising from spurious read alignments. Variant calls can also be filtered by base quality, read count, and variant frequency, which is useful for isolating variants at frequencies of say 1% or higher in a pool of samples.

Proof of Principle: Individual 454 and Pooled Illumina Sequencing

To demonstrate VarScan’s capabilities, we used data from a collaboration with Stephen Daiger and colleagues, who are resequencing candidate genes in a cohort of families with Retinitis Pigmentosa (RP). The Genome Center had designed 1,000 amplicons targeting the exons of dozens of genes, and obtained PCR products for these for each sample. Samples were sequenced individually on the 454 platform to ~70x coverage. Then, samples from 42 individuals were pooled and sequenced on the Illumina platform to extremely deep coverage (~6000x, or ~125x per sample). We used BLAT and Bowtie to align the 454 and Illumina reads, respectively, and applied VarScan to the resulting alignments.

Sensitivity, Specificity, SNPs, and Indels

I won’t ruin the whole story for you, but suffice it to say that most of the SNPs called by VarScan in the 454 sequencing were present in dbSNP or supported by VarScan’s Illumina/Solexa calls, or both. Incredibly, when we plotted the estimated allele frequencies from deep Illumina sequencing of the pool to the known frequencies from individual 454 results, the correlation was extremely high (>0.95, Pearson’s). This demonstrates VarScan’s capabilities not only to detect variants with good sensitivity/specificity, but to accurately estimate their frequency in a pool. To evaluate VarScan’s performance on insertion/deletion (indel) variants, we identified 77 high-confidence short (1-5 bp) indels in the 454 data and attempted to validate them with Illumina data. Since Maq, Bowtie, and most other Illumina aligners don’t allow gaps, we used another aligner, Novoalign, which claims to have higher read placement sensitivity and aligns reads with gaps in single-end libraries. We used VarScan to call indels in the Novoalign output, and we were able to validate around 60% of the high-confidence indels from 454 with Novoalign alignments of Illumina data. Taken together, these results support the utility of VarScan for variant detection in next-gen sequencing, and demonstrate the feasibility of large-scale mutational profiling by deep resequencing of pooled samples.