Bottom Line:
While the high-coverage afforded can be used, in principle, to identify very rare variants in a population, existing ad hoc approaches frequently fail to distinguish true variants from sequencing errors.Using simulated and real datasets (viral, bacterial and human), we show that LoFreq has near-perfect specificity, with significantly improved sensitivity compared with existing methods and can efficiently analyze deep Illumina sequencing datasets without resorting to approximations or heuristics.We also present experimental validation for LoFreq on two different platforms (Fluidigm and Sequenom) and its application to call rare somatic variants from exome sequencing datasets for gastric cancer.

ABSTRACTThe study of cell-population heterogeneity in a range of biological systems, from viruses to bacterial isolates to tumor samples, has been transformed by recent advances in sequencing throughput. While the high-coverage afforded can be used, in principle, to identify very rare variants in a population, existing ad hoc approaches frequently fail to distinguish true variants from sequencing errors. We report a method (LoFreq) that models sequencing run-specific error rates to accurately call variants occurring in <0.05% of a population. Using simulated and real datasets (viral, bacterial and human), we show that LoFreq has near-perfect specificity, with significantly improved sensitivity compared with existing methods and can efficiently analyze deep Illumina sequencing datasets without resorting to approximations or heuristics. We also present experimental validation for LoFreq on two different platforms (Fluidigm and Sequenom) and its application to call rare somatic variants from exome sequencing datasets for gastric cancer. Source code and executables for LoFreq are freely available at http://sourceforge.net/projects/lofreq/.

gks918-F2: SNV calling in the presence of tumor sample heterogeneity. Germline and somatic variant frequencies for paired tumor-normal exome sequencing datasets from a custom samtools-based pipeline (32) are compared here with those from LoFreq (see ‘Materials and Methods’ section). As shown, while germline variants are consistently distributed around 50% (as expected for heterozygous variants), somatic variants are shifted to lower frequencies, likely due to contamination in the tumor sample from normal stromal tissue. Note that while samtools-based somatic calls appear ‘clipped’ at lower frequencies, LoFreq calls are symmetrically distributed as expected.

Mentions:
High-coverage exome and whole-genome sequencing datasets for matched tumor and normal samples from cancer patients are increasingly being generated to characterize cancer-specific somatic mutations that could play a driving role in tumorigenesis. Despite the known heterogeneity of tumors, calling of somatic variants is often limited to those in a majority of the cells or performed using ad hoc approaches (10,32,41). In addition, since tumor samples are often contaminated with normal tissue, the ability to robustly detect somatic mutations can be critical. In particular, results from a samtools analysis of 14 exome sequencing datasets for gastric tumor/normal paired samples from Zang et al. (32) revealed an asymmetric frequency distribution for the somatic SNVs called, suggesting that sample contamination can lead to significantly reduced sensitivity even with high sequencing coverage (Supplementary Figure S4). Re-analysis of these datasets with LoFreq helped to recover the full distribution (Figure 2), revealing the value of a systematic approach to call low-frequency somatic SNVs even when the goal is to only identify heterozygous and homozygous variants in high-coverage datasets.Figure 2.

gks918-F2: SNV calling in the presence of tumor sample heterogeneity. Germline and somatic variant frequencies for paired tumor-normal exome sequencing datasets from a custom samtools-based pipeline (32) are compared here with those from LoFreq (see ‘Materials and Methods’ section). As shown, while germline variants are consistently distributed around 50% (as expected for heterozygous variants), somatic variants are shifted to lower frequencies, likely due to contamination in the tumor sample from normal stromal tissue. Note that while samtools-based somatic calls appear ‘clipped’ at lower frequencies, LoFreq calls are symmetrically distributed as expected.

Mentions:
High-coverage exome and whole-genome sequencing datasets for matched tumor and normal samples from cancer patients are increasingly being generated to characterize cancer-specific somatic mutations that could play a driving role in tumorigenesis. Despite the known heterogeneity of tumors, calling of somatic variants is often limited to those in a majority of the cells or performed using ad hoc approaches (10,32,41). In addition, since tumor samples are often contaminated with normal tissue, the ability to robustly detect somatic mutations can be critical. In particular, results from a samtools analysis of 14 exome sequencing datasets for gastric tumor/normal paired samples from Zang et al. (32) revealed an asymmetric frequency distribution for the somatic SNVs called, suggesting that sample contamination can lead to significantly reduced sensitivity even with high sequencing coverage (Supplementary Figure S4). Re-analysis of these datasets with LoFreq helped to recover the full distribution (Figure 2), revealing the value of a systematic approach to call low-frequency somatic SNVs even when the goal is to only identify heterozygous and homozygous variants in high-coverage datasets.Figure 2.

Bottom Line:
While the high-coverage afforded can be used, in principle, to identify very rare variants in a population, existing ad hoc approaches frequently fail to distinguish true variants from sequencing errors.Using simulated and real datasets (viral, bacterial and human), we show that LoFreq has near-perfect specificity, with significantly improved sensitivity compared with existing methods and can efficiently analyze deep Illumina sequencing datasets without resorting to approximations or heuristics.We also present experimental validation for LoFreq on two different platforms (Fluidigm and Sequenom) and its application to call rare somatic variants from exome sequencing datasets for gastric cancer.

ABSTRACTThe study of cell-population heterogeneity in a range of biological systems, from viruses to bacterial isolates to tumor samples, has been transformed by recent advances in sequencing throughput. While the high-coverage afforded can be used, in principle, to identify very rare variants in a population, existing ad hoc approaches frequently fail to distinguish true variants from sequencing errors. We report a method (LoFreq) that models sequencing run-specific error rates to accurately call variants occurring in <0.05% of a population. Using simulated and real datasets (viral, bacterial and human), we show that LoFreq has near-perfect specificity, with significantly improved sensitivity compared with existing methods and can efficiently analyze deep Illumina sequencing datasets without resorting to approximations or heuristics. We also present experimental validation for LoFreq on two different platforms (Fluidigm and Sequenom) and its application to call rare somatic variants from exome sequencing datasets for gastric cancer. Source code and executables for LoFreq are freely available at http://sourceforge.net/projects/lofreq/.