Quantitative analysis of differences in copy numbers using read depth obtained from PCR-enriched samples and controls.

Reinecke F, Satya RV, DiCarlo J - BMC Bioinformatics (2015)

Bottom Line:
PCR-enriched amplicon-sequencing data have special characteristics that have been taken into account by only one publicly available algorithm so far.We describe a new algorithm named quandico to detect copy number differences based on NGS data generated following PCR-enrichment.A weighted t-test statistic was applied to calculate probabilities (p-values) of copy number changes.

Background: Next-generation sequencing (NGS) is rapidly becoming common practice in clinical diagnostics and cancer research. In addition to the detection of single nucleotide variants (SNVs), information on copy number variants (CNVs) is of great interest. Several algorithms exist to detect CNVs by analyzing whole genome sequencing data or data from samples enriched by hybridization-capture. PCR-enriched amplicon-sequencing data have special characteristics that have been taken into account by only one publicly available algorithm so far.

Results: We describe a new algorithm named quandico to detect copy number differences based on NGS data generated following PCR-enrichment. A weighted t-test statistic was applied to calculate probabilities (p-values) of copy number changes. We assessed the performance of the method using sequencing reads generated from reference DNA with known CNVs, and we were able to detect these variants with 98.6% sensitivity and 98.5% specificity which is significantly better than another recently described method for amplicon sequencing. The source code (R-package) of quandico is licensed under the GPLv3 and it is available at https://github.com/reineckef/quandico .

Conclusion: We demonstrated that our new algorithm is suitable to call copy number changes using data from PCR-enriched samples with high sensitivity and specificity even for single copy differences.

Fig3: False positive/negative rates. False positive rate (FPR) and false negative rates (FNR) observed on the comparison dataset. The optimal threshold for every algorithm was determined by selecting the value that generated the minimal sum of FPR and FNR. The scores for every individual algorithm (x-axis) were then divided by the identified threshold (normalized to 1) for comparison. For algorithm details, see legend of Figure 2.

Mentions:
Initially, a t-test statistic using log2 ratios of all primer sites in a certain cluster was used, but classification performance based on the obtained p-values alone was not satisfactory (Figure 2, naive). Removal of outliers had a significant effect, mainly on the false negative rate. A similar effect can be seen by calculating weighted means instead of simple averages per cluster (Figure 3, outliers and weighted).Figure 2

Fig3: False positive/negative rates. False positive rate (FPR) and false negative rates (FNR) observed on the comparison dataset. The optimal threshold for every algorithm was determined by selecting the value that generated the minimal sum of FPR and FNR. The scores for every individual algorithm (x-axis) were then divided by the identified threshold (normalized to 1) for comparison. For algorithm details, see legend of Figure 2.

Mentions:
Initially, a t-test statistic using log2 ratios of all primer sites in a certain cluster was used, but classification performance based on the obtained p-values alone was not satisfactory (Figure 2, naive). Removal of outliers had a significant effect, mainly on the false negative rate. A similar effect can be seen by calculating weighted means instead of simple averages per cluster (Figure 3, outliers and weighted).Figure 2

Bottom Line:
PCR-enriched amplicon-sequencing data have special characteristics that have been taken into account by only one publicly available algorithm so far.We describe a new algorithm named quandico to detect copy number differences based on NGS data generated following PCR-enrichment.A weighted t-test statistic was applied to calculate probabilities (p-values) of copy number changes.

Background: Next-generation sequencing (NGS) is rapidly becoming common practice in clinical diagnostics and cancer research. In addition to the detection of single nucleotide variants (SNVs), information on copy number variants (CNVs) is of great interest. Several algorithms exist to detect CNVs by analyzing whole genome sequencing data or data from samples enriched by hybridization-capture. PCR-enriched amplicon-sequencing data have special characteristics that have been taken into account by only one publicly available algorithm so far.

Results: We describe a new algorithm named quandico to detect copy number differences based on NGS data generated following PCR-enrichment. A weighted t-test statistic was applied to calculate probabilities (p-values) of copy number changes. We assessed the performance of the method using sequencing reads generated from reference DNA with known CNVs, and we were able to detect these variants with 98.6% sensitivity and 98.5% specificity which is significantly better than another recently described method for amplicon sequencing. The source code (R-package) of quandico is licensed under the GPLv3 and it is available at https://github.com/reineckef/quandico .

Conclusion: We demonstrated that our new algorithm is suitable to call copy number changes using data from PCR-enriched samples with high sensitivity and specificity even for single copy differences.