Abstract

High-throughput sequencing of circulating tumor DNA (ctDNA) promises to facilitate personalized cancer therapy. However, low quantities of cell-free DNA (cfDNA) in the blood and sequencing artifacts currently limit analytical sensitivity. To overcome these limitations, we introduce an approach for integrated digital error suppression (iDES). Our method combines in silico elimination of highly stereotypical background artifacts with a molecular barcoding strategy for the efficient recovery of cfDNA molecules. Individually, these two methods each improve the sensitivity of cancer personalized profiling by deep sequencing (CAPP-Seq) by about threefold, and synergize when combined to yield ∼15-fold improvements. As a result, iDES-enhanced CAPP-Seq facilitates noninvasive variant detection across hundreds of kilobases. Applied to non-small cell lung cancer (NSCLC) patients, our method enabled biopsy-free profiling of EGFR kinase domain mutations with 92% sensitivity and >99.99% specificity at the variant level, and with 90% sensitivity and 96% specificity at the patient level. In addition, our approach allowed monitoring of NSCLC ctDNA down to 4 in 10(5) cfDNA molecules. We anticipate that iDES will aid the noninvasive genotyping and detection of ctDNA in research and clinical settings.

(a) Impact of alternative error suppression methods on nucleotide substitution classes. Error rates were calculated with respect to each of the four reference bases separately (Methods). (b) Distribution of background alleles uniquely eliminated by barcoding or polishing alone in healthy control cfDNA. (c) Comparison of iDES with various barcoding strategies for selector-wide error profiles and recovered hGEs. The barcoding strategy denoted by ‘2*’ maximizes the retention of sequenced molecules and is the approach used in this work (Methods). Data are presented as means +/− s.e.m. (d) Analytical modeling of detection limits for various error suppression methods,, as a function of available tumor-derived mutations (90% confidence detection limit; Methods). Sequencing throughput was calibrated to iDES, such that the quantity of reads needed to recover 5,000 hGEs was determined and then used to estimate the number of recovered hGEs for all other methods given their reported efficiencies (). The theoretically maximum detection-limit of a given method, shown as a horizontal line, is bound by the method’s error rate. For additional details, see . The same 12 normal control samples shown in were used for the analyses in a–c.

Noninvasive tumor genotyping with iDES-enhanced CAPP-Seq was assessed using technical controls (a–c) and patients with NSCLC (d–f). (a) A DNA reference blend containing known alleles spanning a broad AF range was diluted to 5% in normal cfDNA and analyzed in replicate (n=4) for both known variants (n=29) and 279 negative control variants (, Methods). Left: Differential impact of barcoding, polishing, and iDES on genotyping results for a single representative replicate. Only variant calls with at least 2 supporting reads are shown. Asterisks highlight the complementary background profiles removed by barcoding and polishing. Note that all variant calls are ordered along the x-axis, first by validation status and then by AF. Identical calls are aligned vertically. Right: Performance metrics across all four replicates. Genotyping thresholds were determined as described in Methods. (b) AFs determined by iDES-enhanced CAPP-Seq in the 5% variant blend from panel a (observed) versus their concentrations determined by digital PCR (expected). Only variants in the reference blend with externally validated AFs targeted by our NSCLC selector are shown (n=13; ). Data are expressed as means ± s.e.m (n=4 replicates). (c) Heat map (top) and scatter plot (bottom) depicting candidate SNVs identified by noninvasive selector-wide genotyping of the 5% variant blend from panel a (, Methods). SNVs were tracked across three additional replicates and a ten-fold lower spike. Horizontal lines depict mean AFs. (d–f) Noninvasive tumor genotyping of NSCLC patients. (d) Bottom: The number of hotspot SNVs noninvasively detected in 24 pretreatment NSCLC cfDNA samples by four methods, including iDES (barcoding + polishing). All queried variants are listed in . Top: Positive predictive value (PPV) of each method (indicated below), based on the number of hotspot SNVs that were later confirmed in matching tumor biopsies. (e) The performance of iDES for noninvasive tumor genotyping of two plasma cohorts was assessed using observed allele fractions with a Receiver Operating Characteristic (ROC) plot. In the first cohort (n=66 plasma samples from patients with matching tumor biopsies), hotspot variants from a predefined list of 292 variants were assessed (). Results are shown for the 46 plasma samples with at least one detectable mutation (‘All genes’, n=24 patients); specificity was assessed using variants that were detected but that could not be verified in the primary tumor. In the second cohort, EGFR hotspot variants were assessed in an extended cohort of 103 plasma samples from 41 EGFR-positive patients with NSCLC (‘EGFR’). Specificity was assessed using 27 EGFR-wildtype subjects (Methods). The pie chart shows the distribution of detected EGFR variants. Only patients with genotyped tumors were analyzed. AUC, area under the curve. (f) Noninvasive genotyping of EGFR mutations in plasma samples from 37 patients with advanced NSCLC and with biopsy-confirmed EGFR mutations. Top: Performance of iDES-enhanced CAPP-Seq for the genotyping of actionable EGFR mutations (n=36 patients; 1 of 37 patients did not have an actionable alteration). All performance metrics were assessed at the variant level. Bottom: Comparison of error-suppression methods for noninvasive tumor genotyping of the entire EGFR kinase domain in all patients with biopsy-confirmed EGFR SNVs (n=29 of 37 patients). Performance metrics were assessed separately at the variant level and patient level (using 27 EGFR-wildtype subjects). Percentages indicate iDES performance only. Further details are provided in Methods. Sn, sensitivity; Sp, specificity; PPV, positive predictive value; NPV, negative predictive value.

(a) Analysis of ctDNA detection limits using a hypermutated glioblastoma (GBM) tumor mixed into normal control cfDNA in defined proportions. Here, 30 mutations were randomly selected from a pool of 1,502 total mutations known to be present in the GBM tumor and covered by the sequencing panel. Random sampling of 30 mutations was repeated 50 times and the results are presented as means +/− 95% confidence intervals. For further details, see and Methods. AF, allele fraction. (b) Comparison of error-suppression methods for the detection of ctDNA in pre- and post-treatment plasma from 30 NSCLC patients. Patient-derived somatic variants (columns; n=30 sets) were assessed in every plasma sample (rows; n=116), including 30 normal controls to evaluate specificity. The same samples were analyzed for each method (e.g., iDES) and are identically ordered in the heat map. Red squares denote a genetically matched sample (i.e., patient-derived tumor mutations were significantly detectable in a plasma sample from the same patient). Additional details are provided in . (c) Using iDES, but not other methods, ctDNA was detectable prior to clinical progression in a stage IIIB NSCLC patient. (d) Top: Analysis of variants called from tumor biopsies versus variants called directly from pretreatment cfDNA with iDES-enhanced CAPP-Seq. Estimated ctDNA levels were compared by linear regression. Open circles/squares indicate time points without significantly detectable ctDNA. ND, not detected. Time points are shown in chronological order (1, pretreatment; >1, post-treatment). Bottom: Comparison of error suppression methods for the same analysis shown above but across all 8 evaluable patients (Methods). Linear regression was applied globally across all 37 plasma time points profiled for these eight patients.

Same as , but showing the impact of iDES on the probability of background errors. Post-iDES background data were derived from cfDNA samples pooled from a test cohort of 18 normal donors, none of which were used for learning baseline background distributions. Further details are provided in Methods.