Abstract

Large-scale sequencing of cancer genomes has uncovered thousands of DNA alterations, but the functional relevance of the majority of these mutations to tumorigenesis is unknown. We have developed a computational method, called Cancer-specific High-throughput Annotation of Somatic Mutations (CHASM), to identify and prioritize those missense mutations most likely to generate functional changes that enhance tumor cell proliferation. The method has high sensitivity and specificity when discriminating between known driver missense mutations and randomly generated missense mutations (area under receiver operating characteristic curve, >0.91; area under Precision-Recall curve, >0.79). CHASM substantially outperformed previously described missense mutation function prediction methods at discriminating known oncogenic mutations in P53 and the tyrosine kinase epidermal growth factor receptor. We applied the method to 607 missense mutations found in a recent glioblastoma multiforme sequencing study. Based on a model that assumed the glioblastoma multiforme mutations are a mixture of drivers and passengers, we estimate that 8% of these mutations are drivers, causally contributing to tumorigenesis. [Cancer Res 2009;69(16):OF6660–8]

cancer drivers

CHASM

missense mutations

random forest

somatic mutations

Introduction

Today we face a bottleneck between large-scale acquisition of genomic information discovered through medical resequencing projects and the application of this information to improved understanding of human disease. Projects to systematically resequence tumor genomes have discovered thousands of genes that were not previously linked to tumorigenesis but are somatically mutated in a relatively small fraction of tumors and may be important for tumor initiation or progression (
1–
6). Many of these somatic changes are likely to be “passengers” (
1) that have no functional effects but were already present in the cell that gave rise to the tumor or were acquired during subsequent tumor growth. Only a small fraction of the genetic alterations in a tumor are expected to drive tumor evolution by giving cells a selective advantage over their neighbors.

Determining which mutations are drivers and which are passengers is one of the most pressing challenges in cancer genetics. Although genes that are mutated very frequently (“mountains”) can be confidently classified as driver genes, most genes discovered thus far are mutated in a relatively small fraction of tumors (“hills”). The examination of large numbers of tumors can provide helpful information for classification of drivers versus passengers, but the ability of sequencing alone to provide definitive results is limited by the marked variation in mutation frequency among individual tumors and individual genes. Moreover, it has been clearly shown that genes that are mutated in only a small fraction (<1%) of tumors can still act as drivers (
6). Thus, methods that can classify mutations as either drivers or passengers on the basis of data that is independent of mutation frequency are clearly needed. Such methods include functional studies in model organisms or in cultured cells, using gene knockout, siRNA, or overexpression approaches. These methods are extraordinarily useful for elucidating the function of individual mutated genes but are not well suited to the analysis of the hundreds of gene candidates that arise from every large scale cancer genome project.

Here, we describe a novel high-throughput computational prediction method to identify the mutations most likely to be drivers. We chose to focus on missense mutations as they account for the majority of somatic mutations found in the exons of tumor-derived DNA (
6), and because their functional significance is more difficult to infer than that of nonsense or frameshift mutations.

Previous work in this area has resulted in several innovative ways to characterize the differences between driver and passenger missense mutations. Driver mutations may have characteristics similar to those causing Mendelian disease when inherited in the germ line (
7) and may be identifiable by constraints on tolerated amino acid residues at the mutated positions (
3,
7–
9). In contrast, passenger mutations may have characteristics more similar to those of nonsynonymous single nucleotide polymorphisms (nsSNP) with high minor allele frequencies (MAF; refs.
3,
7). Based on these similarities, supervised machine learning methods have been used to predict which missense mutations are drivers (
3,
7). The CAN-Predict method trains a Random Forest (
10) to discriminate between mutations from the COSMIC cancer somatic mutation database (
11) and nsSNPs with high MAFs (
3). A method specific to protein kinases (
7) trains a support vector machine (SVM; ref.
12) to discriminate between known disease kinase nsSNPs and common kinase nsSNPs. Although not specifically designed for this problem, bioinformatics methods, such as PolyPhen and SIFT (
9,
13) have also been applied to identify pathogenic, tumor-derived mutations in genes of interest (
6). These methods attempt to discriminate driver from passenger mutations by considering properties such as evolutionary conservation, compatibility of the mutant amino acid residue with the wild-type or with equivalently positioned residues in homologous proteins, the predicted protein local environment (
7), and enrichment of the protein structural domain in which mutations occur with respect to biological processes thought to be critical for cancer (
3).

We hypothesized that although existing computational methods could detect differences between somatic missense mutations observed in cancers and high MAF nsSNPs in the germline, these differences might be less relevant to the discrimination between driver and passenger mutations that occur somatically in tumors. Although high MAF nsSNPs and passenger mutations have properties in common, they also have differences. Passenger mutations may or may not have a functional impact on proteins; by definition, they are neutral with respect to cancer cell fitness. In contrast, high MAF nsSNPs have become fixed in the human genome and must be functionally neutral or have a mild functional impact with respect to normal cell fitness. We reasoned that we could train a classifier with improved specificity by representing passenger missense mutations not by high MAF nsSNPs, as done previously, but rather by in silico simulations using mutation profiles that reflected tumor type as well as mutation context.

Materials and Methods

Feature selection. We used a Random Forest classifier (
10,
14) that was trained on 49 predictive features (Supplementary Table S1). Feature selection was done with a protocol based on mutual information (Supplementary Materials and Methods: Feature Selection and Information Theory; Supplementary Fig. S1). Mutual information is a generalized version of correlation that does not make assumptions about linear relationships between two variables of interest (
15). Features with missing values were estimated with a k-nearest neighbors algorithm (Supplementary Materials and Methods: Missing Values).

Classifier training. The Cancer-specific High-throughput Annotation of Somatic Mutations (CHASM) method is based on a Random Forest classifier (
10,
14) trained to discriminate between driver missense mutations and synthetically generated passenger missense mutations. The classifier is implemented using PARF,
5 a Fortran 95 adaptation of Leo Breiman's original Random Forest software.
6 Before training, all features were standardized with the Z score method using the scale command in R statistical software (
16). To avoid overfitting, we divided our known driver mutations and synthetic passenger mutations into two partitions, one for feature selection and one for classifier training.

This Random Forest is an ensemble of “decision trees,” specifically classification and regression trees (
17), each of which uses a hierarchical set of rules to decide whether a mutation is a driver or a passenger. The rules are based on our input features and the final score yielded for each mutation is the fraction of trees that voted for the passenger class. We used a forest with 500 trees, and default parameters (mtry = 7). The Random Forest algorithm is robust to class label contamination and performs well with high dimensional data sets (
10,
14).

Classifier assessment. We assessed Random Forest classifier performance by two threshold-independent measures— receiver operating characteristic (ROC) and Precision-recall (PR) curves (Supplementary Materials and Methods: ROC and Precision-recall Curves and Minimum Error Point). We considered both the training set out-of-bag error (
10) and the error on two held-out validation sets of known oncogenic mutations in P53 and epidermal growth factor receptor (EGFR). The out-of-bag error estimate is produced while the Random Forest is being trained and is a viable replacement for error estimates by cross-validation (
18). We compared the Random Forest with a SVM classifier (assessed with 5-fold cross-validation; Supplementary Materials and Methods: Support Vector Machine; ref.
12) and with the performance of several state-of-the-art missense mutation function prediction methods.

Probabilistic interpretation of random forest classification scores in tumor-derived glioblastoma multiforme mutations. We used the trained Random Forest to compute a classification score for each of 607 glioblastoma multiforme (GBM) missense mutations reported by Parsons and colleagues (
4). However, these scores are not probabilities and the statistical behavior of the algorithm has not been well-characterized (
10). Therefore, it is not evident where to set a trusted score cutoff for purposes of identifying driver mutations. To do this, we first interpret the scores in the framework of statistical hypothesis testing. For each of the 607 GBM mutants, we test the null hypothesis: the mutant is not functionally related to the growth of the tumor (passenger), versus the alternative hypothesis that it is (driver). We obtain a P value for a mutation by comparing its score to the null distribution, which consists of the scores of a filtered set of synthetic passengers that were held out from Random Forest training (Supplementary Materials and Methods: Filtering of Synthetically Generated Passenger Mutations), using the Benjamini-Hochberg algorithm to correct for multiple testing (Supplementary Materials and Methods: Controlling the False Discovery Rate; ref.
19).

GBM mutations. We assessed 607 GBM mutations from 21 patient samples (
4). Five of the mutations described by Parsons and colleagues (
4) were dropped because they occurred in gene transcripts that are no longer supported by the RefSeq database (
20). Three mutations were dropped because they were found in gene transcripts that were larger than 14,000 codons. For gene transcripts of this size, we were unable to generate protein multiple sequence alignments because of their high computational expense. Finally, one of the GBM tumor samples was from a patient with a hypermutator phenotype who had been treated with radiation and temozolomide. Because this sample had 17 times as many alterations as the other GBM samples and a radically different mutation spectrum (
4), these mutations were excluded from our analysis.

Estimation of fraction of drivers in GBM. We assumed that the GBM mutations are a mixture of drivers and passengers and wanted to estimate the proportion of drivers in the mixture. The probability distribution of the GBM CHASM scores should then be similar to the CHASM score distribution of a mixture of known driver and synthetic passenger mutations (
21). We numerically find the mixing proportion, which minimizes the distance between these two score distributions (Supplementary Materials and Methods: Estimating the Fraction of Drivers).

Comparison with other methods. For comparison purposes, we assessed the performance of several published methods that were possibly useful for driver mutation prediction both on our training set and the two held-out validation sets of P53 and EGFR mutations. The tested methods were as follows: PolyPhen (
22), SIFT (
9), CanPredict (
3), and KinaseSVM (
7). We also assessed a consensus prediction, based on agreement between SIFT and PolyPhen (Supplementary Materials and Methods: Comparison with Other Missense Mutation Function Prediction Methods).

Wherever possible, we assessed the performance of these methods using a numerical score, rather than a categorical prediction, so that we could construct threshold-independent ROC and PR curves. We computed precision and recall statistics (
Eq 4) when only categorical predictions were available (CanPredict and the PolyPhen/SIFT consensus).where TP is the number of drivers correctly classified, FP is the number of synthetic passengers misclassified, and FN is the number of drivers misclassified. We compared the performance of these methods to CHASM's performance on its own training set, based on out-of-bag scores, and also to CHASM's performance when all P53 and EGFR mutations were held out of its training and feature selection sets. We also compared Random Forest performance with performance of a SVM (
12), another state-of-the-art machine learning classifier, using the same training sets and predictive features. The SVM was trained using the e1071 package in R statistical software and assessed using 5-fold cross-validation and constructing ROC and PR curves.

Results

Feature selection. To develop a new classifier, we first evaluated a large number of candidate predictive features and found that >50 features contained at least some information that seemed to be useful for discriminating between driver and passenger mutations. In particular, using a method that estimates mutual information between a predictive feature and class labels, we found that the majority of the candidate predictive features were weakly informative (Supplementary Table 3; ref.
23). In our training set (described in Materials and Methods), we calculated that a feature capable of correctly classifying a mutation as a passenger or driver would require 2.05 bits of information (Supplementary Materials and Methods: Information Theory). As our top-ranked feature had only 0.06 bits of information, we compensated by using 49 features (Supplementary Table S3; Supplementary Fig. S3). This is a much larger number of features than used in previous studies (
3,
7). The sum of the information in each individual feature was 0.37 bits. However, the Random Forest works with all features jointly, which may yield much higher information content than the simple sum.

Some of our top-ranked features have not, to our knowledge, been used previously for missense mutant function prediction. These features include the average nucleotide-level conservation of the exon in which a mutation occurs in 17-way vertebrate Multiz alignments (
24), estimated by PhastCons (
25); SNP density (the number of SNPs in the exon where the mutation occurs, normalized by exon length); and frequency of missense change type in the COSMIC database of somatic variation in cancer (
11).

Datasets used for training. As noted in the Introduction, the choice of training sets is critically important to the performance of any classifier. As drivers, we selected 2,488 missense mutations previously identified as playing a functional role in cancer, culled from the COSMIC database and recent large-scale resequencing studies (see Materials and Methods). The passenger data set was derived by a two-step process. First, we selected genes that were mutated at least once in four large-scale sequencing studies of colorectal, breast, brain, or pancreatic tumors (
2,
4–
6). Second, we generated synthetic passenger missense mutations in these genes in silico, using an algorithm that recapitulated the type of base substitutions found in brain tumors (mutation context). Note that we purposefully chose genes that were mutated as the substrate for the in silico generation of synthetic mutations. This increased the likelihood that the new classifier would detect mutations that were extraordinary rather than detect genes that were extraordinary (e.g., had very different codon compositions than the average). Our classifier would thus be able to detect differences between driver and passenger mutations even if the mutations were in the same gene.

Past classifiers have often used high MAF nsSNPs as the passenger data set rather than the synthetic passenger data set described above. To determine whether there were major differences between our new data set and high MAF nsSNPs, we compared them using principal component analysis applied to the top-ranked 21 predictive features (Supplementary Table S1). As shown in
Fig. 1
, a randomly selected set of 4,395 high MAF ns SNPs from the HapMap project were distributed differently than a set of 4,500 synthetic passengers. Interestingly, the synthetic passengers formed two distinct clusters in this analysis, along the dimension of principal component four, which is dominated by feature 72. The feature is a binary descriptor of regions in proteins that are functionally interesting, as annotated in the UniProtKB database (
26). It seems that although a subset of the synthetically generated passenger mutations were located in annotated regions of functional interest, the MAF nsSNPs tended not to be located in these regions. This result is consistent with evolutionary selective pressure on MAF nsSNPs for functional neutrality. Other features with large magnitude coefficients in these principal components analysis components included predicted amino acid residue propensities for secondary structure, solvent accessibility, backbone flexibility, and additional protein-based functional annotations from UniProtKB.

Principal components analysis of nsSNPs versus synthetic passenger mutations. Synthetic passenger mutations (red) and high MAF nsSNPs from the HapMap project (blue) have substantial overlap in the space defined by principal components one, three, and four, but there are regions in the space occupied only by high MAF nsSNPs and regions occupied only by synthetic passengers.

Classifier construction. We then attempted to use these features and data sets to design a new classifier using two state-of-the-art machine learning methods, SVMs, and Random Forests. Although both methods were able to define good classifiers, the Random Forest proved superior (Supplementary Fig. S4) and was used for the remainder of the analyses. Details of the construction of the Random Forest–based classifier, henceforth termed CHASM, are described in Materials and Methods.

To test the performance of CHASM, we first assessed it with respect to its out-of-bag classification error on the training sets (equivalent to a cross-validation test (
10)). For this purpose, ROC and PR curves were used, as these metrics consider classification errors at all possible score thresholds. Using area under the curve (AUC) as a performance summary statistic, where 1.0 indicates perfect classification, CHASM yielded AUCs of 0.91 and 0.79 for ROC and PR, respectively (
Fig. 2
).

ROC and PR curves calculated for (A) CHASM, (B) PolyPhen PSIC, and (C) SIFT on the training set mutations. CHASM training out-of-bag scores were used to generate the ROC and PR curves in A. A color version is available as Supplementary Fig. S6.

This performance was then compared with that of other methods, including PolyPhen's PSIC score, SIFT, CanPredict, KinaseSVM (Supplementary Fig. S5), and a SIFT-PolyPhen consensus. The fraction of mutations that could be evaluated by these alternative methods (coverage) was considerably lower than that of CHASM (Supplementary Materials and Methods; Supplementary Table S4). Moreover, even the best performing of the alternative methods was inferior to CHASM in specificity, sensitivity, and precision (Supplementary Table S4). These differences translated to much lower AUCs for ROC and PR (
Fig. 2).

As another test of the performance of CHASM, P53 or EGFR mutations were held out of the mutation data set used for training, and then these known driver mutations were assigned scores by CHASM and the other algorithms. To evaluate both the sensitivity and specificity of each method, we also held out 590 synthetic passenger mutations. If we consider the fraction of misclassified mutations at the minimum error point, the CHASM classifier had high sensitivity and specificity for both the P53 and EGFR test sets (Supplementary Table S4). The performance of CHASM was considerably better, both in terms of sensitivity and specificity, than previously described classifiers (Supplementary Table S4). These differences are graphically illustrated in the AUCs presented in
Figs. 3
and
4
(Further detail is provided in Supplementary Table S5.).

ROC and PR curves calculated for (A) CHASM, (B) PolyPhen PSIC, and (C) SIFT on P53 and synthetic passenger mutations held out of the CHASM training set. A color version is available as Supplementary Fig. S7.

ROC and PR curves calculated for (A) CHASM, (B) PolyPhen PSIC, and (C) SIFT on EGFR and synthetic passenger mutations held out of the CHASM training set. A color version is available as Supplementary Fig. S8.

For a practical estimate of the CHASM performance, we calculated P values for each of the held out P53 and EGFR mutations, then controlled the false discovery rate (FDR) to 0.2 using the Benjamini-Hochberg procedure. We found that 195 of the 196 experimentally observed P53 mutations and 131 of the 133 experimentally observed EGFR mutations were predicted to be drivers by CHASM. In comparison, a maximum of 188 of the 196 experimentally observed P53 mutations and 101 of the 133 experimentally observed EGFR mutations were predicted to be drivers by PolyPhen or SIFT.

Analyses of GBM. The CHASM Random Forest classifier was then used to score 607 missense mutations in glioblastoma multiforme (GBM) described by Parsons and colleagues (
4). The driver data set used to train the Random Forest was the same as that described above except that all of the missense mutations actually observed in GBMs were excluded. The raw CHASM scores of the mutations, representing the fraction of trees in the forest that voted for classifying the mutation as passenger, ranged from 0 to 1 (
Fig. 5
). For each of these missense mutants, we tested the null hypothesis that the mutant was a passenger. A P value was calculated for each mutant by comparing its CHASM score to the score distribution of a filtered set of synthetic passengers (see Materials and Methods for details). The Benjamini-Hochberg procedure was used to control the FDR at the desired level of 0.2 (
19).

Histograms of CHASM scores for driver mutations and passenger mutations held out from the training set, and 607 mutations experimentally identified in GBM. Estimated kernel density for each set of scores (solid line) and fitted mixture of the driver and passenger score densities (dashed line) are shown superimposed on the histograms.

At this FDR level, CHASM classified 24 of the 607 GBM mutations as drivers (
Table 1
). Importantly, CHASM successfully identified 11 mutations that were likely to be drivers based on previous experimental data. These 11 mutations included nine in P53 or PTEN, well-known tumor suppressor genes, one in PIK3CA, a well-known oncogene and one in IDH1, a gene recently discovered to be altered in many brain tumors (
27). In addition to these, 11 CHASM identified 13 others that otherwise would not have been suspected of playing a major role in GBM tumorigenesis (
Table 1). Intriguingly, these mutations included those in genes that are likely to be involved in critical signaling pathways, such as the protein kinases STK39 and RIPK4, the protein phosphatase PTPRM, and the insulin-signaling mediator PHIP.

Driver mutations predicted by CHASM at FDR of 0.2, shown with their associated Random Forest scores and P values

Finally, to estimate the proportion of driver missense mutations in the GBM mutation set, we minimized the difference between the distributions of the CHASM scores of the GBM mutations and the CHASM scores of a mixture of known driver and synthetic passenger mutations (see Materials and Methods for details). We thereby estimated that 49 of the 607 missense mutations identified in GBM, or 8%, were drivers.

Discussion

Computational methods to predict the impact of mutations discovered in tumor resequencing are still under development. Although initial work focused on identification of driver genes rather than driver mutations (
1,
5), it has recently been suggested that the occurrence of some missense mutations in oncogenes or tumor suppressor genes are actually passengers (
7), motivating the need for a higher resolution approach that identifies individual mutations as drivers. In light of the large number of mutations that are being discovered in current large-scale cancer gene sequencing efforts, and the impossibility of assessing this large number through experimental functional studies, bioinformatic approaches to classify and prioritize mutations for further analysis are essential for progress.

Confronted with this problem, some researchers have tried to apply methods that were developed to predict the impact of germline missense variants. We found that these methods have good sensitivity in recognizing recurrent driver missense mutations in P53 and EGFR, but poor specificity (Supplementary Table S4;
Figs. 3 and
4). This result implies that there may be differences between the distinguishing characteristics of neutral mutations in the cancer genome versus the germline genome. Application of methods developed for the latter problem to the former problem yielded less than optimal results. In contrast, the CHASM classifier, specifically developed to detect somatic rather than germline driver mutations, had substantially improved sensitivity, specificity, and precision over previously described methods.

Overall, our results highlight the importance of “null model” selection in designing a predictive algorithm to identify driver mutations in cancer resequencing data. Within the context of a prediction method, the null model incorporates assumptions about what driver missense mutations do not look like. It is used explicitly in supervised learning methods such as CAN-predict, Kinase SVM, and our previous version of CHASM (
2,
4). It is also used implicitly in methods such as SIFT and PolyPhen because their utility has been assessed with a validation or benchmark set as a false-positive control. SIFT has used experimental results of functional assays in bacterial and viral proteins as a control; PolyPhen has used species divergence data from amino acid substitutions found in equivalent positions in alignments of protein orthologs. We suggest that these null models of functional neutrality do not optimally represent the passenger missense mutations found in tumors.

Although existing methods for missense mutant function prediction in cancer have provided tools to prioritize candidate driver mutations, we have developed a quantitative approach to identify candidate drivers by controlling the FDR. To our knowledge, this is the first application of FDR to the classification of missense mutations, providing a statistically meaningful threshold for discovery.

We estimate that the proportion of drivers among all GBM missense mutations in our data set is ∼8%, with 5.4% occurring outside of known gene mountains. Note that the actual number of drivers in the mutation data set of Parsons and colleagues (
4) is likely to be higher, as CHASM only considers missense mutations. Many of the tumor suppressor gene alterations that drive tumorigenesis are nonsense mutations, frameshifts, or large deletions.

Our method is high-throughput and can be easily adapted to any tumor type of interest, given a sufficient sample size to compute context-based DNA mutation rates. It also represents an advance over previous classifiers in that most mutations can be scored (coverage; Supplementary Table S4). Because the method focuses on properties of individual mutations, rather than the frequency at which mutations appear in a gene, it can potentially detect driver mutations that are present at low frequencies. These mutations may disregulate pathways that are potential new drug targets. A recent example is the isocitrate dehydrogenase (IDH1) R132 mutation, discovered in GBM resequencing (
4). In the initial screen by Parsons and colleagues (
4), this mutation was originally found in only a small proportion of GBMs, so its role as a driver was questionable. CHASM, however, shows that the mutation has a high likelihood of being a driver when present in a tumor. Subsequent studies revealed that the mutation was present in a high fraction of an uncommon GBM subtype as well as other brain tumor types (
4,
27–
30). Functional studies suggest that mutant IDH1 dominantly inhibits production of α-ketogluterate, which is required by enzymes that degrade HIF-1α, thus hyperactivating the HIF-1 pathway and promoting tumor angiogenesis. Drugs designed to be α-ketogluterate mimics might thus be useful for GBM patients with the IDH1 mutation (
31). We hope CHASM will provide a useful tool to guide follow-up experiments based on the results of the many cancer genome projects now being performed or planned.

The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.

We would like to thank Giovanni Parmigiani and Mark Diekhans for valuable discussions and Ivan Adzubey, Josh Kaminker, and Ali Torkamani for help scoring the mutations with PolyPhen, CanPredict, and KinaseSVM.