Abstract

Although a few cancer genes are mutated in a high proportion of tumours of a given type (>20%), most are mutated at intermediate frequencies (2-20%). To explore the feasibility of creating a comprehensive catalogue of cancer genes, we analysed somatic point mutations in exome sequences from 4,742 human cancers and their matched normal-tissue samples across 21 cancer types. We found that large-scale genomic analysis can identify nearly all known cancer genes in these tumour types. Our analysis also identified 33 genes that were not previously known to be significantly mutated in cancer, including genes related to proliferation, apoptosis, genome stability, chromatin regulation, immune evasion, RNA processing and protein homeostasis. Down-sampling analysis indicates that larger sample sizes will reveal many more genes mutated at clinically important frequencies. We estimate that near-saturation may be achieved with 600-5,000 samples per tumour type, depending on background mutation frequency. The results may help to guide the next stage of cancer genomics.

Mutation patterns for one known and two novel cancer genes. EGFR shows distinctive tumor-type-specific concentrations of mutations in different regions of the gene. RHEB, which encodes a small GTPase in the RAS superfamily, shows a mutational hotspot in the effector domain. RHOA, another a member of the RAS superfamily, also shows a mutational hotspot in the effector domain. Colored bars after tumor type names are copy-ratio distributions for the gene, when available (red=amplified, blue=deleted). See also . Similar diagrams for all genes are available at http://www.tumorportal.org.

Cancer genes in selected tumor types. Genes are arranged on the horizontal line according to p-value (combined value for the three tests in MutSig). Yellow region contains genes that achieve FDR q≤0.1. Orange interval contains p-values for the next 20 genes. Gene name color indicates whether the gene is a known cancer gene (blue), a novel gene with clear connection to cancer (red; discussed in text), or an additional novel gene (black). Circle color indicates the frequency (percent of patients carrying non-silent somatic mutations) in that tumor type. See also .

Cancer genes identified in 4742-tumor dataset. X-axis indicates the q-value (FDR) in the most significant of the 21 tumor types. Y-axis indicates the q-value when the 4742 tumors are analyzed as a combined cohort. Genes in the upper left quadrant reached significance only in the combined analysis. Genes in the lower right quadrant reached significance only in one or more single-type analyses. Genes in the upper right quadrant were significant in both the combined set and in individual tumor types. Color of gene names is as in .

Down-sampling analysis shows that gene discovery is continuing as samples and tumor types are added. a. Analysis within tumor types. Each point represents a random subset of patients. Blue line is a smoothed fit. b. Analysis by adding tumor types. Each grey line represents a random ordering of the 21 tumor types. c. Analysis by adding samples. Each point is a random subset of the 4742 patients. d. Analysis in panel c broken down by mutation frequency. Genes mutated at frequencies ≥ 20% are nearing saturation, while intermediate frequencies show steep growth. See also .

Number of samples needed to detect significantly mutated genes, as a function of a tumor type’s median background mutation frequency of (x-axis) and a cancer gene’s mutation rate above background (the various curves). Y-axis shows the number of samples needed to achieve 90% power for 90% of genes. Grey vertical lines indicate tumor type median background mutation frequencies. Black dots indicate sample sizes in the current study. For most tumor types, the current sample size is inadequate to reliably detect genes mutated at 5% or less above background. See also .