The Cancer Genome Atlas (TCGA) is an initiative supported by the National Cancer Institute and the National Human Genome Research Institute. Its immediate aim is to obtain a catalog of the molecular changes that occur in hundreds of tumors from the more common cancers in population. One year ago, the Pan-Cancer analysis group was formed within the TCGA, with the goal of analyzing alterations surveyed across several cancer types with an integrative approach. The rationale was that such across tumor types integrative analysis would provide insights to the process of tumorigenesis that could not be obtained from the analysis of each dataset separately. As explained in a previous post, we have had the opportunity to participate in this exciting project, whose marker paper has been recently published in Nature Genetics. On detail, our role was to analyze the mutations of the Pan-Cancer data set to identify genes that drive tumorigenesis in one or more of these cancer types. The results of this analysis have been published today in Nature Scientific Reports.

Illustration of the four signals of positive selection used to identify driver genes and the methods that implement them.

Briefly, we have analyzed 3,205 samples corresponding to 12 cancer types -whose selection was based upon numbers of samples available and comprehensiveness of the current TCGA data. The genes that are driving the disease were detected by identifying signals of positive selection in the mutations observed across a cohort of tumor samples. To this end, we have used OncodriveFM and OncodriveCLUST, two methods developed in our lab, and in collaboration with other groups (Washington University, University of Toronto and Broad Institute) we have incorporated the results of three additional methods to identify drivers. The datasets were analyzed in two manners: first, all samples were pooled together to increase the statistical power for detecting drivers acting across several tumor types. And second, we analyzed the data at per project level to avoid the potential dilution effect in the detection of drivers that are relevant only in a subset of the cancer types. These results were combined by an ad-hoc approach aimed to balance the pros and contras of each of the methods that have been employed, and as a result we have ended up with a list of 291 high-confidence drivers acting in one or more of the cancer types under study.

Some interesting observations can be done from this list of drivers. For instance, note that only two of them appeared mutated in the Pan-Cancer data set at a frequency larger than 10% (TP53 and PIK3CA), whereas the remaining drivers are mutated at lower rates, some of them below 1% (lowly recurrent drivers). Most of the drivers tend to be mutated across several tumor types, except some of them that are very specific of a single cancer type. This is the case, for instance of the Von Hippel–Lindau tumor suppressor in kidney renal clear cell carcinoma or the Adenomatosis Polyposis Coli Tumor Suppressor in colorectal adenocarcinoma.

(A) Network representation of HCDs. Trimmed version of the functional interaction network integrated by 124 HCDs that either map to the five broad biological modules enriched among HCDs or connect them. Genes annotated in the CGC are represented as round squares, HCDs not in CGC are represented as circles and non-HCDs used as linkers between HCDs as diamonds. Genes with a clear preference for bearing protein affecting mutations (PAMs) in one tumor type are colored following the project code shown in the figure legend. Colored shadows encircle genes within five enriched biological modules. (B) Frequency of PAMs observed HCDs in panel A across samples of each cancer type, following the tumor type color code. The annotations below indicate methods that identify each gene signals of positive selection. Genes with clear preference for bearing PAMs in one tumor type are indicated with a colored square below the histogram, using the tumor type color code.

Another question that can be addressed from our results is how many mutational drivers cause each of the cancer types. As expected, the lowest number of mutational drivers occur in acute myeloid leukemia (mutations in 2 drivers per sample are observed as a median), which is the only non-solid tumor included in the present study. On the other hand, lung cancers (lung adenocarcinoma and lung squamous cell carcinoma) and bladder urothelial carcinoma (median of 9 drivers mutated per sample) possess the highest rate of drivers per sample. Whether other mechanisms are participating in the tumorigenesis of the disease is out of the scope of the present study, although we already point out a higher contribution of copy number alterations in certain tumors as breast invasive carcinoma and ovarian serous cystadenocarcinoma.

Many of the putative drivers retrieved in our study map to biological processes that are known cancer hallmarks, including emergent ones, such as mRNA processing and chromatin remodeling-related pathways. Most of these drivers -as expected- are already well known to cause cancer (e.g. about 45% of them are genes included in the Cancer Gene Census), but in some cases we have extended their role to tumors in which they had not been described before. Also, we have identified novel candidates to tumorigenesis, mostly lowly recurrent drivers that can be important in smaller group of patients and helped to complete the landscape of mutational drivers in cancer. As TCGA projects continue to collect and analyze more samples, a similar approach should be performed to identify rare variants or variants that appear in rarer cancer types.

The list of drivers identified in our study can be downloaded from the Synapse repository system. All the Pan-cancer data have been prepared for exploration with Gitools, and we have uploaded the results of the analysis to the IntOGen platform . I hope that these efforts will help to develop novel strategies aimed to improve patient’s care.