Abstract

To characterize somatic alterations in colorectal carcinoma (CRC), we conducted genome-scale analysis of 276 samples, analyzing exome sequence, DNA copy number, promoter methylation, mRNA and microRNA expression. A subset (97) underwent low-depth-of-coverage whole-genome sequencing. 16% of CRC have hypermutation, three quarters of which have the expected high microsatellite instability (MSI), usually with hypermethylation and MLH1 silencing, but one quarter has somatic mismatch repair gene mutations. Excluding hypermutated cancers, colon and rectum cancers have remarkably similar patterns of genomic alteration. Twenty-four genes are significantly mutated. In addition to the expected APC, TP53, SMAD4, PIK3CA and KRAS mutations, we found frequent mutations in ARID1A, SOX9, and FAM123B/WTX. Recurrent copy number alterations include potentially drug-targetable amplifications of ERBB2 and newly discovered amplification of IGF2. Recurrent chromosomal translocations include fusion of NAV2 and WNT pathway member TCF7L1. Integrative analyses suggest new markers for aggressive CRC and important role for MYC-directed transcriptional activation and repression.

Associated Data Files

These data represent a data freeze from Feb 02, 2012. Please note that more recent data are available via the TCGA Data Portal.

Some archives listed for download below contain more sample data than was in the publication. The Supplementary Table 1 should be used as the key for sample identification for data in those archives.

Supplementary Data

Supplementary Table 1 [xls|html]: Following is a brief description of the contents in each worksheet:

Summary: for each participant, this table provides a summary of data types analyzed and the clinical data values used as input for analysis.

Mutations: a list of BAM files and their metadata that were used for identifying mutations.

microRNA: a list of participants, aliquot IDs, and the archives that contain the data used as input for analysis of microRNA

SNP6: a list of participants, aliquot IDs, and the archives that contain the data used as input for analysis of copy number.

WGS: A list of low-pass whole genome sequencing (WGS) BAM files and their metadata that were used for identifying structural variation.

RNASeq: a list of participants, aliquot IDs, and the archives that contain the data used as input for analysis of RNA Sequence gene expression.

Agilent: a list of participants, aliquot IDs, and the archives that contain the data used as input for analysis of Agilent microarray gene expression.

Methylation: a list of participants, aliquot IDs, and the archives that contain the data used as input for analysis of methylation.

Affymetrix SNP 6 data: The level 3 data were derived using a method developed for this work from Level 1 data that are available at the DCC. The standard level 3 data now at the DCC were not used in this analysis.

SuperPathway.txt: Superimposed Pathway used by the PARADIGM analysis. All of the merged concepts and interactions pooled from NCI-PID, Reactome, and BioCarta databases. At the top of the file, declarations of all of the concepts (genes, complexes, families, processes) can be found. Beneath these declarations are all of the regulatory interactions including transcriptionally activating (-t>), transcriptionally inactivating (-t|), subunit to complex relations (-component>), post-transcriptionally activating (-a>), post-transcriptionally inactivating (-a|), activation of an abstract process (-ap>), inhibition of an abstract process (-ap|), and membership in a family relation (-member>).

tcgaCOADREAD_Expression.vNormal.MANUSCRIPT.tab: A PARADIGM-ready version of the expression data formatted as a tab-delimited file with the expression rank-ratios given as input to the PARADIGM algorithm.

tcgaCOADREAD_CNV.vNormal.MANUSCRIPT.tab: A PARADIGM-ready version of the copy number data. A tab-delimited file containing the copy number rank-ratios given as input to the PARADIGM algorithm.

params.txt: The set of parameters needed to run PARADIGM that determine the initial setting of the constraints between concept- and interaction-related constraints (probabilistic factors). These parameters were learned from previous rounds of learning on other cancer cohorts and reused for this analysis.

config.txt: Contain settings for how PARADIGM.s inference engine was run for the CRC analysis. The file specifies that the belief propagation method for maximum likelihood inference should be used with a maximum of 10,000 iterations for convergence and that the datasets for gene expression and copy number to be used are the files listed above.

Cytoscape data (Supplementary Data File 2) [cys]: A network of the pathway concepts found by PARADIGM to be significantly modulated across the colonic and rectal tumor samples. The file contains the network as a Cytoscape session that has been tested on versions 2.6 or later. Nodes in the network correspond to concepts in the Superimposed Pathway and include genes (circles), complexes (hexagons), families (triangles), and cellular processes (boxes). Concepts are connected by regulatory interactions depicted as either activating (arrows) or inhibiting (.T.-bars) at the transcriptional level (solid lines), or post-transcriptional level (dashed lines). Subunit membership in complexes are depicted using undirected dashed lines. The network includes concepts with higher activation (red nodes) or inactivation (blue nodes) in tumors compared to normal. The size and opacity of the nodes are drawn as a function of the modulation score.