Understanding the cellular signal transduction pathways that drive cells to become cancerous is fundamental to developing personalized cancer therapies that decrease the morbidity and mortality of cancer. The purpose of this study was to develop an unsupervised deep learning model for finding meaningful, lower-dimensional representations of cancer gene expression data. Ultimately, we hope to use these representations to reveal hierarchical relationships (cellular pathways) involved in cancer pathogenesis.

We downloaded 7,528 gene expression samples (each with 15,404 features) across 17 different cancer types from TCGA and developed a python deep learning library—including an unsupervised implementation of a Stacked Restricted Boltzmann Machine (SRBM) – Deep Autoencoder (DA). Extensive model selection identified a promising hidden layer architecture for this dataset. Logistic regression to predict the pathological N-stage of the samples, using the final hidden layer representations (least number of hidden units) as input, performed better than a proportionally random or tissue-type based classifier. Consensus clustering of the final hidden layer representations allowed for more robust clustering than clustering the high-dimensional input data. Consensus clustering of glioblastoma samples across all models identified 6 clusters with differential prognosis. Numerous novel and previously reported glioblastoma subtype-specific genes were found to be significantly correlated with each glioblastoma subtype.

An SRBM-DA can be trained to represent meaningful abstractions of cancer gene expression data that provide novel insight into patient survival. Ultimately, deep learning and consensus clustering revealed a subclass of the proneural glioblastoma subtype that was enriched with G-CIMP phenotype samples and demonstrated improved prognosis.