ABSTRACTIt is challenging to cluster cancer patients of a certain histopathological type into molecular subtypes of clinical importance and identify gene signatures directly relevant to the subtypes. Current clustering approaches have inherent limitations, which prevent them from gauging the subtle heterogeneity of the molecular subtypes. In this paper we present a new framework: SPARCoC (Sparse-CoClust), which is based on a novel Common-background and Sparse-foreground Decomposition (CSD) model and the Maximum Block Improvement (MBI) co-clustering technique. SPARCoC has clear advantages compared with widely-used alternative approaches: hierarchical clustering (Hclust) and nonnegative matrix factorization (NMF). We apply SPARCoC to the study of lung adenocarcinoma (ADCA), an extremely heterogeneous histological type, and a significant challenge for molecular subtyping. For testing and verification, we use high quality gene expression profiling data of lung ADCA patients, and identify prognostic gene signatures which could cluster patients into subgroups that are significantly different in their overall survival (with p-values < 0.05). Our results are only based on gene expression profiling data analysis, without incorporating any other feature selection or clinical information; we are able to replicate our findings with completely independent datasets. SPARCoC is broadly applicable to large-scale genomic data to empower pattern discovery and cancer gene identification.

pone.0117135.g005: Independent verification testing of the 128 identified genes.Testing of the 128 identified genes on the ACC (a) and the GSE5843 dataset (b). Kaplan-Meier plots of the clusters of the samples show statistically significant survival differences, with p-value = 0.0106 for the ACC dataset, and p-value = 0.00672 for the GSE5843 dataset. For each verification test, the separation of the samples is from MBI running with parameter k2 = 2 on the corresponding rows of the dataset (i.e., using only the part of Y matrix of ACC or GSE5843 that corresponds to the 128 genes). (c) Testing of the 128 identified genes on Jacob stage1. The separation of the samples is from MBI running on the corresponding rows of the dataset with parameter k2 = 2. Kaplan-Meier plot of the sample clusters show statistically significant survival differences, p-value = 0.000817.

Mentions:
For independent verification, we used the ACC dataset. We mapped the 128 genes by gene symbol to the ACC dataset and then ran our clustering approach using only the mapped genes. From the verification, we got consistent clusters for the ACC dataset. Kaplan-Meier plots showed statistically significant differences in OS (p = 0.0106 by log-rank test) between the 2 subgroups of patients (Fig. 5A).

pone.0117135.g005: Independent verification testing of the 128 identified genes.Testing of the 128 identified genes on the ACC (a) and the GSE5843 dataset (b). Kaplan-Meier plots of the clusters of the samples show statistically significant survival differences, with p-value = 0.0106 for the ACC dataset, and p-value = 0.00672 for the GSE5843 dataset. For each verification test, the separation of the samples is from MBI running with parameter k2 = 2 on the corresponding rows of the dataset (i.e., using only the part of Y matrix of ACC or GSE5843 that corresponds to the 128 genes). (c) Testing of the 128 identified genes on Jacob stage1. The separation of the samples is from MBI running on the corresponding rows of the dataset with parameter k2 = 2. Kaplan-Meier plot of the sample clusters show statistically significant survival differences, p-value = 0.000817.

Mentions:
For independent verification, we used the ACC dataset. We mapped the 128 genes by gene symbol to the ACC dataset and then ran our clustering approach using only the mapped genes. From the verification, we got consistent clusters for the ACC dataset. Kaplan-Meier plots showed statistically significant differences in OS (p = 0.0106 by log-rank test) between the 2 subgroups of patients (Fig. 5A).

Bottom Line:
Current clustering approaches have inherent limitations, which prevent them from gauging the subtle heterogeneity of the molecular subtypes.SPARCoC has clear advantages compared with widely-used alternative approaches: hierarchical clustering (Hclust) and nonnegative matrix factorization (NMF).We apply SPARCoC to the study of lung adenocarcinoma (ADCA), an extremely heterogeneous histological type, and a significant challenge for molecular subtyping.

ABSTRACTIt is challenging to cluster cancer patients of a certain histopathological type into molecular subtypes of clinical importance and identify gene signatures directly relevant to the subtypes. Current clustering approaches have inherent limitations, which prevent them from gauging the subtle heterogeneity of the molecular subtypes. In this paper we present a new framework: SPARCoC (Sparse-CoClust), which is based on a novel Common-background and Sparse-foreground Decomposition (CSD) model and the Maximum Block Improvement (MBI) co-clustering technique. SPARCoC has clear advantages compared with widely-used alternative approaches: hierarchical clustering (Hclust) and nonnegative matrix factorization (NMF). We apply SPARCoC to the study of lung adenocarcinoma (ADCA), an extremely heterogeneous histological type, and a significant challenge for molecular subtyping. For testing and verification, we use high quality gene expression profiling data of lung ADCA patients, and identify prognostic gene signatures which could cluster patients into subgroups that are significantly different in their overall survival (with p-values < 0.05). Our results are only based on gene expression profiling data analysis, without incorporating any other feature selection or clinical information; we are able to replicate our findings with completely independent datasets. SPARCoC is broadly applicable to large-scale genomic data to empower pattern discovery and cancer gene identification.