Contributions: JFL and FJM designed the study and wrote the manuscript, IU, RW, DK, RS, LL and FJM designed and conducted the bioinformatics analysis, LL, CL, PHS, MR, IHP, FJM and NOS conducted experiments and provided essential materials for this study.

Abstract

Stem cells are defined as self-renewing cell populations that can differentiate into multiple distinct cell types. However, hundreds of different human cell lines from embryonic, fetal, and adult sources have been called stem cells, even though they range from pluripotent cells, typified by embryonic stem cells, which are capable of virtually unlimited proliferation and differentiation, to adult stem cell lines, which can generate a far more limited repertory of differentiated cell types. The rapid increase in reports of new sources of stem cells and their anticipated value to regenerative medicine1, 2 have highlighted the need for a general, reproducible method for classification of these cells3. We report here the creation and analysis of a database of global gene expression profiles (“Stem Cell Matrix”) that enables the classification of cultured human stem cells in the context of a wide variety of pluripotent, multipotent, and differentiated cell types. Using an unsupervised clustering method4, 5 to categorize a collection of ~150 cell samples, we discovered that pluripotent stem cell lines group together, while other cell types, including brain-derived neural stem cell lines, are very diverse. Using further bioinformatic analysis6 we uncovered a protein-protein network (“PluriNet”) that is shared by the pluripotent cells (embryonic stem cells, embryonal carcinomas, and induced pluripotent cells). Analysis of published data showed that the PluriNet appears to be a common characteristic of pluripotent cells, including mouse ES and iPS cells and human oocytes. Our results offer a new strategy for classifying stem cells and support the idea that pluripotence and self-renewal are under tight control by specific molecular networks.

Cultured cell populations are traditionally classified as having the qualities of stem cells by their expression of immunocytochemical or PCR markers.7 This approach can often be misleading if these markers are used to categorize novel stem cell preparations or predict inherent multi- or pluripotent features.8 To develop a more robust classification system, we created a framework for identifying putative novel stem cell preparations by their whole genome mRNA expression phenotypes (Figure 1). The core reference dataset, which we call the Stem Cell Matrix, includes cultures of human cells that have been reported to have either stem cell or progenitor qualities, including human embryonic stem cells, mesenchymal stem cells, and neural stem cells. To provide the context in which to place the stem cells, we included non-stem cell samples such as fibroblasts and differentiated embryonic stem cell derivatives. To avoid biasing the classification methods, it was critical that we designate the input cell types with terminology that carried as little preconception about their identity as possible. Our nomenclature (“Source Code”) has two components: the first is the tissue or cultured cell line of origin. The second term captures a description of the culture itself. Supplementary Tables 1 – 8 summarize the descriptions of the core samples and their assigned Source Codes.

To sort the cell types we used an unsupervised machine learning approach to cluster transcriptional profiles of the cell preparations into stable distinct groups. Sparse nonnegative matrix factorization (sNMF) was adjusted for this task by implementing a bootstrapping algorithm to find the most stable groupings (see also Supplementary Discussion 1).4, 5 The stability of the clustering9 indicated that the dataset most likely contained about twelve different types of samples (Figure 2; Supplementary Method 2). The composition of the stable clusters revealed both predictable and unpredicted groupings of a priori designations (Figure 2 and Supplementary Figure 1). The twenty samples identified as undifferentiated human pluripotent stem cell (PSC) preparations were grouped together in one dominant cluster (Figure 2, Cluster 1) and one secondary cluster (Figure 2, Cluster 5). Sixty-two of the samples were brain-derived cells that were described as neural stem or progenitor cells based on their source, culture methods and classical markers. Most of the designated neural stem cells were distributed among multiple clusters, indicating a great deal of diversity in neural stem cell preparations. But one group of the brain-derived lines, those derived from surgical specimens from living patients (HANSE cells, see below), remained together throughout the iterative clusterings (Figure 2, Cluster 6; Supplementary Figure 3; Supplementary Method 1). The HANSE cell group consisted of transcriptional profiles that were derived from neurosurgical specimens following published protocols for multipotent neural progenitor derivation and propagation.10, 11 These cells expressed markers that are commonly used to identify neural stem cells12 (see Supplementary Figure 4), but the clustering clearly separated them from the other samples that had been derived from postmortem brains of prematurely born infants (see Figure 2).10,11

We tested the ability of our dataset to categorize additional preparations by adding 66 samples comprising new cultures derived from PSC lines that were already in the matrix, preparations that were not yet included (but their presumptive cell type was already represented), or new cell types. We chose two new types of cells: a differentiated cell type (umbilical vein endothelial cells [HUVEC]) and a recently developed new source of pluripotent cells, induced pluripotent stem cells13-16 (iPSC, Supplementary Table 9). iPSCs have been generated from somatic cells, including adult fibroblasts, by genetic manipulation of certain transcription factors.13, 15-17 We re-computed clustering results including the test dataset (Supplementary Table 10). All of the HUVEC samples clustered together and formed a distinct group. Most of the additional PSC lines (human ES cells [embryonic PSC; ePSC] and iPSCs) from several different labs were placed into a context that contained solely PSC lines. The three additional germ cell tumor lines clustered together with the tumor-derived pluripotent stem cell (tPSC) line 2102Ep and samples of three human ES cell lines: BG01v18, Hues719, and Hues1319. BG01v is an established aneuploid variant line and the two Hues lines were aneuploid variants of the originally euploid lines (not shown).

We used a combination of analysis tools to explore the basis of the unsupervised classification of the samples in the core dataset. Gene Set Analysis3 (GSA) is a means to identify the underlying themes in transcriptional data in terms of their biological relevance.

While GSA is valuable for discovering specific differences among sample groups, it is limited to curated gene lists and cannot be used to discover new regulatory networks. The MATISSE algorithm6 (http://acgt.cs.tau.ac.il/matisse) takes predefined protein-protein interactions (e.g. from yeast-two-hybrid screens) and seeks connected subnetworks that manifest high similarity in sample subsets. The modified version used in this analysis is capable of extracting sub-networks that are co-expressed in many samples but also significantly up- or down-regulated in a specific sample cluster. Since the PSC preparations were consistently clustered together we used MATISSE to look for distinctive molecular networks that might be associated with the unique PSC qualities of pluripotence and self-renewal. A Nanog-associated regulatory network has been outlined in mouse embryonic PSC,21 and we looked for the elements of this network in human PSCs using our unbiased algorithm. We found that the algorithm predicts that human PSC possess a similar NANOG-linked network (Figure 3a; elements labelled in red). However, we also discovered that the human NANOG network appears to be integrated as a small component of a much larger protein-protein interaction network that is up-regulated in human PSCs (Figure 3). Remarkably, this PSC-specific network (termed Pluripotency associated Network, PluriNet) contains key regulators that are involved in the control of cell cycle, DNA replication, DNA repair, DNA methylation, SUMOylation, RNA processing, histone modification and nucleosome positioning (see also Supplementary Discussion 2 and www.openstemcellwiki.org). Many of the genes in the PluriNet have been linked to embryogenesis, tumorigenesis, and aging (Figure 3c and Supplementary Figure 6). We further explored the hypothesis that pluripotency is closely linked to PluriNet expression by analyzing published gene expression datasets from human oocytes, various types of PSCs, and murine embryos (see Table 1 for a summary of our findings in various model systems). Analysis of a microarray dataset22 that spans development from murine oocytes to the late blastocyst stage revealed that the PluriNet expression is dynamic and up-regulated during early mammalian embryogenesis (Table1; Supplementary Figures 7 - 9).23 Also, our preliminary analyses indicate that the PluriNet is strongly up-regulated in mouse PSCs, mouse iPSCs, and mouse epiblast-derived stem cells24 when compared to somatic cells. Therefore the PluriNet may be useful as a biologically inspired gauge for classifying both murine and human PSC phenotypes (Table 1; Supplementary Figures 10 – 13).

In summary, our data indicate that an unbiased global molecular profiling approach combined with a transcriptional phenotype collection using suitable machine learning algorithms can be used to understand and codify the phenotypes of stem cells.4, 5, 25 Although it is more extensive than any stem cell dataset reported to date, we consider our database and the PluriNet to be a work in progress. As more direct evidence for protein-protein interactions in human cells becomes available, it will be possible to refine the networks we’ve defined and make them more useful for testing hypotheses about the nature of stem cell pluri- and multipotence. Also, our sample collection is limited to pluri- and multipotent stem cell types that grow well in culture, and does not include some of the most well-studied lineages, such as hematopoietic stem cells. Resolution and reliability of a context-based unsupervised classification can be expected to grow with the breadth and depth of the database content.26 Even with these limitations, we have shown that the dataset and PluriNet have already proved useful for categorizing cell types using unbiased criteria. As more stem cell populations become available, cultured by new methods, isolated from new sources, or induced by new methods, we will use the PluriNet and the Stem Cell Matrix as a reference system for phenotyping the cells and comparing them with existing cell lines.

Methods Summary

For an overview of the general workflow, please also refer to Figure 1. A detailed list of the samples, culture methods and reference publications is provided in the Supplementary materials.11. Generally, RNA from each sample was prepared from approximately 1 × 106 cultured cells. Sample amplification, labeling and hybridization on Illumina WG8 and WG6 Sentrix BeadChips were performed for all arrays in this study according to the manufacturer’s instructions (http://www.illumina.com) at a single Illumina BeadStation facility. We used the Consensus Clustering framework9 to cluster transcription profiles and to assess stability of the results. As the algorithm, we used sparse non-negative matrix factorization.5 For data perturbation, 30 sub-sampling runs were performed for each considered number of clusters (k). In each run, 80% of the data was subjected to ten random restarts. The R-script can be downloaded at the accompanying website www.stemcellmachinelearning.org. Details on the application of GSA,20 PAM,27 MATISSE6 as well as publicly available datasets used in this study can be found in the Methods section. We modified the MATISSE6 computational framework to fit the goals of this study. For the present analysis we used the human physical interaction network that we had previously assembled6 and augmented it with additional interactions from recent publications.21, 2829 The 64 interactions in Wang et al. 200621 were mapped to the corresponding human orthologs using the NCBI Homologene database. The microarray data has been deposited at NCBI GEO (GEO series accession number: GSE11508). It can also be accessed, processed and downloaded at www.stemcellmesa.org.

Supplementary Material

Supplementary

Acknowledgments

We thank Chris Stubban, Helga Dittmer, Svenja Zapf and Hildegard Meissner for their work with various cell cultures. We are grateful to Dustin Wakeman, Rodolfo Gonzalez, Scott McKercher, Jean Pyo Lee, Hyun-Sook Park, and Shin Yong Moon for sharing their cell preparations for the type collection. We are especially grateful to Robin Wesselschmidt and Martin Pera for their unique GCT lines and George Daley for providing human iPSCs. Arif Murat Kocabas and Jose Cibelli shared their human oocyte expression data with us. Aaron Barsky let us use the CEREBRAL 2.0 plug-in before its publication. Maggie Rosentraeger helped to compile the cell culture meta-data. We thank Josef Aldenhoff, Dunja Hinze-Selch, Manfred Westphal, Katrin Lamszus, Uwe Kehler, David Barker, and Anja Fritz for their support and discussions of this project.

Financial support This study has been supported by the following grants and awards: Christian-Abrechts University Young Investigator Award (FJM), SFB-654/C5 Sleep and Plasticity (FJM and Dunja Hinze-Selch), Hamburger Krebsgesellschaft Grant (NOS), Edmond J. Safra Bioinformatics program fellowship at Tel-Aviv University (UI), Converging Technologies Program of The Israel Science Foundation Grant No 1767.07 (RS), Raymond and Beverly Sackler Chair in Bioinformatics (RS), Reproductive Scientist Development Program Scholar Award K12 5K12HD000849-20 (LL), California Institute for Regenerative Medicine Clinical Scholar Award (LL), NIH P20 GM075059-01 (JFL), the Alzheimer’s Association (JFL), and anonymous donations in support of stem cell research.