The cell line transcriptome

The word transcriptome refers to the full set of transcribed RNA molecules within a cell at a given time point. In contrast to the genome, which is characterized by its stability over different cells within an organism, the transcriptome varies greatly. This plastic nature of the transcriptome has made it appealing to study, owing to its potential to serve as a proxy for cellular identity and diversity. In the Cell Atlas all 19613 protein-coding human genes are classified according to their expression across a large number of in vitro cultured cell lines (Figure 1)
(Uhlén M et al, 2015). The cell lines have been harvested during log phase of growth and extracted high quality mRNA was used as input material for library construction and subsequent sequencing. The expression level of gene-specific transcripts is given as Transcript Per Million (TPM) values. Genes with a TPM value ≥1 are considered as detected. Altogether the transcriptome of 64 cell lines have been analyzed to form a basis of different expression categories.

Approximately one third of all protein-coding genes (n=6693) were expressed in all cell lines, consistent with a "housekeeping" function for the corresponding proteins. 11% (n=2069) of all genes were not detected in any of the analyzed cell lines, suggesting that corresponding proteins are only expressed in highly specialized cell types, during specific developmental stages or under specific conditions such as cell stress. 43% (n=8455) of the protein-coding genes show a more restricted pattern of expression across the analyzed cell lines, some expressed in only a few or even just a single cell line. In Table 1 the specific expression profile for each analyzed cell line is shown with clickable numbers for total detected genes, cell line enriched genes, group enriched genes and cell line enhanced genes.

The cell line transcriptome was compared with the transcriptome of 37 different normal tissues and organs. 255 genes were only expressed in cell lines and not in any of the analyzed normal tissue types. These genes serve an interesting starting point to study the function and role of corresponding proteins in human biology. Furthermore, 1220 genes were only found to be expressed in normal human tissues but not in any of the analyzed cell lines. Several of the proteins corresponding to these genes have functions associated with differentiated cells in specialized tissues or subcompartments of tissues, exemplified by
ACR (acrosin) the major proteinase present in the acrosome of mature spermatozoa in normal testis and
ABCB11, the major canalicular bile salt export pump in normal liver.

Figure 1. Pie chart showing the number of genes in the different RNA-based categories of gene expression in the panel of cell lines.

Table 1. Table showing the number of detected genes per cell line based on RNA sequencing (TPM ≥1), and the number of genes in the enriched and enhanced categories.

A diversity of cell lines

The 64 different cell lines used in the Human Protein Atlas have been selected to represent various cell populations in different tissue types and organs of the human body. A vast majority of the selected cell lines have been derived from human cancer and thus are best described as human cancer cell lines with limited resemblance to normal cell types. Cell lines are in general adapted to cultivation in vitro and can only approximate the lives of normal cells that perform their function in a complex tissue content. As cancer is a composite tissue with heterogeneous cancer cell populations in addition to the stromal component, it is not surprising that several features of a normal cell corresponding to the putative progenitor cell are lacking in the corresponding cancer-derived cell line. Despite the evident differences between primary cells in tissue and in vitro cultured cell lines, a global analysis based on an unbiased hierarchical clustering analysis (Figure 2) shows that cell lines in fact do cluster as expected from similarities in origin and phenotype of the cancer cells from which the respective cell line was derived from. This can be exemplified by the derivatives of the isogenic BJ fibroblast model that mimics the four stages of malignant transformation (normal, immortalized, transformed and metastasizing) by cumulative addition of defined genetic elements
(Hahn WC et al, 1999). At the highest level of separation, cell lines that grow in solution and also represent hematopoietic and lymphoid cell systems cluster together and separate into two major clusters dependent on myeloid or lymphoid origin/phenotype. Moreover, several related cell lines cluster together such as the versions of immortalized and transformed fibroblastic cell lines (BJ derivatives), glioma (U-138 MG and U-251 MG), melanoma (WM-115 and SK-MEL-30), breast cancer (SK-BR-3, MCF7 and T47d) and endothelial cell lines (TIME and HUVEC).

The selection of human cancer cell lines for the Cell Atlas was aimed to correspond to the origin and phenotype of solid cancer types represented in the
Pathology Atlas of the Human Protein Atlas. A special emphasis has been made to represent cells in the hematopoietic and immune system as these
corresponding tumor types are more scarcely represented in the Cancer Atlas. Data from altogether 7 and 8 cell lines representing different stages of myeloid and lymphoid
differentiation, respectively, has been generated and analyzed. In addition to cancer-derived cell lines there are also a number of cell lines that have been generated through
in vitro protocols for immortalization of growing cells as well as stem cells. Details regarding the different cell lines can be found
here.

Cell line enriched genes

A majority of the cell line enriched genes also belong to the tissue elevated gene expression categories (tissue enriched, group enriched and tissue enhanced). The expression pattern in normal tissues and function of these proteins relate to the specific traits and functions of the corresponding normal tissue type and organ. Examples are presented in Figure 3 and include: The secreted proteins
AHSG and
ALB that are only expressed in normal liver and the liver derived cell line Hep-G2, where immunofluorescent analysis shows localization to the Golgi apparatus and vesicles respectively. The transcription factor
HOXB13 that is only expressed in the nuclei of prostate, colon and rectum tissue as well as in the prostate-derived cell line PC-3. The adhesion glycoprotein
CDH15 that is enriched in skeletal muscle tissue and in the sarcoma cell line RH-30. The enzyme
TYR that is exclusively expressed in skin and in the melanoma derived cell line SK-MEL-30. The epidermal growth factor receptor
EGFR enriched in female tissues and skin, and in the skin-derived cell line A-431.

The RNA-seq data for all 64 cell lines expressing 89% (n=17544) of all protein-coding human genes are presented in the Cell Atlas and can be used as a tool for selection of suitable cell lines for an experiment involving a particular gene or pathway or for further studies on the transcriptome of established human cell lines.

Figure 3. Examples of proteins with enriched expression in a cell line and the corresponding tissue of origin. The proteins are
AHSG,
ALB,
HOXB13,
CDH15,
TYR, and
EGFR. The immunohistochemical (IHC) staining shows the protein expression pattern in tissue in brown. The immunofluorescent (IF) staining shows the protein subcellular expression pattern in cell lines in green. The nucleus and microtubules are shown in blue and red respectively in the IF images.