Affiliations
Department of Clinical Biochemistry and Pharmacology, Ben-Gurion University of the Negev, Beer-Sheva, Israel,
National Institute for Biotechnology in the Negev, Ben-Gurion University of the Negev, Beer-Sheva, Israel

Figures

Abstract

An open question in human genetics is what underlies the tissue-specific manifestation of hereditary diseases, which are caused by genomic aberrations that are present in cells across the human body. Here we analyzed this phenomenon for over 300 hereditary diseases by using comparative network analysis. We created an extensive resource of protein expression and interactions in 16 main human tissues, by integrating recent data of gene and protein expression across tissues with data of protein-protein interactions (PPIs). The resulting tissue interaction networks (interactomes) shared a large fraction of their proteins and PPIs, and only a small fraction of them were tissue-specific. Applying this resource to hereditary diseases, we first show that most of the disease-causing genes are widely expressed across tissues, yet, enigmatically, cause disease phenotypes in few tissues only. Upon testing for factors that could lead to tissue-specific vulnerability, we find that disease-causing genes tend to have elevated transcript levels and increased number of tissue-specific PPIs in their disease tissues compared to unaffected tissues. We demonstrate through several examples that these tissue-specific PPIs can highlight disease mechanisms, and thus, owing to their small number, provide a powerful filter for interrogating disease etiologies. As two thirds of the hereditary diseases are associated with these factors, comparative tissue analysis offers a meaningful and efficient framework for enhancing the understanding of the molecular basis of hereditary diseases.

Author Summary

An open question in human genetics is what underlies the tissue-specific manifestation of hereditary diseases, which are caused by genomic aberrations that are present in cells across the entire human body. In order to answer this question, we created an extensive resource of protein expression and interactions across 16 main human tissues. Using this resource, we first show that the genes underlying hundreds of hereditary diseases are widely expressed across tissues, yet, enigmatically, cause disease phenotypes in few tissues only. We then identify two distinct, statistically-significant factors that could lead to tissue-specific vulnerability in the face of this broad expression: (i) many disease-causing genes have elevated expression levels in their disease tissues, and (ii) disease-causing genes have a significantly higher tendency for tissue-specific interactions in their disease tissues. As we show for several disease-causing genes, these tissue-specific interactions highlight disease mechanisms and provide an efficient filter for interrogating the molecular basis of diseases. Together the two factors we identified are relevant for as many as two thirds of the tissue-specific hereditary diseases. Our comparative tissue analysis therefore provides a meaningful and efficient framework for enhancing the understanding of the molecular basis of hereditary diseases.

Funding: This research was supported by the European Union Seventh Programme under the FP7-PEOPLE-MCA-IRG Funding scheme through grant agreement n° [256360]; by the Israel Cancer Association, Israel; and by the Israel Science Foundation administered by the Israel Academy of Sciences and Humanities through grant number 860/13 (to EYL). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Hereditary diseases arise due to germline aberrations that are present across the human body. Enormous progress has been made over the years in mapping the genetic causes for a large variety of hereditary diseases. To date, causal germline aberrations for over 1,500 hereditary diseases can be retrieved from the OMIM database [1]. Additional disease-related factors have been mapped by genome-wide association studies, mRNA profiling and epigenetic marking. Yet, despite this wealth of data, the molecular basis of many hereditary diseases remains elusive. It is thus apparent that novel strategies are required to enhance our understanding of the molecular basis of these diseases.

While the genetic aberrations that cause hereditary diseases are global, the diseases are often manifested in specific organs or tissues (for simplicity we will use ‘tissue’ to denote ‘organ’ as well). The mechanism for this selectivity and vulnerability in certain tissues is unknown. Common explanations refer to a unique function of the disease tissue, as in the case of muscle and liver glycogen storage disease, or a unique feature of the disease tissue, such as long-lived neurons and age-related protein misfolding diseases [2]. It was also shown that expression levels of genes underlying genetic diseases tend to be elevated in their disease tissues [3], [4]. Yet a rigorous analysis of the tissue-specific manifestation of hereditary diseases and their respective disease genes has rarely been performed.

An important determinant of tissue features is the repertoire of proteins that are expressed in the tissue and their physical interactions, whose union defines the tissue interactome. Tissue interactomes were utilized to assess tissue-specific functions of proteins and interactions (e.g., [4]–[8]) and to illuminate general properties of disease genes (e.g., [3], [7]). Tissue interactomes were typically constructed by superimposing tissue expression data on a static protein-protein interaction (PPI) network composed of interactions that were frequently identified through in-vitro assays, without any tissue context. These PPIs were included in the tissue interactome if the interacting partners were found to be expressed in that tissue, otherwise the PPIs were down-weighted or excluded (e.g., [4]–[11]). Due to the scarcity of protein expression data across tissues, most studies exploited data of gene expression across tissues. The dataset of Su et al. [12], which used DNA microarrays to profile transcript levels across 77 tissues, has been a prominent quantitative resource in many of these studies (e.g., [3], [9], [11]).

Recently, our views of gene and protein expression across tissues were considerably amplified owing to the application of additional technologies to expression mapping. First, application of next-generation RNA sequencing (RNA-seq) revealed that many more transcripts were expressed per tissue than previously acknowledged [13], thereby affecting conclusions drawn from previous data [7]. Second, large-scale measurements of protein abundance have become available, providing direct evidence for the presence of a protein in a tissue. In particular, the Human Protein Atlas (HPA) initiative offers qualitative immunohistochemical measurements of protein abundance for thousands of proteins across tens of tissues [14].

Here we took advantage of the exciting new wealth of information regarding gene and protein expression across tissues and harnessed them to illuminate the tissue-specificity of hereditary diseases. First, we combined these data and integrated them with known PPIs to create the interactomes of 16 main human tissues (Figure 1). We observed considerable similarities between tissue interactomes in terms of expressed proteins and interactions, including significant correlations between transcript levels and the numbers of interacting partners per gene in each tissue. Focusing on genes causing hereditary diseases, we show that, in contrast to the tissue-specific manifestation of their respective diseases, the genes are often expressed in many tissues. However, they tend to have elevated transcript levels or tissue-specific PPIs preferentially in their disease tissues. We demonstrate that knowledge of the tissue-specific PPIs of genes causing hereditary diseases can be used to highlight disease-related mechanisms. Therefore, comparison between tissue interactomes can serve as an efficient strategy for illuminating the molecular basis of these diseases.

Figure 1. The construction of 16 human tissue interactomes by integrating data of tissue expression with data of PPIs.

Data of expression per tissue according to DNA microarray (GNF, [12]), protein abundance (HPA, [14]), and RNA-sequencing (RNA-seq, [15]) were consolidated into 16 main tissues. In parallel, experimentally detected PPIs were united from BIOGRID [16], DIP [17], IntAct [18] and MINT [19] to form a global human interactome. Tissue interactomes were then constructed by filtering the global interactome per tissue to contain only PPIs in which both pair-mates were found to be expressed within the tissue.

Results

We obtained extensive data of genes and proteins expression across tissues by integrating three major datasets: the dataset of the Genomics Institute of the Novartis Research Foundation (GNF) measured by using DNA microarrays [12], the Human Protein Atlas (HPA) dataset described above [14], and the Illumina Body Map 2.0 dataset measured by using RNA-seq [15]. From each dataset we extracted the set of proteins or protein-coding genes expressed per tissue, denoted tissue expressome, by using stringent thresholds (see Methods). Since the available PPI data were oblivious to alternatively-spliced isoforms we associated each protein-coding gene with a single protein product (see Methods). For simplicity we henceforth refer to protein-coding genes and proteins interchangeably.

Dataset comparison of the genes expressed in each tissue revealed that RNA-seq identified the largest number of genes per tissue, with a median increase of 1.5-fold relative to HPA and of 3.8-fold relative to GNF (Table 1). Still, RNA-seq did not fully contain data from the other datasets but covered between 58–80% of their expressomes per tissue. We therefore tested whether the different datasets were compatible and could be combined. Indeed, the overlaps between datasets in expressed genes per tissue were highly statistically significant in all cases (p-value<10−97, Fisher exact test). Moreover, corresponding tissues from the different datasets best correlated with each other in almost all cases (see Methods and Table S1). We also compared between the tissue distributions of genes per dataset and found them to be bi-modal, with most genes showing either tissue-specific or ubiquitous expression across tissues (Figure 2A). In particular, in the GNF dataset most genes were tissue-specific, but in the more recent RNA-seq and HPA datasets, in accordance with other expression datasets [4], most genes were ubiquitously expressed across all tissues. We observed similar bi-modal distributions upon using less stringent expression thresholds (Figure S1).

A. The distribution of proteins and PPIs by the number of tissues in which they are expressed is bi-modal, with most genes being globally expressed or tissue-specific. The distribution is shown per dataset and when combined. PPIs show a corresponding bi-modal distribution (the numbers of PPIs across 1–16 tissues appear in Table S11). B. A comparative view of the numbers of expressed proteins and PPIs across tissues. The core sub-network that is shared by all tissues (the right-most bar) is larger than the interactome of each tissue that remains after excluding the core. The numbers of genes and PPIs in the interactome of each tissue appear in Table S12. C. Most tissue hubs are widely expressed and retain their large PPI degree when expressed. The PPI degrees of the 451 tissue hubs (rows) in the 16 tissue interactomes (columns) are presented using a heat map, where each entry marks the PPI degree of the corresponding hub in that tissue. Entries are colored by the PPI degree from yellow (≤30 PPIs) to dark blue (≥150 PPIs); a white entry implies that the hub is not expressed in that tissue. Tissue acronyms: LV = Liver, WBC = White Blood Cells, BT = Breast, OV = Ovary, HT = Heart, AP = Adipose, SM = Skeletal Muscle, CL = Colon, LG = Lung, KY = Kidny, TR = Thyroid, PT = Prostate, LG = Lung, BN = Brain, TS = Testis, AD = Adrenal. D. A strong correlation between RPKM levels and PPI degree is observed in adipose tissue (Spearman r = 0.98, p = 4.7*10−7). The box-plot diagram shows the quartiles (25%, 50% and 75%) of the sorted PPI degree values in each RPKM bin. Similar correlations were observed in all 16 tissues (Figure S3).

To obtain an extensive view of the repertoire of expressed genes and their potential PPIs in each tissue we combined the three datasets. Specifically, we defined a gene as expressed in a tissue if it was found to be expressed in that tissue in at least one dataset. The resulting tissue expressomes maintained the bi-modal tissue distribution (Figure 2A): 61% of the genes were expressed in 14–16 tissues, henceforth denoted globally expressed genes, and 14% of the genes were expressed in 1–3 tissues, henceforth denoted tissue-specific genes. Testing for gene ontology (GO) enrichments we found that globally expressed genes were highly enriched for basic cellular processes common to living cells, such as RNA splicing (p<10−28) and protein transport (p<10−26, Table S2). Tissue-specific genes were enriched for tissue-specific processes such as spermatogenesis in testis (p = 7.3*10−6) and sensory perception in brain (p = 7.2*10−4, Table S3).

Common features of tissue interactomes

To construct tissue interactomes we first gathered recent data of experimentally-detected human PPIs from four major public databases [16]–[19]. These data amounted to a global interactome consisting of 67,439 interactions between 11,225 proteins, thus covering 52% of the human protein-coding genes. We then constructed tissue interactomes by filtering the global interactome according to tissue expressomes (Figure 1). Specifically, each tissue interactome contained only those PPIs in which both interacting partners were found to be expressed in that tissue (see Methods). Each resulting tissue interactome covered more than half of the global human interactome, with 58–75% of the proteins and 63–79% of the PPIs. The tissue interactomes are provided at http://netbio.bgu.ac.il/tissueinteractoms.

An overall view of the tissue interactomes appears in Figure 2B. All the tissue interactomes shared a common core sub-network that contained 4,989 proteins and 26,370 PPIs. This core sub-network dominated all tissue interactomes by containing half or more of their proteins and PPIs. To test the consistency in expression levels of core proteins across tissues we applied the DESeq method [20]. Only 555 of the 4,989 core genes (11%) showed a significant change in expression (p-value≤0.01) in at least one tissue, implying that most core proteins are expressed at similar levels across tissues. As can be expected, core proteins were highly enriched for basic cellular processes (Table S4).

Another common feature of the tissue interactomes was the scale-free like distribution of their PPI degrees (degree is the number of interacting partners per protein). In each tissue most proteins had at most five interacting partners, while a small subset of proteins, denoted hubs, had over 45 interacting partners each (Figure S2). These 451 tissue hubs generally retained their high degree across tissues (Figure 2C), and were highly enriched for a variety of regulatory processes, such as regulation of transcription (36%, p<10−15) and regulation of signal transduction (18%, p<10−12, see Table S5). 221 of these tissue hubs were also found in the core sub-network and showed a similar regulatory nature relative to other core proteins (Table S6).

The wide range of PPI degrees led us to hypothesize that proteins with many interacting partners may require a larger number of molecules in order to support these interactions, relative to proteins with only a few interacting partners, as was previously observed in budding yeast [21]. We therefore correlated between PPI degrees and transcript level per gene in each tissue (see Methods). In all tissues these correlations were statistically significant (Figure 2D and Figure S3). These correlations were maintained despite the diversity across tissues in transcript levels and in PPI partners (Figure S4).

We gathered 303 hereditary diseases that manifested clinically in at least one of the 16 tissues that we analyzed, and their 233 causal germline-aberrant genes (see Methods and Figure S5). As shown in Figure 3A, most hereditary diseases manifested in a single tissue, and yet over 80% of their causal genes were expressed in 10 tissues or more. Thus, causal genes tend to elicit a clear phenotype in only a small subset of their expressing tissues.

Figure 3. Tissue-related features of hereditary diseases and their causal genes.

A. The tissue-distribution of hereditary diseases and their causal genes shows that diseases are manifested in few tissues, while most of their germline-aberrant causal genes are expressed in 10 tissues or more. The numbers of expressed causal genes across 1–16 tissues appear in Table S13. B. Causal genes tend to be more highly expressed in their disease tissues relative to other tissues in which they are expressed. We observed higher median expression levels in disease tissues for 128 out of the 203 germline-aberrant causal genes for which RPKM values were available (p-value<10−4). The box-plot diagram shows the quartiles (25%, 50% and 75%) of the median RPKM levels of causal genes; for each gene only tissues expressing the gene were considered. C. Causal genes involved in TS-PPI tend to have more TS-PPI in their disease tissues relative to other tissues. Out of 126 genes with TS-PPI, 58 genes had higher median TS-PPI in the disease tissue relative to non-disease tissues in which they are expressed (p-value<10−4). The box-plot diagram shows the quartiles (25%, 50% and 75%) of the median number of TS-PPI of causal genes, where for each gene only tissues expressing the gene were considered. The first (25%) and second (50%) quartiles of non-disease tissues were zero and therefore overlap with the X axis. D. The majority of the 303 hereditary diseases are associated with elevated expression and/or TS-PPIs of their causal genes in their disease tissues.

We next tested whether the expression levels of causal genes differ between their disease- and unaffected tissues, as shown previously for the larger set of genetic diseases caused by somatic or germline aberrations [3]. To this end we compared between the median expression level of causal genes in their disease tissues and their median expression level in non-disease tissues (see Methods). We found that a significant fraction of these genes were expressed at elevated levels in their disease tissues (63%, randomization test p<10−4, Figure 3B), with almost a third of these having significantly high levels (28% with p≤0.01, DESeq analysis). Given the correlation between transcript levels and PPI degrees, we next tested whether causal genes also tend to have more PPIs in their disease tissue. This tendency too was significant (42%, randomization test p<10−4, Figure S6). Moreover, there was a significant overlap between causal genes with elevated expression levels and causal genes with higher PPI degrees in their disease tissue (Fisher exact test p = 0.02).

Given that causal genes tended to have more PPIs in their disease tissue, we tested whether they are also associated with PPIs that occur almost exclusively in that tissue. Such tissue-specific PPIs (TS-PPIs) can offer a clear molecular basis for the tissue-specific manifestation of hereditary diseases. Indeed, we found several examples where the TS-PPIs of causal genes in their disease tissues involved genes and interactions previously shown to be relevant for disease etiology (Table 2 and Figure 4, see Discussion). The full list of causal genes and their TS-PPIs appears in Table S8. We next turned to assess the prevalence of TS-PPIs among causal genes. We found that causal genes had a significantly higher tendency for TS-PPIs relative to interactome genes (Fisher exact test p = 8.8*10−5, Figure S7). Moreover, their TS-PPIs occurred preferentially in their disease tissues (randomization test p<10−4, Figure 3C). As could be expected, causal genes with more PPIs and causal genes with TS-PPIs in their disease tissue significantly overlapped (Fisher exact test p<10−9). However, there was no significant overlap between causal genes with TS-PPIs and causal genes with elevated transcript levels in their disease tissues (Fisher exact test p = 0.56). Thus, TS-PPIs and elevated transcript levels of causal genes in their disease tissues distinctly contribute to the emergence of tissue-specific phenotypes. As shown in Figure 3D, these two factors are related to 67% of the hereditary diseases in our dataset.

Discussion

The identification of germline-aberrant genes underlying many of the hereditary diseases provides an important step toward unraveling their molecular basis. Still, the remarkable tissue-specificity of hereditary diseases makes it clear that additional factors are governing disease manifestations. Relying on the utility of interactomes in understanding genotype-to-phenotype relationships [22], we applied here a comparative analysis of tissue interactomes to uncover determinants of the tissue-specificity of hereditary diseases.

We analyzed over 300 hereditary diseases and their causal genes. However, this set was limited by several factors: First, only diseases caused by mutations in protein-coding genes were included. Second, diseases had to be associated with at least one of the 16 tissues we analyzed. Given that the tissue associations were deduced by using a text-mining approach [3], these associations could be noisy or limited to the subset of hereditary diseases with clear tissue phenotypes. Third, the expression level of causal genes in their disease tissue had to reach a certain threshold, and thus diseases whose causal genes are lowly expressed might have been ignored. These limitations imply that our results may be more relevant for monogenic disorders with strong tissue phenotypes.

To construct tissue interactomes we combined three large-scale datasets of transcript or protein abundance across a multitude of tissues, which were obtained through three experimental techniques. We found relatively strong correlations between transcript levels in corresponding tissues, and statistically significant yet low correlations between transcript levels and protein abundance. The latter correlation was recently shown to be around 0.4 in simultaneous measurements from a common sample [23]. The lower correlations we observed likely stem from noisy estimates of transcript and protein abundance, and from correlating between measurements from different tissue samples.

Similarly to other studies of tissue interactomes (e.g., [9], [11]), we combined the datasets by associating a gene with a tissue if its expression in that tissue passed certain criteria in at least one dataset. This combination resulted in tissue expressomes that were unique in their extent (Table 1). At the same time, relying on no more than three sources allowed us to limit lab bias effects that would have been encountered upon analyzing a similar number of samples but from many different labs [24]. We then superimposed the subset of tissue-associated genes on the set of known PPIs, filtering out PPIs in which at least one interacting partner was not associated with the tissue. It is important to note that whether a PPI indeed occurs in the tissue depends on additional factors, such as the cellular localization of the interacting proteins and their posttranslational modifications. Nevertheless, expression of both partners is a necessary initial requirement, and therefore is often used as a filter for constructing tissue interactomes (e.g., [3], [7], [9]–[11]). The effectiveness of filtered tissue interactomes was demonstrated in two recent studies, which showed that they considerably improve the prioritization of disease genes relative to an unfiltered global interactome [10], [11].

The interactomes of the different tissues had common features. First, the majority of their genes were common to 14 or more interactomes (Figure 2). These globally expressed genes were enriched in basic cellular processes and formed a common core sub-network that dominated all tissue interactomes (Figure 3). Second, hubs in the different tissue interactomes were enriched in regulatory processes. Third, the different tissues shared significant correlations between transcript levels and PPI degrees (Figure 2D and Figure S3). Such correlation was previously observed in budding yeast [21] but not in human. One might assume that these correlations stem from the fact that PPIs between highly-expressed proteins are easier to detect. However, the detection of PPIs was often done outside of a human cell, through in-vitro assays or by using yeast cells (e.g. [25]). In such assays the transcript levels are unrelated to the in-vivo levels of these transcripts within human cells of different tissues, and therefore the bias toward genes with high transcript levels in-vivo is unlikely.

In view of these marked similarities between the different tissue interactomes, the hereditary diseases that we analyzed stood out as a critical manifestation of tissue-specificity. Contrary to what might be expected, only 7% of the tissue-specific hereditary diseases were associated with tissue-specific causal genes (Figure 3D). Instead, the large majority of causal genes were expressed in many tissues that, enigmatically, did not show marked disease phenotypes (Figure 3A). We next harnessed the tissue interactomes to identify features that distinguish causal genes in their disease tissues and could underlie the tissue-specific selectivity and vulnerability.

We found that causal genes tend to have elevated expression levels in their disease tissues relative to unaffected tissues in which they were expressed (Figure 3B). A similar tendency was observed previously among genes causal for genetic diseases excluding cancers [3]. The correlation we observed between transcript levels and PPI degrees (Figure 2D), and the law of mass-action that links gene dosage with probability of interactions [26], both suggest that causal genes will interact in a more promiscuous manner in their disease tissues. Indeed, we found that causal genes tend to have more PPIs in their disease tissues. Given that mutations leading to diseases were shown in some cases to disturb the physical interactions of disease proteins [27], [28], the higher tendency for potentially disturbed PPIs in disease tissues may underlie the increased vulnerability of these tissues.

The other feature that distinguishes causal genes in their disease tissues is their tendency for TS-PPIs, which is observed for 27% of the hereditary diseases (Figure 3D). Notably, such TS-PPIs can offer an explanation for the tissue-specific manifestation of a disease: while they may not comprise the entire disease mechanism, these interactions can enhance or propagate the aberrant phenotypes and thus contribute to the clinical manifestation (Table 2). An important implication of this observation relates to the interrogation of disease etiologies: Whereas current efforts to illuminate the molecular basis of diseases typically consider all the interactions involving causal genes in their disease tissues (e.g., [10], [11]), we suggest concentrating efforts on their TS-PPIs. We show in Table S8 that focusing on these TS-PPIs typically reduces the number of candidate PPIs by 8-fold, thus providing a powerful filter.

Below we demonstrate that TS-PPIs can highlight additional disease-related proteins and interactions effectively. Our first example relates to the widely-expressed tumor suppressor BRCA1 that causes predisposition to hereditary breast and ovarian cancer. We found that, in breast, BRCA1 is involved in a single TS-PPI, with the estrogen receptor ESR1 that activates cell proliferation (Figure 4A). Indeed, it was previously demonstrated that through this interaction BRCA1 inhibits ESR1 and its proliferative activity in breast, and that this effect is reduced in mutated forms of BRCA1 [29]. The second example of a disease-related TS-PPI is that of the widely-expressed epidermal growth factor receptor EGFR. Germline and somatic mutations in EGFR lead to lung cancer [30]. Notably, we found that a lung-specific PPI connects EGFR to its ligand-protein epiregulin (EREG) that was shown to confer invasive properties in an EGFR-dependent manner [31]. Thus, the EGFR-EREG PPI has the potential to enhance the effect of EGFR aberration in a lung-specific manner (Figure 4B). The third example relates to aberrations in three genes that separately cause different subtypes of muscular dystrophy: dystroglycan 1 (DAG1), dystrophin (DMD), and caveolin 3 (CAV3). DMD is a muscle-specific protein that anchors the extracellular matrix to the cytoskeleton. It is also the ligand of DAG1, a globally expressed trans-membrane cell adhesion receptor that interacts with DMD in muscle only [32]. This interaction is prevented by another muscle-specific PPI, in which CAV3 binds to DAG1 [33]. These muscle-specific PPIs explain the muscle-specific phenotypes of DAG1 and CAV3 aberrations (Figure 4C). The last example is a putative explanation for leukoencephalopathy with vanishing white matter, a brain disease that manifests during childhood. The progressive white matter deterioration is caused by germline mutations in any of the five genes encoding the subunits of the translation initiation factor EIF2B, and cells harboring any of these mutations show decreased translation activity [34]. While EIF2B proteins are globally expressed, we found that they exhibit brain- and testis-specific interactions with the netrin-1-receptor DCC, which mediates axon guidance (Figure 4D). Notably, Tcherkezian et al. [35] that identified the relationships between DCC and EIF2B also showed that the absence of netrin significantly lowers cellular translation. We therefore propose that this relationship enhances the effect of EIF2B mutations in a brain-specific manner.

The distinct features we identified provide a starting point for elucidating the molecular basis of many hereditary diseases and can be further applied to filter the wealth of data being generated by large-scale disease-associations studies. In the future, additional tissue features could be considered, such as protein isoform concentrations [36] and relationships other than PPIs [37]. The comparative tissue analysis, along with the extensive resource of human tissue interactomes that we put forward, should become a standard framework for interpreting the wealth of disease-related data and for enhancing our understanding of the etiologies of hereditary diseases.

Methods

Expression data sources

GNF data [12] were downloaded from BioGPS [38], and all genes with intensity value above 100 in a tissue were considered as expressed [39]. HPA data [14] included proteins that were identified as expressed in a tissue, i.e., assigned as ‘low’, ‘medium’ or ‘high’ abundance based on manual assessment of tissue staining by antibodies against the proteins of interest. Proteins were further filtered by imposing stringent thresholds on the reliability and validity of their antibodies: When available, a medium or high antibody-reliability score was required; otherwise we required at least one supportive and no negative validity scores. In case of multiple measurements per tissue per gene we chose the highest value. RNA-seq data from Illumina Body Map 2.0 [15] were filtered for genes with at least 1 read per kilobase per million reads (RPKM). Results for a threshold of 0.3 RPKM were similar and appear in Figure S1. Analysis was limited to proteins and protein-coding genes only, and these were mapped to their Ensembl gene identifiers using BioMart [40]. Table S9 presents the numbers of genes and tissues measured in each dataset.

Consolidation of expression and tissue data

Since RNA-seq data covered the largest number of genes per tissue we based our analysis on the 16 main human tissues profiled with RNA-seq. GNF and HPA each contained profiles for 15 and 14 of these tissues and their subparts, respectively. We manually consolidated the various tissue subparts according to the consolidation scheme given in Table S10. A gene was considered as expressed in a tissue if that gene or its protein product were found to be expressed in that tissue or the tissue subparts by at least one dataset. Compatibility among datasets was tested by (i) computing the overlaps in genes and interactions expressed per tissue using Fisher exact test, and (ii) computing the correlation in expression levels of commonly-expressed genes using Kendall's tau rank correlation, and ranking the correlations between matching tissues compared to correlations between non-matching tissues (Table S1). Figure S8 provides scatter plots comparing the expression levels of common genes in corresponding tissues measured by any two out of HPA, GNF and RNA-seq.

PPI data sources

We assembled PPI data from BIOGRID [16], DIP [17], IntAct [18] and MINT [19]. Only experimentally-detected physical interactions were included, and their union formed the global human interactome.

Construction of tissue interactomes

Tissue interactomes were constructed by filtering the PPIs in the global human interactome according to tissue expression data [41]. A PPI was included only if both pair-mates were found to be co-expressed in the same sample or in the same tissue subpart at the lowest hierarchy level (Table S10). PPIs from subparts of the same tissue were united to form the tissue interactome.

Hubs

Interactome hubs were defined as those nodes in the network where the number of interacting partners (PPI degree) ranked among the top 5%. This resulted in a threshold of over 45 PPI partners for each interactome we analyzed.

Statistical analysis

Pair-wise Kendall's tau rank correlations between datasets were computed for expression levels of commonly detected genes. GO enrichments were performed using DAVID [42]. The total number of human proteins was 21,450 according to BioMart [40]. Differential expression of genes in a tissue was computed using the DEseq method [20]. DESeq analysis was performed for 16 tissues, such that each run compared one tissue to all other 15 tissues. The statistical significance of the overlap between disease genes with elevated expression, higher PPI degree or TS-PPI was calculated using Fisher exact test while excluding tissue-specific disease genes.

Correlation between RPKM levels and PPI degrees

In each tissue we computed the Spearman correlation between RPKM levels and PPI degrees of genes with RPKM readout above 0 and PPIs in the tissue. We also binned genes based on their RPKM levels into 10 equally-sized bins, and computed correlations using the median of each bin. In all tissues both types of correlations were highly statistically significant. The correlation values and figures appear in Figure S3.

Disease-tissue-gene associations

Disease to tissue associations were taken from Lage et al. [3], and included manually curated associations or associations exceeding a cutoff of 15 yielding a precision of 85% as mentioned therein.

Using data from the OMIM database [1] we limited the set of diseases to include only non-somatic and allelic disorders, and extracted disease-genes that were causally associated with those diseases. Data of hereditary cancers were downloaded from the cancer gene census website [43], was limited to cancer germline mutation genes, and manually associated with disease tissues. Only diseases that were associated with at least one of the 16 main tissues and whose causal disease gene was expressed in that tissue were analyzed.

Comparative analysis of RPKM levels of disease genes

We compared the RPKM values of a disease gene between its disease and non-disease tissues in two ways. To identify elevated expression we compared between (i) the median RPKM of that gene in its disease tissues, and (ii) the median RPKM of that gene in its non-disease tissues. The permutation test used to assess the significance of the results is described in the following subsection. To identify significant over-expression we used the DEseq method as described in the ‘statistical analysis’ subsection above [20]. The subset of tissues considered for a specific gene was limited to tissues in which the gene was indeed expressed.

Statistical significance of features of disease genes in their disease tissues relative to non-disease tissues

We used a permutation test to assess the statistical significance of the number of genes whose median value (RPKM level, PPI degree or number of TS-PPI) in their disease tissues was higher than the corresponding median value in their non-disease tissues. Specifically, for each disease gene we randomly selected a set of x disease tissues out of the set of tissues expressing that gene, where x was set to the number of original disease tissues for that gene. The remaining, non-selected tissues expressing the gene were considered as the gene's random non-disease tissues. In all calculations we ignored tissue-specific disease genes as their set of non-disease tissues could be empty. We then computed the relevant median value for that gene in its randomly-selected disease tissues (median value denoted V_d) and the median value in its random non-disease tissues (median value denoted V_nd). If the median in disease tissues was higher (i.e., V_d>V_nd) the gene was considered as success in the permutation test. We applied this test to all disease genes and counted the total number of random successes in that run. We repeated this analysis 10,000 times. We computed the p-value as the fraction of runs out of the 10,000 runs in which the total number of random successes was at least as high as the number of successes in the original data.

Supporting Information

The distribution of genes by number of expressing tissues remains bi-modal when gene expression thresholds are relaxed. Most genes are either globally expressed or tissue-specific, yet the tendency for global expression is enhanced. GNF refers to the study of Su et al (1). HPA refers to data of the human protein atlas (2). RNA-seq refers to RNA-sequencing data (3). GNF genes were considered as expressed in a tissue if their intensity value was above 30, and RNA-seq genes were considered as expressed in a tissue if their RPKM was at least 0.3. Total is the combined expression.

The 16 tissue interactomes show similar distributions of PPI degrees. The PPI degree of a protein in a tissue is the number of its PPI partners in that tissue. Most genes have at most 5 PPI partners in a tissue. Also shown are the PPI degree distributions within the global interactome and within the backbone interactome shared by all tissues.

The diversity across tissues in transcript levels and in genes' PPI partners. A. A heat map showing the diversity in RPKM expression levels. Each row represents a gene (g), each column represents a tissue (t), and each entry (g,t) represents the rank of the RPKM level of the gene g in tissue t. The ranking is in decreasing order of expression, such that the most highly expressed gene is ranked 1. The ranking is from blue to gray, white entries represent non-expressing tissues. B. The diversity on PPI partners of genes across tissues. Each row represents a gene and the percentage of its PPI partners that are expressed in 1–3 tissues (orange), 4–13 tissues (blue) and 14–16 tissues (grey).

The PPI degrees of causal genes across tissues. A. The PPI degree distribution of causal genes across tissues is scale-free like. Similarly to other interactome genes, most causal genes have at most five PPI partners. B. Causal genes tend to have a higher PPI degree in their disease tissues relative to other tissues (42%, randomization test p<10−4). The box-plot diagram shows the quartiles (25%, 50% and 75%) of the median PPI degree of causal genes; for each gene only tissues expressing the gene were considered.

Causal genes are significantly enriched in genes involved in tissue-specific PPIs. Each pie chart describes the fraction of genes with 0 to 5 and above tissue-specific PPIs. Notably, 54% of the causal genes have at least one tissue-specific PPI, relative to only 42% of all interactome genes (p = 8.8*10−5, Fisher exact test).

Compatibility between corresponding tissues from different datasets.Table S1A. Overlap in edges between corresponding tissues shows significant overlap between different datasets. Table S1B. Correlations between corresponding tissues from different datasets as determined by the expression levels of commonly detected genes in each tissue.