Figures

Abstract

While there have been studies exploring regulatory variation in one or more tissues, the complexity of tissue-specificity in multiple primary tissues is not yet well understood. We explore in depth the role of cis-regulatory variation in three human tissues: lymphoblastoid cell lines (LCL), skin, and fat. The samples (156 LCL, 160 skin, 166 fat) were derived simultaneously from a subset of well-phenotyped healthy female twins of the MuTHER resource. We discover an abundance of cis-eQTLs in each tissue similar to previous estimates (858 or 4.7% of genes). In addition, we apply factor analysis (FA) to remove effects of latent variables, thus more than doubling the number of our discoveries (1,822 eQTL genes). The unique study design (Matched Co-Twin Analysis—MCTA) permits immediate replication of eQTLs using co-twins (93%–98%) and validation of the considerable gain in eQTL discovery after FA correction. We highlight the challenges of comparing eQTLs between tissues. After verifying previous significance threshold-based estimates of tissue-specificity, we show their limitations given their dependency on statistical power. We propose that continuous estimates of the proportion of tissue-shared signals and direct comparison of the magnitude of effect on the fold change in expression are essential properties that jointly provide a biologically realistic view of tissue-specificity. Under this framework we demonstrate that 30% of eQTLs are shared among the three tissues studied, while another 29% appear exclusively tissue-specific. However, even among the shared eQTLs, a substantial proportion (10%–20%) have significant differences in the magnitude of fold change between genotypic classes across tissues. Our results underline the need to account for the complexity of eQTL tissue-specificity in an effort to assess consequences of such variants for complex traits.

Author Summary

Regulation of gene expression is a fundamental cellular process determining a large proportion of the phenotypic variance. Previous studies have identified genetic loci influencing gene expression levels (eQTLs), but the complexity of their tissue-specific properties has not yet been well-characterized. In this study, we perform cis-eQTL analysis in a unique matched co-twin design for three human tissues derived simultaneously from the same set of individuals. The study design allows validation of the substantial discoveries we make in each tissue. We explore in depth the tissue-dependent features of regulatory variants and estimate the proportions of shared and specific effects. We use continuous measures of eQTL sharing to circumvent the statistical power limitations of comparing direct overlap of eQTLs in multiple tissues. In this framework, we demonstrate that 30% of eQTLs are shared among tissues, while 29% are exclusively tissue-specific. Furthermore, we show that the fold change in expression between eQTL genotypic classes differs between tissues. Even among shared eQTLs, we report a substantial proportion (10%–20%) of significant tissue differences in magnitude of these effects. The complexities we highlight here are essential for understanding the impact of regulatory variants on complex traits.

Funding: Wellcome Trust funded the study and also supported ACN. Additional support was provided by the Louis-Jeantet Foundation to ETD and ACN, the Wellcome Trust grant 077016/Z/05/Z, and the United Kingdom NIHR Cambridge Biomedical Research Centre to IB. JTB is a Wellcome Trust Henry Wellcome Fellow. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Gene expression is an essential cellular function whose regulation determines a significant proportion of the phenotypic variance. Using microarrays and recently second generation sequencing (RNA-seq) [1], [2], major progress has been made in understanding the genetics of human gene expression and identifying loci that drive differential expression across individuals [3], [4], populations [5]–[7] and tissues [7]–[11]. This development is especially valuable for the biological analysis of genome-wide association (GWAS) signals [12], which often map to non-genic regions and are thus hard to interpret in the absence of additional information [13].

Transcript abundance is a very proximal endophenotype affected by genetic variation and has already facilitated the identification of candidate susceptibility genes for metabolic disease traits [14], asthma [15] or Crohn's disease [16]. This has been mostly possible when the tissue of expression was relevant to the interrogated complex trait, as disease phenotypes manifest themselves only in certain tissues. eQTLs discovered in LCLs have primarily helped explain GWAS associations with immunity-related disorders [17], [18] while associations with obesity-related traits were mostly observed when gene expression was quantified in adipose tissue [9]. Nevertheless, our guess of tissue relevance is yet far from satisfactory [19], reinforcing thus the incontestable value of measuring expression in multiple cell-types (including primary tissues reflecting in vivo patterns).

Transcriptional regulatory networks are expected to dictate tissue-specificity of regulatory effects [20], but the extent of this is still under debate. Depending on the cell-types compared and the eQTL discovery methods used, current estimates for tissue-specificity of eQTLs range from ∼30% (liver, adipose tissue) [21] to 70–80% (LCL, fibroblasts, T cells) [7].

In this study we investigated various aspects of tissue-specificity and we emphasize the importance of accounting not only for statistical significance but also for continuous biological properties of regulatory variants, such as fold change in expression. We explored the complexity of the human cis-regulatory variation landscape in three tissues (LCL, skin and fat) derived from a subset of female Caucasian twins aged between 40 and 87 years old (mean 62 years) from the UK Adult Twin registry [22]. The present study represents the pilot phase of the MuTHER project (Multiple Tissue Human Expression Resource—http://www.muther.ac.uk/), a major resource initiated to enhance our knowledge about common trait susceptibility by providing genome-wide expression, methylation and eventually transcriptome sequencing information for 855 extensively phenotyped twins (clinical, anthropometric, life-style information as well as a wide range of biological measurements are available).

We tested for SNP-gene expression associations (eQTLs) separately in each tissue. We considered only unrelated individuals at a time by separating twins from the same pair and thus performing two independent eQTL analyses per tissue. This study design, hereafter named Matched Co-Twin Analysis (MCTA), permits immediate replication and validation of eQTL discoveries. We used Spearman Rank Correlation (SRC) to detect associations and restricted our search to cis effects located within 1Mb on either side of a gene's transcription start site (TSS). Statistical significance was assessed at different thresholds using permutations (10,000 per gene) [5]. We detected an abundance of cis eQTLs (Table S1A) per tissue at a comparable rate to other studies of similar sample size [5], [7]. The reported eQTLs appear robust as they replicate well between individuals of the two co-twin groups per tissue. We measured the eQTL overlap in a continuous fashion by taking the significant SNP-gene associations from one co-twin set and estimating the proportion of true associations (π1 statistic [23], see Materials and Methods) on the distribution of corresponding p-values in the reciprocal co-twin validation set. High levels of eQTL replication were observed across co-twins, with a mean π1 of 0.93 in skin and 0.98 in LCL and fat (Table 1). We also measured the estimated proportion of true positives among the subset of genes that did not replicate in the co-twin at the same threshold. This too is high (π1 = 0.84 for skin and 0.94 for LCL and fat), suggesting that exact overlap of genes at a given permutation threshold (PT) is an underestimate of eQTL replication due to winner's curse. In other words, we detected eQTLs in the co-twin that clearly replicated the initial findings, but at p-values that marginally missed the initial discovery threshold. To further confirm the robustness of our discoveries, we overlapped the MuTHER LCL results with available eQTL data from two recent independent studies. 40% of the genes for which we detect LCL eQTLs overlap with eQTLs detected in HapMap 3 samples of European ancestry (CEU) (Stranger et al. submitted). Likewise, 36% of the cis associations detected by Gibson et. al. in leukocytes derived from 194 southern Moroccan individuals [24] overlap with genes reported in our study. Given the differences in gender distribution, sample preparation or even cell-type tested (LCL versus leukocytes) across these studies, the gene overlap observed is reassuring.

The observed variation in gene expression is not entirely due to genetic effects. Experimental noise and environmental conditions also affect transcript levels in a global manner. Therefore, it is desirable to remove the effects of such random variables and thus increase the power to detect eQTLs. For this purpose, we employed factor analysis (FA) on each tissue separately and corrected for global latent effects on all individuals in each tissue [25]. We fitted various parameters such as number of learned factors and proportion of variance explained, in order to maximize for replication of eQTLs per tissue between twin sets. After performing standard SRC eQTL analysis on the factor-corrected expression data (SRC-FA), we obtained a substantial improvement in eQTL discovery at each of the standard permutation thresholds used (Table S1B). The improvement (twice as many eQTLs at 10−3 PT) is consistent in all tissues. The high eQTL replication between twin sets persists after FA, with an additional improvement of true positives detection in skin: π1 = 0.95 (Table 1). As expected, FA correction recovers the majority of the eQTLs discovered with the initial analysis (90% of LCL and fat and 80% of skin) ensuring that proximal genetic effects have not been corrected out. The FA correction enabled the discovery of additional signals (Table S2) likely representing real effects that could not be detected initially due to low power. This is supported by the significant overrepresentation of low association p-values (π1 = 0.99, Figure 1) estimated in the uncorrected data for eQTLs detected only after FA correction.

The significant overrepresentation of low p-values for the new eQTLs (π1 = 0.99) shows that the signal existed in the uncorrected data but wasn't called significant due to low power. In each tissue, the exact SNP-gene combinations (eQTLs) tested are presented for both co-twin sets (Twin 1—first column, Twin 2—second column).

Direct tissue overlap of significant eQTLs supports an extensive level of tissue-specificity for the three tissues, with very similar proportions in both the SRC and SRC-FA analyses (Figure 2). In the first co-twin set we discovered 858 eQTL genes (non-redundant union) at 10−3 PT in all three tissues (Table 2). Of these, 106 genes (12.35%) are shared across all tissues, 139 (16.2%) are shared in at least two tissues and 613 genes (71.44%) are detected in only one tissue. In skin we detect proportionally fewer tissue-specific effects (10.02% of skin eQTLs are specific to skin at 10−3 PT), an observation likely due to tissue heterogeneity and larger variety of present cell-types. SRC-FA results confirm the estimated ∼30% of eQTLs to be shared in at least two tissues based on threshold eQTL discovery (Table S3).

Tissue-specific effects are largely not due to tissue-specific expression of the underlying transcripts. We detected regulatory variants active only in one tissue for genes that are expressed at high levels in the other two tissues (Figure S1). The strength of tissue-specificity was investigated further by performing a joint repeated-measures ANOVA analysis with the tissue modelled as a categorical predictor variable (i.e. tissue type comprised the repeated measure). We assessed the relationship to the genotype by inspecting the SNP×tissue interaction p-value term. As expected, we detected a large enrichment of significant SNP×tissue interaction p-values for all associations (π1 = 0.56) with tissue-specific effects having higher enrichment (π1 = 0.6) than shared ones (π1 = 0.41) (Figure S2). The enrichment in the shared category suggests additional attributes of tissue-specificity beyond statistical significance, as presented in the succeeding fold change analysis.

The direction of allelic effects for shared eQTLs (10−3 and 10−2 PT) is consistent across the three given tissues (Figure S3). As expected, for eQTLs significant in one tissue only the SRC correlation coefficient rho (reflecting direction and magnitude of effects) explains a substantially higher fraction of gene expression variation in the tissue of discovery compared to the other two tissues (identical SNP-gene associations - Figure S4). On the other hand, the amount of expression variance explained by shared eQTLs (10−3 PT) is comparable across tissues.

To refine regulatory signals and describe independently acting variants, we mapped eQTLs to recombination hotspot intervals and filtered markers in high LD (Materials and Methods). We found that ∼7% of the genes tested are regulated by more than one independent cis eQTL, with similar estimates obtained from the standard and factor eQTL analysis (Figure S5). For finer comparison of eQTL effects, we conducted an analysis where sharing was required for both the gene and the genomic interval harboring the eQTL. This analysis yielded similar counts of tissue-shared and specific effects (Tables S4, S5), suggesting that the vast majority of shared genes also share regulatory variants across tissues. Furthermore, as shown previously [7], we observed that eQTLs cluster symmetrically around the TSS, with shared effects being distributed tightly around the TSS and tissue-specific effects spanning a greater range of distances (Figures S6, S7).

The results described so far are based on thresholds, which are driven by statistical significance. Overlaps at these levels are heavily dependent on power and affected by winner's curse. In addition, eQTLs sharing statistical significance may still have notable effect differences on gene expression levels across tissues, with potentially different biological consequences. Given these caveats, we examined tissue-specificity in a continuous manner by quantifying the proportion of true positives estimated from the enrichment of low p-values (π1). Specifically, the p-value distribution of significant SNP-probe pairs (10−3 PT) from a reference tissue was investigated in the other two tissues. The p-value distribution in the other tissues indicates a high degree of tissue sharing (53 to 80%) both with the SRC and SRC-FA, varying slightly depending on the reference tissue in the comparison (Table S6). This suggests that there are effect size differences (both fold change and amount of variance explained) among tissues for the same regulatory variants, which is the basis for the previously described higher eQTL tissue-specificity estimates [7]. Overall, 29% of eQTLs (1-mean π1) are estimated with the continuous approach to be tissue-specific, when comparing the three tissues studied.

As described above, tissue overlap of eQTLs should encompass not only sharing of a statistically significant regulatory effect, but also a similar effect size (fold change in expression) of that variant across tissues. In this respect, we report the fold change as the difference between the gene expression means of the heterozygous and major homozygous genotypic classes. Within the same tissue, the two co-twin sets are only slightly different in their fold change estimates. These minor differences reflect most probably the winner's curse effect (0.96 Pearson's correlation of fold change between Twin 1 and Twin 2 in LCL, 0.93 in skin and 0.93 in fat - Figure 3, Figures S8, S9). The difference in estimated effect size is much more apparent however across tissues (e.g. LCL eQTLs have a 0.69 and 0.77 fold change correlation with skin and fat eQTLs respectively, skin eQTLs have a 0.69 fold change correlation with fat eQTLs). This is largely a consequence of eQTL tissue-specificity, but a small effect of winner's curse is also expected (as observed in the comparison of co-twin sets). Furthermore, additional possible hidden tissue-specific effects are implied by the fact that shared eQTLs (at the same threshold of significance) don't always share the same effect size across tissues (LCL fold change correlation of 0.78 in skin and 0.84 in fat for shared eQTLs i.e. up to 20% difference in fold change magnitude between tissues compared to within-tissue difference). This suggests that even statistically tissue-shared eQTLs have additional dimensions of tissue-specificity and their mere discovery in multiple tissues does not guarantee similar magnitude of consequences.

The plotted fold change on the X-and Y-axes was calculated as the difference in mean expression of the heterozygous and major homozygous genotypic classes. For each pairwise tissue comparison, the Pearson's correlation coefficient between fold changes is shown.

Discussion

We have performed eQTL analysis in one cell-line (LCL) and two primary tissues of clinical importance (skin – previously uncharacterized and fat). For each tissue we report robust eQTLs replicating in independent samples with identical (MZ) or on average 50% similar (DZ) genetic background using a matched co-twin design (MCTA). To further increase our power to detect eQTLs and uncover smaller genetic effects, we applied factor analysis accounting for global variance components in the data. We refined our signals to detect independently acting cis eQTLs and for most genes we found single associated regulatory variants. When these variants are shared across tissues, they also share the same direction of allelic effects and map to the same recombination hotspot interval. Using threshold-based criteria, tissue overlap of eQTLs supports a large degree of tissue-specificity for the three tissues studied. However, this estimate is dependent on power and we therefore put forth a continuous measure of tissue-specificity that provides a refined view of the decay of statistical significance as well as fold change effect on gene expression. Using this approach we observed a significant overrepresentation of low p-values in all pairwise tissue comparisons, indicating larger proportions of shared statistically significant regulatory effects, some yet to be discovered with bigger sample sizes. However, we also observed significant eQTLs at the same threshold exhibiting differential fold changes in expression between genotypes across tissues. These cases represent tissue-specific effects as well, since differential fold change in expression is likely to have different biological consequences. Overall biological interpretation of regulatory effects - much like in the case of complex traits – is tissue-dependent, highlighting the value of multiple tissue expression datasets. Understanding such complexities and context-dependent effects in the genetic architecture of gene expression and other cellular phenotypes is essential for the interpretation of the biological properties of disease causing variants.

Materials and Methods

All samples and information were collected with written and signed informed consent. The project has been approved by the local ethics committees of all institutions involved.

Sample collection

All individuals recruited in this study were Caucasian female twins aged between 40 and 87 years old (mean age 62). Skin punch biopsies (N = 196) were taken from a relatively photo-protected area adjacent and inferior to the umbilicus. The fat sample was then carefully dissected from the same skin biopsy incision. A peripheral blood sample to generate lymphoblastoid cell lines (LCL) was taken contemporaneously. For a full description of the biopsy technique see Text S1.

Gene expression measurements and genotyping

RNA levels were measured in LCL, skin and fat using Illumina's whole-genome expression array HumanHT-12 version 3 as previously described [5]. Each sample had three technical replicates. Illumina's v3 probes were mapped to unique Ensembl gene IDs by combining and cross-checking two methods. The first approach used Illumina's probe annotation to RefSeq IDs. These were further queried with BioMart (Ensembl 54) for corresponding Ensembl genes. RefSeq IDs mapping to multiple EnsGenes were excluded. The second approach used BLAT to map the 50-mer probe sequences to Ensembl transcripts and to extract genomic locations matching for all 50 bases of the probe sequence. Probes with unique perfect match to the genome and corresponding transcripts matching to the same genes were kept. The union of the two mappings after excluding 196 conflictingly matching probes resulted in 27,499 probes corresponding to 18,170 autosomal genes available for association analysis.

Genotyping has been performed in parallel using Illumina's 1M-Duo and 1.2M-Duo custom chips on different subsets of individuals. Before further filtering, there were 106 samples with call rate (CR)≥0.90 on the 1.2M and 88 samples with CR≥0.90 on the 1M chip. Combined intensity files were created for Illuminus [26] by retaining on a per-chromosome basis only SNPs common to both chips. Additionally, any SNPs that moved position between the two chips were removed. Following further quality checks (Hardy-Weinberg p>10−4, MAF>1%), 865,544 SNPs were kept for analysis.

The overlapping set of successfully genotyped samples with available expression data amounted to 156 (LCL), 160 (skin) and 166 (fat) individuals.

Post-experimental normalization of gene expression data

Log2 - transformed expression signals were normalized separately per tissue as follows: quantile normalization was performed across the 3 replicates of each individual followed by quantile normalization across all individuals.

Genotype-gene expression associations and multiple testing correction

The eQTL analysis was done separately for each tissue. Within each tissue, twins from the same pair were separated by id in two samples analyzed independently. This separation resulted in the following sample size for LCL, skin and fat respectively: Twin 1 (74, 76, 79) and Twin 2 (82, 84, 87). Associations between SNP genotypes and normalized expression values were conducted using Spearman Rank Correlation (SRC). We considered only SNPs in cis, i.e. within a 1MB window from the TSS. We assess the statistical significance of the nominal associations using permutations as previously described [5]. We call an eQTL significant at 10−3 permutation threshold (PT) if the nominal association P-value is greater than the 0.001 tail of the minimal P-value distribution resulting from the SNP's associations with 10,000 permuted sets of expression values for each gene.

Factor analysis

We applied a Bayesian factor analysis model [25] to the expression data in each tissue. This approach uses an unsupervised linear model to account for global variance components in the data, and yields a residual expression dataset that can be used in further analysis.

We tested a wide range of parameter settings for the model, controlling the amount of variance explained by it. This was achieved by setting the parameters of the prior distributions for gene expression precision (inverse variance) and factor weight precision. These random variables are modelled using Gamma distributions, thus we varied their natural exponential family parameters - the prior mean and number of prior observations. We varied the prior mean from 10−6 to 10−2, and number of prior observations from N*10−3 to N, where N is the number of observations from data, and learned 120 latent factors. In the subsequent analysis, we used for each tissue the residual dataset that gave the best eQTL overlap between the two twin samples. The prior values used for each dataset are given in Table S7. The eQTL analysis on the corrected expression data was performed identically to the standard analysis: SRC followed by permutation testing.

Proportion of true positives from p-value distribution

For quantifying eQTL replication and tissue sharing in a continuous way, we used Storey's QVALUE software [23] (implemented in the R package qvalue 1.20.0, default recommended settings). The program takes a list of p-values and computes their estimated π0 - the proportion of features that are truly null - based on their distribution (the assumption used is that p-values of truly alternative cases tend to be close to zero, while p-values of null features will be uniformly distributed among [0,1]). The quantity π1 = 1−π0 estimates the lower bound of the proportion of truly alternative features, i.e. the proportion of true positives (TP). Replication and sharing between two samples is reported as the proportion of TP (π1) estimated from the p-value distribution of independent eQTLs discovered in sample 1 in the second sample (exact SNP-probe combinations are tested).

Recombination hotspot interval mapping and LD filtering

We refined the eQTL signals in order to characterize likely independent effects per gene. For this purpose, we mapped all common autosomal SNPs to recombination hotspot intervals as defined by McVean et.al [27]. We map significant eQTLs to recombination hotspot intervals and save the most significant SNP per gene. For each gene, SNPs resulting from this mapping are in addition filtered for LD in a pairwise manner (for each pair with D′>0.5 the least significant SNP is ignored). This filtering ensures that true shared effects (interval-gene combinations) are compared and not just genes.

Cumulative SRC rho distribution across tissues for tissue-specific and shared eQTLs (10−3 PT, Twin1). eQTLs discovered in one tissue only have distinctively higher variance in the tissue of discovery compared to shared effects.

Fold change within twins and across tissues for SKIN eQTLs (10−3 PT, SRC) discovered in Twin 1. Fold change was calculated as the difference in mean expression of the heterozygous and major homozygous genotypic classes. For each pairwise tissue comparison, the Pearson's correlation coefficient between fold changes is shown.

Fold change within twins and across tissues for FAT eQTLs (10−3 PT, SRC) discovered in Twin 1. Fold change was calculated as the difference in mean expression of the heterozygous and major homozygous genotypic classes. For each pairwise tissue comparison, the Pearson's correlation coefficient between fold changes is shown.