This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Formula display:

Abstract

Background

Gene expression is controlled by proximal promoters and distal regulatory elements
such as enhancers. While the activity of some promoters can be invariant across tissues,
enhancers tend to be highly tissue-specific.

Results

We compiled sets of tissue-specific promoters based on gene expression profiles of
79 human tissues and cell types. Putative transcription factor binding sites within
each set of sequences were used to train a support vector machine classifier capable
of distinguishing tissue-specific promoters from control sequences. We obtained reliable
classifiers for 92% of the tissues, with an area under the receiver operating characteristic
curve between 60% (for subthalamic nucleus promoters) and 98% (for heart promoters).
We next used these classifiers to identify tissue-specific enhancers, scanning distal
non-coding sequences in the loci of the 200 most highly and lowly expressed genes.
Thirty percent of reliable classifiers produced consistent enhancer predictions, with
significantly higher densities in the loci of the most highly expressed compared to
lowly expressed genes. Liver enhancer predictions were assessed in vivo using the hydrodynamic tail vein injection assay. Fifty-eight percent of the predictions
yielded significant enhancer activity in the mouse liver, whereas a control set of
five sequences was completely negative.

Conclusions

We conclude that promoters of tissue-specific genes often contain unambiguous tissue-specific
signatures that can be learned and used for the de novo prediction of enhancers.

Background

A fundamental question in biology is how cells and tissues differentiate and maintain
their identity from essentially the same genome. Wide variation in spatial, temporal
and condition-dependent expression patterns of more than 20,000 genes in the human
genome [1] is required for the establishment and maintenance of different cell fates and environmental
responses. Tissue-specific genes are often implicated in distinct developmental and
metabolic pathways and therefore may constitute good candidates for biomarkers or
drug targets.

The control of gene transcription is mediated by transcription factors (TFs), which
interact in a sequence-specific manner with DNA motifs, known as TF binding sites.
The promoter is frequently divided into a basal core, covering approximately 100 bp
upstream of the transcription start site (TSS), and a proximal promoter, which extends
up to a few hundred base pairs and typically contains multiple TF binding sites [2,3]. In addition to promoters, other cis-regulatory sequences, such as enhancers, are specifically bound by TFs and are central
players in the control of transcription in multicellular eukaryotes. The regulation
of promoters by distal enhancers involves DNA looping or scanning and/or higher-order
conformation changes in chromatin [4-6], resulting in an increase in the local concentration of TFs in the vicinity of a
promoter and the initiation or enhancement of transcription. It has been long recognized
that proximal promoters and enhancers are functionally similar, and virtually undistinguishable
from each other (see, for example, [7,8]).

Both enhancers and promoters have been shown to contain DNA motifs for specific TFs,
depending on their tissue-specific activities (for example, [8-10]). In particular, CpG-depleted promoters are enriched with DNA motifs [11], suggesting a distinct regulatory mechanism from CpG-rich promoters. The transcription
complex LDB1, which involves GATA1, GATA2, TAL1, LMO2, and RUNX1, and has been extensively
studied in the context of the differentiation of erythroid cells, illuminates this
distinction. Whereas LDB1 binds mostly within CpG-depleted promoters, it only binds
downstream of CpG-rich promoters, often within the first intron of their target gene
[12]. Additional evidence suggests that such DNA motifs representing putative TF binding
sites are predictive of promoter activity, including tissue-specific expression of
their target gene (for example, [13,14]). In addition, DNA motif enrichment analyses have shown that DNA motifs are highly
predictive of enhancer activity [15-18].

Unlike promoters, enhancers can act over very long distances. Based on the relative
location of conserved non-coding elements (CNEs) in the human genome, early estimates
suggested that a large number of enhancers are more than 250 kilobases (kb) away from
their target gene [19]. For example, a conserved enhancer of Shh that is associated with polydactyly is located 1 megabase (Mb) upstream of Shh, within an intron of another gene [20]. Furthermore, similar approaches have determined that the regulatory elements controlling
the transcription of SOX9 are scattered over 1 Mb upstream of its TSS [21,22]. More recently, genome-wide chromatin interaction analyses have confirmed that such
long-range interactions are indeed widespread, providing evidence that the vast majority
of enhancers target genes other than their nearest genes [23,24]. Because of their genomic distribution and poorly characterized sequence features,
enhancers have been difficult to identify. Only the advent of high-throughput sequencing
technologies has led to large-scale screens for regulatory sequences that are now
starting to reveal complete regulatory networks and signal transduction pathways in
higher eukaryotes [25]. Such screens, however, represent a snapshot of a single cell type and set of conditions,
and conclusions cannot, therefore, be easily generalized.

Previous studies have focused on identifying sequence features in either promoters
or enhancers, and constructing models that describe these genomic elements, individually.
Here, we show how the presence and/or absence of motifs in the promoter regions of
genes with tissue-specific expression profiles can be used to reliably identify distal
enhancers with analogous tissue-specific activity. Predicted enhancers are highly
enriched in the loci of concordantly expressed genes (for instance, in the case of
predicted liver enhancers, they are five-fold more abundant in the loci of most highly
expressed liver genes than in the loci of lowly expressed liver genes), and overlap
significantly with chromatin signatures predictive of enhancer activity. Experimental
validation in mice supports the high accuracy of the presented method in predicting
tissue-specific enhancers. With the advent of new technologies and the resulting deluge
of expression data, approaches exploiting sequence features shared between promoters
and enhancers hold great promise to understanding the cis-regulatory code encrypted in the genomes of higher organisms.

Results and discussion

Promoters of tissue-specific genes contain tissue-specific signatures

Although most promoters drive basal levels of transcription ubiquitously, some promoters
are capable of controlling transcription in a tissue- and/or temporal-specific manner
[26-29]. Genes controlled by these types of promoters are expressed in specific tissues and
developmental stages, and may be induced by endogenous or exogenous factors. Here,
we set out to systematically test whether promoters of genes that exhibit a particular
expression profile in a given tissue contain sequence signatures that confer tissue-specificity
and separate them from ubiquitous promoters. For this purpose, we collected the promoter
regions (from 25 kb upstream to 0.5 kb downstream of the TSS; see Materials and methods)
of the top 200 highly expressed genes in 79 different tissues and cell types [30] (see Materials and methods). As negative controls, we selected the promoters of the
200 least expressed genes in the same set of tissues. Although expression breadth
was not considered in the construction of such gene sets, most of the genes in the
sets are only expressed at high (or low) levels in the tissue of interest. Even if
some genes are expressed across several tissues at high or low levels, the sets are
highly non-overlapping (Supplementary notes in Additional file 1). Thus, while the individual genes in a given set are not strictly tissue-specific,
the set itself is.

To assess the role of promoters in determining tissue-specific expression, we trained
a support vector machine (SVM) for each of the 79 tissues. More precisely, to discriminate
between promoters of most highly expressed and inhibited genes in each tissue, the
classifiers relied on in silico occurrences of TF binding sites within their evolutionary conserved regions (see
Materials and methods; Supplementary notes in Additional file 1). We evaluated the models’ ability to accurately predict expression using the area
under the receiver operating characteristic curve (AUC) in a five-fold cross-validation
framework. Most models (73/79) can reliably distinguish promoters of genes most highly
expressed in a given tissue from those of lowly expressed genes by identifying TFs
associated with these tissues with median AUC values between 0.60 and 1.00 (Figure 1A). To be specific, for half of the models (39/79), we obtained median AUC values
higher than 0.8. Moreover, when we tested a model trained on a particular tissue on
the promoters of genes expressed in another tissue, we obtained relatively high AUC
values mainly for related tissues (Figure S1 in Additional file 1), confirming that the models rely on tissue-specific motifs. We even obtained high
AUC values for models in which, at first glance, we could not detect any significantly
enriched motifs, such as for BM-CD71+ early erythroid cells. This result suggests
the existence of different subsets of promoters, with characteristic sequence features.
Modest performance is likely explained by lack of sequence features and/or relatively
high heterogeneity of the promoters in the training set of the model. Thus, our models
performed well even in the presence of a relatively large fraction of promoters overlapping
CpG islands, but yielded higher AUC values when trained on CpG-poor promoters (with
the mean fraction of promoters overlapping CpG islands being 0.58 for reliable models,
as compared with 0.67 for unreliable models; Figure S2 in Additional file 1; Pearson’s r2 = 0.1 with P-value = 0.001). Since genes expressed in the brain are strongly associated with CpG
islands [31,32], many of the models yielding low AUC values involved brain tissues. The performance
of the models is also negatively correlated with the fraction of promoters enriched
in TATA-box motifs (with the mean fraction of promoters containing TATA boxes being
0.49 for reliable models, as compared with 0.57 for unreliable models; Figure S3 in
Additional file 1; r2 = 0.4, P-value = 2.8 × 10-11). Additionally, promoters of most highly expressed genes in reliable models are less
conserved at the TSS compared to those in poor models (with average percentage of
sequence identity between human and mouse of 0.63 for reliable models, as compared
with 0.70 for unreliable models; Figure S4 in Additional file 1; r2 = 0.4, P-value = 4.4 × 10-10). The genes regulated by these promoters exhibit similar conservation trends. This
result suggests that extensive use of promoters with tissue-specific activity could
have arisen as a means to facilitate the acquisition of novel gene functions.

Figure 1.DNA motifs in human promoters predict tissue-specific expression. (A) Area under the receiver operating characteristic (ROC) curve for 79 models trained
and tested on promoters of genes highly expressed in 79 different tissues. The AUC
is an overall summary of diagnostic accuracy. AUC equals 0.5 when the ROC curve corresponds
to random chance and 1.0 for perfect accuracy. Reliable models (with median AUC ≥0.6)
are displayed in red, while unreliable models (with median AUC ≤0.6) are displayed
in gray. Models were evaluated in a five-fold cross-validation setting. (B) Motifs with the greatest predictive power for the liver model. The weights w of the
motifs (see Materials and methods) are given in red. Motif weights have been scaled
to [-1, 1], where 1 represents the scaled weight of the motif with highest predictive
power, and -1 the scaled weight of the motif with the lowest negative predictive power
(signs are preserved; see Materials and methods). The names of the features are listed
near the baseline of the graph. For comparison, we include weights w for the same
motif in the lung, caudate nucleus, thymus models (in different shades of gray). Similarities
among the genes that were used to train the models - which reflect functional relatedness
among tissues - explain similarities in the predictive power of the motif. Thus, 15%
of genes that are highly expressed in liver are also highly expressed in lung, while
less than 5% are in caudate nucleus and thymus.

We next observed that many of the most highly predictive motifs for tissue-specific
gene expression (that is, those with the largest positive weights; see Materials and
methods) for reliable models are known to be involved in the regulation of the corresponding
tissue. For instance, motifs with the highest predictive power (among the top 2%)
for the liver model included binding sites for HNF4A, PPARA, NR1H3, and NR2F2 (Additional
file 2), which are among TFs that have been experimentally shown to control hepatic function
and development [33-36]. This analysis is limited in that TFs may recognize similar binding sites and in
that motif databases are partially redundant. Thus, establishing the identity of the
TF that may be binding to particular motifs is not trivial. However, taken as a whole,
these observations suggest that our model specifically captures key regulators of
the liver transcriptional network (Figure 1B). Also, as expected, binding sites for the same TFs characterize models for tissues
with similar gene expression profiles. For example, most brain tissues share binding
sites for members of the Hox and Pax families of TFs (Additional file 3), confirming the correlation between motifs with high predictive power and tissue-specific
regulation.

In summary, the strong predictive value of the motifs identified in promoter regions
confirms that they are highly associated with tissue-specific gene expression, and
substantiates the involvement of promoters in the regulation of tissue-specific expression.

Promoter signatures identify tissue-specific enhancers

We next assessed whether the models describing tissue-specific promoter activity could
be exploited to discover enhancers. For this purpose, we applied each of the 73 reliable
models trained on promoter regions to predict enhancers in the loci of genes that
were among the 200 most highly or lowly expressed in the corresponding tissue (see
Materials and methods). We evaluated only evolutionarily conserved non-coding sequences
across the human and mouse genomes located at least 2.5 kb upstream and 0.5 kb downstream
of the nearest TSS (see Materials and methods). Recent studies suggest that only about
50% are conserved in mammals, the remainder constituting lineage-specific elements
(for example, [37-41]). This fraction, however, is expected to depend on the particular tissue where the
enhancers are active. On the other hand, only 10% of the genomic sequence is conserved
between mammals. This makes conservation an effective filter for enhancer identification.
Indeed, integrating sequence analysis with comparative genomics has been shown to
reveal important subsets of enhancers (for example, [17,18,42,43]). While restricting the analysis to conserved sequences implies a reduction in sensitivity,
we considered this filter essential to increase the specificity of our approach. While
tissue-specific enhancers often regulate gene expression over longer distances, they
tend to be enriched near genes that are expressed and functional in the tissue of
interest [40,44,45]. Hence, differences in the enrichment of candidate tissue-specific enhancers between
the loci of genes most highly and lowly expressed in the corresponding tissue could
be used as an indicator of whether the predicted enhancers do indeed drive tissue-specific
expression.

In 78% of tissues, our enhancer predictions are enriched in the loci of the 200 most
highly expressed genes as compared to lowly expressed genes (P-values ≤0.05, Fisher’s exact test; Figure 2A; see Materials and methods for details). The most pronounced enrichment in physiologically
normal tissue was observed in heart, lung, and liver, with fold differences of at
least 4.5. Moreover, for 44% of the tissues, we also found predictions in a significantly
larger fraction of loci of most highly expressed genes as compared to loci of lowly
expressed genes. For instance, we observed candidate liver enhancers in the 60% of
the loci of most highly expressed genes, but only in 43% of the loci of lowly expressed
genes (P-value = 0.01, computed with Fisher’s exact test; Figure S5 in Additional file 1). Finally, for 26% of the tissues the scores of the candidate enhancers were significantly
greater in the loci of most highly expressed genes as compared with those in the loci
of lowly expressed genes (P-value ≤0.05, Wilcoxon rank-sum test; Figure S6 in Additional file 1), suggesting that increasing the stringency of the prediction threshold would result
in even stronger associations. In total, enhancer predictions in the loci of most
highly expressed genes differed significantly from their counterparts in the loci
of lowly expressed genes for 85% of the tissues examined according to at least one
of the above-mentioned criteria, with 14% (10) of the tissues exhibiting significant
differences according to all of them (Table 1; Additional file 3).

Figure 2.Genome-wide enhancer predictions. (A) The number of enhancer predictions in the loci of highly expressed genes divided
by the total number of sequences scanned in the loci of highly expressed genes (in
red), as compared to the number of enhancer predictions in the loci of lowly expressed
genes divided by the total number of sequences scanned in the loci of lowly expressed
genes (in black), for 71 promoter-based models. Statistically significant differences
are indicated by asterisks (P-values ≤0.05, Fisher’s exact test). (B) Correlation between the fold enrichment between the proportions of enhancer predictions
in the loci of highly expressed and lowly expressed genes with the cross-validation
accuracy of the corresponding promoter-based models. (C) Overlap of liver enhancer predictions with strong enhancers predicted by ChromHMM
in HepG2 cell lines [51], compared to random sequences with similar length in the loci of genes highly expressed
in liver. (D) Overlap of liver enhancer predictions with DNase I hypersensitivity sites (DHS) in
HepG2 cell lines from the ENCODE project, compared to random sequences with similar
length in the loci of genes highly expressed in liver.

Table 1.Tissues for which the promoter models produce the most robust sets of predictions
for enhancers

Fold enrichment between the proportions of enhancer predictions in the loci of the
200 most highly and lowly expressed genes is strongly correlated with the accuracy
of the promoter models (Figure 2B). Promoter models that performed only modestly, such as those based on brain tissues
(AUC ≤0.70), had limited success in predicting enhancers locus-wide (with fold enrichments
reaching at most 1.3), while well-performing promoter models (AUC ≥0.90), such as
those for heart, liver, kidney, and lung, achieved greater fold enrichments of at
least 2.8 (for kidney). We also observed slightly higher fold enrichments between
the proportions of enhancer predictions in the loci of most highly and lowly expressed
genes when the difference in GC content between the former and the latter was relatively
large (log2 ratio of 0.13 as compared to -0.01 for lower fold enrichments between
the proportions of enhancer predictions in the loci of highly and lowly expressed
genes, P-value = 3.2 × 10-11, Wilcoxon rank-sum test), suggesting a role for the GC content in the control of
tissue-specific expression.

In general, enhancers predicted in the loci of the 200 most highly expressed genes
in a given tissue were found to overlap extensively with experimental and computational
enhancer marks characteristic of functional regulatory regions. For instance, candidate
tissue-specific enhancers were found to be significantly enriched (P-value ≤0.05, Fisher’s exact test) in binding sites for TFs within regulatory networks
that are known to be important in the respective tissues, such as MYC and NFKB1 in
heart, and HNF4A and SP1 in liver [46-49] (data not shown). The combined collection of enhancers predicted in the loci of the
200 most highly expressed genes in each of the tissues considered significantly overlap
with ORegAnno, a manually curated collection of regulatory sequences [50], featuring a two-fold enrichment (P-value <0.001, computed based on 1,000 randomized sequences genome-wide). Also, our
enhancer predictions are enriched for specific epigenetic histone marks generally
associated with distal transcriptional regulation, as suggested by 41% of predicted
enhancers overlapping ChromHMM predictions for strong and weak enhancers (1.5-fold
enrichment, P-value <0.001, computed based on 1,000 randomized sequences genome-wide [51]). In addition, our predictions are significantly associated with the enhancer chromatin
signature H3K4me1 (1.3-fold enrichment, P-value ≤0.001, computed based on 1,000 randomized sequences genome-wide) and DNase
I hypersensitive sites (DHSs) in different human cell lines, with a total of 42% of
predicted enhancers overlapping 1% of the DHSs (1.6-fold enrichment, P-value <0.001, computed based on 1,000 randomized sequences genome-wide). In particular,
liver enhancer predictions extensively overlap with different enhancer marks, such
as p300 binding, chromatin marks, and DHSs (Figure S7 in Additional file 1). For example, 29% of liver enhancer predictions overlap chromatin marks and ChromHMM
enhancer predictions for the HepG2 hepatocellular carcinoma cell line, providing additional
evidence for the tissue-specificity of the activity of the predicted enhancers (Figures 2C,D; Figure S8 in Additional file 1). Substantial overlap is also observed for other classifiers with DHSs (Figure S9
in Additional file 1). Finally, we found that enhancer predictions are significantly enriched in matching
p300 embryonic brain, limb, and heart enhancers (2.5-fold enrichment, P-value <0.001, computed based on 1,000 randomized sequences genome-wide, [45,52]).

Taken together, these observations are consistent with our promoter-based models being
able to predict enhancers that drive specific expression of neighboring genes in different
tissues.

The most reliable evidence for the accuracy of our promoter-based models in predicting
tissue-specific enhancers is the experimental verification of their regulatory activity
in vivo. Substantiated by the consistent results from the computational analysis, we chose
to validate a subset of liver enhancer predictions in the loci of highly expressed
liver genes using a mouse liver reporter assay [53,54]. We selected, as described in detail below, 12 out of the total of approximately
400 regions with predicted liver enhancer activity (Table 2) and 5 regions with no predicted activity as controls (Table 3) for functional testing. Importantly, we tried to ensure that the enhancer predictions
tested were not significantly different from the whole set of predictions, and chose
controls exclusively based on their score. Thus, differences between enhancer predictions
and controls observed for other sequence properties simply reflect an association
between high scores and the existence of functional constraints, rather than bias
in the selection of the sequences. Liver enhancer predictions selected for validation
had an average score of 1.79, and were distributed across the complete range of scores
(Figure S10A in Additional file 1). Additionally, liver enhancer predictions selected for validation are located at
an average distance to the nearest TSS of 7.3 kb, and are not significantly different
from the entire set of liver enhancer predictions (Figure S10B in Additional file
1). Also, liver predictions selected for validation did not exhibit statistically significant
differences in the level of evolutionary constraint compared to the entire set of
liver predictions, with an average phastCons score [55,56] of 0.38 (Figure S10C in Additional file 1). Finally, while half of the regions with predicted liver enhancer activity were
selected randomly, the remaining half was selected randomly among those predictions
overlapping with strong enhancer predictions by ChromHMM in HepG2 cell lines [51]. Controls were selected randomly among sequences that had scores in the bottom half
of the score distribution for the full set of scanned sequences, had an average score
of -1.70, were located 27.5 kb away from the nearest TSS, and had an average phastCons
score of 0.37. Each liver enhancer prediction and control was cloned upstream of a
minimal promoter element and the luciferase reporter gene (pGL4.23; Promega). Each
construct was then injected using the hydrodynamic tail vein injection assay into
at least three different mice, and liver enhancer activity was assayed after 24 h
by measuring luciferase levels (see Materials and methods).

Figure 3.Experimental validation of liver enhancer predictions using the hydrodynamic tail
vein enhancer assay. On each injection day, we also injected an empty pGL4.23[luc2] vector and a known
liver enhancer of the ApoE gene as negative and positive controls, respectively. At least three mice were injected
per construct. Statistical significance was tested using Student’s t-test followed by multiple testing adjustment with Benjamini-Hochberg’s method. The
asterisks indicate statistical significance to control at adjusted P-value ≤0.05.

Promoter-based models have the potential of shedding light on the human regulatory
landscape

We applied each of the 73 reliable models trained on promoter regions to the entire
sequence of the human genome. We scanned approximately 1,200,000 non-promoter CNEs
across the human and mouse genomes for enhancer signatures (see Materials and methods).
No model generated more than 160,000 enhancer predictions, with an average of approximately
51,000. We observed substantial overlap among enhancer predictions for related tissues
(Figure S11 in Additional file 1), in part reflecting the resemblance between promoter-based models of tissues with
similar gene expression profiles, but also indicating the existence of shared regulatory
pathways. Thus, from all sequences scanned, approximately 900,000 (73%) were considered
enhancer predictions for at least one of the models, with an average of approximately
12,000 non-redundant enhancer predictions per tissue, consistent with current ChIP-seq
findings [29]. Although we estimate the false positive rate at approximately 5% based on the number
of enhancer predictions in the loci of lowly expressed genes, a caveat of our approach
is that local differences in the composition of the human genome could result in overall
higher false positive rates. Also, consistent with the literature (for example, [57]), we found that most loci in the genome contain more than one enhancer. Indeed, without
considering redundancy among predictions, we predict an average of four enhancers
per locus per model, with the exact number depending on the tissue (Figure S12 in
Additional file 1).

We then analyzed the distribution of enhancer predictions across the genome relative
to genes. From all sequences that were classified as enhancer predictions by at least
one of the models, 55% mapped within intronic regions, 43% mapped within intergenic
regions, and the remaining 2% to UTRs. The trend is consistent for all tissues, in
that the proportion of intronic enhancer predictions is always greater than that of
intergenic predictions. Overall, tissue-specific enhancer predictions tend to be located
closer to TSSs, and in particular, near TSSs of highly expressed genes in matching
tissues. For example, there was more than 3-fold enrichment in liver enhancer predictions
within 100 kb of the TSS of the 200 most highly expressed genes in the liver (P-value <0.001, computed based on 1,000 randomized sequences genome-wide), a number
that increased to 4-fold enrichment within 10 kb of the TSS (P-value <0.001, computed based on 1,000 randomized sequences genome-wide). Furthermore,
stronger enhancer predictions are closer to TSSs than weaker predictions, with, for
instance, the strongest 1% of liver enhancer predictions being located 40 kb away
from the nearest TSS as compared to 73 kb for the complete set of liver enhancer predictions.
These results are in agreement with the literature, and suggest that the functional
relevance of a genomic region depends on its position relative to the TSS [58]. Our enhancer predictions are enriched near genes annotated with relevant gene ontology
terms. For example, we found more than five-fold enrichment in liver enhancer predictions
within the loci of genes associated with ‘positive regulation of hepatic stellate
cell activation’, ‘liver development’, and ‘positive regulation of hepatocyte differentiation’
(P-values <0.05, Fisher’s exact test), as well as enrichment for genes with critical
liver functions, such as ‘positive regulation of cholesterol metabolic process’ (P-value = 2.7 × 10-14, Fisher’s exact test), ‘triglyceride lipase activity’ (P-value = 6.0 × 10-8, Fisher’s exact test), and sucrose, maltose, and trehalose metabolic processes (all
P-values <0.05, Fisher’s exact test).

Although all our tissue-specific enhancer predictions were selected from conserved
non-coding sequences across the human and mouse genomes, they exhibit different levels
of conservation according to their phastCons scores (Figure S13 in Additional file
1). For example, liver and heart enhancer predictions in the loci of highly expressed
genes are significantly more conserved than the sequences used as basis for making
predictions (0.41 versus 0.34, and 0.43 versus 0.37, with P-values 1.1 × 10-32 and 5.6 × 10-33, respectively, calculated using the Wilcoxon rank-sum test). For models that did
not perform well in terms of their fold enrichment between the proportion of enhancer
predictions in the loci of highly and lowly expressed genes (for example, skin and
fetal brain), we observed significantly less constrained predictions. We observed
similar trends when we applied our promoter-based classifiers to investigate unconstrained
sequences (see Supplementary notes in Additional file 1).

In summary, our results indicate the existence of largely disjoint sets of tissue-specific
regulatory sequences located in the neighborhood of their potential target genes.
They also confirm an important role for evolutionarily constrained sequences, in that
73% of sequences conserved across mammals exhibit regulatory potential. Finally, consistent
with previous studies, they support a role for both promoters and enhancers in determining
spatiotemporal patterns of gene expression.

Conclusions

By analyzing the sequence of promoters of tissue-specific genes, we confirmed that
tissue-specific promoters and enhancers share TF binding motifs within the loci of
their cognate genes. Moreover, we observed that regulatory information in the promoters
of tissue-specific genes is predictive of the enhancers targeting these genes. For
73/79 tissues, we could reliably distinguish between highly and lowly expressed genes
based exclusively on the presence or absence of putative motifs (AUC ≥60%). Although
similar cut-offs have been recently employed (for example, [18,59]), we recognize that the half of the models exhibiting modest performances (AUC ≤80%)
might have limited predictive value. It is, however, important to note that the reported
AUCs represent the lower bound of the classifier accuracy due to the fact that the
strength of the tissue-specificity enhancer signal is expected to vary among the promoters
of tissue-specific genes. Promoters containing only weak signals will inherently deflate
the classification AUC estimates. To further address the performance of the classifiers
at predicting tissue-specific enhancers, we introduced a panel of independent computational
and experimental tests, which ultimately validated our analysis. Many of the TFs binding
to the motifs that are identified as relevant to each of these models are known to
play a fundamental role in the development or maintenance of normal function of the
corresponding tissues. We showed that the motifs found in promoter regions can be
used to predict enhancers with matching tissue-specificity. The accuracy of our tissue-specific
enhancer predictions by promoter-based models is supported by a highly significant
association of enhancer predictions with the genes most highly expressed in a given
tissue, and by a significant overlap of predictions with experimentally identified
tissue-specific enhancers.

More importantly, 58% (7/12) of liver enhancer predictions generated by the promoter-based
model drove luciferase expression in the liver following hydrodynamic tail vein injection
in mice, whereas none of the five negative controls did.

Six of the seven validated liver enhancers were located within introns (for the genes
TBX6, SERINC2, TF, ARAP1, STARD10, and GPRC5C), while the remaining prediction was in the immediate vicinity of GPRC5C. These genes have been previously reported as moderately to highly expressed in liver
and gallbladder [60]. For example, although the specific function of GPRC5C is unknown, the gene is highly expressed in the liver and has been suggested to play
a role in signaling events when induced by retinoic acid [61]. In addition to other motifs, liver enhancer predictions that exhibited luciferase
activity contained predicted binding sites for 5 to 11 out of 27 known liver TFs (Additional
file 5). Although each of the critical liver TFs PPARA, PPARG, NR2F2, and HNF4A had binding
sites in 6/7 (86%) sequences, no single TF had a binding site in all 7 sequences that
exhibited luciferase activity, highlighting the ability of our method to model flexible
regulatory sequence encryptions. In turn, each of the 7 sequences included binding
sites for at least 2 of these 4 TFs, and 4/7 (57%) for all 4. The 5 liver enhancer
predictions that exhibited no significant luciferase activity contained binding sites
for 4 to 8 out of 27 known liver TFs. HNF4A had binding sites in all 5 sequences,
PPARA and NR2F2 had binding sites in 4/5 (80%) sequences, and PPARG in 3/5 (60%).
With one exception, each of the sequences included binding sites for at least 2 of
these 4 TFs, and 3/5 (60%) for all 4. For comparison, 87% of all sequences scored
by the liver promoter-based model contained 1 to 18 out of 27 known liver TFs. HNF4A
had binding sites in 28% of the sequences. Only 11% of the sequences had binding sites
for at least two TFs among PPARA, PPARG, NR2F2, and HNF4A, and 7% for all four TFs.
Despite limitations in the accuracy of the TF binding site predictions and despite
the fact that many motifs may be nonfunctional [62], our results suggest that particular combinations of TFs, rather than single TFs,
are necessary to establish liver transcription. In addition, the function of assayed
sequences may be subject to activation or inhibition by additional cis- and trans-regulatory elements. For example, enhancer activity might be induced by hormones
or drugs under particular conditions [63] or depend on neighboring functional elements that are absent in the construct used
for the experiment. This and other phenomena could produce false-negative observations
in the reporter assays. In any case, the experimental data presented here provide
independent and robust validation of the enhancer predictions obtained with the promoter-based
models, and lend further support for the hypothesis that the specificity of interactions
between enhancers and promoters is at least partly due to the binding of tissue-specific
TFs.

Additional file 5: Table S4. A list of transcription factors known to be relevant for liver function.

Our models predict multiple tissue-specific enhancers per locus and per tissue, as
well as multiple tissues or domains of activity for most enhancers. This redundancy,
which has long been reported (for example, [64-66]), may serve to increase the robustness of the regulatory network [67]. Furthermore, it is likely that apparently redundant enhancers activate gene expression
in different cell types and/or under different developmental stages or conditions
[68-71]. The genomic distribution of enhancers is also likely to vary depending on the function
of their target genes. For example, the loci of transcription factors and developmental
genes are known to contain particularly high densities of CNEs, many of which act
as distal enhancers [72-77]. More recently, advances in technical approaches, such as chromosome conformation
capture and its derivatives, have confirmed these findings independent of sequence
conservation [24,78]. We observed that relatively long loci, such as those of genes expressed in brain
tissues, featured more enhancer predictions per locus compared to short loci. However,
some compact loci, such as those of genes highly expressed in liver, lung, and heart,
contained a relatively large number of enhancer predictions, providing evidence for
a particular need for fine-tuning the expression level in these tissues. Furthermore,
the level of conservation of enhancer sequences is likely to depend, as other studies
suggests (for example, [79,80]), on their particularly activity, although we found that, for all models, a large
proportion of the enhancer predictions is likely to be conserved across mammals.

Finally, our results add further evidence for a significant role of both promoters
and enhancers in determining tissue specificity. This role is supported by several
examples from the literature [14,81,82]. Different enhancer-promoter preferences would provide an additional level of transcriptional
control, assisting in establishing the favorable interactions, for instance, between
enhancers and their cognate promoters when they are distant, or between enhancers
and their cognate promoters within a gene cluster. The intimate coordination of promoters
and enhancers in regulating tissue-specific transcription has immediate practical
consequences. It makes it possible to describe the complex regulatory landscape of
higher eukaryotes, and eventually identify regulatory elements located hundreds of
kilobases away from their target gene, based solely on the analysis of proximal regulatory
elements. DNA microarrays and, more recently, RNA-seq are currently being used to
profile the transcriptomes of a diverse range of cell/tissue types, conditions, and
species. As more expression data become available, particularly in the context of
large projects such as ENCODE [83] and the 1000 Genome Project [84], it is our belief that the application of approaches such as the one we are proposing
here will result in important new insights and improve our understanding of transcriptional
regulation. Such projects are also generating a wealth of epigenetics information
that can be easily integrated with our models to reveal genomic signatures controlling
transcription.

Locus definition

In order to define gene loci, we first clustered together all overlapping transcripts
in the refGene.txt and knownGene.txt tables (available in the UCSC Genome Browser
database [86]), and then assigned the closest half of the intergenic sequence separating two genes
to each of the corresponding gene loci. Although the genes that are closest to the
enhancers are reasonable target genes, there are many known cases of enhancers located
in introns of genes that are not their targets, as well as enhancers several kilobases
away from their targets, with unrelated genes in between. Current integrative approaches
result only in modest improvement in enhancer-target gene associations (for example,
[87]), often requiring non-available data. Recently, a method based on Hi-C has been introduced
to identify genome-wide functional domains based on higher-order chromatin interactions
[5]. However, comparisons between alternative methods are limited because of the lack
of an appropriate reference or gold standard.

Promoter annotation and definition for promoter modeling

Promoter regions were defined as encompassing a 3 kb region (2.5 kb upstream and 0.5 kb
downstream of the TSS), relative to 5′ TSSs of all transcripts annotated in RefSeq
[85]. Although the total length is arbitrary, it intends to span both the core and proximal
promoter regions. In most cases, the signal that turned out to be relevant for the
models was detected within 500 bp of the TSS (Figure S14 in Additional file 1).

Gene expression values for each of the promoters of the most highly and least expressed
genes in each of the 79 tissues considered were extracted from [88]. Probe IDs were converted to UCSC Known Gene IDs using [89]. Subsequently, UCSC Known Gene IDs were converted to gene symbols and RefSeq IDs
using [90]. Expression values for transcripts with the same gene symbol were averaged together.
The 200 most highly and least expressed genes with different gene symbols were selected.
TSSs of all RefSeq IDs associated with those gene symbols were then used to define
3 kb promoter regions.

Sequence conservation of promoter regions

Sequence identity of promoter regions was determined based on genome-genome alignment
of human and mouse (from the net/chain track at UCSC [86]), using the hg18 and mm9 genome assembly, respectively.

Sequence conservation of coding regions

As an indicator of coding conservation across species we used the proportion of orthologs
of human genes found in other eukaryotic species (HomoloGene Build 64 [91]).

Motif occurrences

Presence or absence of putative motifs was determined scanning the sequence for 775
motifs in TRANSFAC [92] and JASPAR [93-95] using MAST [96] with default parameters.

Motif over-representation in promoter regions

Over-representation of 775 motifs representing TF binding sites in TRANSFAC and JASPAR
among promoter regions of the 200 most highly expressed genes in each of the 79 tissues
considered was determined by comparing the promoter regions of the 200 most highly
expressed genes to the promoters of the 200 least expressed genes in the corresponding
tissue. The entire length of the promoter region (-2.5 kb to +0.5 kb with respect
to the TSS) was searched for motif occurrences with MAST. The numbers of putative
TF binding site occurrences in each set of promoters were compared using the Wilcoxon
rank-sum test.

CpG islands and TATA-box motifs

Annotation for CpG islands was obtained from the 'cpgIslandExt' UCSC track of the
hg18 assembly of the human genome database [86]. Presence or absence of TATA-box motifs in promoter regions was determined by scanning
the sequence for TATA-box motifs in TRANSFAC [92] using MAST [96] with default parameters.

Separating promoters of most highly and lowly expressed genes

Training data

The promoter regions (-2.5 kb to +0.5 kb with respect to the TSS, based on RefSeq
annotation [85]) of the 200 most highly expressed genes (positive set) were compared to the promoters
of 200 genes with the lowest expression (negative set) in each of the 79 considered
tissues.

Sequence representation

Next, we converted the DNA sequence of each promoter into a set of TF binding site
feature vectors. We first identified all CNEs (at least 70% sequence identity between
human and mouse [98]) within each promoter sequence. Next, we ran the program MAST [96] with default parameters to identify motif occurrences in the CNEs matching 775 known
TF binding sites from the TRANSFAC [92] and JASPAR [93-95] databases. With this information, each CNE was then transformed into a 775-dimensional
TF binding site feature vector, where each feature corresponds to the number of the
corresponding TF binding site occurrences in the sequence of the CNE. There were 2.4
feature vectors (one per CNE) in a promoter, on average.

For a given classifier, the training set contained as many feature vectors as the
number of CNEs found in the promoters of the 200 most highly expressed genes (positive
set) plus the number of CNEs found in the promoters of the 200 genes with the lowest
expression (negative set). Because promoter regions may overlap, the sets included
only unique CNEs.

Classifiers

Linear SVMs [99] were used to find features relevant to distinguish between the CNEs in promoters
associated with highly expressed genes (positive class) and those in promoters associated
with lowly expressed genes (negative class). For each tissue, we trained a SVM on
an average of 553 feature vectors representing CNEs in the promoter regions of highly
expressed genes and 525 feature vectors representing CNEs in the promoter regions
of lowly expressed genes. We optimized the weight of the positive class w1 by performing a grid search. The optimal value was chosen from , where , n+ is the number of signal sequences, and n- is the number of control sequences.

A double-loop cross-validation was used to assess the accuracy of the classifier.
In each fold of the cross-validation, we used four-fifths of the members of the positive
and negative classes to identify a 'consistent' set within the positive class. This
strategy is aimed at identifying sequences that are consistent with each other, in
an effort to reduce the natural heterogeneity of the promoter sets. More precisely,
in each fold of the cross-validation, for each promoter P in the positive class in the four-fifths of the data that was used to identify a
consistent set, we trained a model excluding all sequences associated with P. Subsequently, we used that model to score each of the sequences associated with
P. Finally, among those sequences, we randomly selected two positive-scoring ones to
represent P in a 'consistent' positive set. After repeating this for all promoters in the positive
class, we obtained a 'consistent' positive set. This consistent positive set was used
together with the remaining one-fifth of the members of the negative class to train
a final classifier. The accuracy of this final classifier was evaluated using a standard
five-fold cross-validation. The entire procedure was repeated for each of the five
cross-validation folds, and the cross-validation was repeated five times. AUC was
used as criterion for optimality. This double-loop cross-validation has been successfully
applied to the enhancer prediction problem in the past (for example, [17]).

Figure S15 in Additional file 1 illustrates the variation of the size of the consistent positive set for the 79 tissues
considered. In our cross-validation framework, the consistent positive set contained
an average of 157 CNEs, representing 35% of the training data. However, the size of
the consistent positive set depends on the particular tissue, ranging from 39 (10%)
to 269 (41%) CNEs for PB-CD56+ NK cells and medulla oblongata, respectively. For PB-CD56+
NK cells the consistent positive set also contained the smallest fraction of CNEs,
while the largest fraction was obtained for uterus corpus (51%).

Linear SVMs

Training a linear SVM classifier is equivalent to solving the following constrained
optimization problem [100]:

Given the training samples , find the values of w, b and ξi that minimize

satisfying the constraints

and

The decision function of the classifier for an unknown sample x is given by:

The dual form of this problem can be described as follows: Given the training samples
, find the values that maximize

satisfying the constraints

and

Samples xi for which ai ≥ 0 are called support vectors.

The vector w can be computed in terms of αi as:

and, therefore, contains the weighted features of the support vectors.

SVM parameter selection

Linear SVMs have only one parameter, C, which controls the trade-off between errors on the training data and margin maximization.
We found that the performance of the Hb enhancer classifier was relatively stable
with respect to changes in C. We estimated C based on the training data as . Misclassifications are penalized differently depending on the class of sequences,
proportionally to the total number of sequences in each class.

Predictive power of the motifs

After obtaining a linear SVM model, the weight vector w can be used to decide the
relevance of each feature [101]. The larger |wj|, the more important role of feature j in the decision function. On these grounds, we used the weights wj to assess the predictive power of each motif.

Scaled SVM weights

To make motif weights comparable across different SVM classifiers, we scaled them
preserving their sign according to:

where

and

GC content of transcription factor binding sites

Sequence motifs representing motifs are usually encoded as PWMs. A PWM is a matrix
containing the relative frequency of each of the four possible nucleotides at each
position of a motif, which are estimates of the corresponding probabilities.

To obtain the GC content of a motif, we calculated and averaged the probability of
observing G or C at each position of the corresponding PWM.

In order to assess the contribution of the GC content to the performance of the promoter-based
enhancer models, we trained 5 models using the aforementioned strategy, each time
replacing the original 775 PWMs by an equally large collection of PWMs, in which the
nucleotide probabilities of each PWM have been randomly permuted.

Difference in GC content between two loci

Differences in GC content between loci of highly and lowly expressed genes were expressed
as the natural logarithm of the ratio between the GC content of the loci of highly
expressed genes and the GC content of the loci of lowly expressed genes.

Enhancer predictions

We applied our promoter-based models as genome-wide predictors of human enhancers
to both conserved and non-conserved sequences. In particular, for a given tissue,
when we refer to predictions in the loci of the (200) most highly and lowly expressed
genes, we imply predictions in the loci of the 200 genes with highest and lowest expression
levels whose promoters were used to train the corresponding classifier.

Prediction of conserved enhancers

First, we selected CNEs with at least 70% identity across the human and mouse genomes
[98] located at least 2.5 kb upstream and 0.5 kb downstream of TSSs annotated in refGene.txt
and knownGene.txt tables (available in the UCSC Genome Browser database [86]). Thus, we scored approximately 1,200,000 CNEs across the human genome, with an average
length of 249 bp. In particular, the loci of the 200 most highly expressed genes in
any of the 73 tissues considered comprised, on average, 85 CNEs, and comprised a total
of 500,000 CNEs, while the loci of the 200 genes with lowest expression in any of
the 73 tissues considered included an average of 108 and a total of 750,000, respectively.

Prediction of non-conserved enhancers

Second, we scanned the genome using a sliding window approach. Windows overlapping
the sequence 2.5 kb upstream and 0.5 kb downstream of the nearest TSS according to
the refGene.txt and knownGene.txt tables (available in the UCSC Genome Browser database
[86]) were excluded from further analysis. For the size of the window, we chose the average
length of the conserved region between human and mouse [98], namely 230 bps. The sliding window is shifted by 115 bps. A given sequence was considered
an enhancer prediction (or enhancer candidate) if its score was greater than s = min(0, δ), where δ is the lowest score of the top 5% sequences scored in the control loci.

Computational evaluation of genome-wide enhancer predictions

Functional analysis

To assess whether these elements disproportionally occur near genes with particular
functions, we obtained the Gene Ontology [102] (CVS version 1.2811, GOC Validation Date March 28, 2012) annotations of the closest
neighboring UCSC known genes [103] for all non-coding elements, and assigned those annotations to each element. Gene-to-GO
mapping was achieved by combining the UCSC refGene.txt and knownGene.txt tables and
GOA [104] association table using UniProt IDs. P-values were corrected for multiple testing using Bonferroni’s method [105].

Fold enrichment of enhancer predictions in the loci of the 200 most highly expressed
genes as compared to the loci of lowly expressed genes

In order to account for differences in the length of the loci, we did not directly
compare the number of enhancer predictions in the loci of the 200 most highly expressed
genes in a given tissue with the number of enhancer predictions in the loci of lowly
expressed genes in that same tissue, but the numbers of enhancer predictions divided
by the numbers of scanned sequences for loci of highly and lowly expressed genes.
Therefore, the fold enrichments in Table 1 and Additional file 3 were computed as the ratio of two proportions: (i) the total number of enhancers
predicted in the loci of the 200 most highly expressed genes divided by the total
number of sequences scanned in the loci of highly expressed genes; and (ii) the total
number of enhancers predicted in the loci of lowly expressed genes divided by the
total number of sequences scanned in the loci of lowly expressed genes. For the 73
tissues evaluated and focusing only on CNEs across the human and mouse genomes, these
proportions averaged 0.04 for loci of highly expressed genes, and 0.03 for loci of
lowly expressed genes. In the case of whole-loci predictions, these proportions averaged
0.03 for loci of the 200 most highly expressed genes, and 0.02 for loci of lowly expressed
genes.

Fraction of loci comprising enhancer predictions

The fraction of loci comprising enhancer predictions was defined as the number of
loci in which at least one of the scanned sequences was considered an enhancer prediction
divided by the total number of loci to which we applied the classifier. Therefore,
the fold enrichments in Table 1 and Additional file 3 were computed as the ratio of two ratios: (i) the total number of loci of highly
expressed genes comprising at least one enhancer prediction each divided by the total
number of loci of highly expressed genes comprising at least one scanned sequence
each; and (ii) the total number of loci of lowly expressed genes comprising at least
one enhancer prediction each divided by the total number of loci of lowly expressed
genes comprising at least one scanned sequence each. Each of the latter ratios ranges
between 0 (no loci comprising enhancer predictions) and 1 (all loci comprising scanned
sequences also comprise enhancer predictions). For the 73 tissues evaluated and focusing
only on CNEs across the human and mouse genomes, 59% of the loci of highly expressed
genes comprised at least one enhancer prediction, while 52% of the loci of lowly expressed
genes did.

Overlap between predictions and different enhancer marks

Predictions resulting from the 73 reliable promoter-based classifiers were combined
into a set of non-redundant predictions and overlapped with different enhancer marks.
Additionally, when specifically stated, we report overlaps with predictions for particular
promoter-based classifiers - for example, the classifier trained on liver promoters.

Overlap with p300

Genomic regions enriched for p300 in mouse forebrain, midbrain, limb, and heart tissues
were extracted from Additional files 3, 4 and 5[45], and mapped to the human genome (hg18) using LiftOver [106]. Genomic regions identified in forebrain, midbrain, limb, and heart were combined
into one dataset. Overlapping genomic regions were clustered together.

Overlap with DNase I hypersensitivity sites

DNase I hypersensitivity data ('narrow peaks') for 86 human cell lines from the ENCODE
project [25,107] were downloaded from the UCSC browser [108,109], converted to the hg18 assembly using LiftOver [106], and combined into one dataset. Overlapping genomic regions were clustered together.
This resulted in a total of 1,722,559 non-overlapping regions with an average length
of 253 bp. We then computed the intersection between the set of non-redundant enhancer
predictions identified by any of the 73 promoter-based models and this DNase I hypersensitivity
data dataset. Liver enhancer predictions, in particular, were also compared with DNase
I hypersensitivity data in HepG2. Predictions for enhancers in other tissues were
compared with DNase I hypersensitivity data in closely related ENCODE tissues and
cell lines (Figure S9 in Additional file 1).

Overlap with histone modification marks

Histone mark data (H3K4me1, H3K27ac) for 11 human cell lines from the ENCODE project
[25,107] were downloaded from the UCSC browser [108,110], converted to the hg18 assembly using LiftOver [106], and combined into one dataset. Overlapping genomic regions were clustered together.
This resulted in a total of 189,889 non-overlapping regions with an average length
of 6,275 bp. We then computed the intersection between the set of non-redundant enhancer
predictions identified by any of the 73 promoter-based models and this histone mark
dataset. Liver enhancer predictions, in particular, were also compared with histone
marks in HepG2.

Overlap with ChromHMM predictions

Weak and strong enhancers identified in nine human cell lines (HSMM, GM12878, HUVEC,
H1-hESC, K562, HepG2, NHEK, HMEC, NHLF) using ChromHMM [51] were downloaded from the UCSC browser [108,111], converted to the hg18 assembly using LiftOver [106], and combined into one dataset. Overlapping genomic regions were clustered together.
This resulted in a total of 399,500 non-overlapping regions with an average length
of 1,504 bp. We then computed the intersection between the set of non-redundant enhancer
predictions identified by any of the 73 promoter-based models and the ChromHMM dataset.
Liver enhancer predictions, in particular, were also compared with ChromHMM enhancers
in HepG2.

Conservation analysis

PhastCons conservation scores [56] were based on alignment of 28 vertebrate species and an 18 species placental mammal
subset, respectively [55].

In vivo validation of liver enhancer predictions

Sequences selected for in vivo validation were PCR-amplified using TopTaq (Qiagen, Hilden, Germany) from human genomic
DNA (Roche, Basel, Switzerland), purified using the QIAquick PCR purification kit
(Qiagen) and cloned into the pENTR-dTOPO vector (Life Technologies, Carlsbad, CA,
USA). Proper insertion and orientation was confirmed by colony PCR, after which positive
clones were transferred into the pGL4.23[luc2] vector (Promega) using the Gateway
system (Life Technologies). Sequence and orientation of the insert were re-verified
by Sanger sequencing, and approximately 200 μg of endotoxin-free plasmid DNA was isolated
using the EndoFree Plasmid Midi prep (Qiagen).

For the hydrodynamic tail vein assay, 10 μg of each assayed sequence in pGL4.23[luc2]
was injected along with 2 μg of pGL4.74[hRluc/TK] vector to correct for injection
efficiency, into at least three CD1 mice (Charles River Laboratories, Wilmington,
MA, USA) using the TransIT EE hydrodynamic gene delivery system (Mirus Bio LCC, Madison,
WI, USA) according to the manufacturer’s protocol. Negative (empty pGL4.23[luc2])
and positive (ApoE liver enhancer [110,112]) controls (n = 3 to 5) were also injected at each injection date/experiment. After
24 hours, livers were harvested and homogenized in passive lysis buffer (Promega),
followed by centrifugation at 4°C for 30 minutes at 14,000 rpm. Firefly and Renilla
luciferase activity in the supernatant (diluted 1:20) were measured on a Synergy 2
microplate reader (BioTek Instruments, Winooski, VT, USA) in technical replicates
of four for each liver, using the Dual-Luciferase reporter assay system (Promega).
The ratios for firefly luciferase:Renilla luciferase were determined and expressed
as relative luciferase activity. All mouse work was approved by the UCSF Institutional
Animal Care and Use Committee.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

IO conceived the work and supervised the analysis. LT carried out the analysis. RPS
and MJK conducted the mouse experiments. NA supervised the experiments. The manuscript
was written by LT and IO, with contributions from the other authors. All authors read
and approved the final manuscript.

Acknowledgments

This research was supported in part by the Intramural Research Program of the NIH,
National Library of Medicine (IO). NA is supported in part by NIGMS award number GM61390,
NICHD grant number R01HD059862, NHGRI grant numbers 1R01HG005058 and 1R01HG006768,
NIDDK grant number 1R01DK090382, NINDS award number 1R01NS079231 and the Simons Foundation
(SFARI #256769). RPS is supported in part by a CIHR fellowship in hepatology. MJK
was supported in part by NIH training grant T32 GM007175, the Amgen Research Excellence
in Bioengineering and Therapeutic Sciences Fellowship and the QBC Fellowship. We are
grateful to Marcelo Nobrega and John Westlund from the University of Chicago for productive
discussions and assistance with the analysis of preliminary data.