Abstract

Deciphering the specific contribution of individual motifs within cis-regulatory modules (CRMs) is crucial to understanding how gene expression is regulated and how this process is affected by sequence variation. But despite vast improvements in the ability to identify where transcription factors (TFs) bind throughout the genome, we are limited in our ability to relate information on motif occupancy to function from sequence alone. Here, we engineered 63 synthetic CRMs to systematically assess the relationship between variation in the content and spacing of motifs within CRMs to CRM activity during development using Drosophila transgenic embryos. In over half the cases, very simple elements containing only one or two types of TF binding motifs were capable of driving specific spatio-temporal patterns during development. Different motif organizations provide different degrees of robustness to enhancer activity, ranging from binary on-off responses to more subtle effects including embryo-to-embryo and within-embryo variation. By quantifying the effects of subtle changes in motif organization, we were able to model biophysical rules that explain CRM behavior and may contribute to the spatial positioning of CRM activity in vivo. For the same enhancer, the effects of small differences in motif positions varied in developmentally related tissues, suggesting that gene expression may be more susceptible to sequence variation in one tissue compared to another. This result has important implications for human eQTL studies in which many associated mutations are found in cis-regulatory regions, though the mechanism for how they affect tissue-specific gene expression is often not understood.

(A) Schematic representation of flow from design of synthetic CRMs, assessment of in vivo activity to tissue-specific modeling. Optimized TFBSs (from ChIP experiments) were separated by spacers, with minimal affinity to known TF binding models, and placed in front of a minimal promoter and lacZ reporter and integrated into the Drosophila genome. (B–F) Homotypic synthetic CRMs containing six TFBSs from the represented sequence logo for each TF, separated by a 6 bp spacer. CRM activity was assessed by double fluorescent in situ hybridization of the lacZ reporter gene driven by synthetic CRM (red) and the corresponding endogenous gene (green). Synthetic CRMs composed from GATA (B–B″) and Doc (C–C″) motifs drive expression in the presumptive amnioserosa (white box). Tin synthetic CRM (D–D″) is expressed in the dorsal mesoderm (arrow), while Twi CRM (E–E″) is expressed in the foregut weakly (white arrowhead), hindgut (white arrows) visceral mesoderm and ectoderm (asterisks). Bin synthetic CRM (F–F″) is active in the foregut (arrowhead), midgut (asterisk) and hindgut (arrow) visceral mesoderm (VM). All embryos are shown laterally with anterior to the left, except for (E) which is a dorsal view.

(A) Automated image analysis workflow. The gene expression pattern was digitized to create a mask for the tissue of interest. Comparing the mask with CRM activity enabled rapid and reliable scoring of both penetrance and expressivity. Errors in penetrance and expressivity were estimated as described in Supplemental Methods (). Penetrance (red bars) and expressivity (blue bars) of pMad-Tin heterotypic CRMs in the VM (B) and the heart (C). Note that P-element pMad-Tin A2 line was used to quantify heart activity.

(A) Red and blue bars represent the experimentally measured penetrance and expressivity, respectively, of the six-motif pMad-Tin heterotypic CRMs in the VM. Gray bars represent the model's fitted results when pMad-Tin-pMad cooperative interactions were included. (B–E) CRM activity for short pMad-Tin heterotypic CRMs. Double in situ hybridization against the lacZ reporter gene driven by the synthetic CRMs (B–E, red) and the endogenous dpp gene (B′–E′) green), where arrowheads indicate the midgut visceral mesoderm (VM). Embryos are dorsally oriented, with anterior to the left, stage 13/14. (B″–E″) Schematic representations of the CRM composition, where purple and green triangles depict the number and orientation of pMad or Tin sites, respectively. Spacing between adjacent pMad-Tin sites (below) and pMad-pMad sites (above) is indicated. (F) Penetrance (red bars) and expressivity (blue bars) of the short pMad-Tin heterotypic CRMs in the VM with the model's prediction (black bars) when pMad-Tin-pMad cooperative interactions were included. (G) Model predictions for pMad-Tin enhancers of different motif number with antisense orientation of the Tin site and 4 bp separation between sites. Red (blue) curve corresponds to the model prediction with (or without) pMad-Tin-pMad interactions. (H) As (A) but for the heart, where only non-zero results are shown. Note that we fit the heart data including the penetrance and expressivity results from the P-element pMad-Tin A2 line. Gray bars represent the model fit when considering Tin-pMad-Tin as the minimal cooperative TF configuration.