Abstract

The ChIP-seq technique enables genome-wide mapping of in vivo protein-DNA interactions and chromatin states. Current analytical approaches for ChIP-seq
analysis are largely geared towards single-sample investigations, and have limited
applicability in comparative settings that aim to identify combinatorial patterns
of enrichment across multiple datasets. We describe a novel probabilistic method,
jMOSAiCS, for jointly analyzing multiple ChIP-seq datasets. We demonstrate its usefulness
with a wide range of data-driven computational experiments and with a case study of
histone modifications on GATA1-occupied segments during erythroid differentiation.
jMOSAiCS is open source software and can be downloaded from Bioconductor [1].

Background

The advent of high-throughput next generation sequencing (NGS) technologies has revolutionized
the fields of genetics and genomics by allowing rapid and inexpensive sequencing of
billions of bases. Among the NGS applications, ChIP-seq (chromatin immunoprecipitation
followed by NGS) is perhaps the most successful to date. Initial ChIP-seq studies
largely focused on single-sample investigations. However, as we begin to understand
the role of epigenomics in biological variation, detailed comparisons of transcription
factor (TF) binding and epigenomic marks between different tissues and individuals
at single or multiple time points or developmental stages are becoming essential to
understand the etiology and progression of many diseases. Therefore, comparative analysis
of multiple ChIP-seq samples to identify combinatorial TF binding or epigenome profiles
are rapidly emerging. Some examples include: (i) identifying differential binding
of a TF or modification of a histone mark across multiple individuals, for example,
[2] studied variation in binding of NF-κB and RNA polymerase II (Pol II) across ten individuals;
[3] performed a genetic analysis of Ste12 binding in yeast by studying differential binding
across 43 segregants of a cross between two yeast strains; (ii) genome-wide binding
profiles of multiple TFs in a single tissue or cell line, for example, comparative
analysis of 22 Caenorhabditis elegans TFs [4]; (iii) time course or multiple developmental stage ChIP-seq experiments, for example,
Pol II binding at six developmental stages of C. elegans [4]; and (iv) comparative analysis of binding profiles of one or more TFs with Pol II
or modifications of histone marks, for example, [5,6].

Although there are already more than 30 algorithms and methods for ChIP-seq analysis
(reviewed in [7]), all of them are limited to single-sample analysis and lack the ability to simultaneously
compare multiple ChIP samples. The small number of available multi-sample ChIP-seq
analysis tools are either specific to ChIP-seq design (for example, [8] is specifically for identifying chromatin states from ChIP-seq of histone modifications;
[9] focuses on gene-centric analysis), exploratory [10] or difficult to generalize to more than two samples [11-13] due to computational reasons. This presents challenges for biological interpretation
since combining results from individual analysis of multiple experiments can be a
daunting task, especially for systematically enumerating combinatorial patterns of
enrichment, controlling the overall false discovery rate (FDR), and prioritizing candidate
regions for further experimental validation. A more recent genome segmentation algorithm,
Segway [14], designed for multiple ChIP-seq datasets, utilizes a dynamic Bayesian network method
and offers flexibility by enabling analysis at 1 bp resolution. However, the current
Segway implementation requires specialized cluster management systems and is not readily
available for running on standard desktops or computing clusters without a cluster
management system.

We introduce jMOSAiCS (joint model-based one- and two-sample analysis and inference
for ChIP-seq), which is a probabilistic model for integrating multiple ChIP-seq datasets
to identify combinatorial patterns of enrichment. The key components of jMOSAiCS are
base models for the sequencing reads of each individual ChIP-seq experiment and a
model that governs the relationship of enrichment among different samples. We chose
well-developed models from the ChIP literature for both of these key components. We
evaluate jMOSAiCS with extensive data-driven computational experiments and compare
it to both a separate analysis approach of multiple datasets and chromHMM [8]. We show that jMOSAiCS, which is applicable to both TF and histone ChIP-seq data,
has better power and provides better false discovery rate control than the separate
approach. We present an application of jMOSAiCS to multiple histone modifications
during erythroid differentiation [6]. This analysis identified a cluster of GATA1-occupied loci exhibiting a pattern of
enrichment that is different than that identified by chromHMM analysis of the same
datasets. We support our computational predictions by experimental validation of the
predicted patterns of histone modifications for a number of selected loci. These results
indicate that jMOSAiCS can reveal both global and local combinatorial enrichment patterns
with high sensitivity.

Results

Model description

The most commonly used NGS platform for ChIP-seq is the Illumina platform [5,15-17], which works by sequencing 25 to 100 bp from one or both ends of each DNA fragment
in the sample of interest and generates millions of short reads. Standard pre-processing
of reads involves mapping to a reference genome and summarizing total counts in each
small non-overlapping interval (referred to as bins). Statistical analysis to detect
enriched regions, that is, peaks, in a single ChIP-seq sample is based on these counts
and is carried out as a one- or two-sample analysis depending on the availability
of a control sample. In contrast, inference from multiple samples involves classifying
regions of genome into patterns of enrichment. For D samples, we can observe up to 2D different enrichment patterns across genomic regions. For example, for D = 2, {00, 01, 10, 11} is the set of possible patterns: 00 means not enriched in either
of the samples; 10 enriched only in sample 1; 01 enriched only in sample 2 and 11
enriched in both samples.

We consider I genomic regions of possibly different lengths across a reference genome. These initial
set of I regions can be obtained by analyzing each dataset separately with one of the many
available ChIP-seq analysis methods [7] and identifying regions of enrichment at a liberal FDR level. Let unobserved random
variable Eid∈{0,1} denote enrichment for region i in dataset d. The overall enrichment pattern Ei is defined as the vector (Ei1, ..., EiD). Our joint model has three layers, which are depicted in Figure 1. The first layer, called the E layer, concerns joint modeling of Eid for inferring combinatorial enrichment. This is enabled by defining a region-level
random variable Bi as described below. The second layer, called the Y layer, concerns observed read count data for region i across D samples: Yi = (Yi1, ..., YiD), where YiD=(Yid1,…,YidLi) and Li denotes the number of bins in region i. In the case of a two-sample problem, Yidj is vector-valued and denotes both the ChIP and control counts for the jth bin of the ith region in the dth sample. We assume that the counts from different samples are independent conditional
on the enrichment pattern:

Figure 1.Pictorial depiction of the jMOSAiCS model for a region across two ChIP-seq datasets. Region i consists of three bins. The B variable governs whether or not the region is enriched in any of the two samples.
The E variables denote sample-specific enrichments and are conditionally independent given
the B variable. The Z variables depict enrichment at the bin level and are conditionally independent given
the sample-specific E variables. When Eid = 1, one or more consecutive Z variables are set to 1 to capture enrichment. The observed read count Y can be scalar or vector-valued depending on the availability of a control input sample.
Data fits at the Y layer are obtained by MOSAiCS [19] on individual samples and evaluated by the goodness-of-fit (GOF) plots. ChIP: chromatin
immunoprecipitation; jMOSAiCS: joint model-based one- and two-sample analysis and
inference for ChIP-seq; MOSAiCS: model-based analysis and inference for ChIP-seq data

Yid⊥Yid′|Ei,∀d,d′=1,…,D,

and hence

Pr(Yi)=∑r=1R∏d=1DPr(Yid|Ei=r)Pr(Ei=r),

where r = 1, ..., R represents possible enrichment patterns. Note that Pr(Yid | Ei = r) = Pr(Yid | Eid = rd), rd = 0, 1, and only concerns data for the Li bins from the dth sample. Eid = 0 implies that all the bins in region i are from the background (unenriched) component in the dth sample. In contrast, if Eid = 1, one or more bins show enrichment. The third layer, called the Z layer, concerns Zidj, which we define as the bin-specific enrichment variable. If the jth bin in the ith region is enriched in dataset d, then Zidj = 1 and is 0 otherwise. We assume that Zidj,j=1,…,Li,∀d,i are independent and conditional on the region-specific enrichment indicator Eid and hence:

Pr(Zid1,…,ZidLi|Eid=rd)=∏j=1LiPr(Zidj|Eid=rd)

.

The key to our joint modeling approach are the models we utilize for the E and Y layers. For the E layer, we adopt the joint ChIP-chip model of JAMIE [18], which facilitates information sharing across experiments by capturing the correlation
among datasets. In this model, the broad dependencies among the D samples are captured via unobserved variable B, where Bi∈{0,1} denotes whether region i is potentially enriched and Eid is defined to be 1 if region i is enriched in sample d. We assume that Ei1, ..., EiD are conditionally independent given Bi. Let Pr(Bi = 1) = τ1, Pr(Eid = 1 | Bi = 1) = ηd, and Pr(Eid = 1 | Bi = 0) = 0, that is, the region cannot be enriched in any dataset if Bi = 0. Then, we have:

Pr(Eid=rd)=τ1ηdrd(1-ηd)1-rd+(1-τ1)I(rd=0).

The joint probability of (Ei1, ..., EiD) is given by:

Pr(Ei1=r1,…,EiD=rD)=τ1∏d=1Dηdrd(1-ηd)1-rd+(1-τ1)I(r1=0,…,rD=0).

For the Y layer, we adopt the model-based approach of MOSAiCS [19] since MOSAiCS provides parametric models for read counts from both the enriched and
unenriched regions in both the one- (without a control sample) and two-sample (with
a control sample) problems. At the bin level, Yidj | Zidj = 0 ~ Nidj, where Nidj ~ NegBin (a, a/µidj) are the background read counts. Its mean µidj is parameterized as logμidj=β0+β1Xidjc, where Xidj are the bin-level read counts in the control sample and c is a transformation parameter set data-adaptively. For one-sample analysis without
a control sample or for two-sample analysis with a shallow sequenced control sample,
MOSAiCS provides a parameterization of the bin-level counts that also depends on mappability
and guanine-cytosine (GC) content. For the enriched bins, Yidj | Zidj = 1 ~ Nidj + Sidj, where Sidj is the signal due to enrichment, that is, protein binding or epigenomic marker modification.
The signal Sidj is modeled either as a single negative binomial distribution or a mixture of two negative
binomial distributions. This choice is based on model fit and is determined through
Bayesian information criterion (BIC) [20] by MOSAiCS. For model fitting, we utilize the fact that MOSAiCS provides fast and
accurate estimates of the dataset-specific background and signal distributions. Therefore,
as a part of model fitting, jMOSAiCS only needs to infer parameters associated with
the B and E variables, namely τ1 and ηd, d = 1, ..., D. In addition, jMOSAiCS provides posterior probabilities of the B and E variables that facilitate identification of region-specific enrichment patterns across
the D datasets. We implemented jMOSAiCS as an R package and it is available from Bioconductor
[1]. Additional Files 1 and 2 provide a freeze of the R package and its vignette. These are included in the manuscript
for archival purposes only. We recommend that users download the most recent version
of the software from Bioconductor [1].

Data-driven computational experiments

We evaluated jMOSAiCS with data-driven computational experiments by simulating multiple
ChIP-seq datasets based on model fits on actual datasets. We utilized ChIP-Seq experiments
of STAT1 binding in interferon-γ-stimulated HeLa S3 cells by [21], H3K9me3 (repression mark) modification in peripheral blood mononuclear cells (PBMCs)
from two unrelated individuals (Bresnick Lab, UW Madison), and methyl CpG binding
protein (MeCP2) in mouse cortex (Chang Lab, UW Madison). The model fits were obtained
by MOSAiCS and the goodness-of-fit plots indicated satisfactory fits as discussed
in [19]. We simulated multiple ChIP-seq datasets using parameters that matched observed values
in the STAT1, H3K9me3 and MeCP2 ChIP-seq experiments. The density plots of the read
counts from the actual and sample simulated data are provided in Additional file 3, Figure S1, and indicate that the simulated data mimics the actual data well. In
what follows, we first compared jMOSAiCS with a commonly practiced separate analysis
scheme where each ChIP-seq dataset is analyzed individually and the enrichment patterns
are generated by post hoc analysis. Then, we compared jMOSAiCS to chromHMM [8], which is currently the state-of-the-art approach for discovering combinatorial patterns
of chromatin states from multiple ChIP-seq data.

Additional file 3.Supplementary materials. This file contains further details on and additional results from the computational
experiments presented as supplementary text and figures.

jMOSAiCS improves on a separate analysis of multiple ChIP-seq datasets

Comparisons based on data-driven STAT1 experiments: analysis of multiple ChIP-seq
datasets of two or more TFs under similar biological conditions

Data for this experiment uses the actual input experiment as the control sample and
emulates ChIP-seq of multiple transcription factors in a single biological condition.
Since we repeated each simulation experiment multiple times to assess variability,
we restricted our data generation process to chromosome 12 of the human genome to
reduce computational time. We considered two settings with D = 2 and D = 3 datasets. The actual parameter values for each setting are summarized in Additional
file 3, Table S1. For both settings, jMOSAiCS and the separate analysis approach, which
identified enrichment for each individual dataset separately by MOSAiCS, are employed.
Typical output from a ChIP-seq analysis is a ranked list of enriched regions. The
length of the list can be based on a FDR cutoff, other types of type-I error rate
control or the investigators may choose to consider a certain number of high ranking
regions. We evaluated the joint and the separate analysis approaches by taking this
variation in reporting of the results into consideration. Specifically, we considered:
(i) accuracy by plotting the proportion of correctly detected enriched regions obtained
by the B variable and also correctly detected enrichments obtained by dataset-specific E variables as a function of top ranking enrichment regions; (ii) sensitivity by plotting
the proportion of the true set of enrichments that are detected as a function of the
nominal false discovery rate (the total number of detected true enrichments identified
at different FDR cutoffs divided by the total number of true enrichments are reported);
(iii) false discovery rate control by plotting the observed FDR as a function of target
nominal FDR. Ranking of regions and FDR control for jMOSAiCS relied on the posterior
inference with the B variable, which captures whether or not any given region is enriched in any of the
datasets, and the E variables, which infer whether or not the regions are enriched in specific datasets.
We generated similar variables for the separate analysis in a post hoc fashion after individual samples were analyzed with MOSAiCS.

Figure 2 summarizes these results for the D = 2 setting across 20 simulation runs (results for D = 3 are available in Additional file 3, Figure S2). This setting, on average, has 85,000 enriched regions, that is, regions
with B = 1. Figure 2(a), which displays the proportion of top ranking enriched regions that are true positives,
indicates that jMOSAiCS and the separate analysis exhibit similar accuracy for the
top 36% of the enriched regions; however, jMOSAiCS outperforms the separate approach
significantly as we go down the list of top ranking regions. The differences in performances
are significant both at the region level (B level, based on the B variable) in detecting whether or not there is any enrichment in a region in any of
the D datasets and also at the individual dataset level (E1 and E2 levels, based on the E variables). Beyond the 68% of the top enrichment regions (≥58,000), the improvement
in accuracy due to the joint analysis is about 10% at the individual dataset level.
In addition, jMOSAiCS exhibits much smaller variation in accuracy compared to the
separate analysis as the number of top ranking regions considered increases. Since
this setting had similar signal strengths for both datasets, dataset-specific accuracy
improvements over the separate analysis captured by the E1 and E2 variables are similar.

Figure 2(b) evaluates the two approaches in terms of sensitivity and illustrates that jMOSAiCS
has better sensitivity than the separate approach at every nominal FDR level. Overall,
jMOSAiCS identifies a larger number of enriched regions and captures a significantly
higher proportion of the true set of enrichments compared to the separate approach
at the same FDR level. When FDR is 0.01, the improvement in sensitivity is 9% at the
B level and more than 15% at the E level. At the same FDR cutoff, jMOSAiCS identifies more true enrichments than the
separate analysis. Next, we show how well the FDR is controlled by the two approaches
in Figure 2(c), which depicts observed FDR across 20 simulations for different levels of nominal
FDR. Overall, we observe that jMOSAiCS provides better FDR control than the separate
approach and its FDR estimates at the E level are more accurate. For the B level, we observe some overestimation of FDR by jMOSAiCS; however, this still presents
significant improvement over the separate analysis. Overall conclusions based on the
H3K9me3 simulations, which emulate data for a single epigenetic mark in two different
conditions (two different individuals), agree with those of STAT1 results and the
detailed results are provided in Additional file 3, Figure S3.

ChIP-seq experiments are often carried out with at least two biological replicates
to allow an assessment of variability. Prior research suggests that non-specific biases
such as GC content can vary significantly between biological replicates [19,22]. As a result, it is not often clear whether or not data can be pooled at the biological
replicate level for the purpose of identifying enrichment. We studied a joint analysis
strategy of multiple replicates with jMOSAiCS for a computational experiment based
on MeCP2 binding in mice. The data consisted of two biological replicates with five
and six lanes of sequencing reads, respectively. The number of usable reads within
a lane varied between 6.8 and 19.7 million. MOSAiCS provided adequate fits on each
dataset and the simulation parameters were set according to estimates from the MOSAiCS
fits. Details on the parameter settings are available in Additional file 3, Table S1. Within this simulation, we varied the sequencing depth of one of the replicates
(replicate 2) at one, three, and six lanes while keeping the other replicate at five
lanes. One- and three-lane scenarios emulate the cases where one of the replicates
has much lower sequencing depth than the other. This setting can arise in a variety
of contexts, for example, when multiple samples are multiplexed together in one lane
or when replicates are generated at different times. Figures 3, 4, and 5 summarize the results for these experiments.

Figure 3(a) illustrates that, for lower depth scenarios of replicate 2, jMOSAiCS has significantly
higher accuracy than the separate analysis at the B level when inferring whether the regions are enriched in any of the replicate datasets.
E-level comparisons of accuracy for replicate 2 (Figure 5(a)) reveal a consistent 15% difference in accuracy between jMOSAiCS and the separate
approach. When both replicates have high sequencing depths, jMOSAiCS provides a small
but significant improvement over the separate analysis (jMOSAiCS (5-6) vs. Separate
(5-6) across Figures 3(a), 4(a), and 5(a)). The differences in the sensitivities of the two approaches vary significantly
with the number of lanes of replicate 2 (Figures 3(b), 4(b), and 5(b)). Overall, jMOSAiCS consistently identifies 10% to 15% more of the true enrichments
when replicate 2 has lower depth. In Figure 4, as expected, the sensitivity of enrichment detection in replicate 1 is not affected
by the number of lanes of replicate 2 in the separate analysis. However, jMOSAiCS
also improves on this replicate as the number of lanes for the other replicate increases
by sharing information across the two replicates through the B variable. The largest improvement due to jMOSAiCS is in the detection of enriched
regions in the low depth replicate when it has only one lane of data (Figure 5(b)). In this setting, jMOSAiCS identifies 50% more of the true enrichment regions across
all the nominal FDR levels. In Figures 3(c), 4(c), and 5(c), we observe that jMOSAiCS generally has more variable but accurate FDR estimation
for both the B and E levels. When replicate 1 has five lanes and replicate 2 only one lane, FDR controls
by jMOSAiCS for the B and E2 levels are less accurate; however, the overall accuracy of jMOSAiCS is significantly
better when a fixed number of top ranking regions are considered (Figures 3(a) and 5(a)).

We also carried out a variation of this experimental setting by lowering the sequencing
depths of both of the replicates to one and three lanes. The results are reported
in Additional file 3, Figures S4, S5, and S6, and agree well with the overall conclusions reported here.

Comparison with chromHMM

chromHMM [8] is a hidden Markov model-based approach for partitioning a reference genome into
multiple chromatin states based on multiple histone modification ChIP-seq datasets.
The software accepts as input either aligned read files or enrichment/peak calls for
each dataset. When provided with the aligned reads, it partitions the genome into
200 bp intervals and assigns each interval a 1 or 0 based on a local Poisson background
distribution to depict enrichment. chromHMM aims to identify global patterns of enrichment
and hence it approximates the space of two-dimensional enrichment patterns with a
much smaller number as it is computationally prohibitive to consider the full state
space with this model. As output, it reports the specific combination of epigenomic
marks (enrichment patterns) associated with each chromatin state and the frequencies
between 0 and 1 with which they occur. We compared jMOSAiCS and chromHMM in three
settings using the data-driven experiments of STAT1 ChIP-seq data in HeLa cells. Although
these initial parameters are derived from TF ChIP-seq data, they are able to generate
ChIP-seq data with marginal density similar to those of histone data. The specific
simulation settings are as follows:

SE1: Same as the STAT1 simulation described in the earlier section.

SE2: η2 lowered from 0.9 to 0.5 to increase the number of regions with ten patterns.

One of the major differences between chromHMM and jMOSAiCS is that chromHMM models
binary enrichment indicators as the observable data whereas jMOSAiCS models the actual
read counts (Y layer). In addition, jMOSAiCS can capture all possible enrichment patterns even for
a large number of datasets (D) because the joint distribution of the enrichment variables is governed by the univariate
B variable. To investigate the effect of the binarization in chromHMM, we considered
three versions of chromHMM: (i) original chromHMM; (ii) chromHMM coupled with true
binarization; (iii) chromHMM where bin-level binarization is based on peak calling
with MOSAiCS at nominal FDR levels of 0.05 and 0.2. Detailed results for setting SE2
are provided in Figure 6. Figure 6(a) summarizes enrichment pattern identification results for the 11 and 10 patterns
based on the genome annotations obtained by jMOSAiCS and variations of chromHMM. The
results for the 01 pattern are not displayed because there are very few regions with
this pattern and they are mostly misannotated by chromHMM. Overall, these results
illustrate that jMOSAiCS outperforms four-state chromHMM in this setting. When coupled
with true binary data, chromHMM annotated all chromatin states accurately. Using peaks
called by MOSAiCS increased the accuracy compared to the original four-state chromHMM
but identified fewer correct regions in the 11 state. Figure 6(b) provides detailed comparison of jMOSAiCS with the two-state chromHMM where chromHMM
approximates the full state space of dimension 4 by only two states. A similar comparison
between jMOSAiCS and four-state chromHMM is provided in Additional file 3, Figure S7. We observe that approximating the state space of dimension of 4 by two
dimensions leads to significant loss in accuracy for chromHMM. At the individual dataset
level, the difference in accuracy between two-state chromHMM and jMOSAiCS can be as
large as 20% (comparing jMOSAiCS-E2 with chromHMM-E2 in Figure 6(b)). The results for simulation settings SE1 and SE3 are similar and provided as Additional
file 3, Figures S8 and S9.

Figure 6.Comparisons between jMOSAiCS and chromHMM based on data simulated from ChIP-seq experiment
of STAT1 in HeLa3 cells (setting SE2). (a) Identification of combinatorial patterns: 11: enriched in both samples; 10: enriched
only in sample 1. True: number of enriched regions; chromHMM: results by original
four-state chromHMM; chromHMM-true: four-state chromHMM coupled with true binary data
for the bins; chromHMM-0.05: four-state chromHMM coupled with MOSAiCS binarization
of the bins at an FDR of 0.05; chromHMM-0.2: four-state chromHMM coupled with MOSAiCS
binarization of the bins at an FDR of 0.2. (b) Accuracy of enrichment detection at the region (B) and dataset-specific region (E1 and E2) levels by jMOSAiCS and two-state chromHMM. ChIP: chromatin immunoprecipitation;
FDR: false discovery rate; FP: false positives; jMOSAiCS: joint model-based one- and
two-sample analysis and inference for ChIP-seq TP: true positives

Scalability with large numbers of datasets

We evaluated how well jMOSAiCS scales up to large numbers of datasets by extending
our simulation setting SE3 with 5b1 and 5b2. We generated D = 20 datasets and assigned each region of size 250 bp to one of the 76 states out
of 220 possible states, including the 0...0, non-enrichment state. The number of regions
assigned to each individual enrichment state ranged between 1,048 and 1,212 and, on
average, each state had 1,129 assigned regions. These experiments revealed that because
the dataset-specific background and signal read distributions are estimated separately
in the jMOSAiCS framework, it scales up easily to large numbers of datasets. Thus,
for D = 20 datasets, jMOSAiCS runs in 2 hours on a 64-bit machine with an Intel Xeon 3.0
GHz processor after the background and signal read distributions are obtained for
each dataset. Dataset-specific estimation with MOSAiCS requires 30 minutes to an hour;
however, since each dataset can be handled separately, this process can be run in
parallel using multiple CPUs.

Although jMOSAiCS can generate any number of states for large D, it is important to summarize key states for any given collection of datasets. One
key advantage of jMOSAiCS is that the model fitting can be carried about without a priori setting the maximum number of allowed states. After the model is fit, each region
is initially assigned to the state for which it has the largest posterior probability.
However, if a much smaller number of states is desired, we consider the top K states with the largest number of region assignments and reassign each region to these
K states. From the perspective of jMOSAiCS, summarizing the set of possible states with
K states is analogous to choosing the number of clusters in a clustering problem. We
have implemented a penalized average silhouette based criterion [23], which is widely used in clustering and seems to work well for our large D simulations (Additional file 3, Figure S10).

We next evaluated the accuracy and sensitivity for both jMOSAiCS and chromHMM by varying
the maximum number of allowed states, Kmax, as 30, 76, and 100 (Figures 7(a) and 7(b)). Overall, jMOSAiCS has very good accuracy and sensitivity when the number of states
is chosen optimally or overestimated. However, when the number of states is grossly
underestimated, the accuracy is comparable to those of large numbers of states for
the top set of detected enriched regions (approximately 35%) and then it rapidly deteriorates
since the small number of states simply does not capture the state of many regions.
In order to evaluate chromHMM under similar conditions, we performed chromHMM analysis
by requiring 30, 76, and 100 states and thresholded the resulting emission probabilities
to generate distinct states. This resulted in a total of 27, 63, and 85 distinct states.

Figure 7.Computational experiments for evaluating scalability of jMOSAiCS to large numbers
of datasets using data simulated from ChIP-seq experiment STAT1 in HeLa3 cells (extension
of setting SE3). (a) Accuracy of enrichment detection at the combinatorial pattern (state) level for different
numbers of states. (b) Sensitivity at varying numbers of states.

We applied jMOSAiCS to ChIP-seq data with antibodies specific to the histone modifications
H3K4me3, H3K4me1, H3K27me3, and H3K9me3 in G1E and G1E-ER4+E2 cells [6]. These data were generated as part of the mouse ENCODE project and analyzed by chromHMM
to segment the mouse erythroid genome based on chromatin modifications in [6]. The original analysis by [6] focused on segmentation of GATA1-occupied segments since G1E cells are a GATA-(null)
cell line derived from targeted disruption of GATA1 in embryonic stem cells whereas
G1E-ER4 cells are G1E cells engineered to express a conditionally active estrogen
receptor (ER) ligand binding domain fusion to GATA1 (ER-GATA1). When estradiol is
added to the culture medium (G1E-ER4+E2), the ER-GATA1 fusion protein gets activated
and binds to GATA1-specific sites. chromHMM analysis approximated a 24 = 16 dimensional state space with only six states. Our jMOSAiCS application explored
the full state space and, in addition to the six states identified by chromHMM, identified
five more states to which a significant number of GATA1-occupied segments were assigned.
Figure 8(a) enumerates the state space for jMOSAiCS and Figure 8(b) lists the number of GATA1-occupied segments for each state in the G1E and G1E-ER4+E2
cells. Overall, we observe that chromHMM captures broad dominating patterns and jMOSAiCS
improves resolution for identifying local structures. In Figure 8(c), we provide normalized read data for the 311 GATA1-occupied peaks (with width less
than 1,400 bp out of a total of 366) identified to switch from state 1101 in G1E to
state 1111 in G1E-ER4+E2. We note that the chromHMM output does not include the 1101
or the 1111 pattern and distributes these loci over the six patterns it utilizes.
However, as evidenced from the heatmaps, these GATA1-occupied segments lack the repressive
mark H3K27me3 in G1E cells and exhibit the mark upon activation of GATA1 in G1E-ER4+E2.

Figure 8.Analysis of mouse ENCODE histone ChIP-seq datasets. (a) List of combinatorial patterns identified by jMOSAiCS. Patterns 1 to 6 are also identified
by chromHMM. (b) Changes in chromatin states between G1E and G1E-ER4+E2 cells for DNA segments occupied
by GATA1 in the latter cells. (c) Heatmap of normalized raw data for a group of 311 GATA1-occupied segments identified
to switch from 1101 in G1E cells to 1111 in G1E-ER4+E2 cells by jMOSAiCS. Enriched
regions (excluding segments longer than 1,400 bp in size) identified across different
marks are aligned and depicted in between the dashed lines.

We annotated these GATA1-occupied segments with respect to gene location and identified
that a large subset of them (48%) map to the immediate 5' or 3' end, or within introns
of known genes. We studied expression profiling data from GATA1-null erythroid precursor
cells that stably express a conditionally active allele of GATA1 fused to the estrogen
receptor ligand binding domain (G1E-ER-GATA-1). Differential expression analysis of
uninduced and beta-estradiol-induced G1E-ER-GATA-1 cells [24] identified Elf1, Atp6v1e1, Cmas, Ech1, Extl3, Rab4a, Casc3, and Lrrf1p2 as significantly
induced upon GATA1 activation with beta-estradiol treatment for 24 hours (FDR adjusted
P value = 0.05). Although H3K27me3 is conventionally viewed as inhibitory to transcription,
[25] recently identified an enrichment profile of H3K27me3 in the promoter of genes associated
with active transcription. The genes we identified constitute further examples of
this class. Several of these significantly expressed genes have established functions
in stem cell biology and hematopoiesis. For example, Elf1 is an Ets transcription
factor involved in the control of hematopoiesis through participating in the transcriptional
activation of the Stem Cell Leukemia (SCL)/T-cell Acute Lymphocytic Leukemia-1 (TAL1)
gene [26,27]. We performed quantitative ChIP analysis of these four loci and validated the H3K4me1,
H3K4me3, H3K27me3, and H3K9me3 marks at these loci in beta-estradiol-induced G1E-ER-GATA-1
cells (Additional file 3, Table S2). We provide detailed read coverage plots of these regions in Additional
file 3, Figures S11 to S14 along with their chromHMM annotations to further support their
jMOSAiCS annotation.

In addition to the above direct comparison of jMOSAiCS analysis with the results of
[6] for a six-state chromHMM, we repeated the chromHMM analysis by requiring 16 states.
We cross-tabulated the numbers of GATA1-occupied segments assigned to each of the
16 states by the two methods in Additional file 3, Figures S15(a) and (b) for the G1E and G1E-ER4+E2 cell lines, respectively. Overall,
59% of regions are assigned to the same chromatin state by both methods in both cell
lines (G1E: 6774/ 11485 and G1E-ER4+E2: 6752/11485). The largest discordances between
the two methods are due to the modification of the H4K3me3 mark. Most of the regions
that are identified as unenriched for the H4K3me3 in chromHMM are identified as enriched
by jMOSAiCS. In order to quantify this further, we marginally tabulated the regions
according to their enrichment classification by both chromHMM and jMOSAiCS (Additional
file 3, Table S4). Next, we investigated the raw data (both ChIP read counts and log base
2 ratios of ChIP over scaled input read counts) within each of these classes. Additional
file 3, Figures S16(a) and (b) display boxplots of log base 2 ratios for 11 (detected as
enriched by both jMOSAiCS and chromHMM), 10 (detected as enriched only by jMOSAiCS),
01 (detected as enriched only by chromHMM), and 00 (not detected as enriched by either
method) groups within each mark across the two cell lines. For each GATA1-occupied
segment, we used the bin with the maximum log base 2 ratio to generate these plots.
Overall, we observed that regions declared as enriched by both methods showed the
most enrichment of ChIP compared to input and the regions identified as unenriched
by both methods showed the least enrichment. Although the differences between 10 (jMOSAiCS
specific) and unenriched regions did not appear as pronounced as the differences between
11 (enriched according to both methods) versus 00 (unenriched according to both methods),
the observed differences were highly significant (P values < 1 × e-10 with both the t-test and Wilcoxon rank-sum test), indicating that these regions are
enriched for H4K3me3. In addition, we observed a relatively small number of regions
identified as enriched exclusively by chromHMM. Although some of these appear to be
true false negatives for jMOSAiCS, overall, they tend to be regions with relatively
low ChIP counts. Additional file 3, Figure S16(c) displays ChIP versus normalized input read counts across all the regions,
stratified with respect to their enrichment status by jMOSAiCS and chromHMM, for the
H3K4me1 mark in G1E-ER4+E2 cells. We also evaluated how the performance of chromHMM
changed for the segments that were identified as changing from 1101 in G1E to 1111
in G1E-ER4+E2 by jMOSAiCS. Although the 16-state chromHMM produced concordant results
with jMOSAiCS for 58.5% of these GATA1-occupied loci, including the validated Atp6v1e
and Cmas loci, it still mislabeled both Elf1 and Extl3 (Additional file 3, Table S5).

Discussion

Integrative analysis of multiple ChIP-seq datasets for enumerating enrichment patterns
is an emerging need. We introduced jMOSAiCS to enable efficient one- or two-sample
integrative analysis of multiple ChIP-seq datasets. jMOSAiCS capitalizes on the dataset-specific
accurate model fits by MOSAiCS and efficient encoding of the joint distribution of
the enrichment across multiple datasets by the JAMIE approach of [18]. Diagnostics is an important component of probabilistic model-based approaches. jMOSAiCS
inherited the goodness-of-fit plots provided by MOSAiCS for model checking and diagnostics.
In contrast to some of the few available joint analysis methods for multiple ChIP-seq
data (for example, [11]), jMOSAiCS can efficiently handle multiple datasets and is accurate at both obtaining
global and local structures. A comparison of jMOSAiCS with chromHMM revealed that
jMOSAiCS is better at identifying local structures since it can capture any specific
enrichment pattern and does not rely on approximating the number of states with a
smaller number of patterns. This observation is further supported by identification
of a considerable number of GATA1-occupied segments in a different state than was
identified by chromHMM. Our computational experiments indicated that jMOSAiCS scales
up well with large numbers of datasets and it can summarize key states with a penalized
average silhouette criterion [23].

Our analyses illustrated that jMOSAiCS is powerful in analyzing biological replicates
simultaneously when it is not appropriate to pool them due to non-specific sequencing
biases such as the GC content. When one or more of the replicates is shallowly sequenced
compared to others, jMOSAiCS boosts the power for these replicates. Another particularly
attractive use for jMOSAiCS is when the TF of interest interacts with the reference
genome through another DNA binding protein. For example, virus-host interactions are
typically facilitated by virus proteins interacting with the host DNA via host proteins.
Joint analysis of ChIP-seq data for host and virus proteins has the potential to boost
power for detecting regions enriched for the virus protein (for example, [28]).

Meta-analysis of multiple samples is another integrative approach to multiple ChIP-seq
samples. However, the focus of such meta approaches (for example, MM-ChIP [29] and ChIPMeta [30]) is the analysis of ChIP (-chip or -seq) data of the same protein under similar biological
conditions but by different platforms or laboratories for the purpose of boosting
the power of peak detection. The focus in jMOSAiCS is combinatorial pattern detection
across multiple datasets (same TF in different biological conditions or different
TFs or epigenomic marks in the same biological conditions). Therefore, our computational
experiments focused on comparing jMOSAiCS with chromHMM, which is suitable for the
latter task. jMOSAiCS can handle multiple ChIP-seq datasets with varying experimental
parameters such library size and read length because the marginal distributions of
read counts in each dataset are modeled in a dataset-specific manner.

jMOSAiCS currently implements a naive Bayes model for the joint distribution of the
dataset-specific enrichment indicators. This model captures broad dependencies among
the samples via an unobserved variable. A potential improvement is to consider how
enrichment of a region in a sample depends on its enrichment in other samples. A general
way to induce such a structure is by Bayesian networks, where a directed acyclic graph
represents the dependencies. Trees, which generalize first-order Markov chains, and
mixtures of trees for which efficient structure learning algorithms exist [31] are two appealing, flexible candidates that can encode for increasingly complex dependencies.
Furthermore, they can be tailored for specific characteristics of analyzed samples,
for example, a Markov structure for time course ChIP-seq experiments.

Conclusion

jMOSAiCS facilitates joint analysis of multiple ChIP-seq datasets for both identifying
enrichment patterns of a single TF across multiple conditions and characterizing enrichment
patterns of multiple epigenomic marks in one or more conditions. Given model fits
from the peak/enrichment caller MOSAiCS, a typical jMOSAiCS run takes about 30 minutes
(2 hours) to identify combinatorial patterns of four (twenty) datasets across the
whole mouse genome with a single CPU on a 64-bit machine with an Intel Xeon 3.0 GHz
processor.

Materials and methods

Model fitting and parameter estimation in jMOSAiCS

Let f0d and f1d denote read count distributions for unenriched and enriched bins in dataset d. We will denote estimates of these by MOSAiCS with f^0d and f^1d. When a region is not enriched in dataset d, data for all the bins within that region are generated from f0d . Hence:

p0id≡Pr(Yid|Eid=0)=∏l=1Lifod(Yidl).

If region i is enriched in dataset d, then read counts for one or more consecutive bins within region i are generated from f1d. This enforces local spatial coherence and is motivated by the wide range of enriched
region widths observed in ChIP-seq data of histone modifications. Note that this kind
of spatial dependence is also capture by the chromHMM model. Let Vid denote the number of enriched bins and Sid∈{1,…,Li} the starting position of the set of enriched bins in region i. Then, we have:

where we assume that the run of enriched bins can start anywhere within the region
with equal probability of 1/Li and the length of the run has a uniform discrete distribution, that is, Pr(Sid = s | Eid = 1, Bi = 1, Vid = v) = 1/(Li - v + 1), s = 1, ..., Li - v + 1. The likelihood of full data is a product over I regions:

We estimate f0d and f1d for each individual dataset separately using the MOSAiCS algorithm. Therefore, the
quantities p0id and p1id, i = 1, ..., I, d = 1, ..., D are fixed given f^0d and f^1d. Because B, E, S, and V are unobserved variables, we derive an expectation-maximization [32] algorithm to obtain maximum likelihood estimators of τ1 and η = (η1, ..., ηd) based on the likelihood in (1). The full data log likelihood can be written as:

Lτ1,η=∑i=1I1-Bilog1-τi+Bilogτi+∑i=1I∑d=1DBi1-Eidlog1-ηd+BiEidlogηd+C,

where C is a constant that does not contain the parameters to be estimated and can be computed
given f^0d and f^1d. Taking expectation of the full data likelihood conditional on observed read counts
Y, we obtain the following E and M steps, where τt1, ηt denote parameter estimates from the tth iteration:

This EM algorithm converged within 100 iterations in both the computational experiments
and the analysis of ChIP-seq data of histone modifications used in the case study.
We used the posterior probabilities PrBi|Yi,τ^1,η^ and PrEid|Yi,τ^1,η^ for false discovery rate control with a direct posterior probability approach [33] in the computational experiments.

Computational experiments

All the computational experiments were based on the following procedure. The reference
genome (human for STAT1 and H3K9me3 or mouse for MeCP2) was divided into bins (50
bp for STAT1, 250 bp for H3K9me3, and 200 bp for MeCP2) based on average fragment
size in the actual experiment. Consecutive n∈3,5 bins were organized into non-overlapping regions to facilitate B-level data generation. For each region i, i = 1..., I, the Bi variable was set to 1 with probability τ1. If Bi = 0, then all the Eid and Zidj variables were set to 0 for that region, indicating no enrichment for all the bins
in the region across all the datasets. For regions with Bi = 1, the E variable was simulated at the dataset level, for example, Eid was set to 1 with probability ηd. The bin-level Z variables were generated based on Eid. For Eid = 1, the region i should have at least one enriched bin in dataset d. To ensure this, we selected the bin that the enrichment starts within in a region
at random and allowed the number of consecutive bins with enrichment to vary within
each region. For non-enriched bins, Zidj was set to 0 and the corresponding Y layer data (read counts) were generated from the background distribution. For enriched
bins, Zidj was set to 1 or 2 with probabilities p1 and 1 - p1, and denoted the components of the mixture distribution for the signal. Specifically,
Zidj = 1 implied that Yidj ~ Nidj + NegBin(b1, c1 /(1 + c1)), whereas Zidj = 2 referred to Yidj ~ Nidj + NegBin(b2, c2 /(1 + c2)). We generated multiple ChIP-seq datasets by varying the signal component parameters
b1, b2, c1, c2, and p1 of this procedure according to the parameters estimated from the actual ChIP-seq studies
(Additional file 3, Table S1).

Separate analysis of multiple ChIP-seq datasets and annotation of genomes into combinatorial
patterns in the computational experiments

In the separate analysis, we analyzed each dataset by MOSAiCS [19]. This allowed us to quantify the gain due to the joint modeling approach rather than
differences in modeling the read count data by different ChIP-seq analysis methods.
MOSAiCS reports bin-level posterior probabilities of enrichment (posterior probabilities
at the Z layer). For the sensitivity and empirical FDR calculations, enriched bins were identified
at the various levels of nominal FDR using a direct posterior probability approach
[33]. Then, dataset-specific E variables were set to 1 if there was at least one enriched bin in a region. Similarly,
region-specific B variables were set to 1 if at least one of the E variables for a given region was set to 1. The accuracy calculations required ranking
of regions based on the B and E variables. For this purpose, we followed a meta-analytic approach and used the maximum
of bin-level posterior probabilities of enrichment within each region for inference
at the E level and the maximum within each region across D datasets for inference at the B level. Then, these posterior probabilities were used for ranking the regions in the
accuracy plots. We also considered FDR control over these meta-analytically defined
B and E variables as an alternative to the above approach for identifying the set of enriched
regions in the separate analysis; however, this modification yielded similar results
and did not change the overall conclusions. Ranking for the joint analysis in the
accuracy plots utilized posterior inferences for the B and E variables based on the jMOSAiCS model. Accuracy as a function of the top number of
detected enriched regions required ranking of regions by chromHMM. For each region,
we summed over chromHMM estimated pattern probability times the pattern-specific emission
probability of each bin within the region and generated pattern-specific posterior
probabilities for ranking.

Comparison of chromHMM and jMOSAiCS required annotation of the genome into TF binding/chromatin
states based on the jMOSAiCS fit. We calculated the joint posterior probability of
the E variables Pr(Ei1 = r1, ..., EiD = rD | Yi, τ1, η) for each combination of r1, ..., rD, where ri = 0, 1. The enrichment pattern (or state) of each region is assigned as the one with
the maximum joint posterior probability.

We partitioned the mouse genome into 200 bp intervals and applied jMOSAiCS to data
from the G1E and G1E-ER4+E2 cells separately. Enriched regions were identified by
controlling the FDR at 0.01 through the E variable. In the downstream analysis, we focused on 11,485 GATA1-occupied segments
defined by [6] and enumerated H3K4me3, H3K4me1, H3K27me3, and H3K9me3 modification patterns of these
regions across the two cell types. The median width of the GATA1-occupied segments
was 800 bp and only 0.75% of the segments were wider than 2,000 bp.

Quantitative ChIP assay

Quantitative ChIP analysis was conducted with two independent biological replicates
of beta-estradiol-induced G1E-ER-GATA-1 cells using control and specific antibodies
as described in [34]. The relative levels of the specific histone marks are indicated in the Additional
file 3, Table S2. The PCR primers used to analyze the four loci are provided in Additional
file 3, Table S3.

Authors' contributions

SK conceived and designed the method, computational experiments, and the data analysis
and wrote the paper. XZ designed and implemented the method, computational experiments,
and the data analysis, and wrote the paper. RS and EHB performed experimental validation
of the selected targets. HL and QC contributed data. All authors read and approved
the final manuscript.

Acknowledgements

H3K9me3 ChIP-seq was performed by Henriette O'Geen in the lab of Peggy Farnham (University
of Southern California), using PBMCs provided by Emery Bresnick (University of Wisconsin,
Madison) and Swee Lay Thein (King's College London). We thank Jason Ernst PhD (Department
of Biological Chemistry, UCLA) for discussions on the chromHMM software and Weisheng
Wu (Hardison Lab, Penn State University) for information on the ChIP-seq datasets.
We also thank Sushmita Roy PhD (Department of Biostatistics and Medical Informatics,
University of Wisconsin, Madison) for her insightful comments on an earlier version
of the manuscript. This research is supported by a National Institutes of Health Grant
(HG006716) to SK.