Detecting differential binding in ChIP-seq data with csaw

Motivation

I'll assume that everyone here knows what ChIP-seq is - but if not, then you can look at this cute picture.

Why look for differential binding (DB) between biological conditions?

The null hypothesis is easy to define, i.e., is binding equal between libraries? No need for a background model or negative controls.

Results may be more relevant, as DB sites might contribute to biological differences.

Why look for de novo sites?

Minimize assumptions about where the protein binds.

Discover new things!

Why use windows?

Because they're simple.

Improve spatial resolution for complex events.

Overall workflow

It's a fairly modular process:

Counting reads into windows

Computing normalization factors

Filtering out low-abundance windows

Testing for differential binding

Aggregating windows into clusters

Visualization, annotation and other stuff

But before we get into that, we have to set up some data.
We'll be using the libraries from the Tiwari et al.'s 2012 study.
This compares binding of the NF-YA transcription factor in embryonic stem cells (ESCs) and terminal neurons (TNs).

These are BAM files, so we've already mapped the reads to the mm10 build of the mouse genome.
For those who are interested, this was done using subread, with some help from SAMtools and Picard.
The full processing pipeline can be found at the location below.

system.file("doc", "sra2bam.sh", package="csaw")

## [1] "/home/ubuntu/R-libs/csaw/doc/sra2bam.sh"

Counting reads into windows

The first step is to count reads into windows.
Reads are directionally extended to the average fragment length, to represent the original fragments that were present during immunoprecipitation.
Windows of constant size (10 bp) are tiled across the genome at regular intervals (50 bp).

frag.len <- 110
window.width <- 10
spacing <- 50

The number of imputed fragments overlapping each window is then counted in each library.
We use the windowCounts function to do the actual counting.
We filter out low-quality alignments (mapping quality score less than 50) and we ignore everything except the standard chromosome set.

This gives us a RangedSummarizedExperiment object that we can play around with.
Check out rowRanges(data), for the genomic coordinates of the windows; assay(data), for the counts for each window; and colData(data), for library-specific information.

The idea is to define a single readParam object for use throughout the analysis, to ensure the same reads are being extracted each time (with some exceptions, e.g., correlateReads).
For example, look at the one that we previously constructed.

Computing normalization factors

There's several different normalization options in csaw, but we'll just focus on removing composition bias.
This is introduced when DB regions take up a bigger slice of the sequencing pie in one library, suppressing the coverage of everything else (e.g., non-DB sites, background regions).
We don't want to detect differences due to this suppression, so we normalize based on the assumption that most of the genome is non-DB background.
First, we count reads into large 10 kbp bins to quantify the coverage of the background regions.

binned <- windowCounts(bam.files, bin=TRUE, width=10000, param=param)

Any systematic difference between libraries across these background bins must be artifactual and should be eliminated by scaling normalization.
The normalize function is a S4 method that wraps the normalizeCounts function, which in turn wraps the calcNormFactors function from the edgeR package.
The actual normalization is performed using the trimmed mean of M-values (TMM) method.

normfacs <- normalize(binned)
normfacs

## [1] 1.006 0.973 1.015 1.007

We can have a look at these factors on MA plots.
The code below compares the binned counts for libraries 1 and 2, i.e., the two ES replicates.
Note the big blob, which corresponds mostly to the background regions of the genome.
The location of the red line represents the log-scaling factor used to normalize one library against the other.

Interesting questions

Why do we use bins instead of windows?

Does the bin size affect the normalization factor estimates?

Advanced usage

Another source of bias comes from differences in immunoprecipitation efficiency between libraries.
If such differences are present, we wouldn't want to detect them because they're not biologically interesting.
Thus, we can normalize by eliminating any systematic difference in abundance across high-abundance windows (i.e., bound sites).
This involves filtering (see next section) and normalizing on the remainders.

keep <- aveLogCPM(asDGEList(data)) > -1
normalize(data[keep,])

## [1] 0.610 0.799 1.533 1.338

Note: we can only pick one set of normalization factors - either to remove composition bias, or to remove efficiency bias.
The choice depends on the biological context of the experiment.
Specifically, do you expect wholesale changes in binding between conditions?

if not, then systematic changes at binding sites are likely to be artifactual and should be removed as efficiency bias.

if yes, then systematic changes are likely to be genuine and should not be removed (i.e., normalize for composition bias).

In this case, we'll go for composition bias, as there's no reason that NF-YA binding should be constant between cell types.

Trended biases may also be observed as trends in the MA plots.
These cannot be eliminated by scaling methods like TMM, but require the use of non-linear methods instead.
In csaw, this can be achieved by setting type="loess" in the call to normalize.
This performs loess-based normalization to remove the trend, and produces an offset matrix for use in later model fitting.

Filtering away low-abundance windows

Filtering is done using the average abundance of each window, as computed using aveLogCPM in edgeR.
This can be interpreted as the log-(average count)-per-million for each window.
In our statistical model, the average abundance is roughly independent of the DB status of each window, so we avoid data snooping when we select for high-abundance windows.

abundance <- aveLogCPM(asDGEList(data))

We then keep the windows that are considered to be high-abundance, according to some threshold.
The RangedSummarizedExperiment object can be easily subsetted to achieve this (we'll hold onto the original, just in case).

A very large number of low-abundance windows is problematic as they will mess up some of edgeR's statistical methods and assumptions.
Filtering gets rid of most of these windows and their associated problems, while also reducing computational work and the severity of the multiple testing correction.

We then use the filterWindows function to process these into enrichment values, for each window over its local background.
We keep those windows that have a 3-fold or higher increase in abundance over its neighbours.

We use the negative binomial (NB) framework in the edgeR package.
This accounts for low, discrete counts that exhibit overdispersion between biological replicates.
Specifically, variability between replicates is modelled using the NB dispersion parameter, as estimated through the estimateDisp function.

We can augment this model with quasi-likelihood (QL) methods.
The QL dispersions account for window-specific variability, while the NB dispersions model biological variability between replicates.
We fit a generalized linear model (GLM) to the counts for each window, where the QL dispersion for that window is estimated from the GLM deviance.

fit <- glmQLFit(y, design, robust=TRUE)

Finally, we wrap things up by using the QL F-test to test for DB between our two cell types.

Advanced usage

This isn't really that advanced, but we can plot the NB dispersions with the plotBCV command.
In the QL framework, we focus on the trend as only the trended NB dispersions will be used later.
We usually see a decreasing trend with abundance, possibly plateauing off at very large abundances.

plotBCV(y)

Similarly, for the QL dispersions, we can use the plotQLDisp function.
This shows the effect of empirical Bayes shrinkage on the estimates, when the raw values are squeezed towards the trend.
We use the shrunken estimates for all downstream processing.
The idea is to reduce uncertainty in the presence of limited replicates, to improve detection power for DB testing.

plotQLDisp(fit)

If you have more complicated experiments (e.g., multiple groups, blocking factors), you might have to think more carefully about the design matrix and the contrasts.
Check out the edgeR user's guide for how to deal with them.

Aggregating windows into clusters

To correct for multiple testing across thousands of windows, we can think about controlling the false discovery rate (FDR).
However, controlling the FDR across all windows isn't that useful, as few people would interpret ChIP-seq results in terms of windows.
Instead, we want to control the FDR across all genomic regions, i.e., the region-level FDR.

To do that, we can cluster adjacent windows into genomic regions using the mergeWindows function.
Windows that are less than tol apart get thrown into the same cluster.
Merging of adjacent or overlapping windows reduces redundancy and simplifies interpretation of the results.
Each cluster now represents a distinct genomic region.

We can combine p-values for all windows in each cluster using Simes' method.
This computes a single combined p-value for each cluster/region, representing the evidence against the global null.
We also get some other statistics, like the total number of windows in each cluster as well as the number that are changing substantially in each direction.

Advanced usage

We can use many different clustering strategies, so long as they are blind to the DB status of the windows.
One strategy that is particularly useful is to cluster windows according to the promoters in which they are located.
This allows for direct coordination with, say, a differential expression analysis.

It's then a simple matter to combine the p-values, as we did before.
This is done using a wrapper function that directly accepts a Hits object.
We end up with a table, with one row for each entry of prom.

This means we can make a direct statement regarding DB in a promoter region.
Note that the NA's represent those entries of prom that have no overlapping windows, and are not shown here because they're not particularly exciting.

Other stuff

Simple gene-based annotation can be added to the regions using the detailRanges function.
This identifies annotated genomic features (exons, promoters, introns) that are overlapped by each region, as well as those flanking each region (plus the gap distance between each flanking feature and the region).

Some explanation may be in order here.
Each string represents a comma-separated list, where each entry is of the form SYMBOL|EXONS|STRAND, i.e., the symbol of the relevant gene, the exons that are overlapped (I for introns, and 0 for promoters) and the strand of the gene.
For flanking regions, an additional [DISTANCE] field is present that indicates the distance between the annotated feature and each region.

head(anno$left)

## [1] "" "" "" ""
## [5] "" "Vcpip1|1|-[19]"

We can now put together a table to save to disk, for other people to read and whatnot.
Of course, you can save to other formats if you wish, using packages like rtracklayer.

Simple visualization can be performed using the Gviz package.
We extract reads from the BAM file using the extractReads function, and we plot coverage in reads-per-million.
This is done separately for each strand in blue and red.