Sequencing Quality Control, statistics and bioinformatics blog

Main Menu

Your ChIP in “50 shades of grey”!

Since the movie ’50 shades of Grey’ is about to be released I thought this is the perfect opportunity to introduce everybody to the concept of “grey lists” and the recent R package developed by Gordon Brown at my institute: The GreyListChIP R package!

Blacklists

ChIP-seq and many other NextGen sequencing experiments (e.g. MNase-seq, DNase-seq, FAIRE-seq) often produce artifact signal in certain regions of the genome. These so called blacklisted regions are often found at repeat elements (such as satellite, centromeric and telomeric repeats), and show unstructured and high signal (excessive pile up of reads) independently of cell line and experiment type. The ENCODE project generated two sets of human blacklists (the DAC and DUKE regions, see here: http://genome.ucsc.edu/cgi-bin/hgFileUi?db=hg19&g=wgEncodeMapability).

Blacklisted regions are known to present problems for fragment length estimation and signal normalisation between samples. Although often found at repeat elements, reads typically map uniquely to these regions and so simple mappability-based filters do not remove them. Therefore the exclusion of these regions of artifact signal (aka blacklisting) is highly advised (see Hoffman et al., 2013; Kundaje, 2013; and Carroll et al., 2014).

Greylists

However, even after blacklisting, Gord has noted that many cell lines and tumour samples show anomalous high signal in the input samples that are also replicated in the correspondent ChIP samples, and importantly, these are cell-type specific. He called these lists of high signal grey lists, to distinguish them from ENCODE’s black lists for two reasons:

Firstly, because they represent regions of high artifact signals that are specific to your cell-type or sample

Secondly, and quoting Gord directly on the package vignette: “Because they can be tuned depending on the stringency required, and so that we can make jokes about having 50 shades of grey lists”.

Greylisting is done in your input samples. You can use the R package GreyListChIP to identify these regions of artifact signal.

Creating custom greylists with GreyListChIP

Let’s put it to use:

In a previous post, we created the merged input files for MCF7 and A549 cell lines (a breast and a lung cancer cell line, respectively). Here is how one can compute the greylists using GreyListChIP R package:

What does GreyListChIP do and how does it work?

Essentially, what GreyListChIP is doing is estimating a threshold on the read counts by which one should label a particular region of the genome has having “artifact” signal. Firstly, it fits a negative binomial distribution to the input sample and estimates the mu and size parameters. Then, given a p-value cutoff it computes the read count threshold.

In this example, I just used the package defaults to create my greylists. I used the function greyListBS, which is a wrapper function for several steps of the implemented process for greylist construction. But, there are several parameters that you may want to use differently depending on the stringency criteria that you which to apply, so just have a look at the package vignette and the reference manual to explore it further.

Greylists vs. Blacklists – genome coverage

How cell-type specific are these generated greylists? And how different are they from the blacklists?

To compare between multiple BED files I’ll use the multiIntersectBed tool in BEDtools.

#This has to be run using BEDtools (outside R)
multiIntersectBed -i blacklist.ChIPQC.bed MCF7_input_grey.bed A549_input_grey.bed > intersect.bed

The output (intersect.bed) reports the presence/absence of each file at each interval that is covered by any one of the three files. Now we can use this file to do a venn diagram in R. I have made the following code below that works for this file in particular:

2 responses to “Your ChIP in “50 shades of grey”!”

Reblogged this on Scientific B-sides and commented:
Interested in quality control of sequencing data? You should be! And a new blog will answer all the questions you were too afraid to ask. Here is an example: