I'm working on a differential methylation analysis on WGBS data, using the bsseq package.

I'm following the vignette to identify differentially methylated regions (DMRs) between two groups but I'm having troubles in understanding this step:

Once t-statistics have been computed, we can compute differentially methylated regions (DMRs) by thresholding the t-statistics. Here we use a cutoff of 4.6, which was chosen by looking at the quantiles of the t-statistics (for the entire genome).
> dmrs0 <- dmrFinder(BS.cancer.ex.tstat, cutoff = c(-4.6, 4.6))

So, my questions are:

How was the 4.6 cutoff for the t-statitistic chosen and what does it mean to look at the quantiles of the t-statistic? Can you provide some code that shows how the choice was made?

Would it be possible to filter by p-value instead? I would find it more intuitive but I don't see a p-value column in the results object. How would I go about adding a p-value column based on the values of the t-statistic?

The 4.6 cutoff was picked based on a quantile of the cancer dataset (using all autosomes). I have since used the 4.6 cutoff routinely for other systems. The problem I have with a quantile cutoff is that you're implicitely assuming that a certain percentage of the genome is differentially methylation, which is often an open question. In my experience 4.6 gives you good looking DMRs and decreasing this cutoff substantially just gives a lot of bad looking regions. In systems with few differences I don't find many DMRs with a 4.6 cutoff, but that is because there is not much difference between the two groups, not because the cutoff is too stringent.

The best way to do this is to pair the choice of cutoff with a permutation analysis where you permute the sample labels. For most samples sizes using WGBS this sounds a bit insane, but we have done this successfully. For example, in our paper on EBV in Genome Research we do use a permutation approach despite the fact that we only have 3 samples in each group. But in this case the signal is so strong that we for almost all the DMRs, we do not find any DMR in any permutation which is better.

Thanks for clarifying Kasper - very helpful. I will check what results I get with both the 2.5/97.5 percentiles and your empirical 4.6 cutoff.

Probably a naive question (as you might have already realised I don't have a stats background) but wouldn't you solve this problem at least in part by selecting DMRs based on their p-value adjusted for multiple testing? In that way you would not make the assumption that a certain percentage of the genome is necessarily differentially methylated and you could use a p-value threshold that, albeit still arbitrary, is easier to explain/justify in most contexts.

Back to Peter's answer below, if I were to compute p-values based on the t-statistics with pt(), how would I calculate the degrees of freedom?

The cutoff is chosen as follows: Compute the t-statistic for each CpG in your study and then form the distribution of these t-statistics. Compute the (alpha/2)-level lower and upper quantiles of this distribution; this can be done using the quantile() function in R. The choice of alpha is up to you - from memory, the numbers referred to in the vignette correspond to alpha = 0.05, i.e., the lower quantile is at the lower 2.5% of the data and the upper quantile as at the upper 97.5% of the data. This means we are selecting empirically the 5% of the most extreme t-statistics in your data. To emphasise, these numbers should be based on the distribution of t-statistics from your data.

Sure, you could filter by p-value. You can easily map the t-statistic to a p-value; see the pt() function in R. However, you will need to also compute and supply the appropriate degrees of freedom as well as accounting for the sign of the t-statistic. To me, this makes it simpler to just use the t-statistics themselves for setting the cutoff.

I hope this helps.

By the way, there's no need to notify the bsseq developers via Twitter of your question on this forum. We receive email notifications of posts tagged by packages we are involved with.

WRT to Twitter mentions: It's good to know that you get notified through this forum, but I think that tweeting about it can stimulate engagement with the wider rstats/bioinformatics community, which is the primary reason why I do that. Having said that, if you found the tweet mention annoying I apologise.

It might be worth adding some extra info to the vignette as it's not immediately obvious how the cutoff is calculated and with what percentiles. From what I can see, 4.6 does not correspond to specific percentiles on the vignette data.

And you could exclude all but about 2% of the regions under contention by using the cutoffs that are used in the vignette

> sum(ps < -4.6 | ps > 4.6)/length(ps)
[1] 0.01981683

You could use the p-value for those t-statistics, but it wouldn't be materially different than using the t-statistics directly.

The idea here is pretty simple - you just want to get a relatively small number of CpG sites that look to be differentially methylated, and then see if any of them are all together in one or more regions. How you select the CpG sites will necessarily be pretty ad hoc, and you just want to make sure you don't have too many (don't want too many false positives creeping in) or too few (you do want some results, no?). In the end you need some criterion to say if a CpG is in or out, and using the t-stat or the p-value will result in pretty much the same thing.