I have a RNA-seq dataset with two treatments A and B (+/+, +/-, -/+, -/- design) and a batch effect. Both environments are equally represented in all batches. For differential expression analysis, I have added this batch effect as an element in the design matrix in EdgeR and it works perfect.

I also want to make a clustered heatmap from the genes that were found to be differentially expressed. When I do this, the batch effect is visible and samples form the same batch group together (but there is a stronger grouping from treatment A). I wanted to remove this batch effect and re-do the clustering and heat map generation. For this purpose, I tried the limma package function removeBatchEffect() and a few other similar tools.

Here is what I have done:

(1) Defined three experimental factors: A, B and batch.
(2) Made a design matrix that only includes A and B, but not batch.
(3) Put raw, non-normalized counts (non-integer expected counts output from RSEM) in a DGEList.
(4) Used TMM-normalization.
(5) Converted to log2 counts per million.
(6) Ran removeBatchEffect() on the result of (5) and specified the design matrix without batch and the batch factor.

The results look absolutely fantastic. Samples from the same batch no longer cluster at all, yet the basic appearance of the heatmap is the same in term of treatment effects.

Previous attempts at this some months ago produced horrible results (no heatmap patterning at all, not even small regions), presumably because I made mistakes, wrote wrong things etc. But when things goes from horrible to absolutely fantastic from an afternoon's work, I instantly become suspicious that it is too good to be true and that I have made some other error that makes it all invalid somehow. Sure, it is a bit paranoid perhaps, but better safe than sorry in these relatively complicated scientific matters.

So my two questions are:

(1) Is the usage above, both the direct application of removeBatchEffect() and the prior treatment of the data appropriate for the stated goal of making a "batch-free" clustered heatmap? Any red flags that pop up? I have browsed through several previous posts and looked at the manual for this function before asking.

(2) Is it crucial to include a length normalization before making a clustered heatmap for these kind of RNA-seq-based differential expression figures? Obviously, the gene length is exactly the same for all samples for a given gene, so it will not bias within-gene between-sample comparisons. What would I lose if i skipped it? Would it affect any technical aspect (such as gene hierarchical clustering) or what kind of conclusions that can be drawn from the heatmap? If I wanted to include it, is it enough to switch from cpm() to rpkm() and add a vector with gene lengths?

Looks fine to me. I would suggest that you mean-center your expression values before calling heatmap (i.e., by subtracting rowMeans(logCPM_no_batch) from logCPM_no_batch). This ensures that your colours aren't all-red or all-blue for highly or lowly expressed genes, respectively. Rather, the heatmap will focus on differences in relative expression within each gene. We also tend to use larger prior.count for visualisation, just to avoid being the range of colours being dominated by extreme log-fold differences due to zeroes; but that's a nitpicky point, so you'd probably be okay.

If you do the mean-centering like I suggested above, then length normalization won't matter because it'll just cancel out within each gene. By and large, most downstream analyses will compare within genes, so you needn't worry about length normalization in most situations. But if you did, then rpkm would be the way to go (provided you use a vector of gene lengths matching the annotation used for counting).

In the part of the custom code I plan to use for making a heat map (I added heatmap() for simplicity), I use the following code:

data <- t(scale(t(logCPM_no_batch), scale=F))

Does this accomplish mean-centering in the way you propose?

For the purposes of this heat map, it will just be an illustration in a paper. No downstream analysis will be done using the heatmap as such, but instead be done on the edgeR differential expression analysis. Am I understanding you correctly that it will then not really matter if I skip length normalization for the purposes of generating a heatmap?

Thank you for your valuable reply, particularly with regards to prior count and rpkm().

If I

data <- t(scale(t(logCPM_no_batch), scale=T))

instead of scale=F and set scale="none" in heatmap() would this produce a heat map that had both mean and sd scaled, thus being more in line with what is usually done? This is how I interpret the scale() documentation, but just want to make sure I am not going off the rails here.

The reason I ask is that if i comment out the data <- t(scale(t(logCPM_no_batch), scale=T)) and use scale defaults in heatmap, heatmap() complains that "`x' must be a numeric matrix".

It isn't possible for the code given in your post to give the error message you mention from heatmap() because removeBatchEffect() always produces a numeric matrix. You already told us in your original post that the code worked.

It would seem that you may have introduced a problem somewhere with other code that you haven't mentioned here. I suggest you solve that problem rather than using an ad hoc fix with scale().

I'm conflicted on whether or not to post as a new question, but since this is a directly related follow up I'll ask here.

Recently I wanted to present the results of how accounting for batch in a differential expression analysis from RNA-seq will impact the downstream results aside from just providing two topTable results (one each from a design with and without a batch covariate) for comparison, so I also wanted to show before / after heatmaps of raw / corrected data.

I initially provided these results using the suggested workflow here, which is to provide the results from cpm(dgelist, log=TRUE, prior.count=5) to removeBatchEffect to get your "batch corrected data".

Skimming through the removeBatchEffect documentation, however, I realized that it explicitly calls out that the function will utilize weights, if available, in estimation of the batch effects.

This made me think about voom, but passing in a voomed EList dictates that the prior.count=0.5

At the risk of beating a dead horse here, my question is what's the official party line here? What would the source of the weights be that removeBatchEffect is referencing if not from voom? I realize that the documentation in removeBatchEffect that points to weights may have been there since before voom, so perhaps it's another function that shoots weights along with the input object x, but not sure where that might come from.

There's also other weights like the array weights from, well, the arrayWeights function. Obviously, removeBatchEffects is not restricted to RNA-seq data, so the use of weights is more general than just those from voom. My view is that I only ever use removeBatchEffects for visualization, so I don't bother worrying about weights to improve accuracy - the human eye can't distinguish between "very blue" and "very very blue" anyway. However, I do worry when I get a solid red/blue block because of extreme log-fold changes from low counts. Hence, the use of a large prior count in cpm.

I had initially thought that it could be referencing weights fom arrayWeights, but arrayWeights doesn't "load" those results into the EList (presumably what you pass into removeBatchEffect as x), but rather returns a vector of weights, so wasn't sure if that was it.

But I now realize that you can just use the dots ... and pass in a weights argument which then passes them down to lmFit which then does the job, ie.

To address your point about extreme log fold changes blowing out the signal in your heatmap, I've found the circlize::colorRamp2 (which is what ComplexHeatmap uses by default to generate its colors) awesome here. Values outside of the specified breaks are simply set to the break value/color for plotting purposes, which is nice.

just add something potentially useful to you and maybe others: removeBatchEffect() basically substracts estimated the batch coefficients/means from the expression data.

I implemented the same strategy a while ago using simulated data and DESeq2, detailing the the steps involved. For what it's worth, here is the code, I use a PCA instead of a heatmap to visualize the differences, but it should give you an idea on how it works: