I've samples under two different conditions (3 biological replicates each) that I'm about to analyse for differential gene expression. I've obtained read counts over all exons for each gene for all replicates. Now, I've always had trouble with filtering a set of genes which are "not expressed (enough)" before passing the counts to DESeq or edgeR. For example, edgeR uses the function cpm (counts per million) to discard those genes whose cpm > threshold (usually set to 1 or 2).

However, I am not convinced by this method as it doesn't take gene length into account to perform gene expression quantification. That is, for cpm, we are comparing different genes within the same sample as opposed to same gene against different samples as in a typical DGE analysis. The RNA-seq data I work with are ribosomal RNA depleted libraries, meaning they contain ncRNAs, snRNAs etc... in addition to mRNAs. If gene expression is measured as a factor of just read counts per million, then filtering by cpm will be biased to those genes that are just longer. As an example, between a ncRNA with 500 reads and length of 400 base pairs transcript length and an mRNA 1000 reads and 3000 base pairs transcript length, it's clear that ncRNA has more coverage per base. However, this information would be lost if one were to consider just 500 vs 1000 reads (which is what cpm does).

So, what I did was to compute transcript length for each gene and calculate the RPKM for each replicate. Then, I retained those genes only when RPKM value for each gene over all replicates were >= 1 in at least any 3 (out of 6) replicates. The RPKM is computed just for filtering genes based on "expression" values, not for subsequent DGE analysis with DESeq or edgeR.

I insist on doing this because the NB model tends to give a lot of "significant" results with "low read counts" and I want to avoid this as much as possible.

Basically, I'd like to hear biostar community's thoughts on this. Is this a normal practice? I find it's a better practice than say cpm based filtering? Are there better approaches? Are there disadvantages to this mode of filtering (which I can't seem to come up with)?

The problem with filtering reads based on raw or normalized tag count is the underlying assumption you are making about these reads. Are you discarding lowly covered transcripts due to technical reason or are you discarding lowly expressed transcripts due to biological condition. If it is the latter case, you probably want to keep it.

One thing you can do before you filter is to generate a rarification graph where you plot increasing mapped subsets of your reads vs number of transcripts with at least X number of tags. If you reach saturation pretty early, then I think the latter case of lowly expressed transcripts is more likely and I would filter based on raw tag count. If you haven't reached saturation, then it is probably due to low coverage and I would filter based on normalized tag count (RPKM, DEseq, TMM, whatever normalization you want to use).

Damian, thanks for your reply. Regarding your question, my objective is to reduce too many results with really low counts being DGE. For ex: if you were to do a glm.nb fit on reads from condition 1 = c(2,3,4) and condition 2 = c(30,22,25) and assume these reads come from a gene whose length is 2 Kb and the library was sequenced at 50 million reads, is this really expressed? Because the fold change will be close to 15! and the p-value will be quite significant (which will therefore escape multiple testing). About your suggestion, I think I understand what you mean. Let me try this out and see if it helps clear my viewpoint. Else, I'll write back. Thanks again!