First, let me apologize in advance for posting a problem which might prove to be out of the scope of this forum.

I am currently analysing data from a RNA-seq experiment in bacteria:

The experiment involves 60 RNA samples, corresponding to two groups of bacterial strains (commensal (C) and disease-causing (D)); each group consisting of 15 different strains; either strain treated (IND) and not treated with a chemical (CTR). I am interested in differences in expression between groups (D/C) and between conditions (IND/CTR) within groups.

Samples were stabilized with RNA later before RNA isolation, treated with Ribo-Zero to remove rRNA before library prep with stranded Truseq kit and run on an Illumina Miseq (50 bp single reads).

My pipeline includes aligning reads to the respective genome of each strain using bowtie2, counting reads mapped within CDS using htseq-count (stranded= reverse, minaqual=10), and using edgeR for differential expression analysis. Furthermore, I have restricted the analysis to genes which are found exactly once in each strain ("core transcriptome"), and I remove counts lower than 1 read per million in 15 samples. Library sizes are on average 1,103,694 counts after normalization (TMM), and the average normalization factor is 1.05.

The issue becomes clear when I estimate common, trended and tagvise GLM dispersions in edgeR, resulting in unusual MeanVar and BCV plots like these:

I have tried looking at the groups and conditions separately, all three methods of normalization (TMM, upper quantile and relative log), changing prior.df, removing samples with low RNA quality, removing MDS outliers, removing samples with very high expression of known housekeeping genes, etc. to find an explanation. Basically I am now scratching my head, wondering if anyone else has seen similar dispersion trends, and if anyone has any idea if the cause could be biological, technical or with the analysis?

As Ryan said, the design matrix is important here. I can see two obvious choices - use a one-way layout with four groups (C-IND, D-IND, C-CTR and D-CTR) where different strains are treated as replicates within each group; or use an additive model that has a blocking factor for each individual strain (assuming that the same strain is used in the control/treatment groups).

Both of these approaches involve assumptions about the pattern of expression across samples, and violations of those assumptions can lead to odd dispersion estimates. For example, the one-way layout assumes that different strains are replicates, whereas the blocking approach assumes that the treatment effect is the same for all strains. Systematic differences between strains will result in inflated dispersion estimates for both methods.

On an unrelated note, another check is to have a look at the MA plots between pairs of samples in the same group. Any systematic trends in the MA plot will not be removed by the normalization strategies you have mentioned. Preservation of systematic differences will then result in higher dispersion estimates, especially at high-abundance genes (as scaling methods remove the differences at the bulk of low-abundance genes). I see this type of BCV trend occasionally in ChIP-seq datasets, which I then resolve with non-linear normalization.

There's a function in the csaw package, called normalizeCounts. If you set type="loess", it'll use a modified version of fast loess normalization. The modification is designed to make it a bit more robust to spurious patterns at low counts.

Sorry for the late reply, and thank you Aaron for explaining how these design matrices can affect the dispersion estimates. I have now tried using both the one-way layout with four groups and a nested approach (model.matrix(~group+group:strain+group:condition)), but in either case I get the same BCV trend. I have also checked the MA plots, but its not really obvious if the pairs of samples display any systematic trend. However, when I use the normalizeCounts method and use GLM to estimate dispersions I get these results (one-way four groups):

And these for the nested design:

Being new to RNA-seq data analysis, I am still unsure what are "acceptable" dispersions for performing differential expression analysis?

You might want to try estimating dispersions with prior.df set to zero and then plotting it to see the un-squeezed genewise dispersions. This can tell you if it's just a couple of very highly variable high-abundance genes dragging the trend upward. If this is the case, it might help to use the robust argument to estimateDisp to downweight those outliers' controbution to the trend.

(Don't actually use the dispersions from the prior.df=0 run, those are just for visualization.)

The robust argument in estimateDisp is slightly misleading, as it actually doesn't affect the estimation of the mean-dispersion trend. Rather, it only affects the shrinkage of tagwise dispersions to the trend, in that likely outliers are not squeezed as strongly towards the trend. So, the actual value of the trend won't be affected by setting robust=TRUE.

That said, checking for genes with high (unshrunk) dispersions is still a good idea. If you're lucky, there might be a set of related genes (e.g., replication machinery, ribosomal components) that are driving the variability between samples. It's then a simple matter to remove the offending set prior to the DE analysis.

These trends look a lot better than what you had originally, prior to using normalizeCounts. There's no massive upward curve at high abundances, for starters. The trend itself hovers around 0.2 - 0.3, which is a bit on the high side for RNA-seq, but that's attributable to the differences between strains when they're treated as replicates. You could avoid this by setting up another experiment where each strain/treatment combination was done twice to get true biological replication. Right now, though, I reckon you could proceed with what you've got.