I'd like to pose a rather complicated question regarding a DE analysis I'm trying to do using DESeq2.

In brief, I have a dataset consisting of gene counts of d.melanogaster sampled at several time points and ranging all different development stages, from embryo to mature adult. Each stage has been sampled at uneven timepoints according to the organism development scheme (e.g.: every 2hrs for the embryo, at every larval stage for the larvae). For each timepoints I have a different number of replicates, ranging from 4 to 6.

I've been asked to find a number of significant DE genes between each stage (e.g. larva vs embryo, pupae vs larva and so on). I've first modeled my dataset by keeping into account only the stage when creating the DE object like this (an example of my design table is at the end of the post):

(design=~stage)

Using this approach and following the guidelines described in the tutorial, correcting with BH with an FDR=1%, I obtain a very high number of significant DE genes in each comparison (on average ~3K). This number is quite high, so I tried to rebuild my colData accounting for the peculiar "noise" of the time points, considering that many genes during a stage may turn on and off intermittently, though exhibiting a very high stage-specific variance.

I therefore first tried to model the DESeq object naming each timepoint and considering a model like

design = ~stage + timepoints + stage:timepoints,

then reducing the object for the timepoints and then applying the LRT test, as also described in the Bioconductor vignette. With this approach however the situation gets even worse, with the number of significant DE genes raising consistently (10K on average).

As final effort, I then decided to consider each timepoints separately by creating a "fake" column in my colData by pasting the stage and the timepoints, column so

and then try to model the gene counts in DEseq. At this point though, before computing the results(), I'd like to apply a contrasts matrix externally that considers the stages only, and then test for DE expression only between stages. Problem is, I don't know if this is possible and even more, if this procedure is correct. Another alternate approach I thought was to use sva or other packages to remove the "timepoints batch effect" in each stage, however I still haven't tried this. and I'd rather prefer to ask you if you had performed similar tests and what are the recommendations for this peculiar cases.

Here's a brief example of how my colData table is built,

<caption>

sample

stage

timepoint

replicate

paste

embryo0.2h.rep1

embryo

0-2hrs

1

embryo-0-2hrs

embryo0.2h.rep2

embryo

0-2hrs

2

embryo-0-2hrs

embryo2.4h.rep1

embryo

2-4hrs

1

embryo-2-4hrs

embryo2.4h.rep2

embryo

2-4hrs

2

embryo-2-4hrs

L1.rep1

larvae

L1

1

larvae-L1

L1.rep2

larvae

L1

2

larvae

I hope that this is clear enough, otherwise I'll try to be more intelligible. Of course if you need more code, or more info regarding the experiment, feel free to ask and I'll share what I can.

Have you made a PCA plot? I would imagine differential gene expression between stages is very high.

If you wanted to focus on the very large changes, instead of changing the design (so, just sticking to ~stage), you could look at the MA plot and choose a threshold to pull out only very large fold changes with lfcThreshold argument of results().

There seem to be a lot of modeling choices to think about seeing the different colors in the PCA plot (non-DESeq2 issues), but if you look at the MA plot, and you want to test for large changes, I don't see why the list of DE genes would not be reduced by choosing large lfcThreshold. You can just pick a log2 value at which you clearly separate small from large LFC.

Indeed, the modelling choices are many, as well as the biological assumptions. Here's an example of an MA plot of two stadium considering the LFC thrshold of 2 and showing fold change limits up to +/- 10:

so at 4 I loose a great majority of all the up-regulated genes and keep still a consistent number of under-expressed genes, which is quite strange, considering i was expecting "somewhat balanced" number of up and down DE genes.

Ah, I get it now. The numbers *are* reduced, but you are expecting less DE genes for some reason. I think the problem is just the expectation that there would only be, say, a couple hundred genes. I wouldn't go to far restricting the LFC to 4, but instead discuss with the collaborator.

I'm not surprised that there are thousands of genes which turn on when you compare e.g. embryo and larvae, or larvae and pupae.

I would meet with the biological collaborators and show them the MA plots and ask what effect size they are interested in, e.g. doubling (LFC=1), quadrupling (LFC=2), etc. This choice can be informed by the MA plots.