I have been using variancestabilizingtransformation for normalizing single cell RNA-seq data. I have assembled a large data set of >1500 cells (samples) that contain information on about 40,000 genes. variancestabilizingtransformation has been running for about 24 hours. Will this process finish? previously I have used variancestabilizingtransformation on a sample set of 406 cells/samples and 32000 genes, which took 6 hours. If the increase in processor time is not linear how bad is this? The DESEQ manual stated that variancestabilizingtransformation should be faster than rlog, but they test on a sample data set of 20 samples and 1000 genes.

Another question, vst does not run as a parallel process, is there a method to do this is this not possible?

Thanks for the reply. I saw the new vst() function now. It increases speed through sub-sampling. I was concerned about this for the performance as it is already an estimate and it was not clear if the sampling routine would also need to be optimized for the data set. A 1000 genes might be representative of a small data set with low diversity. How would the 1000 genes be picked based on variance?

The vst() function uses 1000 genes which are equally spaced on a grid of mean normalized count. This is sufficient for bulk RNA-seq datasets to estimate the dispersion trend line (which is all that is used by the variance stabilizing transformation).

However, depending on what input data you use, you may want to custom fit the dispersion trend line using some subset of rows you pick to be representative. By choosing say 100-1000 rows, you obtain a large speedup in performing the VST. Note that the varianceStabilizingTranformation() function itself should take no time once you've already estimated the trend line and specify blind=FALSE.

Thinking of Michael's answer above. The example in the DESeq2 manual stated that 20 samples by 1000 genes took on the order of 5 seconds. this is 20K data features or about 4K/sec. My smaller data set of 12 million features took about 4 hours to process, but if vst were linear it would take 0.8 hours. so four times longer than linear. I now have 60 million features that should take 4.1 hours if linear.

The dispersion estimation routine may not be optimized for the intercept-only case when blind=TRUE. I could potentially get some speed up there, but I've also got a number of other ideas for transformations for new kinds of experimental data that I'd like to work on.

When blind=FALSE, you need to form the design matrix and perform matrix multiplication to calculate Cox-Reid dispersion estimates, so you should not expect time to scale linearly in that case.

variancestabilizingtransformation() crashed and produced no object returne4d. I am trying with vst(). if I can find the time I will try some different sizes and setting to determine what the cost is to increase the sample and gene dimensions, as it could be good to know.