Abstract

In the genomic setting, most data have relative small sample size (n) considering large number of covariates (p). For this type of data structure, it is not appropriate to fit simple linear regression models since the variance would be large and it could encounter over-fitting. Methods for restraining the number of variables contained in the model are necessary.In this study, constrained best subset (CBS) and LASSO methods were performed to select covariates and detect differentially expressed (DE) genes. For comparison purpose, we set two different simulation settings for each method. Under univariate settings, all methods had type I error well controlled and CBS methods were more powerful than LASSO. However, LASSO had better prediction results compared to CBS methods even though it had more false positive covariates selected. Under genome-wide simulation settings, FDR only well controlled for larger sample size (n=50, 100). Other results have a similar trend as in the univariate setting.Beyond simulations, eight transcriptomic studies from post-mortem brain tissues of major depressive disorder (MDD) patients were used as a real data application to further compare the CBS2 method and LASSO. As the result of meta-analysis combining all eight studies, CBS2 method generated more DE genes compared to LASSO. It also detected more significant pathways compared to LASSO. Our evaluations suggest that no method performs universally the best in the small-n-large-p scenario and selection of the best method depends on sample size, dimensionality and the desired biological purpose. From the public health significance perspective, using CBS2 method under small sample size genomic setting could help us detect more DE genes as well as more meaningful pathways.