Bottom Line:
We introduce a new approach called the jackstraw that allows one to accurately identify genomic variables that are statistically significantly associated with any subset or linear combination of PCs.The proposed method can greatly simplify complex significance testing problems encountered in genomics and can be used to identify the genomic variables significantly associated with latent variables.Using our method, we find a greater enrichment for inflammatory-related gene sets compared to the original analysis that uses a clinically defined, although likely imprecise, phenotype.

btu674-F5: Evaluation of significance measures of associations between variables and their PCs by comparing true P values and the Uniform(0,1) distribution. (a) The conventional F-test results in anti-conservative P values, as demonstrated by P values being skewed towards 0. (b) The proposed method produces P values distributed Uniform(0,1). The dashed line shows the Uniform(0,1) density function

Mentions:
For a given simulated dataset, we tested for the associations between the observed variables and the latent variables by forming association statistics between the observed and their collective PC, (r = 1). We calculated P values using both the conventional F test and the proposed method with s = 50 synthetic variables (Fig. 5). Over 500 simulated datasets, the conventional F test resulted in 500 one-sided KS P values that exhibit a strong anti-conservative bias with a double KS P value of (Supplementary Fig. S4, black points). Conversely, the proposed method correctly calculates P values, by accounting for the over-fitted measurement error in PCA, with a double KS P value of 0.502 (Supplementary Fig. S4, orange points). Alternatively, a comparison of estimated versus true FDR demonstrates an appropriate adjustment for over-fitting in the jackstraw method (Supplementary Fig. S5). Note that the classification of P values is based on the true association status from the population-level data generating distribution from model (1), not based on model (2) or on the observed loadings from the PCA.Fig. 5.

btu674-F5: Evaluation of significance measures of associations between variables and their PCs by comparing true P values and the Uniform(0,1) distribution. (a) The conventional F-test results in anti-conservative P values, as demonstrated by P values being skewed towards 0. (b) The proposed method produces P values distributed Uniform(0,1). The dashed line shows the Uniform(0,1) density function

Mentions:
For a given simulated dataset, we tested for the associations between the observed variables and the latent variables by forming association statistics between the observed and their collective PC, (r = 1). We calculated P values using both the conventional F test and the proposed method with s = 50 synthetic variables (Fig. 5). Over 500 simulated datasets, the conventional F test resulted in 500 one-sided KS P values that exhibit a strong anti-conservative bias with a double KS P value of (Supplementary Fig. S4, black points). Conversely, the proposed method correctly calculates P values, by accounting for the over-fitted measurement error in PCA, with a double KS P value of 0.502 (Supplementary Fig. S4, orange points). Alternatively, a comparison of estimated versus true FDR demonstrates an appropriate adjustment for over-fitting in the jackstraw method (Supplementary Fig. S5). Note that the classification of P values is based on the true association status from the population-level data generating distribution from model (1), not based on model (2) or on the observed loadings from the PCA.Fig. 5.

Bottom Line:
We introduce a new approach called the jackstraw that allows one to accurately identify genomic variables that are statistically significantly associated with any subset or linear combination of PCs.The proposed method can greatly simplify complex significance testing problems encountered in genomics and can be used to identify the genomic variables significantly associated with latent variables.Using our method, we find a greater enrichment for inflammatory-related gene sets compared to the original analysis that uses a clinically defined, although likely imprecise, phenotype.