Power analysis for the standard design

The standard design randomly assigns subjects to either treatment or control with probability 0.5. We have to make assumptions about the size of the treatment effect and the standard deviation of the outcome variable. We then vary the total number of subjects to see how power will vary with sample size.

This operation can be done with an analytically derived formula, but this simulation will provide the basis for more complicated designs.

Power analysis for covariate control

Including covariates as regressors can help the precision with which treatment effects are estimated. We’re not including the covariates because we are interested in the causal impact of the covariates on outcomes, but because we are interested in reducing the noise environment of our experiment. By modeling the dependent variable as a function of some covariates plus error, we are effectively reducing the standard deviation of our outcomes.

Suppose that due to some prior observational research, we know that income is correlated with both gender and age — men earn more than women, and older people earn more than young people. We can include estimates of the strength of this correlation when we generate hypothetical potential outcomes. (line 16 of the code block below).

When we control for covariates, the predictive ability of our model increases. This has strong implications for power. The graph below shows the output of this simulation — at any sample size, the covariate-adjusted model does better than the unadjusted model. In fact, the unadjusted model requires three times as many subjects to achieve 80% power than does the covariate-adjusted model.

Power analysis for multiple treatments

When you conduct an experiment with multiple treatment arms, you are increasing the number of opportunities to turn up a significant result – though the significance statistics may be misleading. At the limit, consider an experiment with 20 treatment arms — even if all the treatments are ineffective, we would expect about one of the treatments to turn up significant at the 0.05 level, just by chance. A common approach is to employ a Bonferroni correction, where the significance level is divided by the number of total comparisons made with the same data.

If you plan to test multiple treatments with the Bonferroni correction, power analysis becomes more complicated. What is the “power” of an experiment that has multiple arms? It could be one of at least three things, from least demanding to most demanding:

The probability that at least one of the treatments turns up significant.

The probability that all of the treatments turn up significant.

The probability that the treatment effects will be in the hypothesized ranking, and that all the differences are significant.

The simulation we’re conducting here supposes that the first treatment has an ATE of 2.5 and the second treatment has an ate of 5.0. For each subject pool size, what’s the probability of our suite of hypotheses holding up?

The results of this simulation are graphed below: with 800 or so subjects equally allocated between the three treatments, this experiment is about 80% likely to recover at least one significant result. Achieving two significant results is much more demanding, not least because of the Bonferroni correction: having both treatments turn up significant 80% of the time would require 3000 subjects. Finally, demonstrating the hypothesized differences in terms of a complete ranking would require about 4000 subjects — a corollary to this is that testing fully specified theories is particularly resource-intensive!