Search form

Want to learn more about this and related topics?

Sign up for our twice-monthly email newsletter.

Estimating Power for Multiple Hypothesis Tests

Project

Overview

Project Overview

Current practice for ensuring that impact evaluations in education have adequate statistical power does not take the use of multiplicity adjustments into account. Multiplicity adjustments to p-values protect against spurious statistically significant findings when there are multiple statistical tests (for example, due to multiple outcomes, subgroups, or time points), but an important consequence of these adjustments is a change in statistical power. It is typically argued that multiplicity adjustments result in a loss of power, which can be substantial. Therefore, this project will provide critical alternatives to current practice in projects that adjust for multiplicity. It will develop, implement, and test methods for estimating power, sample size requirements, and minimum detectable effect sizes (MDES’s) while accounting for multiplicity adjustments using one of three statistical procedures commonly used in education research — the Bonferroni, Benjamini-Hochberg, and Westfall-Young procedures.

This project will also investigate alternatives to standard practice for how power is defined in studies that adjust for multiplicity. Just as we account for multiplicity with respect to Type I errors, we may need to account for multiplicity with respect to Type II errors (the inverse of power), as these two types of errors are inextricably linked. This project will explore different ways to accomplish this task as well as the implications on power, sample size requirements, or MDES’s.

The project’s impacts on future research are the potential for more accurate estimates of power (or of MDES’s or sample size for a given power requirement) and the potential for more appropriate estimates of power than those that are currently used.

Conducting multiple statistical hypothesis tests can lead to spurious findings of effects. Multiple testing procedures (MTPs) counteract this problem but can substantially change statistical power. This paper presents methods for estimating multiple definitions of power and presents empirical findings on how power is affected by the use of MTPs.

Extensive literature, resources, and tools are available to help researchers determine power, sample size requirements, or the MDES for a single, non-adjusted test, as well as design education studies with adequate sample sizes — for example, Dong (2013), Spybrook et al. (2011), Raudenbush et al. (2011), Hedges and Rhoads (2010), Bloom, Richburg-Hayes, and Black (2007). However, MDRC has found no education or impact evaluation literature on estimating power, sample size, or MDES while accounting for multiplicity adjustments. The IES guidelines for multiple testing (Schochet, 2008) state that “statistical power calculations for confirmatory analysis must account for multiplicity,” but give no explanation for how to do so in the case that multiple testing procedures are used to adjust p-values. This project will fill this gap. It will investigate alternatives to standard practice for how power, sample size, or MDES are estimated in studies that adjust p-values with multiple testing procedures. It will also investigate alternatives to standard practice for how power is defined in studies that adjust p-values with multiple testing procedures.

This project will develop, implement, and test methods for estimating power, sample size requirements, or MDES’s while accounting for multiplicity adjustments using one of three common multiple testing procedures used in education — the Bonferroni, Benjamini-Hochberg, and Westfall-Young procedures. To develop these methods, the research team will look to relevant literature in the fields of medicine, genomics, and biostatistics. The research team will find and modify appropriate methods so that they address the issues specific to education research and their implementation is feasible for a wide range of applied education researchers. The final product will provide intuitive, step-by-step guides on how to implement the recommended set of methods and sample computer code. The project will also use the proposed methods to illustrate and compare power and sample size implications of the different multiple testing procedures and of the different power definitions under various realistic scenarios.

Conducting multiple statistical hypothesis tests can lead to spurious findings of effects. Multiple testing procedures (MTPs) counteract this problem but can substantially change statistical power. This paper presents methods for estimating multiple definitions of power and presents empirical findings on how power is affected by the use of MTPs.

To narrow the scope of the research, the research team is focusing on the simplest research design and analysis plan that education evaluations typically use in practice: a multisite randomized trial with the randomization of individuals blocked by site. The team is also focusing on multiplicity that arises from testing for effects on multiple outcomes (which is very similar to testing for effects at multiple follow-up time points). If time and resources permit, the team will extend the analysis to address problems that arise due to multiple subgroups or multiple treatments. The purpose is to establish a research base, with the simplest cases, on which future research can build.

Conducting multiple statistical hypothesis tests can lead to spurious findings of effects. Multiple testing procedures (MTPs) counteract this problem but can substantially change statistical power. This paper presents methods for estimating multiple definitions of power and presents empirical findings on how power is affected by the use of MTPs.

Bloom, MDRC’s Chief Social Scientist from 1999 to 2017, led the development of experimental and quasi-experimental methods for estimating program impacts, working closely with staff members to build these methods into their research.