Not whether but how much

Last week I was lucky enough to attend the Hewlett Foundation’s Quality Education in Developing Countries (QEDC) conference in Uganda, which brought together both Hewlett-funded organizations running education interventions and outside researchers tasked with evaluating the projects. (My advisor and I are working with Mango Tree Uganda to evaluate their Primary Literacy Project.) Evaluation was one of the central themes of the conference, with a particular focus on learning from randomized controlled trials (RCTs). While RCTs are clearly the gold standard for evaluations nowadays, we nevertheless had a healthy discussion of their limitations. One area that got a lot of discussion was that while randomized trials are great for measuring the impact of a program, they typically tell you less about why a program did or did not work well.

We didn’t get into a more fundamental reason that RCTs are seeing pushback, however: the fact that they are framed as answering yes/no questions. Consider the perspective of someone working at an NGO considering an RCT framed that way. In that case a randomized trial is a complicated endeavor that costs a lot of effort and money and has only two possible outcomes: either you (1) learn that your intervention works, which is no surprise and life goes on as usual, or you (2) get told that your program is ineffective. In the latter case, you’re probably inclined to distrust the results: what the hell do researchers know about your program? Are they even measuring it correctly? Moreover, the results aren’t even particularly useful: as noted above, learning that your program isn’t working doesn’t tell you how to fix it.

This yes/no way of thinking about randomized trials is deeply flawed – they usually aren’t even that valuable for yes/no questions. If your question is “does this program we’re running do anything?” and the RCT tells you “no”, what it’s really saying is that no effect can be detected given the size of the sample used for the analysis. That’s not the same as telling you that your program doesn’t work; it’s the best possible estimate of the effect size given the data your collected, and telling you that the best guess is small enough that we can’t rule out no effect at all.

It is true that running a randomized trial will get you an unbiased answer to the “yes” side of the yes/no does-this-work question: if you find a statisticall significant effect, you can be fairly confident that it’s real. But it also tells you a whole lot more. First off, if properly done it will give you a quantitative answer to the question of what a given treatment does. Suppose you’re looking at raising vaccination rates, and the treatment group in your RCT has a rate that is 20 percentage points higher than the control group, significant at the 0.01 level. That’s not just “yes, it works”, it’s “it does about this much”. This is the best possible estimate of what the program is doing, even if it isn’t statistically significant. Better yet, RCTs also give you a lower and an upper bound on what that how much figure is. If your 99% confidence interval is 5 percentage points on either side, then you know with very high confidence that your program’s effect is no less than 15 percentage points (but no more than 25).*

I think a lot of implementers’ unease about RCTs would be mitigated if we focused more on the magnitudes of measured impacts instead of on significance stars. “We can’t rule out a zero effect” is uninformative, useless, and frankly a bit hostile – what we should be talking about is our best estiamte of a program’s effect, given the way it was implemented during the RCT. That alone won’t tell us why a program had less of an impact than we hoped, but it’s a whole lot better than just a thumbs down.

*Many of my stats professors would want to strangle me for putting it this way. 99% refers to the share of identically constructed confidence intervals that would contain the true effect of the program, if you ran your experiment repeatedly. This is different from there being a 99% chance of the effect being in a certain range: the effect is a fixed value, so it’s either in the interval or not. It’s the confidence intervals that vary randomly, not the true value being estimated. The uncertainty is in whether the confidence interval contains the true value of the effect, rather than in whether the true value of the effect lies in the range. If that sounds like pure semantics to you, well, you’re not alone.

2 thoughts on “Not whether but how much”

Jason – is this only a different way of saying what researchers would be doing? Even though we could say to a potential partner in implementation that we’re going to measure “how much” impact their program will have, as scientists we know it’s altogether possible that the impact will be 0 (and in my experience running around Malawi, I’ve seen a lot of 0).

In fairness to potential partners, I always discuss how they measure impact and what their ideas of “success” are — and this usually involves asking for some additional resources to gather qualitative data alongside an RCT. In my opinion, I like this notion because we can also use that data to examine some of the mechanisms through which interventions can succeed/fail.