What is success, anyhow?

As a fair number of impact evaluations I work on are programs designed by governments or NGOs, I often initially have to have a tricky discussion when it comes time to do the power calculations to design the impact evaluation. The subject of this conversation is the anticipated effect size. This is a key parameter – if it’s too optimistic you run the risk of an impact evaluation with no effect even when the program had worked to some (lesser) degree, if it’s too pessimistic, then you are wasting money and people’s time in your survey.

My experience is that the initial estimate of anticipated effect size from the program side of the team is too optimistic. One easy step is to switch the question from “What do you think the impact of this program on household consumption [or whatever the key outcome of interest is]?” to “Below what level of impact on consumption would you consider this program a failure and/or shut it down and do something else?” In my experience, the gap between these two answers is usually large – and the latter often takes more thought (not least of which because it doesn’t show up in most project proposals).

Some other useful things to do are to ask multiple people on the implementation side of things this question to see not only how much this answer varies, but also what the most pessimistic voice has to say. It’s also helpful to look at monitoring data if the intervention to be evaluated, or something like it, has been running. If the quality of monitoring data is bad, this should give you some pause on the reliability of the estimates on the effect size and suggest further digging around. And of course, if the intervention (or something close to it) is running somewhere, a field visit can be useful. It won’t give you the aggregate data you need, but it may help you understand a) where the monitoring data is coming from and b) what the range of potential outcome variables are (having more than one outcome variable to hang a power calculation on is useful).

The fundamental question, of course, is what effect size is meaningful? The shut down the program threshold is one answer to this – but others can be at play – including other definitions of economically meaningful (particularly when looking at outcomes in multiple dimensions of welfare) or even simple statistical significance (in some cases of behavioral experiments, for example).

Then there is the reality of the budget for the survey. While hopefully the power calculations shaped this, I’ve been in the situation more than once where a shock or surprise on the cost and budget size meant that the survey would be smaller than I hoped. Then it’s back to iteration on what is meaningful and whether we feel comfortable with the new threshold.

Finally, a lurking issue that may bear on some discussions of effect size is program uptake. Again, my experience here is that estimates from the program side tend to the optimistic. With an existing intervention, anything above the current rate of uptake has to be seriously questioned – what’s new, why are more people now going to enroll, really? If it’s a new intervention, some qualitative work and/or questions in a baseline (if that is how things are structured) can help you get a sense for what the response will be.

Of course, once the evaluation and the program get rolling, you get a closer idea of what the truth might be – not only for uptake, but also the effect size. And then you may want to have some contingency plans for ways to boost power – along the lines of the discussion in Jed’s most recent post.