Taking Ethical Validity Seriously

More thought has been given to the validity of the conclusions drawn from development impact evaluations than to the ethical validity of how the evaluations were done. This is not an issue for all evaluations. Sometimes an impact evaluation is built into an existing program such that nothing changes about how the program works. The evaluation takes as given the way the program assigns its benefits. So if the program is deemed to be ethically acceptable then this can be presumed to also hold for the method of evaluation. (I leave aside ethical issues in how evaluations are reported and publication biases.) We can dub these “ethically benign evaluations.”

Another type of evaluation deliberately alters the program’s (known or likely) assignment mechanism—who gets the program and who does not—for the purpose of the evaluation. Then the ethical acceptability of the intervention does not imply that the evaluation is ethically acceptable. Call these “ethically contestable evaluations.” The main examples in practice are randomized control trials (RCTs). Scaled-up programs almost never use randomized assignment, so the RCT has a different assignment mechanism, and this may be contested ethically even when the full program is fine.

A debate has emerged about the ethical validity of RCTs. This has been brewing for some time but there has been a recent flurry of attention to the issue, stimulated by a New York Times post last week by Casey Mulligan and various comments including an extended reply by Jessica Goldberg. Mulligan essentially dismisses RCTs as ethically unacceptable on the grounds that some of those to which a program is assigned for the purpose of evaluation—the “treatment group”—will almost certainly not need it, or benefit little, while some in the control group will. As an example, he endorses Jeff Sachs’s arguments as to why the Millennium Villages project was not set up as an RCT. Goldberg defends the ethical validity of RCTs against Mulligan’s critique. On the one hand she argues that randomization can be defended as ethically fair given limited resources, while (on the other hand) even if one still objects, the gains from new knowledge can outweigh the objections.

I have worried about the ethical validity of some RCTs, and I don’t think development specialists have given the ethical issues enough attention. But nor do I think the issues are straightforward. So this post is my effort to make sense of the debate.

Ethics is a poor excuse for lack of evaluative effort. For one thing, there are ethically benign evaluations. But even focusing on RCTs, I doubt if there are many “deontological purists” out there who would argue that good ends can never justify bad means and so side with Mulligan, Sachs and others in rejecting all RCTs on ethical grounds. That is surely a rather extreme position (and not one often associated with economists). It is ethically defensible to judge processes in part by their outcomes; indeed, there is a long tradition of doing so in moral philosophy, with utilitarianism as the leading example. It is not inherently “unethical” to do a pilot intervention that knowingly withholds a treatment from some people in genuine need, and gives it to some people who are not, as long as this is deemed to be justified by the expected welfare benefits from new knowledge.

Far more problematic is either of the following:

Any presumption that an RCT is the only way we can reliably learn. That is plainly not the case, as anyone familiar with the full range of (quantitative and qualitative) tools available for evaluation will know.

Any evaluation for which the expected gains from new knowledge cannot reasonably justify an ethically-contestable methodology.

The latter situation is clearly objectionable if it is seen to hold. But it is often hard to verify in development settings. Ethics has been much discussed in medical research. In that context, the principle of equipoise requires that there should be no decisive prior case for believing that the treatment has impact sufficient to justify its cost. (This is David McKenzie’s sensible modification to clinical equipoise to fit the types of programs in discussion here.) By this reasoning, only if we are sufficiently ignorant about the likely gains relative to costs should we evaluate further. Implementation of such an ethical principle may not be easy, however. In the context of antipoverty or other public programs, a priori (theoretical and/or empirical) arguments can often be made both for and against believing ex ante that impact is likely. A clever researcher can often create a convincing straw man to suggest that some form of equipoise holds and that the evaluation is worth doing. While this cannot be prevented, we should at least demand that the case is made, and it stands up to scholarly public scrutiny. That is clearly not the norm at present.

It has often been argued that whenever rationing is required—when there is not enough money to cover everyone—randomized assignment is a fair solution. (Goldberg makes this claim, though I have heard it often. Indeed, I have made this argument a few times with government counterparts in attempting to convince them on the merits of randomization.) In practice, this is clearly not the main reason that randomistas randomize. But should it convince the un-believers? It can be accepted when information is very poor, or allocative processes are skewed against those in need. In some development applications we may know very little ex ante about how best to assign participation to maximize impact. But when alternative allocations are feasible (and if randomization is possible then that condition is evidently met) and one does have information about who is likely to benefit, then surely it is fairer to use that information, and not randomize, at least unconditionally.

Conditional randomization can help relieve ethically concerns. One first selects eligible types of participants based on prior knowledge about likely gains, and only then randomly assigns the intervention, given that not all can be covered. For example, if one is evaluating a training program or a program that requires skills for maximum impact one would reasonably assume (backed up by some evidence) that prior education and/or experience will enhance impact and design the evaluation accordingly. This has ethical advantages over simple randomization when there are priors about likely impacts.

But there is a catch. The set of things observable to the evaluator is typically only a subset of what is observable on the ground (such information asymmetry is, after all, the reason for randomizing in the first place). At local level, there will typically be more information—revealing that the program is being assigned to some who do not need it, and withheld from some who do. The RCT may be ethically unacceptable at (say) village level. But then whose information should decide the matter? It may be seen as quite lame for the evaluator to plead, “I did not know” when others do in fact know very well who is in need and who is not.

Goldberg reminds us of another defense often heard, namely that RCTs can use what are called “encouragement designs.” The idea here is that nobody is prevented accessing the primary service of interest (such as schooling) but the experiment instead randomizes access to some form of incentive or information. This may help relieve ethical concerns for some observers, but it clearly does not remove them—it merely displaces them from the primary service of interest to a secondary space. Ethical validity still looms as a concern when any “encouragement” is being deliberately withheld from some people who would benefit and given to some who would not.

While ethical validity is a legitimate concern in its own right, it also holds implications for other aspects of evaluation validity. There is heterogeneity in the ethical acceptability of RCTs. That will vary from one setting to another. One can get away with an RCT more easily with NGOs than governments, and with small interventions, preferably in out-of-the-way places. (By contrast, imagine a government trying to justify why some of its under-served rural citizens were randomly chosen to not get new roads or grid connections on the grounds that this will allow it to figure out the benefits to those that do get them.) An exclusive reliance on randomization for identifying impacts will likely create a bias in our knowledge in favor of the settings and types of interventions for which randomization is feasible; we will know nothing about a wide range of development interventions for which randomization is not an option. (I discuss this bias for inferences about development impact further in “Should the Randomistas Rule?”.) Given that evaluations are supposed to fill our knowledge gaps, this must be a concern even for those who think that consequences trump concerns about processes.

If evaluators take ethical validity seriously there will be implications for RCTs. Some RCTs may have to be ruled out as simply unacceptable. For example, I surely cannot be the only person who is troubled on ethical grounds by the (innovative) study done in Delhi India by Marianne Bertrand et al. that randomized an encouragement to obtain a driver’s license quickly, on the explicit presumption that this would entail the payment of a bribe to obtain a license without knowing how to drive. (This study was conducted and funded by the World Bank’s International Finance Corporation. And it was published in a prestigious economics journal.) The study confirmed that the process of testing and licensing was not working well even for the control group. But the RCT put even more drivers on Delhi roads who did not know how to drive, adding to the risk of accidents. The gain from doing so was a clean verification of the claim that corruption is possible in India and has real effects, though I was not aware of any prior doubt about the truth of that claim.

There may well be design changes to many RCTs that could assure their ethical validity, such as judged by review boards. One might randomly withhold the option of treatment for some period of time, after which it would become available, but this would need to be known by all in advance, and one might reasonably argue that some form of compensation would be justified by the delay. Adaptive randomizations are getting serious attention in biomedical research; for example, one might adapt the assignment to treatment of new arrivals along the way, in the light of evidence collected on covariates of impact. (The U.S. Food and Drug Administration issued guidelines a few years ago.)

The experiment might not then be as clean as in the classic RCT—the prized internal validity of the RCT in large samples may be compromised. But if that is always judged to be too high a price then the evaluator is probably not taking ethical validity seriously.

Stay tuned for posts tomorrow, and later this week by Berk, David, and Markus in which we debate some of the points raised by Martin here.

Comments

Why do we hold such a high bar for RCTs and not for any other kind of intervention and/or evaluation? It's not cool to have imperfect targeting if there's an RCT involved, but its totally fine to spend your money on whatever takes your fancy if there's no RCT involved?

In response to Martin's comment about the Bertrand et al. paper on drivers' licenses in India: the authors of the study point to footnote 3 of their paper, where they note that "To ensure that there were no social costs to the study, participants in the comparison and bonus groups were offered free driving lessons upon completion of the final survey and driving test."

Yes, the experimental group that was induced to bribe for a license without testing were offered "free lessons" after the study. But this hardly "ensures that there was no social cost." There is no mention of lesson take up amongst the bonus group but statements elsewhere in the paper suggest this was not high. 40% of the experimental group that was offered free lessons at the outset did not take up those lessons, and we would expect an even higher proportion for the bonus group. (At one point the paper says that the "lesson group" were the better drivers, implying that the bonus group had even lower take up of the free lessons.) So we are left to conclude that the experiment increased the number of unsafe drivers, as I claim.

If I am not misunderstanding the timeline discussed in the paper, to the extent that the 'bonus treatment' helped someone obtain a driver's license without the proper skills, they would be behind the wheel in cars on Indian roads from the time they obtained their licenses until at least the offer of the free driving lessons after the final survey. Perhaps the language of 'no social costs' is too strong...

One of the hallmarks of RCT in medicine is that each participant is informed of the risks and benefits, and then freely gives informed consent to participate, knowing the risks. Do RCT's at the village level include informed consent, and if so, how is it obtained? Is everyone in the village capable of understanding and agreeing to undertake the risks and benefits?

Research on humans (and other biological subjects) is difficult and messy. Pure randomization is not possible in the way that a chemist can achieve. Conditional randomization seems the best way to go, but interpretation of the results is more complicated.

Transparent assignment rather than random assignment.
I strongly agree that the road to better evaluations lies on the path of determining the most ethical way to allocate scarce resources. As Keynes pointed out in 1937 (1), and as everyone who has worked in government realizes, “There is nothing a government hates more than to be well-informed; for it makes the process of arriving at decisions much more complicated and difficult.” In addition program administrators are naturally hostile to evaluation—the best outcome that they can hope for from an evaluation is that it confirms the rosy picture that they have painted of their program. The worst is that their program will be found to be useless and their budget is cut.

The strong forces that oppose (rigorous) evaluation will often employ ethical arguments against random assignment (2) and in the absence of random assignment the default is an unknown selection mechanism. By shifting the debate from “Random assignment or not?” to “What is the most ethical method of selection?”, we avoid the unknown selection mechanism, the alternative that precludes rigorous evaluation.

In some cases, as David McKenzie points out (March 19th) randomization will be found to be the most ethical method of allocating scarce resources. In other cases we may be advised to allocate resources to those who most need it first (with a Rawlsian justification). Some would argue that that outcome would be preferred by an evaluator since the Regression Discontinuity Design estimates would give us a better understanding the marginal benefit from expanding a program than we would be if we had used random assignment. But whatever method is found to be most ethical will allow us to produce unbiased estimates of impact as long as the selection method is transparent.

By definition, making the proper ethical choice is the right thing to do. IMHO paradoxically, failing to explicitly put the ethical argument first gives those who oppose evaluation for selfish purposes a powerful rhetorical tool, and in the final analysis reduces the probability of a successful evaluation.

Finally you tackle this issue. But let me be very blunt: does the World Bank have a working, compulsory Institutional Review Board for interventions it sponsors, or at least for those its researchers carry out? This is a real, not rethorical question. If not, it would be interesting to know why it has not been setup.
I've conducted experiments from major US universities and we had a tough time but VERY good and relevant feedback from our IRB. I have been evaluating proposals for funding agencies and I've been shocked by some of these proposals by international organizations (not by WB teams to be fair) which would have had zero chance to be approved by an ethical board in any major US university. Perhaps the AEA could have an ad hoc committee for institutions lacking a board.