Taking Ethics Seriously: Response #1

Yesterday, Martin Ravallion wrote a piece titled ‘Taking Ethical Validity Seriously.’ It focused on ethically contestable evaluations and used RCTs as the main (only?) example of such evaluations. It is a good piece: researchers can always benefit from questioning themselves and their work in different ways. For full disclosure, all four Development Impact bloggers saw earlier drafts of Martin’s post and at least three of us sent comments, which led to revisions that culminated in yesterday’s piece. As our private comments to Martin could be of interest to our readers, we are publishing cleaned-up versions of them here throughout the week. I start today; David goes tomorrow, then Markus the day after. We hope you join in with your comments…

As this is a long post and only those who are really interested in this topic will get through the whole thing, I moved my summary and my main comment to the top – like one would do at an academic seminar with the first slide: I think that many of the criticisms or worries Martin raises in his post are the exception rather than the norm in development economics. Many of us spend weeks if not months worrying about these issues and adjust and readjust our study designs accordingly (while the people and organizations who asked for our help wait patiently while these discussions go seemingly ad infinitum). But, if more checks and balances would help, no one could argue against them. If that is to happen, let’s hold everyone (including governments and the World Bank) to the same standards, not just researchers conducting RCTs.

Suppose that I have a colleague in Malawi whose mother lived in a village near the millennium village in Malawi. When she asked her son why her village was not chosen to be a millennium village, what is the answer? I would have been comfortable saying something along the lines of: “All villages fitting certain poverty-related criteria were identified and a lottery was conducted to determine which ones can access the limited resources.” Many of the concerns raised by Martin about RCTs apply to project-level decisions: who holds governments, NGOs, and large donor organizations to account ethically on these types of decisions? One could argue that the presence of an RCT could make an intervention better (and more ethical) than it was without it…

Now that I have this mini rant out of the way, I want to address four issues that Martin raises:

RCTs are the only way to learn anything in economics (what?);

Ethically contestable RCTs;

Using local knowledge to inform treatment assignment rather than using random assignment; and

RCT design in light of important ethical considerations

Let’s go through these one by one:

RCTs are the only way to learn anything in economics

Martin writes:

“Far more problematic is either of the following:

Any presumption that an RCT is the only way we can reliably learn. That is plainly not the case, as anyone familiar with the full range of (quantitative and qualitative) tools available for evaluation will know.

This has unfortunately become a common criticism of so-called randomistas. Is there really anyone who believes this? If there is, could someone please point out their identities, provide a quote from one of these scapegoats?

The reason I ask is because my impression of the modus operandi, when someone approaches me or one of my colleagues with an idea for evaluating a program, is that we do consider random assignment, but we also consider other causal inference methods: sometimes interventions cannot be randomly assigned for a variety of reasons, while other times it is undesirable to do so. In such cases, I start going down the list: could an ‘as good as randomized’ regression discontinuity design be employed? If not, what are the prospects for a good diff-in-diff exercise? How about matching? In some cases of a one-off policy change in one state/region, I might think of employing synthetic control methods. I am not going to apologize for starting by exploring the possibility (and sensibility) of random assignment, but if that option is ruled out, I don’t stop: if the question at hand is sufficiently interesting or important, you just keep exploring the identification methods and explain to your policy collaborator the conditions/assumptions under which something useful can be learned about her question given the chosen method.

Martin also singles out RCTs for changing how a program is implemented. I don’t think that this is exactly right. For example, if you were to think of evaluating a proxy-means tested anti-poverty program at the eligibility cutoff, you might consider adjusting your score to ‘behave,’ i.e. not be lumpy but more smooth or continuous. That could affect how many baseline characteristics you use to create a score, which might alter the beneficiary list (in an ethically defensible way). If you were thinking of using a diff-in-diff method, you’d want to make sure that there are (preferably multiple rounds of) pre-intervention data on the target population, so that the parallel trends assumption holds. But, baseline data collection might delay your intervention. I’d say that whenever policymakers (or organizations piloting new interventions) are planning to evaluate, how they implement their new program will inevitably change – even if only because they invite people with differing viewpoints on the research questions, methods, etc. to the table. That type of ex-ante questioning (akin to what Cartwright and Hardie call a pre-mortem) is bound to change intervention details. That’s a good thing…

2. Ethically contestable RCTsIt would have been useful if Martin had given more specific examples of ethically objectionable cases vs. ethically OK ones. As he states in his piece himself (and given that he also uses RCTs to answer certain questions), RCTs are not inherently unethical.

Another example might be RCTs for interventions that would have never happened without the specific research question. Suppose that you, as the researcher, raised funds for cash transfers as part of a research proposal. Martin argues that if you know who needs cash the most, it might be unethical to transfer funds by random assignment for your study’s sake. But, what if the cash transfers would not have taken place in the first place if it were not for this research proposal? Is it still unethical to distribute this windfall by random assignment? If it is, then you should not have been granted those funds in the first place…It’d be hard to blame the researcher for thinking that her research grant will make some people significantly richer while leaving others no worse off. She might even generate some useful knowledge in the end…

[A side note here: I want to repeat something about the notion of equipoise in economics – even though others have made similar arguments recently, the most excellent of which was David’s post on the topic. As social scientists dealing with constrained optimization, we cannot be satisfied with the fact that something has been shown to work, in the sense that the null hypothesis of no effect has been rejected. As my colleague Winston Lin pointed out in a private exchange, RCTs aim to identify an average treatment effect with confidence intervals. It’s that effect size; combined with intervention costs and compared with alternative interventions that concerns the policymaker. So, yes, we have all rolled our eyes at plain vanilla RCTs that seem to state the obvious: giving money to poor people increased total household consumption, or giving eyeglasses to students with poor eyesight improved learning. Stated that way, they do sound silly: but we should give the researchers an opportunity to make the case that there is something more valuable to be learned from their proposed experiment. Martin states that such scrutiny of their studies ‘…is clearly not the norm at present.’ I am going to disagree: I estimate (without hard evidence to back me up) that such projects form a small share all development RCTs and are the exception rather than the norm. Martin, along with many of us and many critics of randomistas, is a reviewer of funding proposals at various stages. If useless or unethical research proposals receive public funding, we are all to blame. And, I agree with Martin that we should strive to do better…]

3. Using local knowledge to inform treatment assignment rather than using random assignmentMartin also states that the researcher can use ‘knowledge on the ground’ about who needs the intervention the most. He is implicitly referring to this paper by Alatas et al. (2012), where local targeting is shown to be as effective as top-down targeting using proxy means tests. Point taken (and David will elaborate more on this study tomorrow), but two counterpoints: first, many econ RCTs, like the Alatas et al. paper itself, are cluster RCTs, meaning that the ranking would have to be made across communities rather than individuals. That is harder to do with local knowledge and requires data at the community level, like poverty maps. The cost of targeting is most certainly part of the cost-effectiveness calculations to follow, and Martin knows better than anyone in the world that, at some point, targeting can become more expensive than not targeting. Second, the typical operation of many programs, left to the devices of an NGO or the government (due to capacity or budgetary constraints), would not only miss the neediest communities but the neediest individuals within program communities when program enrollment is done via village meetings, for example. Such meetings, to which the target population is invited, can end up producing beneficiary lists that are less poor, more informed, and more connected individuals than the average person in the target population (please see this paper for an example). RCTs will generally list the target population going door to door, avoiding this type of selection bias.

4. RCT design in light of important ethical considerationsMartin concludes by stating:

“There may well be design changes to many RCTs that could assure their ethical validity, such as judged by review boards. One might randomly withhold the option of treatment for some period of time, after which it would become available, but this would need to be known by all in advance, and one might reasonably argue that some form of compensation would be justified by the delay...

The experiment might not then be as clean as in the classic RCT—the prized internal validity of the RCT in large samples may be compromised. But if that is always judged to be too high a price then the evaluator is probably not taking ethical validity seriously.”

These are important points, some of which speak to issues at the core of why we do random assignment in the first place, so some discussion is in order.

First, delayed treatment designs, such as described by Martin above complete with compensation, exist: for example, I am involved in one such intervention where the delayed treatment group gets additional resources as compensation for drawing the short stick. However, I would agree that they are not the norm. Governments and NGOs should consider this type of design more often, even if it means that the overall endeavor may end up being costlier. This is less of an issue for true pilot programs: if the pilot is successful, the control group will receive the intervention anyway. But, what is an NGO, which might have raised limited funds from an interested donor, to do? Without the RCT, they were going to implement the intervention in, say, 100 villages – period. Can they afford to treat another 100 villages after endline? If they cannot, is this unethical? Does the fact of conducting an RCT now require the NGO to provide the intervention to another 100 communities at the end of the trial? If the answer is yes, donors and implementation agencies will have to plan budgeting accordingly ahead of time. Or, they can choose another method of evaluation, but it is not clear that other methods absolve us of similar issues. Suppose that you’re using RDD to evaluate program impacts: should you not provide treatment to the ‘control’ group in your study (people/communities who were deemed barely ineligible but who were in the study sample and provided lots of data for the evaluation) after the intervention is over? At the threshold, they are equally-deserving, just unlucky, hence Lee and Lemieux’s term ‘as good as randomized’…

Second, the tradeoff between a clean RCT and providing delayed treatment with full information are all too real and can, at the extreme, really override the choice to conduct an RCT to begin with. Many of us prefer RCTs to get a clean answer to a particular question of policy interest. Here are a few examples of conflicts that arise between study design and provision of delayed treatment with full information:

There are limits on how much you would like to (or can reasonably) delay treatment to the control group, which may be shorter than the time required to see full impacts. Often, we are not interested in the immediate impacts but sustained longer-term ones. If you delay treatment for 12 months for an intervention where the ideal endline is 48 months after baseline, you will not get an answer to your original question experimentally. You can answer the question of 12-month effects, as well as 36- vs. 48-month effect, but they are different questions that your original one. Thinking about this stuff ahead of time also allows you to put quasi- or non-experimental pre-analysis plans in place, but that’s a topic for another day…

The fact of waiting for the intervention can change behavior. For example, in a transfer program, the control group might (if they can) borrow against the impending transfers to smooth consumption (or investment).

How much information should be provided to the control group can be a grey area. Suppose that you have a cluster-randomized control trial, in which towns are the unit of intervention, each of which contains the eligible target group. At study enrollment stage (i.e. during informed consent), each individual is given information about the study and told of the possibility of immediate or delayed treatment via random assignment (for which they would provide separate consent if assigned to a treatment group). Does the researcher have an obligation to provide a detailed account of every aspect of the planned intervention or can the consent documents provide a generic description of the study and intervention goals without providing more details? In practice, such information can make a difference in findings, and there is a definite tension in providing the information required for informed consent and the interpretation of the impact estimates. [Jed has written about the admissibility of misleading study subjects here]

Delayed treatments with full information can still be unethical. Sometimes, any delay may be too much of a burden to the beneficiaries. That could be true for beneficiaries who are elderly, otherwise in poor health, or in need immediately. In such cases, as Martin suggests, random assignment into even a short-delay-treatment group has to be abandoned…

Personally, this is an area where I initially naïvely thought that IRBs would be providing much more guidance to applicants, but the amount of scrutiny and useful feedback received definitely varies by IRB. We currently take our guidance from ethical guidelines for medical studies, which, at this point, are clearly not sufficient: we cannot do placebo treatments (which is why this issue of information provided comes up as a distinct point in economics than medicine); the concern with doing harm with our interventions is much smaller in development economics than medicine; etc. That these guidelines exist and that there are principles we can follow is a great initial start, but we need more. I am sure someone is developing (or has developed), as I write, ethical guidelines for field experiments in development economics. If so, please comment below. If not, more scrutiny of such tradeoffs in the future, with proper guidance for the investigators, would be most welcome.

Stay tuned for David and Markus’s comments tomorrow and Thursday, respectively. Jed has had multiple blog posts on the subject of ethics (see, in addition to the link above, here and here) that he may be taking a pass on this one…

Comments

Glad to read you highlighting the pivotal role that IRBs can and should play. A huge lacunae in the governance structure of medical/clinical trials in developing nations is the absence or total dysfunctional and indeed highly collusive nature of local research ethics committees. Medical trials are now a multi-billion dollar business.
A cursory search revealed that there is at present no IRB within the World Bank despite its signal role in sponsoring and spearheading a whole raft of research both experimental/non-experimental that involve human subjects.
Of course staffing of these research ethics committees and their role in running audits of data collection teams is really long overdue.
This needs in turn to be supporting by better reporting regarding the process of recruitment and follow –up of study subjects especially drop-outs. Journals could play a part in allowing video recordings of data collection, especially curated interviews such as focus groups, to be included on their websites once articles have been accepted for publication.

Thanks for the comments. Yes, the absence of an IRB at the Bank has been a sore spot. Those of us doing field experiments rely on IRBs from collaborators institutions and those in the country the study is taking place. On non-experimental data collection, my LSMS colleagues can chime in, but I think that the argument has been that the WB teams are working with the counterpart countries' Stats offices and things are handled through their procedures for handling human subjects.

Berk, thanks for kicking off this important discussion. I’d like to focus on your 4th point, about phased-in designs. Two things.
First, I feel I am missing something in the argument about whether delayed treatments with full information can still be unethical. Usually, the appealing premise of a phased-in design is that there is some resource constraint that would prevent simultaneous scale-up in any case. In this scenario, no matter how heavy the burden of waiting, there will be to be some rationing. In which case, why not randomization rather than something else, like patronage? Then things get odd. The suggestion seems to be that we may know, ex ante, that at least some types of people (elderly, immune-compromised) will benefit greatly from immediate receipt of the treatment. In which case, we are not in equipoise and whether an RCT (or at least unconditional randomization) is appropriate in any case. Things, of course, get trickier when a resource constraint is not binding simultaneous scale-up.

Second, I feel we should reflect on the purpose and ethics of a phased-in design, especially one with full information. I blabber on about it here as well: http://hlanthorn.com/2014/01/24/something-to-put-in-your-friday-pipeline-and-i-am-not-so-sure-about-phase-inpipeline-designs/

Again, a resource constraint may make it politically acceptable for a governor to say that she will roll-in health insurance randomly across the state, which can allow an opportunity to learn something about the impact of health insurance. So, she stands up and says everyone will get (this) health insurance at some point and here’s the roll-out schedule.

But the reason for making use of this randomization is to learn if something works (because we genuinely aren’t sure if it will, hence needing the experiment) and maybe to have ‘policy impact’. So what if what is learnt from comparing the Phase I and Phase II groups is that there is no impact, the program is rubbish or even harmful? Or, at a minimum, it doesn’t meet some pre-defined criterion of success. Is the governor in a position to renege on rolling out the treatment/policy because of these findings? Does the fine print for everyone other than those in Phase I say “you’ll either get health insurance, or, if the findings are null, a subscription to a jelly-of-the-month club”? In some ways, a full-disclosure phased roll-in seems to pre-empt and prevent policy learning and impact *in the case under study* because of the pre-commitment of the governor. I find that phased roll-in designs without a plan to pause, analyse, reassess and at least tweak the design between Phases I and II to be ethically troubling. I’d be interested in your thoughts.

Also, I would like to echo the point that the farther from a laboratory you get, the less helpful is the present IRB system.

I think we're in agreement on your first point. If you have a pretty good idea that someone needs your intervention now rather than later, you might just exclude that group from the study and treat them immediately. This is subject to the caveats about knowing need by David and Markus (Markus has a particularly nice point about his study on the effects of ARVs on productivity -- you should check it out). As a side note, I think of this as less in the sense of equipoise and more 'do no harm.' If a program is an emergency aid or a palliative program, its goals are different than programs trying to reduce future poverty: the former kind (akin to Markus' quadrant in the 2x2) should, in most cases, not be subject to randomization unless under severe rollout constraints.

I think that the issue of program goals in economics is important and speaks to your second point about going through with phase II. In economics, unlike in medicine, many times the programs we have involve transferring something to individuals, households, or communities (assets, information, money, etc.). Without negative spillovers, we don't think of these as ever not increasing individual welfare, at least temporarily: If I give you a cow, this is great for you. If you don't like it, sell it: your individual welfare will increase (would have been even higher if I just gave you the cash). But, what if my program's goal is not a temporary jump in your welfare, but you escaping poverty as close to permanently as possible? The program could be deemed unsuccessful even though it raised welfare of its beneficiaries for a short period.

The point is, it does seem wrong to break your promise to give something (something people would like to have) to people who drew Phase II in the lottery because you deemed your program unsuccessful for reaching its goals. You promised people that you'd give them the treatment at the outset, so I'd argue that if you'll break your promise you have to give them something at least as good if not better. If you can come up with this (and the phase II group is happy with your decision), perhaps they can even become your phase I group in a new experiment -- in a process where you experiment, tweak, experiment again, … Kind of like what Pritchett et al. argue we should do: lot more experiments not less…

Thinking of your examples. With the Oregon healthcare reform, it would be hard to push a stop or pause button with legislation. Government action takes time and there is the credibility of your policymakers at stake. I don't think you could really argue for a stop/pause because those impacts (even if unequivocal) are considered too small to treat the lottery losers. In the case of a project that is giving cows, I am more optimistic: it might be possible for the project to find an alternative treatment that is of equal or higher value, that is acceptable to the phase II group, and that is feasible to roll out quickly. In such cases, I could see a tweak of the intervention between the two phases.

Finally, this actually reminds me of an issue that I have been thinking about recently -- regarding data collection plans/phases rather than program phases. Many times, we have funds secured for multiple rounds of follow-up. But, you may find yourself in a situation where your early rounds of data collection are so unpromising that you might give up hope of finding any effects. First, and pretty obviously, in such cases it would probably be better to return the money to the funder or switch it to a more productive endeavor. Second, funders of research may benefit from encouraging more research proposals to apply for phased in funding, conditional on not the production of data but evidence of promising results. Sure, there may be 'sleeper effects' and the like but it would be good for the researchers to try to propose ahead of time what defines success (or at least promising findings to keep going with data collection) at each round of impact analysis. This is related to Epstein and Klerman's recent paper asking when a program is ready for evaluation. Both funding of data collection and program implementation -- conditional on program success are, I think, promising avenues to pursue…

Thanks much for engaging, Berk. The point about experiments such as the Oregon health care lottery lead me to believe that in these situations (where it will be difficult to pause or roll-back the intervention because of entitlements and political reputation), we would do better to ask questions along the lines of does A or B enrollment mechanism more (cost)effective in delivering health insurance, rather than asking, health insurance: good/bad.

I do wish that more such endeavors would explicitly have analysis and tweaking periods between phases (along the lines of MeE).

The Epstein paper, that you alluded to, on evaluability assessment should be de rigueur for both researchers and grant officers. In fact I would summarize it as an FPPO cycle. Carry out detailed Formative research or due diligence to prospectively assess the need of the intervention including the ethical issues to rollout, (small scale) Piloting the intervention, not necessarily with a comparator, followed up with a detailed Process evaluation to get a handle on the policies, procedures and key personnel implementing the program's impact in the treatment areas only, and finally a rigorous Outcome evaluation through experimental/quasi-experimental/case control to get net effects etc..

Indeed once an Outcome evaluation is green lighted protocol development including pre-analysis plans and examination of safeguards for human subject safety needs as well as communication/dissemination of study results, data distribution & archiving etc need to be formalized. Such standardized guidelines have been on-going and providing a decent benchmark to assess the readiness of the program staff to undertake an RCT. There was the ASSERT statement checklist which is now subsumed into the SPIRIT statement which may be accessed from:

http://www.spirit-statement.org/spirit-statement/
Note the need for mandatory trial registration. If and when an IRB type body is fully fleshed out within the World Bank, then it would be important for them and concerned public to be able to access a ready cross-referenced repository of all planned, on-going completed and abandoned studies, especially RCTs for periodic audit checks. In point of fact protocols of studies funded through a rigorous peer review process such as NIH/NSF grants should be published in top economics journals, and available open-access online as is done in biomedical journals such as Trials. Accord these the same status and invite the same scrutiny as peer review articles.