Learn to live without external validity

We promised some time ago to review the recent working paper by Pritchett and Sandefur on external validity, and the title of this post is the main take-away for me: my name is Berk Özler and I agree with this specific message. However, while I’d like to say that there is much more here, I am afraid that I, personally, did not find more to write home about...

The basic premise of the paper is the following: variation in effect size across contexts is higher than the variation within. Variation across contexts is more or less defined as variation in well-identified effect sizes (ES), while variation within is the difference between a not well-identified estimate and the well-identified one within the same context. In other words, if you ran an OLS of vouchers on student performance using the baseline data in your experiment that provided vouchers and compared the biased ES with the experimental one, that would be your within variation. If you compared the experimental ES in Colombia with one from Tanzania, that would be your across variation.

The authors look at studies of class size effects, private/public school effects, and Mincerian returns to education to find that the across variation dwarves the within one. This leads the authors to suggest that you’re better off choosing biased estimates from your own context (whatever that exact context is and provided that the study exists) than a well-identified estimate from another context. You’ll have to read the paper yourselves to decide whether you’re convinced by the evidence provided here, but here are four possible objections:

Some of this is based on findings of large across variation due to one study that found an effect of 3.5 SD (not a typo) for a reduction of 10 students in class size!

Some of the within variation is OLS vs. OLS with controls/FE (rather than OLS vs. RCT). You have to also start worrying about publication bias here, which the authors mention.

Your structural model of effect sizes specifies that the effects are on standard deviations of test scores (rather than, say, absolute changes) without any conditioning for baseline levels.

Your structural model also specifies that the effect of reducing class size from 60 to 50 is the same as 40 to 30.

The authors’ definition of what is well identified and what is not is also somewhat arbitrary, if not odd. As an economist, you can pick up any recent causal inference/quant methods course syllabus from a Master’s or PhD program, and you will find that we teach RCTs, natural experiments, RDD, IV, PSM, DD (and more recently synthetic controls) – all as part of our toolbox. It’s strange for two economists to separate the first three from the latter three. If the authors were following the public health model, then they’d limit the first group to RCTs only and treat everything else as not well identified.
Pritchett and Sandefur build a simple structure where a simple correlation consists of two components: a causal one and a selection bias component. They then argue that if I have two different biased estimates from two different contexts and I get another well-identified ES from a third context that falls between the first two, it is logically incoherent for this to happen if I think that the third ES from a different context matters even an iota in changing my priors about the true relationship (this is because the true effect implies different signs of selection bias in the first two contexts – see page 8). The only incoherence that I see here is that the selection bias should work in one direction only: it doesn’t and the authors give a perfectly good example of why not when they discuss the evidence from Cox and Jimenez (1991) on the returns to private/public schooling (page 19). Students who could not get into public secondary schools ‘negatively’ selected into private schools in Tanzania, while students from richer families ‘positively’ selected into elite private schools in Colombia. My first impact evaluation was about school autonomy in Nicaragua, where a set of schools was initially selected for the reforms: the selection could be positive if the administrators wanted a positive demonstration effect; negative if they targeted the worst performing schools.
So, we circle back to the same old, and by now feeling somewhat tired, ‘external validity’ debate: the context is never the same between any two programs (perfectly evidenced by two separate papers by each of the authors Bold et al. (2013), Pritchett et al. (2012)). At the extreme, any finding today is irrelevant for a project being designed for tomorrow. Soooo on and sooooo forth…

The authors claim that they are not building a straw man to then knock down. I was happy to see this statement at the outset of the paper, but then was ultimately disappointed. When the chips are down, are people really making policy by taking the only experimental finding from Dunedin, Te Wai Pounamu, Aotearoa and using it in Regina, Saskatchewan, Canada – completely ignoring the local conditions and the less than perfect evidence from closer, more similar settings? Perhaps, the authors are reacting to this paper by Banerjee and He that they discuss at some length. So, some people may have gotten carried away with certain recommendations; maybe, some overly enthusiastic young researcher was guilty of uttering a careless claim at a conference or in a donor meeting about RCTs being the best and everything else being garbage. Is this really evidence of a big problem in our field? Annoying? Sure. Problem? Yeah, nah. I have been an employee of the World Bank for more than 12 years and I am not aware of anyone who advocated blindly reducing class sizes everywhere due to one finding from Israel or because of calls from, arguably influential, figures like Abhijit Banerjee. Lant had an even longer tenure at the WB than me – perhaps he has good reasons to think otherwise…

Ultimately, the pitfalls assigned to RCTs are more or less the same for non-RCTs. Experimental site was chosen for researcher convenience; non-experimental one was chosen for data availability in that particular district. We know exactly how that small experiment was run, which may be different than the one at scale; you can barely describe the implementation of large government program that you studied non-experimentally. The sample of that experiment is selected and not representative; by the time you finished your nearest-neighbor matching exercise and tossing out the non-overlapping support to obtain a ‘good match,’ your sample lost its representativeness. Marginal experiments are being published in good journals; important null results are more likely to not get published (Brodeur et al. 2012). I could keep going…

I find it more useful to think about what I can do with RCTs that is harder to do using other methods. I can design exactly the ideal experiment to answer a question of interest (if the question is not interesting, the funders won’t give me the money). I can control implementation, which allows me to not only describe the program exactly, but also to oversee implementation integrity. I can design the data collection modules. I can think of efficacy vs. effectiveness. Most importantly, I can generate random variation in things that are extremely unlikely to be found naturally: handing a voucher for contraceptives to a wife vs. a husband; to vary the intensity of citizens encouraged to vote within the exact network of interest to study spillovers.
What do I then do with the findings from an experiment I designed and ran? I try to be responsible: I don’t run to policymakers in Indonesia who may or may not have asked me about what I think of CCT design in their country and regurgitate proven effects of conditionality. I try to stay humble: make it clear that I know little about the background in Indonesia; I tell them what we found in Malawi, make the necessary caveats, a summary of the remainder of the literature, and perhaps some suggestions on what kind of data to look at for some back of the envelope calculations in their context. Most importantly, I try to touch on the interesting things that we learned: the surprising findings that made us think harder about theory or the complex findings that revealed something that we suspected before but for which we didn’t have evidence.

That’s the reason I like experiments: they allow us answers to questions that are hard to examine otherwise. This is the flip side of the criticism that you normally hear about RCTs: that many questions of interest cannot be examined this way. There is no logical incoherence in this tradeoff. And, there is within-researcher variance: for our last funding proposal, we toyed around with a simple diff-in-diff methodology with which we would have been perfectly fine had we found pre-baseline data on the outcome variable justifying the common trends assumption.

Pritchett and Sandefur conclude their paper by advising governments, NGOs, etc. to make use of RCTs in a more nimble way to test, evaluate, and refine their practices. I fully agree. They also warn against systematic reviews, presumably of the kind that is restricted to RCTs only, and advise against a quest for a set of universally true parameters. I agree with this as well. But, having recently completed a meta-analysis of cash transfer programs, I have to say that I see straw men here as well. We included all studies that attempted to provide causal identification – there was never a thought that the eligibility criteria would rule out non-RCT studies. And, the bottom line from this systematic review, on one of the most studied topics in development economics, confirms that we should get used to living without external validity – regardless of the causal identification method used: even though all schooling effects of cash transfer programs are positive (no controversy there), if you asked me the confidence interval implied by our random effects model for a policymaker who is considering designing a new program tomorrow, I’d have to tell you that it encompasses zero. That’s because of the heterogeneity of ES across studies is large and many of the usual suspects that you think would explain the heterogeneity were not helpful.

At this point, I do sincerely believe that there is no substitute for experimenting on your own. Let studies from elsewhere and previous studies from your own setting be good guides, provide great food for thought, and a starting point. Think hard about the problem and scrutinize the details of the extant evidence carefully. Then, start experimenting. I hope that Lant and Justin agree…

Comments

The Pritchett and Sandefur paper is itself a bit of a straw man: the authors have obviously produced a rushed piece of work to try and get ahead of others doing more careful analysis. It's a weak and basically ignorable contribution so let's focus on the real issues.

In your reply you say that the debate on external validity is "same old, and by now feeling somewhat tired". Last I checked that debate hasn't actually happened. Did I miss the Mostly Harmless Econometrics section on that?

"at the extreme, any finding today is irrelevant for a project being designed for tomorrow. Soooo on and sooooo forth…". So your response is basically to yawn at the problem? What happened to 'rigor'? Or is that only for internal validity?

The basic fact is that you don't know how to rigorously obtain policy lessons. Isn't it about time you said so?

Thanks for taking the time to comment. On the issue of the debate on external validity, and whether it's getting tiring discussing it only in the context of RCTs (or the new empirical development economics), we can set aside all the stuff from the past two years (including repeatedly on this blog) and see this symposium from the EPW, which includes contributions from Banerjee, Bardhan, Basu, Kanbur, and Mookherjee: http://www.arts.cornell.edu/poverty/kanbur/NewDirectionsDevEcon.pdf

That I wrote a 1,500-plus word blog post isn't indicative of a yawn, but rather the disappointment of hearing the same arguments on external validity, almost exclusively made as a criticism of experiments in economics (glad to say that the paper covered above is not one of them). The more serious work addressing some of the important issues surrounding external validity (three of the papers cited above and some, like Aronow and Samii, cited within the Pritchett/Sandefur paper) is suggesting so far that there is nothing simple about external validity for policy making.

I believe rigor, and not the lack of it, is what makes (or should make) us humble about the limits to which we can give policy advice based on the extant evidence. Development practiced as cookie cutter advice to policymakers is potentially the worst kind. The evidence doesn't tell you what you should do, but rather gives you the frame within which you should consider your problem and the potential solutions.

We'll be looking for your contribution that is neither weak nor ignorable: perhaps it can even be a new entry into the next edition of Mostly Harmless Econometrics...

This debate seems to ignore what seems to me the most important application of RCTs. That is, comparing different policy options available to an organization/government in achieving a specific goal before rolling out programs. The issue of external validity should be far smaller when the RCT is designed with a specific domestic policy set within the population of interest. I am surprised this is not the direction development program RCTs are going. Or is it?