Should you care about impact heterogeneity?

If you want to know the average impact of being assigned the option of some “treatment”— the so-called “intent-to-treat” parameter—then you will get a good (unbiased) estimate by comparing the mean outcome for an experimental group that is randomly assigned the treatment with that for another group randomly denied that option.

However, it is often the case that we also want to know the impact of actually receiving the treatment. Participation in social experiments is voluntary and selective take-up is to be expected. This is a well-known source of bias. The experimenter’s standard fix is to use the randomized assignment as an instrumental variable (IV) for treatment status. Advocates of this approach argue that randomization is an indisputable source of exogenous and independent variation in treatment status—the “gold standard” for identifying causal impact.

One requirement for a valid IV can be readily granted: the randomized assignment to treatment will naturally be correlated with receiving the treatment. But let’s take a closer look at the other, no less important, requirement, namely that the IV only affects outcomes via treatment—the so-called “exclusion restriction.”

Any imaginable intervention will have diverse impacts, possibly with losers as well as gainers. Such impact heterogeneity can be safely ignored if the differences are uncorrelated with the actual placement of the intervention. But that is hardly plausible. People make rational choices about whether to participate in experiments. Some of the reasons for their choices can be observed as data and so we can control for them by adding interaction effects with the treatment status to the regression for outcomes. That is good practice for understanding impact heterogeneity.

However, people almost certainly base their choices on things they know but the analyst does not know. Take up will depend on latent gains from take up. This gives rise to what Jim Heckman and his coauthors call “essential heterogeneity.” (The idea goes back a long way in Heckman’s writings, but for a good recent discussion see the 2006 paper by Heckman, Urzua and Vytlacil.) This is such an intuitively plausible idea that the onus should be on analysts to establish on a priori grounds why it does not exist. Yet it is still rare for experimenters to consider the implications of essential heterogeneity. (This is not, of course, the only problem faced in practice; I discuss a much wider range of issues here.)

It is not hard to see that essential heterogeneity invalidates randomized assignment as an instrumental variable for identifying mean impact. Amongst the units assigned the option for treatment, those with higher expected gains are more likely to participate. Thus the behavioral responses will entail that the assignment is correlated with the error term—specifically with the interaction between treatment and the latent gains from treatment. The randomized assignment is not then excludable from the main regression, and so it does not provide a valid IV—hardly the gold standard!

Just how much of a problem this is depends in part on what you want to learn from the impact evaluation. If you only want the mean impact for those actually attracted to the treatment in the randomized trial then the IV estimator will give you that number in sufficiently large samples. And this holds even though the randomized assignment is not a valid IV. In some applications this might be all you need, though it is rather limited information. You will not know how much the mean impact for those treated differs from the overall mean impact in the population. If you ignore the (likely) presence of essential heterogeneity, and assume that you have figured out the overall mean impact, then you could easily get it wrong in drawing inferences for scaling up the program based on the trial—which is after all the trial’s purpose.

A simple example will illustrate. First, let me describe the reality of a (highly) stylized world comprising 100 people. A policy intervention is introduced—access to an important new source of credit for financing investment. Counterfactual income (in the absence of the program) can take two possible values, namely an income of either $1 or $2 a day, with equal numbers of people having each income. For half those with $1, the causal impact of the credit scheme over some period is $1 (bringing their post-intervention income to $2), while it is zero for the other half. Similarly for those with the $2 income: half see their income rise to $3 while the rest see no gain. Then the mean impact is $0.50 and the total benefit when the credit scheme is available to all is $50. It can be assumed that only those who gain will participate, implying a take-up rate of 50% for the population as a whole when scaled up, or for any random sample.

Now suppose that the policy maker does not know any of this, and decides to enlist an evaluator to do a randomized control trial to assess the likely benefits from introducing the credit scheme as an option for everyone. Following common practice, the evaluator mistakenly assumes that the scheme has the same impact for everyone (or that the heterogeneity is ignorable). As usual, a random sample gets access to the extra credit, with another random sample retained as controls. It is readily verified that the IV estimate of mean impact will be $1 in a sufficiently large sample, which is also the mean impact on those treated. Ignoring the heterogeneity the policy maker will infer that the aggregate benefit from scaling up is $100—twice the true value.

You might still feel confident that using randomization as an IV does at least do better than simply ignoring the problem of endogenous take-up—by using the naïve ordinary least squares (OLS) method of simply comparing the mean outcome for those treated with that for the control group. But such confidence would be misplaced.

Indeed, if essential heterogeneity is the only econometric problem to worry about then the naïve OLS estimator also delivers the mean treatment effect on the treated; the IV and OLS estimates converge in large samples. I show this in a new paper, found here. (Notice that in the above numerical example, the OLS estimate is also $1.) There is no gain from using the IV method! Indeed, OLS requires less data since one does not need to know the randomized assignment and the control group need only represent those for whom treatment is not an option.

The two estimators only differ in large samples when there is some other source of bias. An extension to the standard formulation of the essential heterogeneity problem is to allow the same factors creating the heterogeneity to also matter to counterfactual outcomes. (I develop this extension in the new paper.) If the higher counterfactual outcomes due to these latent factors come hand-in-hand with higher returns to treatment then the IV estimator can still be trusted to reduce the OLS bias in mean impact. A training program providing complementary skills to latent ability is probably a good example.

But here’s the rub. There is no a priori reason to expect the two sources of bias to work in the same direction. That depends on the type of program and behavioral responses to the program. If the latent factors leading to higher returns to treatment are associated with lower outcomes in the absence of the intervention then the “IV cure” can be worse than the disease. The following are examples (which are described more fully here):

·A training program that provides skills that substitute for latent ability differences, so that the program attenuates the gains from higher ability.

·A public insurance scheme that compensates participants for losses stemming from some unobserved risky behavior on their part.

·A microfinance scheme that provides extra credit for some target group, such that participation attenuates the gains enjoyed by those with greater access to credit from other sources.

In such cases there is no reason to presume that using randomized assignment as the IV will reduce the bias implied by the naïve OLS estimate of aggregate impact. Indeed, there is even one special case in which the OLS estimator (unlike the IV one) is unbiased for mean impact (as described in here)—the essential heterogeneity can be ignored but so too can the randomized assignment! Granted, this is a “knife-edge” result. But even when both estimators are biased, it can be shown that averaging the two can reduce the bias under certain conditions.

I draw two main lessons from all this. First, to learn about development impact, whether or not you use randomization, there is no substitute for thinking about the likely behavioral responses amongst those offered the specific intervention in the specific context. Second, once one considers plausible (rational) behavioral responses, past claims that randomized assignment is the “gold standard” for identifying causal impact are seen to be greatly overstated; valuable maybe, but certainly not gold!

Comments

It is true that randomized eligibility does not identify the mean impact on the whole population from which your sample is drawn. So if by "scaling up" you mean bringing more people into the program other than those who would participate when eligible, then you are right, the randomization of eligibility does not tell you what would happen to those folks on average. But why would this be the effect of interest? If you are talking about a program with voluntary participation, it seems to me that the average effects of greatest interest are 1) the option value (the ITT), and 2) the average effect on those who would take up the program were it offered to them.
However, if by scaling up you mean expanding eligibility, then the randomization of eligibility does tell you about the average impact on participants under expanded eligibility (ignoring other changes over time that might affect outcomes). If the intervention under expanded eligibility is identical to the program used during the randomized trial, then the same types of individuals who selected into your program during your pilot will continue to do, and average impacts on participants will tend to be the same as what you estimated using your pilot. I may be mistaken but I believe this is why Heckman and Vytlacil call the effect estimated under randomized eligibility the "Policy Relevant Treatment Effect" in one of their HB of Econometrics chapters.

Scaling up was not a key issue for my paper, but I agree it is worth considering further, so thanks for your comment. Here are my views on the matter.
The issue is whether the determinants of participation rates can be assumed to be the same once a policy is scaled up. In general, I expect much will change. There are even certain policies that entail essentially mandatory take-up in some population. Then I think we agree that an estimation method that only delivers mean impact on those who choose to take up the treatment in a randomized trial could be very deceptive for scaling up.
But other things will surely change for policies or programs with optional participation, and those changes will depend on how the policy maker interprets the findings of the randomized trial. A policy maker who thinks there are limited benefits beyond the proportion of the sample who participated in the trail will not encourage any higher participation rate on scaling up. By contrast, a policy maker who thinks that there are lots of unexploited benefits will behave differently.
These two cases map into the difference between the standard theoretical model underlying the interpretation of randomized evaluations and the more general model based on essential heterogeneity (with implications for the counterfactual, as outlined in my paper). Prevailing practice is to assume common impact for all (or that the impact heterogeneity is ignorable). Then the policy maker will presume that the estimate obtained using randomized assignment as the instrument for treatment status is BOTH mean impact and mean impact on the treated (in large samples); there is no difference between the two—just as in a perfect experiment. Benefits from scaling up will be expected—benefits that will not be realized in practice, given essential heterogeneity. The policy maker will no doubt be encouraged by the evaluation to chase these benefits through active efforts to promote greater participation. But it will be an illusion.
That gives another perspective on the issue. Further thoughts welcome.
Martin

Thank you for you thoughtful response (I am the mysterious "CM" responsible for the above comment). These are excellent points. To me, this speaks to the importance of 1) Putting structure on the average treatment effects identified by different designs in order to model how program benefits change with the participation rate, and 2) Studying the participation decision as well as impacts when conducting evaluations, rather than treating it as a nuisance. One possible way to address both of these issues is through encouragement designs, allowing one to trace out the average response at different levels of participation, as well as see what types of people participate under different incentives. I realize this won't get at general equilibrium effects that might occur when scaling up a program, or deal with macro-level changes that might occur, but it is a start.

Thanks Martin for this terrific and thoughtful post.
Yes, many have called randomized trials a "gold standard", but which of these people meant, thereby, that randomized trials are flawless in all facets and in all settings? I would say: none. I have always interpreted the "gold standard" phrase as a claim about the relative merits of randomized treatment in the settings where its is feasible. This valuable post does not address the question of relative merits.
This because every caveat discussed above applies to every evaluation method we have: Treatment effects can be heterogeneous conditional on unobservables in regression discontinuity designs, instrumental variables, propensity score matching --- always and everywhere. And concerns about treatment effect heterogeneity would be much stronger in other less rigorous research designs often employed in development work, such as ordinary least squares regressions and qualitative interviews.
The last paragraph's conclusion that randomized treatment is only maybe valuable at all---and certainly not a "gold standard"---is therefore unwarranted. If "gold standard" means unconditional flawlessness, then no one I know ever claimed that, so the point is not useful. If "gold standard" means relatively desirable when possible, then the conclusion is unjustified because this post does not address the question of relative merits by pointing out problems with all methods.
That said, this post makes a strong and helpful case that theory and related nonexperimental results are critical in properly interpreting the results of randomized trials. That is most certainly true and the helpful examples you offer should be studied by everyone in this field.

Michael,
I think you are reading too much into my post. I did not make any comparison between experimental and non-experimental methods. My paper and post are about what to do in analyzing data from an experiment, as in fact I am doing in other work—which led me to write on the topic. We all agree that social experiments are typically imperfect, due to selective take up. This is not at issue. Rather the issue is the case for using randomized assignment as an instrumental variable for (endogenous) treatment status. Some of the received wisdom on this issue amongst practitioners is far from obvious on closer inspection. Under essential heterogeneity, randomization is not a valid IV for mean impact and nor is it necessary for mean impact on the treated if essential heterogeneity is the only problem. And if the essential heterogeneity comes with systematic (latent) differences in counterfactual outcomes then we can’t even claim that using randomization as the IV does any better than ignoring the endogeneity problem stemming from selective take-up entirely. The less rigorous naïve method may well do better.
That surely calls for skepticism about the “gold standard” claims for randomization as an IV that one still hears. It also calls for caution in using a term you use, and I also hear often: “rigorous research.” Along with most economists, you would no doubt think that the IV method is rigorous and OLS is not. Think again. It is not so clear once we consider behavior.
Martin

Martin, I agree with Michael that this is thoughtful and interesting. But in light of the first paragraph, which states the unbiasedness of OLS on randomized intent-to-treat, doesn't the last one overreach, at least in spirit? In a sense, if you regress on randomized ITT, you *can* ignore behavioral responses, right? (Granted, probably you'll gain more insight if you don't ignore them.)
I think a reader of the example could get the impression that essential heterogeneity is such a devilish problem that randomization cannot solve and the policymaker will almost inevitably go astray. But all the policymaker needs for rescue is a regression of the outcome on intent to treat, which would return the correct average impact of $0.50, right? Not coincidentally, it is this regression that seems most policy-relevant to me, because intent to treat is what the policymaker controls.
--David

Sorry, I forgot to add: In response to Michael, you stated that you are not comparing experimental and non-experimental methods. So in challenging the use of the term "gold standard," I gather you are comparing OLS-on-treatment and 2SLS-on-treatment-instrumenting-with-ITT *within the context of an experiment*. Who has called the latter the gold standard within the experimental context? I think it would be good to establish who, in order not to set up red herrings.
My impression is that OLS on ITT, rather than either of the above two, is promoted as most reliable.
(For me "gold standard" summons images of the Olympics. There, gold means better than the available alternatives, not perfect.)
--David

Thanks for the comments David. We can agree that if you don’t care about behavior then you can ignore it. That is what focusing solely on ITT is like. But I don’t think many of us (including policy makers) would be happy only knowing ITT, and you hint as much yourself. Suppose we find that ITT is roughly zero. Is that because of the take-up process or did the treatment have low impact for those treated? Can’t say from ITT alone. Going a bit further we might want to ask: How would the impact change if I also changed some of the factors influencing take up, such as accessibility to the intervention or the scale? And what if I changed design parameters of the intervention? All this is about behavior, and hence my emphasis. I don’t think I am overreaching.
On your second comment, I don’t think I am setting up a red herring either. There seems to be a quite widely held view that a randomized IV is the best you could ever have—the “gold standard” for IVs. Indeed, I have heard some prominent development economists argue that it is not a valid IV unless it is randomized. This is based on a view that only a randomized assignment is excludable from the main regression; that it is uncorrelated with the error term. But that is wrong under essential heterogeneity. Behavioral responses to the option of treatment, based on expected gains, are all one needs to invalidate the randomized IV. This is not saying that non-experimental methods are better, only that we need to apply the same standards in discussing the validity of IVs to the randomized ones. Ultimately identification is about whether one accepts the assumptions one makes about behavior.
Martin