Questioning the External Validity of Regression Estimates: Why they can be less representative than you think.

A common critique of many impact evaluations, including those using both experimental and quasi-experimental methods, is that of external validity – how well do findings from one setting export to another? This is especially the case for studies done on relatively small samples, although as I have ranted before, there appears to be a double standard in this critique when compared to both other disciplines in economics and to other development literature.
In contrast, there is a feeling that although one needs to be more concerned about internal validity, multiple regression estimates taken from large, representative, samples should do much better in terms of external validity. For example, Pranab Bardhan writes “RCTs face serious challenges to their generalizability or “external validity.” Because an intervention is examined in a microcosm of a purposively selected population, and not usually in a randomly sampled population for any region, the results do not generalize beyond the boundaries of the study. For all their internal flaws, the older statistical studies, often based on regional samples, permitted more generalizable conclusions”

Rethinking the representativeness of regression estimates
However, a new paper by Peter Aronow and Cyrus Samii shows that this need not be the case. Consider the following linear regression model:
Y = a + bT + cX + e
Where our interest is in the effect b of some treatment T. Then they show that multiple regression yields an estimate:

Implications:

This is not typically equal to the population average of the individual treatment effects bi. Instead it is a weighted estimate, with observations with less predictable treatment statuses given more weight.

Observations with zero weight contribute nothing to the estimate of b – these are cases where the X’s perfectly predict their treatment status. As a silly example, imagine T = being pregnant, Y = labor productivity, and X = dummy variable for being female. Since X=0 perfectly predicts T = 0, only women contribute to the estimate of b, and we don’t average in any of the “if men were seahorses” impacts. More seriously, this is the well-known point that when we use a within-estimator, only groups which exhibit variation within group contribute towards the estimate.

More generally, the weighting here means that the effective sample for OLS may not be representative of any naturally occurring population.

Examples
As an example, the authors revisit a study which uses data on 114 countries over 1970-97 to look at the impact of regime type on FDI flows. The specification incorporates country and decade fixed effects, lagged FDI, and a set of control variables including lagged values of market size, development level, growth, trade, budget deficit, government consumption, and democracy. Once you take out this variation, the remaining unexplained variation in regime type is unequally distributed across countries - From the sample of 114 countries, twelve contribute over half (51%) of the weight used to construct the estimate of the effect of regime type on FDI, and 32 contribute 90% of the weight. This is seen in the Figure below, which contrasts the nominal sample (left) with the effective sample (right) – a global sample is really relying on just a few countries for identifying the effect of interest.
The authors also look at a second example from political science in which the effective sample looks much more like the nominal sample – basically because the control variables (X) in this case don’t have much predictive power for the treatment effect of interest.

Implications
There is an important irony here – typically in an OLS regression, the X variables are included to deal with internal validity concerns – the hope is that by adding more and more controls, we soak up a lot of the variation that is correlated with both treatment and the outcome of interest. But the more we do this, the more likely it is that the remaining variation left to identify the treatment effect differs substantially across units, leaving an estimate that is not representative of the population of interest. But if we don’t control for many variables, internal validity fears are greater.

This doesn’t happen when controlling for variables when treatment has been randomized – since the X’s don’t help predict treatment, then T is independent of X, and hence of the weights w, and the OLS estimate on experimental data converges to a simple average of the individual treatment effects.

Note in correspondence about this point, Cyrus correctly notes that this does not mean that there is an "internal vs external" validity tradeoff when it comes to deciding what should be "controlled for." If a covariate is not related to the treatment, its inclusion will make no difference to the weights. If a covariate is related to the treatment, it needs to be included in the conditioning set (or else there's bias). If conditional independence does not hold, then the entire proof falls apart -- there is no "reweighting" of causal effects because there is no well-defined causal quantity being estimated at all.

What can you do in practice?
Aronow and Samii outline several reweighting methods that can be used to reweight OLS estimates to give average causal effects for populations. However, they do essentially require that treatment status not be deterministic for particular Xs – if it is, the best you can do is get representative causal effects for a subpopulation – akin to estimates in the common support with propensity-score matching.

A second point is that you can use the methods and/or mapping approaches in their paper to help better characterize where your identification is coming from. Providing standard summary statistics for your sample will not typically tell us in regression studies what the effective sample being used to identify your effect is.

Finally, you should be careful arguing as Pranab did that just because a study is done on a large regional sample it will necessarily be any more generalizable.

Comments

This whole internal versus external (versus construct) validity issue is overdone. The larger point that should not be missed us that international development evaluation is not a white collar profession, yet applied economists/ econometricians continue deluxe themselves that it is so. Stata crunching is not going to get children to improve learning or teens to upgrade their labor market skills or improve labor stanatds in Bangalesh garment factories.

Thought I just submitted a comment, but it looks like it didn’t go through. Apologies if this is posted twice.

I was a bit confused by both this blog post and the Aronow and Samii paper. I haven’t read the Aronow and Samii paper, but it seems like they are just reiterating the often forgotten point that linear regression doesn’t consistently estimate average treatment effects in some situations where people think it does (namely, when there are heterogeneous treatment effects). This is a good point, but one that seems irrelevant to the debate over whether there is any potential advantage of quasi-experimental techniques in terms of increased external validity since matching and other similar methods don’t suffer this same disadvantage.

I, and probably many other readers of this blog, have long assumed that even though there is a potential theoretical advantage of quasi-experimental techniques in terms of increased external validity this doesn’t matter in practice due to the bias of quasi-experimental methods. This paper (which I also have read) calls this assumption into question: http://www.cgdev.org/publication/context-matters-size-why-external-validity-claims-and-development-practice-dont-mix