Browsing pp. 4 ff, it seems the authors are basically saying “hey the stats were challenging, the sample size tiny, other problems, but we solved them all—using innovative methods of our own devising!—and lo and behold, big positive results!”.

So this made me think (and tweet) basically that I hope the topic (which is pretty important) will happen to interest Andy Gelman enough to incline him to give us his take. If you happen to have time and interest…

The two key concerns seem to be: (1) very small sample size (thus, unless the effect is huge, it could get lost in the noise) and (2) correlation of the key outcome (earnings) with emigration.

The analysis generally seems reasonable (as one would expect given that Heckman is a coauthor) but what I’d really like to see are graphs of the individual observations. And, as always in such settings, I’d like to see the raw comparison—what are these earnings, which, when averaged, differ by 42%? I’d also like to see these data broken down by emigration status. That bit did worry me a bit. Once I have a handle on the raw comparisons, then I’d like to see how this fits into the regression analyses.

Overall I have no reason to doubt the direction of the effect—psychosocial stimulation should be good, right?—but I’m skeptical of the 42% claim, for the usual reasons of the statistical significance filter. An example where this might be happening is in the very last paragraph on page 6 that continues onto the top of page 7. There they are doing lots of hypothesizing based on some comparisons being statistically significant and others being non-significant (at least, that’s what I think they meant when they wrote of “strong and lasting effects” in one case and “no long-term effect” in the other). There’s nothing wrong with speculation but at some point you’re chasing noise and picking winners, which leads to overestimates of magnitudes of effects.

So those are my thoughts. My goal here is not to “debunk” but to understand and quantify.

I also talk a bit about the political context of those debates.

Here’s my conclusion (for now):

Where does that leave us, then? If we can’t really trust the headline number from a longitudinal randomized experiment, what can we do? We certainly can’t turn around and gather data on a few thousand more children. If we do, we’d have to wait another 20 years. What can we say right now?

My unsatisfactory answer: I’m not sure. The challenge is that earnings are highly variable. We could look at the subset of participants who did not emigrate, or, if there is a concern that the treatment could affect emigration, we could perform an analysis such as principal stratification that matches approximately equivalent children in the two groups to estimate the effect among the children who would not have emigrated under either condition. Given that there were four groups, I’d do some alternative analyses rather than simply pooling multiple conditions, as was done in the article. But I’m still a little bit stuck. On one hand, given the large variability in earnings, it’s going to be difficult to learn much this sort of small-sample between-person study. On the other hand, there aren’t a lot of good experimental studies out there, so it does seem like this one should inform policy in some way. In short, we need to keep on thinking of ways to extract the useful information out of this study in a larger policy context.

You can read the Symposium article for my full story, and here’s the Gertler et al. article.

30 Comments

This is a very tough problem. The small sample size means a lot of variance so the changes of over-estimating the scale of the effect in cases where an effect was found are definitely not zero. This being said, the direct comparisons were all in the correct direction.

So I think the effect size is difficult to interpret (due to factors like loss to follow-up and the dangers that a person could have a single huge salary — imagine if the local version of the Koch brothers happened to be randomized). It’s also the case that pilot projects rarely do as well when they are scaled up. But this is pretty decent evidence that the authors are on to something . . .

If it _is_ important, the study needs to be reviewed by say an expert panel who will get access to the data and analysis scripts.

Otherwise, no one will really know.

This _looked like_ a good study and had a big impact.
BART (Blood Conservation Using Antifibrinolytics in a Randomized Trial) changes heart surgery around world (Drs. Dean Fergusson and Paul Hébert, 2007).

I don’t have access to the data or their analysis scripts so no way for me to rule out someone messing up (it happens all to often) and as a default I just do not believe published claims (that have not been audited for example by the FDA).

With that access and a whole lot of time to re-analyse, I would likely be very convinced its much smaller than 42%
Other comments here and there do suggest to me (with apologies to the researchers) that they may not understand the winner’s curse that well but I might well convince myself its positive on average. I would do my best to get second opinions from colleagues ideally sending them reproducible code.

There is only a tiny bit of _real_ information in the study for anyone to work with!

I saw this paper presented at Berkeley a year or two ago. I came away thinking 1) that is a pretty incredible result magnitude-wise, but I do think there was probably some effect; 2) emigrating to England is a pretty good plan, wage-wise, for poor Jamaicans; 3) I am now (as in from the minute I left the seminar) incredibly interested in permutation tests; and 4) there is so much good work to be done following up early life interventions that happened in the 1970’s-90’s (the other great example being the INCAP nutrition trial in Guatemala where the inference problem isn’t small N, but small cluster number (2-pairs of matched villages!). Interestingly, the estimates from the INCAP trial were similarly large in terms of increased wages, but obviously the statistical inference strategy is even less convincing.

I also remember one faculty member excusing himself at the end of the seminar by saying “I’m going to go spend some time with my kids right now.”

I think permutation tests are a bit of a dead end but I agree with your other comments.

As I wrote, the topic is important enough, and the general effect is plausible enough, that mere uncertainty about the size, direction, mechanism, and variation of the effect should not cause us to dismiss the result.

And I’d say the same thing even if none of the findings in the paper were statistically significant.

There is always “uncertainty about the size, direction, mechanism, and variation of the effect” so its really a question of how we might decrease it (e.g. with other data) or deal with it for now.

Direction on average is what many hope to at least get right (do more good than harm) with folks like Sir Richard Peto suggesting analysis methods (posts tratification on estimated SDs!?!) based on _wishful_ thinking that variation is only in magnitude but not direction.

At least in clinical research for the last 30 years uncertainty in variation of the effect and size have been very neglected.

Not completely devoid. Consider this: n=1 is better than n=0. n=2 is better than n=1 (assuming we understand the data-collection process). Etc. Even if it takes n=2000 to get statistical significance (which itself is no guarantee of certainty), the steps along the way certainly can’t be useless.

The immigration point you’ve discussed at length, but the above comment on mechanisms reminded me of a point 5 I should have included above, namely that I was sort of amazed by one entry in Table 8. They show that, in effect, treated children were almost never “expelled” from school. I put “expelled” in quotes because I think they mean “suspended” – as in, you get kicked out for a bit but can come back. But regardless, I think this does point to a particular mechanism and does so better than the internalizing/externalizing scores in the same table (which are just sort of un-interpretable to me). To save you from looking: 17% of control group kids were suspended, and treatment reduced that by 12 percentage points (so a reduction in expulsions from about 17% to 5% or so).

That fact got buried in the talk I saw, but I found it really convincing as a mechanism given what we think we know about early life socialization and educational attainment, and the labor market returns to education. A really positive interpretation would be that getting played with in a constructive manner early in life teaches you how to get along well with others and control your emotions in more productive ways. (A less positive one might be some argument about how we are really working on a “technology of docility” so as to make more boring, more obedient people, but a smart person who occasionally gets just a little concerned about this every once in a while would bury such an argument deep in comments where no one would see it).

Now, I think your general point here is more about how we (as researchers) and the public ought to react to papers like this, and you want to say something like “Read carefully. Be skeptical. Update your beliefs conservatively.” And I totally agree with that. But in this case, maybe you are not giving the analysis of the mechanisms enough credit. Well-socialized children do better in school (at least on the extensive margin, since they aren’t kicked out of it). More educated people earn more. Also, some Jamaicans (likely selected based on potential earnings and ease of social adaptability) move to England to make more money. So this paper finds that increased sociability through childhood intervention increases wages, and two obvious candidate mechanisms – education and migration – seem to be part of the explanation in straight-forward, pre-specified, predictable ways.

There is evidence of selective attrition of the migrants. We were able to locate and interview 14 out of the 23 (60%) migrants, a substantially lower share than the share of non-migrants that we were able to find and interview

One possibility is those who emigrate have high expectations placed upon them by family, friends, etc… If their earnings are not sufficiently high, they may refuse to be found / interviewed. If treatment has an effect on income, then we would expect more attrition among controls.

In general I found the discussion of selection bias and attrition to be imprecise and ambiguous, as it relies on the language of statistics and correlation to talk about causation. A simple DAG would be a more direct way to do this. For example, the attrition mechanism I suggested above is simply written as

Treatment ➡ Migration ➡ (latent) Income ➡ Attrition

where income is only observed when Attrition is 0 and missing when attrition is 1.

Dag or no dag, I think this selection problem is huge here. In my linked article I suggested a principal stratification approach. But this sort of thing isn’t easy, it requires a model. In my own research, I can only think of one time where I went to the trouble of fitting a selection model; usually I’ll just control for everything I can control for, and hope for the best.

I think the reason selection models are not used that often, and why the default is to control for everything, share something in common: Inadequate language.

Dags enable us to build qualitative models quickly — literally sketch them out –, use them as mental props to have a dialog with ourselves, and elicit our priors. From finger counting, to abacuses, humans need mental props to make progress.

For example, I am almost certain migration and income have causes in common other than treatment. This means that conditioning on migration — principal stratification — simply opens a back-door path from treatment to the outcome.

My point: There are colliders out there, so we should use our prior information/theories/insights/experiences to postulate likely models, and perform sensitivity analyses, etc.. Dags make this process of exploration considerably cheaper and more accessible. (On the importance of a graphical language see http://vimeo.com/66085662)

1. Here’s my paper with a selection model. It did not kill us to fit it, but it took a bit of work, and the only reason we did it was because our results made no sense without it. I can well imagine that setting it up using dags could make it clearer in some ways, but no matter how you slice it, it took some work. I don’t personally find terms such as “collider” helpful but I recognize that others do. In any case, I am hoping, that Stan will allow us to set up selection models more easily than before. This is something worth looking into.

2. Jennifer and I briefly discuss principal stratification in our book (in chapter 10, I believe). Maybe this will help. Regarding your comment above, principal stratification would not not not condition on migration. It would condition on the migration status that would occur under treatment or under control. I agree that you don’t want to control for intermediate outcomes (Jennifer and I discuss this too); indeed, the whole point of principal stratification is that it does not control for intermediate outcomes. So in this case you’re misunderstanding what principal stratification is!

1. I took a very quick look at the paper (3 minutes). I could not hone in on the selection problem. The identification and estimation are all bound together in a parametric mathematical model, making a quick interpretation hard. A simple DAG would have highlighted the identification problem immediately, separate from the parametric specification, estimation, and computation. But, like coding style, this is a matter of preference.

2. The discussion in your book with Jennifer is on page 192-93. I am still not clear what “principal strata” mean as described in that passage but probably it’s just me. From my reading I guess principal strata simply divides the experimental group into never-takers, always takers, etc. at the level of the mediating variable (see manuscript by Pearl I linked to above, and equation (3) P(Y=y|Z_x=z,Z_x’=z’)).

Now, Z is an intermediate outcome, and a post-treatment variable. The nuance in your comment — “the whole point of principal stratification is that it does not control for intermediate outcomes” — is that we are conditioning on potential outcomes which, under Rubin’s framework, are assumed given. That is, we are not conditioning on the observed post-treatment variable Z = (1-x’)Z_x’ + (x)Z_x. I find this a little contrived — forcing Nature through the eye of our syntax.

A more transparent approach — to me — is using canonical partitions (see Chickering and Pearl 1997), which removes the ambiguity by adding a kind of moderator variable to the graph (Pearl 2009 Chapter 8) That partitioning variable can be pre- or post-treatment.

So yes, I am still confused by principal stratification. In my defense I can sheepishly say that one person’s misunderstanding is relative to another person’s understanding.

No need to apologize for being confused. I remain confused about many aspects of variational Bayes approximations, for example—even after including the topic (and understanding it well enough to program it and write it up for a specific example in our book). I recognize the topic is important and useful but it’s tricky enough, and far enough away from what I usually do, that I remain confused by it.

One might say there are two potential levels of confusion about a topic. The first level is to misunderstand and think it’s something it’s not; the second level is to recognize what it is but to not see how to apply it. For principal stratification, your earlier comment (when you thought that principal stratification was equivalent to controlling for an intermediate outcome) was at the first level; your later comment (that you don’t see the point of the method) is at the second level.

Finally, in many areas of research I have noticed that different people find different methods to be useful. The differences must arise from many factors, including the different sorts of problems that different people work on, and the different goals that each of us has. So it may well be that you will always be comfortable with colliders etc., whereas I will always find such ideas to be confusing.

Here’s one idea that might be helpful (although perhaps not new to you, as I’ve mentioned it several times on the blog): When confused about casual inference, I will often step back and consider everything that flows from a treatment to be a multivariate outcome. So, for example, in the early-childhood-stimulation example, the treatments are clear, and the outcome is bivariate: emigration status and ultimate earnings. The treatment affects both of these. Some of the remarks in the linked article suggest to me that the authors do not think the treatment is acting entirely through emigration. To make such statements, I think some latent-variable models would be needed. Your taste in latent-variable models is different than mine, but I think we agree that explicit models are helpful in such a setting. But, in any case, such models are difficult to construct, in part because they require explicit introduction of prior assumptions. So I can see how the authors would try to avoid this effort by fitting a bunch of regressions and hoping for robustness. This is the sort of thing I often do myself.

No need to read between the lines. From the linked article: “migration might be an important pathway through which the intervention could have improved human capital and earnings outcomes,” and, a bit later, “The later analysis produces a low bound estimate of the treatment effect as it does not allow migration to be a pathway to improved education and earnings.” They said it all. Graphs would be fine but I think what really would be needed to go further is a full latent variable model, with all the work that entails.

Having only skimmed the article it may be that the information is all there. My point is not so much “reading between the lines” as “piecing together nuggets of information across the article” to figure out the underlying causal model they have in mind.

When you read an empirical article, you often say you turn to the figures first. In an ideal world you’ll find a nice figure telling the main result.

I look to do exactly the same. I look forward to the day when I can open an article and find a picture of the model(s) with all the assumptions laid out.

To wit, in an electronic support that model would be encoded in a JSON object or adjacency matrix so we can have a semantic scientific web — one were a computer could discern models used across articles.

Thanks for the informative reply. I would say my confusion is not about the use of the method – I think it can be useful – so much as the characterization and notation used in principal stratification (even the name is confusing/uninformative to me). As a result I was not sure what the method was doing.

Now I think I understand the underlying causal model. I found Pearl & Chickering’s presentation a whole lot more intuitive in this regard but this is a matter of taste.

Regarding the need for modelling, I see DAGs as a fast prototyping language.

[…] issue. Andrew Gelman’s Childhood Intervention and Earnings attracted numerous comments on his blog, one of which is excerpted here. A really positive interpretation [of the study] would be that […]

[…] Last year I came across an article, “Labor Market Returns to Early Childhood Stimulation: a 20-year Followup to an Experimental Intervention in Jamaica,” by Paul Gertler, James Heckman, Rodrigo Pinto, Arianna Zanolini, Christel Vermeerch, Susan Walker, Susan M. Chang, and Sally Grantham-McGregor, that claimed that early childhood stimulation raised adult earnings by 42%. At the time, I wrote, […]

[…] (It was a longitudinal study, and these particular kids were followed up for 20 years.) At the time I expressed skepticism based on the usual reasons of the statistical significance filter, researcher degrees of freedom, […]

[…] P.S. All the abduction in the world won’t save us from selection bias, and I still think that just about all published estimates of effect sizes are biased upward. Including the one discussed here. […]

[…] discuss two examples: the early-childhood intervention study of Gertler et al. which we’ve discussed many times, and a recent social-psychology paper by Burum, Gilbert, and Wilson that happened to […]