“Edlin’s rule” for routinely scaling down published estimates

A few months ago I reacted (see further discussion in comments here) to a recent study on early childhood intervention, in which researchers Paul Gertler, James Heckman, Rodrigo Pinto, Arianna Zanolini, Christel Vermeerch, Susan Walker, Susan M. Chang, and Sally Grantham-McGregor estimated that a particular intervention on young children had raised incomes on young adults by 42%. I wrote:

Major decisions on education policy can turn on the statistical interpretation of small, idiosyncratic data sets — in this case, a study of 129 Jamaican children. . . . Overall, I have no reason to doubt the direction of the effect, namely, that psychosocial stimulation should be good. But I’m skeptical of the claim that income differed by 42%, due to the reason of the statistical significance filter. In section 2.3, the authors are doing lots of hypothesizing based on some comparisons being statistically significant and others being non-significant. There’s nothing wrong with speculation, but at some point you’re chasing noise and picking winners, which leads to overestimates of magnitudes of effects.

After seeing this, Aaron wrote to me:

It seems to me that there should be a standard rule of skepticism. It takes a point estimate and standard error for a significant effect (do you need sample size too) and divides point estimate by something and multiplies standard error by something to get a Posterior under the new principle of ignorance.

That is, suppose you don’t start with priors but I just tell you someone studied something and published a study saying something. What do you believe!

Then if you know something from past experience or think you do you can adjust further.

What is your posterior having read the study by Gertler and Heckman?

I replied:

Yes, this is basic Bayes. The prior for any effect is centered at 0 with some moderate standard deviation. For example, most educational interventions don’t do much, some have small effects, very few have effects as large as 42%. But maybe Gerlter et al would not agree. And, from a political standpoint, it’s savvy for them to ignore me. [Here I was expressing frustration because I’d contacted Gertler, the first author of the aforementioned paper, several times regarding my concerns about his effect size estimate, and he did not respond.] If my article had been published in the New York Review of Books rather than Symposium magazine, maybe things would be different, but, then again, I doubt the New York Review of Books would be particulary interested in someone expressing skepticism on early childhood intervention. . . .

Aaron replied:

Ah. But I am suggesting an actual rule of thumb number that tells me how much to change them. (As a guess). Gelman’s contribution is the actual factors.

Bayes tells you how to do it in principle based on well specified priors. I want a rule of thumb I can carry with me and implement on the spot. Twenty years ago, I heard of a folk theorem called the “Iron Law of Econometrics.” It is that all estimated coefficients are biased toward 0 because of errors in variables. The name is possibly due to Hausman, though the idea is much older of course.

Indeed, there are several different reasons for effect size estimates to be overestimated. Above I mentioned the statistical significance filter and multiplicity (that is, researchers can choose the best among various possible comparisons to summarize their data); Aaron pointed out the ubiquity of measurement error; and there are a bunch of other concerns. For example, here’s Charles Murray:

The most famous evidence on behalf of early childhood intervention comes from the programs that Heckman describes, Perry Preschool and the Abecedarian Project. The samples were small. Perry Preschool had just 58 children in the treatment group and 65 in the control group, while Abecedarian had 57 children in the treatment group and 54 in the control group. In both cases the people who ran the program were also deeply involved in collecting and coding the evaluation data, and they were passionate advocates of early childhood intervention.

Murray continues with a description of an attempted replication, a larger study of 1000 children that reported minimal success, and concludes:

To me, the experience of early childhood intervention programs follows the familiar, discouraging pattern …small-scale experimental efforts staffed by highly motivated people show effects. When they are subject to well-designed large-scale replications, those promising signs attenuate and often evaporate altogether.

Heckman replies here. I’m not convinced by his reply on this particular issue, but of course he’s an expert in education research so his article is worth reading in any case.

In any case, I give Heckman credit for making predictions about effects of future programs. In a 2013 interview, he says:

They’re just now conducting evaluations of Educare, and I know because some of it is being conducted here in Chicago, by people I know and respect. Educare is based on a program that has been evaluated. It is an improvement on the Abecedarian program for which the results are highly favorable. Evidence from the Abecedarian program provides a solid “lower-bound” of what Educare’s results will probably be.

I’d guess the opposite. Given all the reasons above for suspecting that published results are overestimates, I’d guess that, in a fully controlled study with a preregistered analysis, the results for this new study would be less impressive than in the earlier published results. On the other hand, I don’t know anything about the details of any of these programs so in particular have no sense of how much Educare is an improvement on Abecedarian.

Heckman sees the earlier published results as a lower bound because he sees improvements in the interventions (which makes sense; people are working on these interventions and they should be getting better). I see the published results as an overestimate (but not an “upper bound,” because any estimate based on only 111 kids has got to be too noisy to be considered a “bound” in any sense) based on my generic understanding of open-ended statistical analyses.

What’s the Edlin factor?

To return to Aaron’s original question: is there a universal “Edlin factor” that we can use to scale down published estimates? For example, if we see “42%” should we read it as “21%”? That would be an Edlin factor of 1/2 (or, as the economists would say, “an elasticity of 0.5.”

But it doesn’t seem that any single scale-down factor would work. For example, when somebody-or-another published the claim that beautiful parents were 36% more likely to have girls, I’m pretty sure this was an overestimate of at least a factor of 100 (as well as being just about as likely or not to be in the wrong direction). Using an Edlin factor of 1/2 in that case would be taking the claim way too seriously. On the other hand, I’m pretty sure that, if we routinely scaled all published estimates by 1/2, we’d be a lot closer to the truth than we are now, using a default Edlin factor of 1.

Here’s another example where it would’ve helped to have an Edlin factor right from the start.

What do you all think?

P.S. We also need a better name than “Edlin factor.” I don’t like the word “shrinkage,” it sounds too jargony. Scale-down factor is kind of ok but I think we could come up with something better. The point is for researchers and consumers of research to routinely scale down published estimates, rather than taking them at face value and then later worrying about problems.

P.P.S. I sent the above to Edlin himself, who wrote:

Your refusal to name an Edlin factor seems decidedly non-Bayesian and a bit classical to me. I understand that with more information and context, the factor changes dramatically. But shouldn’t a Bayesian be willing to name a number rather than just say <1? (Also the commenter is correct that the Iron Law moves the opposite direction, which is what I was trying to say, btw. you know that of course. Not sure if that was clear or not in the blog.)

I replied: Maybe. But I think it’s also ok for a Bayesian to say that his prior depends on the problems he might be working on. A universal prior is some sort of average over all possible problems, but in that case a lot of precision might be gained by adding some context. I think an Edlin factor or 1/5 to 1/2 is probably my best guess for that Jamaica intervention example. But in other cases I’d give an Edlin factor of something like 1/100. And there are other settings where something close to 1.0 would make sense. Here I’m thinking of various studies that are repeated year after year and keep coming up with consistent results, for example correlations between income and voting.

Also, yeah, that Iron Law thing sounds horribly misleading. I’d not heard that particular term before, but I was aware of the misconception. I’ll wait on posting more about this now, as a colleague and I are already in the middle of a writing a paper on the topic.

39 Comments

The simple replacement for “Edlin factor” the comes to mind is “deflation.” Borrowing some terms from rhetoric for other (potentially humerous) possibilities could result in “hyperbolic balance factor,” “Deauxesisification, or “Meiotic correction” for starters.

I don’t understand? You mean instead of a bunch of regression coefficients, these just look like means? In fact, the treatment effects come from a linear regression, but since the interest isn’t on the effects of non-experimentally manipulated covariates (like age and gender), they just don’t report those regression coefficients.

From the text below Table 4: The treatment effects are estimated by linear regression and are interpreted as the differences in the means of employment outcomes between the stunted treatment and stunted control groups conditional on baseline values of child age, gender, weight-for-height z-score, maternal employment, and maternal education.

I actually greatly prefer this to tons of regression coefficients and tables. And since the statistical inference isn’t coming from the linear regression directly, we don’t need standard errors for them either (those are p-values coming from permutation tests – you may hate those (Andrew does) but they are a reasonable way to perform this kind of test in this context, better than analytic methods I think).

I mean I wanted to see histograms/boxplots/stripcharts/whatever of the uncorrected treatment and control group data. Would this be inappropriate for some reason?

I hoped to be able to see the data described in order to judge whether the assumptions of their approach are met and to consider whether the results made sense (or suggested future lines of investigation) by asking questions such as: “Do the members of the control group have higher paying first or second jobs?”

For example figure 1 has the density plots but there is no x-axis. It looks like the proportion of high earners dropped in the control group from first to last job (did they get into the booze or what?). I did not read the paper very carefully though.

I’ll add that a common problem with some types of medical data such as protien expression, is that they are normalized to the levels of “housekeeping genes”, but the actual levels of “housekeeping gene” expression are not shown. So you can get an apparent positive treatment effect on your protein of interest when instead the treatment was decreasing your housekeeping gene expression.

I have seen this happen anecdotally but have no idea how pervasive the problem is because raw data is not reported (and possibly even looked at by the original investigator). Since the analysis seems focused on only comparing treatment to control, I wished to compare the different levels of control group results to see if they displayed any strange behavior. I would also want to compare control levels of future studies to this one to see if they are consistent.

The lack of labels on those figures is annoying, but at least you get the control group means (Tables 4,8) and since it isn’t a difference-in-difference, we don’t have to worry about a positive relative effect where the treatment group is still below (in levels) the control group.

But in response to the question – yeah, this is about as good of reporting as you get in the Social Sciences. I think there is no question that weird stuff is going on, sub-groups have differential impacts, and the mean is not the only thing one could be interested in. But that’s what they were doing (mean impacts, with some sub-groups), and I think they report that quite well.

Now – the real issue of clarity to me is whether they release the data when the paper comes out. Of course, that’s from a “replication” standpoint, and not a “reader who wants to be able to read a paper critically” perspective. I think the latter is a really difficult length/breadth trade-off in all papers.

You have a suggestion for one figure that you think “adding figure X would greatly improve my ability to read this paper critically”? I have some ideas, but nothing I think would help lots of people, just stuff that could answer little questions of mine.

JRC, if they don’t share the data at least at the treatment vs control level (assuming they do not share more due to privacy considerations), this report is equal to zero evidence until replicated by an independent lab. Yes it was a 20 yr study, no it is not ok to extrapolate from this group to elsewhere.

Regarding that “Iron Law of Econometrics”: I think this is pointing to under-estimation of true effect sizes, not over-estimation, when there is noisy (but unbiased) estimates of the explanatory variables.

I think this is an asymptotic result, and could conceive of a situation where, in the presence of small sample size + strong mechanisms selecting for headline-grabbing results, measurement of explanatory variables could allow for over-estimation, but the general idea is about overly small estimates.

I saw this paper referenced on a blog recently: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2377290. Apologize if it was here. It seemed kind of clever because it supposedly obtains unbiased effect sizes from just the published sample sizes and p-values. The seems to meet Aaron’s requirement of having something he could apply on the fly from just reading a study.

As for some universal Edlin factor — I can’t see how it can exist, at least in general terms.

Having said that, how about the following for a 1st-order correction:
[1] Calculate effect size for -2,-1,0,+1,and +2 standard deviations. So now you have 5 effect estimates.
[2] For each, give your subjective probability (p_i) that you would hear about this result, given it turned up in an analysis, with the same std dev as the original estimate.
[3] Average over these results with weights 1/p_i. This is your revised estimate. Note the estimate explodes if any p_i = 0, which is probably a good thing.

PS – Aaron’s comments seem to imply some kind of revised likelihood, rather than any incorporation of prior information. I am not sure if this is possible, given that the news-worthy-ness of a new finding (which will affect the researcher-level and journal-level filters) will be determined by whatever is out there. “Smoking causes cancer!” = not news. “Having democrat parents causes cancer!” = now THAT is news.

It’s not universal, and it doesn’t address all the issues that lead to biased estimates, but Judy Zhong and Ross Prentice devised a simple correction (a deflation factor) based on the p-value and the significance level. Around the same time, Xiao and Boehnke did something similar.

I find it a very interesting question too: we clearly should have strong priors for how big effects are (we’ve been doing statistically-oriented science for a century now), but where do we get them from and how do we adjust them?

So far, the best approach I’ve found seems to be to take an empirical Bayes approach: collate all meta-analytical effects across a field, and turn that into the distribution. Now you have an upper bound (since weak effects will generally not be replicated enough times to be meta-analyzed).

We’re doing something like this with the Archival Project at archivalproject.org and have been contacted by a few others with similar projects. Of course basing it on average associations in the field would be very crude prior knowledge, and if a subfield level has lots of publication bias your prior is an overestimate too. In sample size calculation I often see a somewhat informal calibration. I think that’s actually a bit more realistic than basing your sample size on the first published study (winner’s curse etc).

This might take a lot of work, but what about looking at it empirically? That is, take studies that claim an effect size in some range, say 40-45%. Of those that have been well replicated with large samples, what is the average effect size in the replicated study?

Yes, but it wouldn’t even as easy as that, given that not all replications are published, and replications themselves are rarely completely clean. When study B is done to replicate study A, often there are differences in the data protocols and analyses.

And within that published, not sure you can ignore Rubin’s response surface estimation (formalism) and for the quality subcomponent of that problems with assessing/allowing for that.

“It appears that ‘quality'(whatever leads to more valid results) is of fairly high dimension and possibly non‐additive and nonlinear, and that quality dimensions are highly application‐specific and hard to measure from published information.” from On the bias produced by quality scores in meta‐analysis, and a hierarchical view of proposed solutions S Greenland, K O’rourke – Biostatistics, 2001

Ah, but statistics is the science of defaults. Even if we’re not going to elevate multiplication by some specific scalar value to the status of a rule, it does make sense (to me) to have a default rough-and-ready method to calibrate anticipations.

Well, it seems like this “Edlin’s rule” is more or less a “skeptic’s field/domain specific prior”. Personally, I’d call this “Skepticism” for short.

To really do this properly, it’d be great to have a meta-analysis comparing all predicted and then subsequently confirmed effect sizes, so that one could get a handle on, say, childhood interventions, or the average rigor of personality studies, for example. But this doesn’t really help for a on-the-fly, how-much-don’t-I-believe-this-study calculation. The best set of “rules of thumb” (or “rule of thumbs”? (just kidding)) likely involves all sorts of factors and might lead to a publication-generalization theory. But in the meantime, if we insist on believing the effects and not just direction of un-replicated studies, and if the effect is large, why not just divide by the ‘surprising-ness’ of the claim?

The effect is between 0.5% and 5% (that’s the 95% confidence interval). In the social sciences it doesn’t matter what you’re measuring, if you’re bothering to do the test then the effect is always between 0.5% and 5%. And if you take several interventions, each of the order of 0.5% to 5%, and apply them all at once, the final effect is between 0.5% and 5%. If your answer is outside this range, it means you don’t have enough data.

Idk, but concerning the Boston Review of Books quote by Murray, there is actually a response by Heckman, strongly disputing that the alleged replication of the Abecedarian Project actually was one. I have no expertise whatsoever on this topic, and perhaps you feel comfortable enough to omit it because you decided Murray is right in this case. If so, I think you should say so. As it is, the omission without any apparaisal of this very question from your side creates the impression as if you quoted Murray as an authority over Heckman. I have an incredibly hard time to believe that.

Fair enough. I added the link. In general of course I’d have to take Heckman, not Murray, as the authority in this area, given Heckman’s research experience both on the applied topic and on econometrics in general. However, on this particular issue I’d have to go with Murray’s belief that the true effect is smaller than the published point estimates, rather than with Heckman, who thinks the published estimates represent a “solid lower-bound” on what will probably happen in a new study. (I’m assuming here that the new study is being conducted in a preregistered way so that there are no degrees of freedom in the data processing and analysis. But I think Heckman is assuming that too; when he says he thinks this new study will show effects that are as large as or larger than the old study, I can’t imagine he’s saying this will occur just because the new study is also subject to statistical overestimation.)

In short, I think Heckman’s great, he’s hugely influential and has made lasting research contributions, but I think he wasn’t thinking about the Edlin factor and selection bias when performing and discussing these education studies. This oversight doesn’t shock me, as I myself have been doing applied social science research for decades but only recently have been thinking seriously about the problem.

I suspect a lot of successful (and unsuccessful) complex social science experiments depend upon a lot of idiosyncratic factors coming together in a good way. The few large scale intervention experiments that Heckman is so fond of are not just simple experiments, they were large organizational projects that depended upon good management, motivated staffs, cooperative parents, excited children, and so forth.

This doesn’t mean they can’t be replicated, just that it’s hard. Massive social science experiments like these are kind of like movies in scale. Right now, some executive in Hollywood is probably looking at the surprise box office success of The Lego Movie and wondering whether he should greenlight The Lincoln Log Movie. How hard could it be to replicate the interaction factors that made The Lego Movie a hit?

Well, movie history suggests: pretty hard. A lot of movie success is catching lighting in a bottle and is hard to replicate. On the other hand, over the decades Hollywood has gotten better at replication: look at the mediocre performance of the Jaws sequels in the 1970s versus the box office triumphs of the Pirates of the Caribbean sequels more recently. So, I don’t think it’s impossible to get better at replication, either.

I like to think of these things as dilations. When I used to do project management in a former life I always calculated what I called the time dilation factor where I took the initial estimate of the project timeline and doubled it then added six more months. It was always a more consistent estimate with how long it took than the optimistic timeline the project management software spit out.

Actuaries have been doing this sort of rough-and-ready Bayesian shrinkage for over a century! The actuarial jargon for “Edlin factor” is “credibility factor”. Suppose BemCo sells auto insurance in both Minnesota and Wisconsin. In MN the average historical loss – based on a rich, reliable data sample – is $500. In contrast an analysis of BemCo’s more sparse historical data in WI yields an unrealistically high $1000 average loss. If you had to make a quick business decision, you might use your judgment based on professional experience to assign “20% credibility” to the WI indication and assign the “complement of credibility” to the MN indication. This results in a $600 indicated cost of selling a policy: (1000*20%)+(500*80%)=$600. The most widely discussed credibility set-up is essentially a random intercept hierarchical regression model of the sort described in Gelman/Hill. But the general notion of “credibility” is used in more intuitive, shotgun ways as well. Out of business necessity, the actuarial profession has long had a strong Bayesian streak. Sharon Bertsch McGrayne has a couple of chapters about this in “The Theory that Would Not Die”.

So maybe “credibility adjustment” would be a useful term here as well? (After all, results like Bem’s are “incredible” in one way or another…) Coming up with good rules of thumb for these types of adjustment factors would be tough… and for all I know maybe impossible in any rigorous sense. But simply having a catchy term that telegraphs the need for such adjustments would be a big step forward for society. Maybe some kind of pragmatic credibility theory could be a topic for the emerging field of data journalism. Reminds me of Krugman’s “World round, scientists say; others disagree”.

Well, sometimes the complement of credibility inflates as opposed to shrinks, but yes, we’ve been doing this for a while. Anyhow, if I recall my exams correctly, Buhlmann credibility is, in theory, the “best linear fit” to the Bayesian posterior expectation.

I am sure you could do some illuminating experiments in a few minutes of R. Make a model with a set of N “predictors” and N (true) linear coefficients and some scatter. Create M data points, and then “do research” to find the best predictor among the N. How much bigger is it than its true coefficient? The answer should depend on the numbers N and M and the scatter, but maybe in a simple way, no? If so, bingo: The saltiness coefficient.

I don’t see how that would work. You would have to introduce some sort of weirdness (heteroskedasticity, correlation of predictors and true error, omitted covariates that are correlated with included regressors) that would violate some regression assumption. Otherwise, your estimates would be consistent, and we would know (analytically) the distribution of coefficient estimates. I don’t think that is the kind of mental deflation factor Andrew has in mind.

I think conceptually what you’d have to do is experiment on experimenters – give them some small data set, which is a random sample of a larger, experimental dataset, and see what “results” they decide to “publish” (meaning send you back as “their results”). Then, with the known “true” experimental effect from the total sample, you might have the right kind of comparison between what people submit for publication and the real effect in the world.

[…] does not kill my statistical significance makes it stronger” fallacy. For example, in an exchange involving about potential biases in summaries of some well studied, but relatively small, early […]