The time reversal heuristic (priming and voting edition)

Over the past decade, social psychologists have dazzled us with studies showing that huge social problems can seemingly be rectified through simple tricks. A small grammatical tweak in a survey delivered to people the day before an election greatly increases voter turnout. A 15-minute writing exercise narrows the achievement gap between black and white students—and the benefits last for years.

“Each statement may sound outlandish—more science fiction than science,” wrote Gregory Walton from Stanford University in 2014. But they reflect the science of what he calls “wise interventions” . . .

They seem to work, if the stream of papers in high-profile scientific journals is to be believed. But as with many branches of psychology, wise interventions are taking a battering. A new wave of studies that attempted to replicate the promising experiments have found discouraging results. At worst, they suggest that the original successes were mirages. At best, they reveal that pulling off these tricks at a large scale will be more challenging than commonly believed.

Well put.

Yong gives an example:

Consider a recent study by Christopher Bryan (then at Stanford, now at University of Chicago), along with Walton and others. During the 2008 U.S. presidential election, they sent a survey to 133 Californian voters. Some were asked: “How important is it to you to vote in the upcoming election?” Others received the same question but with a slight tweak: “How important is it to you to be a voter in the upcoming election?”

Once the ballots were cast, the team checked the official state records. They found that 96 percent of those who read the “be a voter” question showed up to vote, compared to just 82 percent of those who read the “to vote” version. A tiny linguistic tweak led to a huge 14 percentage point increase in turnout. The team repeated their experiment with 214 New Jersey voters in the 2009 gubernatorial elections, and found the same large effect: changing “vote” to “be a voter” raised turnout levels from 79 percent to 90 percent.

Wow! Sounds pretty impressive. But should we really trust it?

Yong continues:

When Alan Gerber heard about the results, he was surprised. As a political scientist at Yale University, he knew that previous experiments involving thousands of people had never mobilized voters to that degree. Mail-outs, for example, typically increase turnout by 0.5 percentage points, or 2.3 if especially persuasive. And yet changing a few words apparently did so by 11 to 14 percentage points. . . .

So he repeated Bryan’s experiment. His team delivered the same survey to 4,400 voters in days leading up to the 2014 primary elections in Michigan, Missouri, and Tennessee. And they found that using the noun version instead of the verb one had no effect on voter turnout. None. Their much larger study, with 20 to 33 times the participants of Bryan’s two experiments, completely failed to replicate the original effects.

Melissa Michelson, a political scientist at Menlo College, isn’t surprised. She was never quite convinced about how robust Bryan’s results were, or how useful they would be. . . .

Jan Leighley from American University agrees. The small sample size of the original study “would have tanked the paper from consideration in a serious political science journal,” she says.

That last bit is funny because the paper in question appeared in . . . you guessed it, PNAS! It’s tough being a social scientist: work that’s not strong enough to appear in own journals, gets into Science or Nature or PNAS, where it gets more publicity than anything in APSR, AJPS, etc.

I looked at the Bryan et al. paper and it does have some issues. First, the estimated effect sizes are huge and strain plausibility, which implies high rates of type M and type S errors. Second, there are some forking paths. For example:

A significant Levene’s test indicated that there was less variance in the noun condition than in the verb condition [F(1,32) = 6.02, P = 0.020]. This appeared to be the case because of a ceiling effect in the noun condition, where 62.5% of participants were at the highest point on the scale (compared with 38.9% in the verb condition). Adjusting for this, the significance level of the condition effect strengthened slightly [t(29.40) = 2.15, P = 0.040]. In addition, a separate χ2 analysis, which does not rely on the assumption of the equality of variance, found that more participants indicated that they were “very” or “extremely” interested in registering to vote (as opposed to “not at all,” “a little,” or “somewhat” interested in registering to vote) in the noun condition (87.5%) than in the verb condition (55.6%) [χ2(1, n = 34) = 4.16, P = 0.041].

From one direction, this looks pretty good: they tried the analysis all sorts of ways and always got statistical significance! From another perspective, though, we see all these researcher degrees of freedom lurking around. For example, the significance level “strengthened” at one point: this was a meaningless change from 0.044 to 0.040. Actually, even p-values of 0.1 and 0.01 are not statistically significantly different from each other! The point is that there are so many different ways to slice this cake and they keep reporting those p-values of 0.03 or 0.04 or whatever.

There’s nothing wrong with data exploration—I’m the last person to insist on preregistration—and I find this paper more interesting than the ESP paper, or the ovulation-and-voting paper, or the fat arms paper, or the beauty-and-sex-ratio paper, or lots of other silly stuff we’ve posted on in the past. But I can’t take these p-values and effect-size estimates seriously.

16 Comments

“There’s nothing wrong with data exploration—I’m the last person to insist on preregistration—and I find this paper more interesting than the ESP paper, or the ovulation-and-voting paper, or the fat arms paper, or the beauty-and-sex-ratio paper, or lots of other silly stuff we’ve posted on in the past. But I can’t take these p-values and effect-size estimates seriously.”

I find this contradictory, or perhaps better phrased: confusing.

I understood that p-values are only legit/diagnostic/insert appropriate term here, when they are not selectively reported and/or are corrected for multiple comparisons.

If this is correct, then i reason that “not taking these p-values seriously” is a direct consequence of the analyses not being pre-registered. Or in other words, if you don’t insist on pre-registration, you can’t take any p-values seriously period.

Also see Neuroskeptic:

“A “no preregistration, no p-values” rule also ensures that p-values can be taken at face value.”

I discuss the matter here. Short answer: in the absence of preregistration, the interpretation of p-values requires more assumptions. Saying “no preregistration, no p-values” is like saying “no random sampling, no sampling inference” or “no randomized experimentation, no causal inference”: it has the advantage of rigor but it excludes a lot of important cases.

A voters civics responsibility should be instilled in them early in without having to result to short term propping up of the vote techniques. I would expect the p values or effect sizes to be different depending upon who, what, when, and where it is measured

In a typical half-hour situation comedy (really about 18 minutes of material), seemingly intractable problems are resolved by the end of the show, usually with small interventions. The problem is resolved, and doesn’t reoccur for the remainder of the comedy series.

So I buy the idea that changing from a noun to a verb causing a 14 percentage point change in turnout is outlandish – and that the small sample size and researcher degrees of freedom should have doomed that study from the outset. On the other hand, my casual empiricism (which makes my gut level feeling sound more scientific – and, admittedly, this is not an area of research I am not at all familiar with) is that a few news headlines about a candidates’ emails within a couple of weeks of an election might induce a 5-10% shift in voting patterns, seem plausible to me. Both should be subject to careful analysis since both claims could be true, but might not be. More to the point, the effect size is unknown and could vary from small to large.

My question is a Bayesian one – what is a reasonable prior for an effect on voting patterns of a new story – whether it be shark attacks, emails under investigation, or something else? I’m sure someone would want to look at prior research – but it seems to me that some of that research would show large effects (possibly to be discounted by the very problems alluded to in this post) and some might show no effects (with similar possible problems). My causal empiricism (again) makes me susceptible to believing that large impacts on voter turnout are unlikely while large swings in votes cast are much more believable. That is because my prior on human behavior is quite skeptical – I think it is easy to lead someone that shows up to vote to change their vote but hard to get them to decide to take the time and trouble to show up and vote.

Not all 5-10% shifts are created equal (and these are over 10% shifts BTW). So you can’t really have a gut feeling about them that can be accurate because they all mean different things.

If this was a 10% shift of voter turnout from 50% up to 60% that’s one thing. That’s substantially more plausible (but still not much at all) than from 80% to 90%. There are lots of ways to think about this fact. As you get closer to the ceiling in proportion it gets harder and harder to get larger effects because there’s nowhere to go. Also consider the standard deviation as well where at 90% turnout the SD = 30% while at 60% the SD = 49%. And what if the shift was from 89% to 99%? The SD of 99% is <10%. The shifts don't mean the same thing.

As for a prior, without trying too hard on can look at the recent presidential election. The turnout was pretty much the amount predicted most everywhere within a few percent. The vote preference was also within a few percent. You get huge swings in the electoral college with small shifts in percentage. Now consider the bomb shells dropped during that election. Within a week of the vote the FBI announced a federal investigation of a candidate for criminal, potentially treasonous, behaviour! What would influence turnout more? That or the difference between a verb and a noun.

I don’t think you need to know much about Poli Sci to see some of the issues.

Even though the effects are significant p-values are close to .05 which means they couldn’t have detected much smaller effects. You can imagine a 95% CI goes from near 2 to 22%. It could be anything. If you’ve watched any election polling, turnout numbers, and results for any length of time you almost never see shifts that huge. Even when a candidate admits openly to thinking it’s fine to sexually assault 50% fo the population because he’s “special” the turnout doesn’t go up or down 10%. I should think that would be more powerful than a noun to verb shift.