“Update: On Twitter, some researchers argued, reasonably in my view, that I wasn’t quite sceptical enough in relating these findings. See the update at the end of this post for more details.”

If you wanted a poster child for the replication crisis and the controversy it has unleashed within the field of psychology, it would be hard to do much better than Fritz Strack’s findings. In 1988, the German psychologist and his colleagues published research that appeared to show that if your mouth is forced into a smile, you become a bit happier, and if it’s forced into a frown, you become a bit sadder. He pulled this off by asking volunteers to view a set of cartoons (paper ones, not animated) while holding a pen in their mouth, either with their teeth (forcing their mouth into a smile), or with their lips (forcing a frown), and to then use the pen in this position to rate how amused they were by the cartoons. The smilers were more amused, and the frowners less so – and best of all, they mostly didn’t discern the true purpose of the experiment, eliminating potential placebo-effect explanations.

This basic idea, that our facial expressions can feed back into our psychological state and behavior, goes back at least as far as Darwin and William James, but “facial feedback”, as it is known, had never been demonstrated in such an elegant and rigorous-seeming manner. Over time, this style of experiment was replicated and expanded upon, and soon it came to be considered a true blockbuster, so famous it found its ways into psychology textbooks, as well as popular books and articles citing it as an example of the unexpectedly subtle ways our bodies and environments can affect us psychologically. Often, facial feedback has been popularised along the lines of Maybe you can smile your way to happiness!, which added an irresistible self-help element that likely helped spread the idea. Either way, it seemed like a genuinely safe and solid psychological finding. That changed rather abruptly in 2016.

That was when a large, multinational replication attempt of the 1988 study, organised by E. J. Wagenmakers of the University of Amsterdam (Strack had bravely volunteered his study for such scrutiny), delivered some surprising results from 17 labs experimenting on almost 2,000 participants: There wasn’t much evidence to support the effect after all. Nine of the labs found the expected effect, albeit in a much weaker form – a difference of just .1 or .2 points, on average, on the nine-point cartoon-amusement rating scale, between the smilers and frowners, as compared to an average difference of about .8 in the original study – while the rest found an about equally weak effect pointing in the other direction. Summing up the whole episode in an enjoyable and comprehensive Slate article, Daniel Engber writes, “When Wagenmakers put all the findings in a giant pile, the effect averaged out and disappeared. The difference between the smilers and frowners had been reduced to three-hundredths of a rating point, a random blip, a distant echo in the noise.”

But a failed replication is rarely the end of the story, because failed replications often spark further controversy over what they mean – or don’t. This was no exception: Some observers agreed that Wagenmakers and his colleagues’ work really did call the idea of facial feedback into question. Others, including Strack himself, argued that because the researchers had altered certain aspects of the experimental setup, these weren’t “true” replications and thus couldn’t be counted as evidence against the original finding. “I don’t see what we’ve learned,” Strack told Engber. (Strack and Wolfgang Stroebe published a paper in 2014 making this replication-sceptical argument more generally.)

Now, a new paper adds a bit of evidence to the idea that the failure to replicate here might have more to do with methodological issues than with the absence of a real effect. For an article published in the Journal of Personality and Social Psychology, Tom Noah, Yaacov Schul, and Ruth Mayo of the Hebrew University of Jerusalem basically replicated the Wagenmakers teams’ replications, but with a newly introduced independent variable to toggle on or off: the presence of a video camera. As the authors explain, this was one of the key differences between the original studies and the replications: in the former, there was no video camera watching the participants, but in the latter there was (the footage was used to check whether the participants had followed the instructions correctly). Strack himself had cited the presence of the cameras as one reason he was skeptical of the failed replications.

Noah and her colleagues write that there are theoretical reasons to believe that the feeling of being observed – by a camera, in this case – could have certain effects that might disrupt facial feedback. Specifically, under such circumstances “people adopt an external perspective of themselves… [and] tend to neglect internal information.” In other words, the act of smiling might cause certain internal cues that in turn cause an uptick in happiness or amusement, but the feeling of being observed could short-circuit this connection.

So in half the Noah team’s (re-)replications, there was a video camera. In the other half, their wasn’t one. And sure enough, there was a good-sized statistically significant effect in the no-camera group – a .83 difference between the smiling and frowning groups on that nine-point scale, which was much larger than the average effect sizes of about .1 or .2 in the successful Wagenmakers replications and right in line with what Strack had found originally – but not in the camera group, where the difference was minuscule and statistically non-significant. This, they write, provides evidence that, as per the paper’s title, “Both the Original Study and Its Failed Replications Are Correct.” In other words, it could be that facial feedback is real, but if you feel like you’re being observed, the effect is stymied.

This could explain the whole sequence, from the original, exciting experiment to the dispiriting follow-ups from 2016: There were no cameras in the original paper (and the various replications of it that followed), so the effect was observed. Then Wagenmakers’ replicators introduced a new feature of the experiment – cameras – that disrupted the effect, so poof, the effect went away.

Toward the end of their paper Noah, Schul, and Mayo write of the importance of “cumulative science”, and their research is a good example of how that principle could be put to work to help resolve what has become something of a vexing controversy in social psychology. Now, researchers have a new theory to work with: being observed can disrupt the facial feedback effect. One study can’t prove this, of course – it’s time, as always, to run more of them, to get a few inches closer to the truth of the matter.

This sort of research also introduces an important cautionary note into the question of how science should be communicated to the public – as an ongoing process rather than as a generator of open-and-closed facts. It is often the case that fairly limited, difficult-to-generalise-from lab findings get popularised in overhyped ways – the reason the power posing controversy, for example, blew up the way it did has just as much to do with the way the findings were popularised and presented to the public (both by Amy Cuddy herself and by journalists and others) than with the core research itself. At the same time, when a famous finding fails to replicate, sceptics can be quick to label the entire topic as junk science. This new replication of a failed replication provides the latest reminder that psychological lab findings – particularly sexy, counterintuitive ones from social psych – can bequite context-dependent and can wither when even subtle changes to the experimental procedure are made. This should make us all the more sceptical about the big, bold claims made by popularisers in TED Talks and elsewhere – and all the more aware of the importance of careful, nuanced science reporting and communication.

Post written by Jesse Singal (@JesseSingal) for the BPS Research Digest. Jesse is a contributing writer at New York Magazine. He is working on a book about why shoddy behavioral-science claims sometimes go viral for Farrar, Straus and Giroux.

—

UPDATE (August 21, 2018): After my post went live, some researchers expressed scepticism about this finding on Twitter and elsewhere. There were some good, informative threads published by Malte Elson and Nick Michalak, among others. And Alexa Tullett, a psych researcher who was asked by Journal of Personality and Social Psychology to peer review the paper, posted that review on Simine Vazire’s website, and it makes important points as well. Check out these links if you want to better understand the full scope of the controversy surrounding these papers.

Overall, the critics make solid points, and after reading their thoughts I believe I may have not covered this study with a critical enough eye. Here I’ll briefly sum up what I view as the two strongest pieces of evidence that go against the idea that this paper significantly bolsters the hypothesis that the presence or absence of a camera can explain the disparity between the Strack study and the followups organised by Wagenmakers. I’m not going to go into some of the more technical, in-the-weeds critiques — readers interested in those can simply click the above links.

The two main issues:

1. In this newest study, the interaction effect between camera presence and facial expression was unimpressive. Noah et al write in their paper: “Although the test of the 2 x 2 interaction was greatly underpowered, the preregistered analysis concerning the interaction between the facial expression and camera presence was marginally significant in the expected direction,” with a p value of p = .051.

The “2 x 2” here is simply referring to the four experimental conditions, smile with camera, smile without camera, frown with camera, and frown without camera. As for p = .051, to greatly oversimplify a technical subject, that means that the finding isn’t quite considered “statistically significant” by the (admittedly arbitrary) rules of statistical science: — p = .05 is the benchmark at which researchers are “confident enough” that what they’re observing isn’t random noise, the lower the better. It’s true, as a I mentioned in my report, that facial expression made a statistically significant difference to amusement when the no-camera condition was considered on its own, but the fact the interaction between conditions was technically non-significant suggests a certain statistical shakiness to the central-to-this-paper claim that the presence or absence of a camera moderates the relationship between facial expression and level of amusement elicited by the cartoons.

2. According to their preregistered experimental setup, the authors planned on running the study with 200 subjects — 50 in each of the four conditions, but they stopped well short of that.

Preregistration is a wonderful innovation when it comes to scientific transparency and replicability: It’s basically a system where researchers publicly lay out exactly how they’re going to run their experiments and what their hypotheses are, making it more difficult for them to engage in certain types of post-facto fiddling and statistical manipulation that have, in the past, likely contributed to the replication crisis. To take the simplest example imaginable: If I pre-register a study in which I state I plan on tossing a coin 10,000 times to determine whether it’s fair, that makes it more difficult for me to then toss it only seven times and (laughably) declare that it’s unfair because I got five tails and two heads.

Noah and his colleagues preregistered this study, and they deserve credit for having done so. But as Michalak pointed out on Twitter, they “preregistered 200 participants who met inclusion criteria [but, according to a footnote in their paper] decided to stop at the end of the semester,” leaving them with fewer subjects. Since the whole point of preregistration is for researchers to actually follow the procedures they have preregistered, this isn’t ideal. It could be that if they had run more subjects, their already-shaky “marginally significant” finding would have dipped into the realm of, well, not even marginally significant.

Again, there are other objections too, many of them more technical, so if you’re curious about this controversy make sure to click on the above links. But all this just goes to show, as I noted in the original post, that no one new study should be taken as “proving” anything. Whatever one thinks of the strength of this finding, and of the camera hypothesis, the answer is the same: Run more studies. Ideally bigger ones that are preregistered, and where everything in the preregistration is followed to the letter.

8 thoughts on “Updated: A re-replication of a psychological classic provides a cautionary tale about overhyped science”

The smiles generated this way are “not authentic”, using a pen to generate smile, ha ha..!
Folk dancing, celebrating festivals etc. make you smile as long as you enjoy them, but, if you want to be happy, it’s an inside-out thing.

Using a logic based independent variable to measure an emotion based dependent variable is complicating matters. Maybe better to stick to one or the other. Also the world has changed since 1988 and comic/cartoon strips are no longer limited to childrens’ comics and Sunday newspapers. They have been been used as vehicles for cynicism and are accessible to all. A alarm may alert the modern brain to potential shock, changing the reaction from emotion to logic. Change the independent to an emotion provoking subject – a puppy or sweet kitten. Even then an excess of cuteness on-line may dull the emotional impact so need something new and surprising!

I remember learning about this experiment early on in my degree many years ago, but I seem to recollect that the pen went across the mouth and between the teeth – really forcing a (forced) smile and using the same muscles that are used in smiling, far more than in the photos shown above where the pen is coming forward and out of the mouth. Maybe my memory is incorrect! In the alternative condition the pen was held just with lips.

One issue here is the p value. .05 is arbitrary to start with, but widely accepted. To call .051 ‘marginal’ is bizarre. That means a ’50 in 1000′ chance is really important while a ’51 in 1000′ is not. Get a life. We need more lab experiments on failed arithmetic reasoning. as an ex-academic social scientist I find the non-scientific approaches that creep into everyday talk of (e.g.) psychologists depressing.