Rolf Zwaan

Friday, May 18, 2018

My former colleague Roy Baumeister famously said that replication
is a "career niche for bad experimenters.”* I like to use this quote in my talk. Roy
is wrong, of course. As anyone who has tried to conduct a replication study
knows, it requires a great deal of skill to perform replications. This leads to
the question Is there a career niche for
replicators?

I was asked this question yesterday when I gave a talk on
Making Replication Mainstream at the marvellous Donders Institute for Cognition, Brain, and Behaviour in Nijmegen. I get asked this question regularly. My standard
answer is that it is not a good career choice. Implicit in this answer is the
idea that in order to become a tenured faculty member, one has to make a unique
contribution to the literature. Promotion-and-tenure writers are always asked
to comment on the uniqueness of a candidate’s work. Someone who only conducts
replication studies would run the risk of not meeting the current requirements to become and remain faculty members.

During lunch, a group of us got to talking some more about
this issue, to which I hadn't given sufficient thought, as it soon turned out.

It was pointed out that there is a sizeable group of researchers who
would like to remain in science, have excellent methodological skills but don’t
necessarily have the ambition/creativity/chutzpah/temerity to pursue a career
as faculty member.

These researchers, was the thinking at our lunch table, are
perfectly suited to conduct replication research. The field would benefit
greatly from their work. If we truly want to make replication mainstream, there
ought to be a career niche for them.

If faculty member is not a viable option, then what would be a good career niche for replicators? It was suggested at our table that replicators should become staff members,
much like lab managers. They would not be evaluated on the originality or uniqueness of their
publications. In fact, maybe they would not even be on the
publications, just as lab managers often are not on publications. Faculty
members select studies for replication and replicators conduct them and by doing so make a value contribution to our science.

I think this is a fair summary of our discussion. I have no
strong opinions on this career niche for replicators yet but I wonder what ya'lls thoughts on this are.

----
* The link is to a paywalled article but I'm sure you can scihub your way to it.

The authors start with two important observations. First, semantic priming experiments yield robust effects, whereas “social priming” (I’m following the authors’ convention of using quotation marks here) experiments do not. Second, semantic priming experiments use within-subjects designs, whereas “social priming” experiments use between-subjects designs. The authors are right in pointing out that this latter fact has not received sufficient attention.

The authors’ goal is to demonstrate that the second fact is the cause of the first. Here is how they summarize their results in the abstract: “These results indicate that the key difference between priming effects identified as more and less reliable is the type of experimental design used to demonstrate the effect, rather than the content domain in which the effect has been demonstrated.”

This is not what the results are telling us. What the authors have done, is to take existing well-designed experiments (not all of which are priming experiments by the way, as was already pointed out in the social media), and then demolish them to create, I’m sorry to say, more train wrecks of experiments in which only a single trial for each subject is retained. By thus getting rid of the vast majority of trials, the authors end up with an “experiment” that no one in their right mind would design. Unsurprisingly, they find that in each of the cases the effect is no longer significant.

Does this show that “the key difference between priming effects identified as more and less reliable is the type of experimental design used to demonstrate the effect”? Of course not. The authors imply that having a within-subjects design is sufficient for finding robust priming effects, of whatever kind. But they have not even demonstrated that a within-subjects design is necessary for priming effects to occur. For example, based on the data in this manuscript, it cannot be ruled out that a sufficiently powered between-subjects semantic priming effect would, in fact, yield a significant result. We already know from replication studies that between-subjects “social priming” experiments do not yield significant effects, even with large power.

More importantly, the crucial experiment that a within-subjects design is sufficient to yield “social priming” effects is absent from the paper. Without such an experiment, any claims about the design being the key difference between semantic and “social priming” are unsupported.

So where does this leave us? The authors have made an important initial step in identifying differences between semantic and “social priming” studies. However, to draw causal conclusions of the type the authors want to draw in this paper, two experiments are needed.

First, an appropriately powered single-trial between-subjects semantic priming experiment. To support the authors’ view, this experiment should yield a null result. This should of course be tested using the appropriate statistics. Rather than using response times the authors might consider using a word-stem completion task. Contrary to what the the authors would have to predict, I predict a significant effect here. If I’m correct, it would invalidate the authors’ claim about a causal relation between design and effect robustness.

Second, the authors should conduct a within-subjects “social priming” effect (that is close to the ones that they describe in the introduction). Whether or not this is possible, I cannot determine.

If the authors are willing to conduct these experiments--and omit the uninformative ones they report in the current manuscript—then they would make a truly major contribution to the literature. As it stands, they merely add more train wrecks to the literature. I therefore sincerely hope they are willing to undertake the necessary work.

Smaller points

p. 8. “In this approach, each participant is randomized to one level of the experimental design based on the first experimental trial to which they are exposed. The effect of priming is then analyzed using fully between-subjects tests.” But the order in which the stimuli were presented was randomized, right? So this means that this analysis actually compares different items. Given that there typically is variability in response times across items (see Herb Clark’s 1973 paper on the “language-as-fixed-effect fallacy”), this unnecessarily introduces noise into the analysis. Because there usually also is a serial position effect, this problem cannot be solved by taking the same item. One would have to take the same item in the same position. Therefore, it is impossible to take a single trial without losing experimental control over item and order effects. This is another reason why the “experiments” reported in this paper are uninformative.

p. 9. The Stroop task is not really a priming task, as the authors point out in a footnote. Why not use a real priming task?

p. 15. “It is not our intention to suggest that failures to replicate priming effects can be
solely attributed to research design.” Maybe not, but by stating that design is “the key difference,” the authors are claiming it has a causal role.

p. 16. “We anticipate that some critics will not be satisfied that we have examined ‘social
priming’.” I’m with the critics on this one.

p. 17. “We would note that there is nothing inherently “social” about either of these features of priming tasks. For example, it is not clear what is particularly “social” about walking down a hallway.” Agreed. Maybe call it behavioral priming then?

p. 18. “Unfortunately, it is not possible to ask subjects to walk down the same hallway 300 times after exposure to different primes.” Sure, but with a little flair, it should be possible to come up with a dependent measure that would allow for a within-subjects design.

p. 19. “We also hope that this research, for once and for all, eliminates content area as an explanation for the robustness of priming effects.” Without experiments such as the ones proposed in this review, this hope is futile.

Wednesday, January 31, 2018

A number of years ago, my colleagues Peter Verkoeijen, Katinka Dijkstra, several undergraduate students, and I conducted a replication of Experiment 5 of Kidd & Castano (2013). In that study, published in Science, participants were exposed to an excerpt from either literary fiction or from non-literary fiction.

Kidd and Castano hypothesized that brief exposure to literary fiction as opposed to non-literary fiction would enhance empathy in people because of the greater psychological finesse in literary novels than in non-literary novels. Anyone who has read, say, Proust as well as Michael Crichton will probably intuit what Kidd and Castano were getting at.

Their results showed indeed that people who had been briefly exposed to the literary excerpt showed more empathy in Theory of Mind (ToM) tests than participants who had been briefly exposed to the non-literary excerpt.

Because the study touches on some of our own interests, text comprehension, literature, empathy and because of a number of reasons detailed in the article, we decided to replicate one of Kidd & Castano’s experiments, namely their Experiment 5. Unlike Kidd and Castano, we found no significant effect of text condition on ToM. We wrote that study up for publication in the Dutch journal De Psycholoog, a journal targeted at a broad audience of practitioners and scientists.

Because researchers from other countries kept asking us about the results of our replication attempt, we decided to make them more broadly available by writing an English version of the article with a more detailed methods and results section than was possible in the Dutch journal. This work was spearheaded by first author Iris van Kuijk, who was an undergraduate student when the study was conducted. A preprint of the article can be found here. An attentive reader who is familiar with the Dutch version and now reads the English version will be surprised. In the Dutch version the effect was not replicated but in the English version it was. What gives?

And this brings us to the wrinkle mentioned in the title. The experiment relies on subjects having read the excerpt. However, as any psychologist knows, there are always people who don’t follow instructions. To pinpoint such subjects and later exclude their data, it is useful to know whether they’ve actually read the texts. In both experiments, reading times per excerpt were collected.

We originally reasoned that it would be impossible for someone to read and understand a page in under 30 seconds. So we excluded subjects who had one or more reading times < 30 seconds per page. This ensured that our sample included subjects who had at least spent a reasonable amount of time on each excerpt. This would give the manipulation, reading a literary vs. non-literary excerpt optimum chance to work.

Upon reanalyzing the data for the English version, my co-authors noticed that Kidd and Castano had used a different criterion for excluding outliers. They had used a criterion that was less stringent than ours. They had excluded subjects whose average reading times were < 30 seconds. This potentially includes subjects who may have had long reading times for one page but may have skimmed another.

Our original approach ensured that people had at least spent a sufficient amount of time on each page. This still does not guarantee that they actually comprehended the excerpts, of course. For this, it would have been better to include comprehension questions, such that subjects with low comprehension scores could have been excluded, as is common in text comprehension research.

Because we intended to conduct a direct replication, we decided to adopt the exclusion used by Kidd and Castano, even though we thought our own was better. And then something surprising happened: the effect appeared!

What to make of this? On the one hand, you could say that our direct replication reproduced the original effect (very closely indeed). On the other hand, we cannot come up with a theoretically sound reason why the effect would appear with a less-stringent exclusion criterion, which gives the manipulation less chance to impact ToM responses, and disappears with a more stringent criterion.

Nevertheless, if we want to be true to the doctrine of direct replication, which we do, then we should count this as a replication of the original effect but with a qualification. As we say in the paper:

“Taken together, it seems that replicating the results of Kidd and Castano (2013) hinges on choosing a particular set of exclusion criteria that a priori seem not better than alternatives. In fact, […] one could argue that a more stringent criterion regarding reading times (i.e., smaller than 30s per page rather than smaller than 30s per page on average) is to be preferred because participants who spent less than 30 seconds on a page did not adhere to the task instruction of reading the entire text carefully.”

The article also includes a mini meta-analysis of four studies, including the original study and our replication. The meta-analytic effect is not significant but there is significant heterogeneity among the studies.

Tuesday, December 19, 2017

A while back, Lorne Campbell wrote a blog post listing the preregistered publications from his lab. This is a great idea. It is easy to talk the talk, but it’s harder to walk the walk.

So under the notion that we don't want to be all hat and no cattle, I rounded up some replications and preregistered original papers that I co-authored.

First the replications.

I find performing replications very insightful. My role in two of the RRRs listed below (verbal overshadowing and facial feedback) was rather minor but the 2016 RRR and the issues surrounding it, on which I've blogged before, felt like an onslaught. The 2012 replication study was used to iron out an inconsistency in the literature. An additional replication study is close to getting accepted and will be added to the list in an update.

These days I use direct replications primarily when I want to build on work by others. As per Richard Feynman, before we move on we first need to attempt a direct replication of the effect we want to build on. We first need to know if we can reproduce it in our own lab.

I started preregistering experiments several years ago. All in all, I find it an extremely important practice, quite possibly the most important thing we can do to improve the field. After a while preregistration becomes second nature and it becomes odd not to do it.

I have no experience yet with reviewed preregistrations (other than the three RRRs that I’ve participated in). My co-authors and I submitted one over three months ago but we haven’t gotten the reviews yet.

I should add, that I've co-authored quite a few additional empirical papers during this period that were not preregistered. This is mainly because the experiments in those papers were conducted years ago before preregistration was a thing.

Thursday, December 7, 2017

A short blog post that strings together 8 tweets that I sent out today about our new paper.

Today our latest paper on grammatical aspect appeared in Collabra: Psychology. The article reflects the times we psychologists are living in. It does so not from the lofty perspective of the methodologist or statistician, but from the work floor on which the actual scientist (**ducks**) operates.

Our first two experiments were inspired by Hart & Albarricin (2011). This research itself was inspired by some of our own work but took it from cognition into the realm of social psychology, as I described in this blog post.

As the paper explains, these experiments were run in 2012, which is why they were not preregistered. Nobody was doing preregistration at the time. We were thinking to build on Hart and Albarricin (H&A) in what some would call a conceptual replication but which is better thought of as an extension.

For the life of us, we couldn’t get an effect like that of H&A. Then we got down to business and started a registered replication project in which we performed a direct replication of H&A. Along with 11 other labs, we found no effect.

We were sidetracked by the replication project. Especially because there were some troubling issues with the initial response to our RRR, as I describe here . We were sidetracked to the point that I’d completely forgotten about our 2012 experiments.

Luckily my co-authors had not and we decided to pick up the pieces of our study. It was clear that our research could no longer be driven by our H&A-inspired hypothesis, so we took a slightly different tack.

We conducted three more experiments, now all pre-registered, which yielded some interesting new findings, which you can read about in our paper. As usual per Collabra, the data are available and the reviews are open.

Monday, August 7, 2017

Collabra: Psychology has a submission option called streamlined review. Authors can submit papers that were previously rejected by another
journal for reasons other than a lack of scientific, methodological, or ethical
rigor. Authors request permission from the original journal and then submit their revised manuscript with the original
action letters and reviews. Editors like me then make a decision about the revised
manuscript. This decision can be based on the ported reviews or we can solicit
further reviews.

One recent streamlined submission had previously been rejected
by an APA journal. It is a failed self-replication. In the original experiment,
the authors had found that a certain form of semantic priming, forward priming,
can be eliminated by working-memory load, which suggests that forward
semantic priming is not automatic. This is informative because it contradicts
theories of automatic semantic priming. When they tried to follow up on this
work for a new paper, however, the researchers were unable to obtain this elimination
effect in two experiments. Rather than relegating the study to the file drawer,
they decided to submit it to the journal that had also published their first
paper on the topic. Their submission was rejected. It is now out
in Collabra: Psychology. The reviews
can be found here.

[Side note: I recently conducted a little poll on Twitter asking
whether or not journals should publish self-nonreplications. A staggering 97% of
the respondents said journals should indeed publish self-nonreplications. However,
if anything, this is evidence of the Twitter bubble I’m in. Reality is more recalcitrant.]

I thought the other journal’s reviews were thoughtful. Nevertheless,
I reached a different conclusion than the original editor. A big criticism in
the reviews was the concern about “double-dipping.” If an author publishes a
paper with a significant finding, it is unfair to let that same author then
publish a paper that reports a nonsignificant finding, as this gives the researcher
two bites at the apple.

I understand the point. What drives this perception of
unfairness is our current incentive system.

People are (still) rewarded for the
number of articles they publish, so letting someone first publish a finding and
then a nonreplication of this finding is unfair. It is as if in football (the
real football, where you use your feet to propel the ball) you get a point for
scoring a goal and then an additional point for missing a shot from the same
position.

However understandable, this idea loses its persuasive power
once we take the scientific record into account. As scientists, we want to
understand the world and lay a foundation for further research. It is therefore
important to have good estimates of effect sizes and the confidence we should
have in them. A nonreplication serves to correct the scientific record. It
tells us that the effect is less robust than we initially thought. This is
useful information for meta-analysts, who can now include both findings in
their collection. And even more importantly, it is very useful for researchers
who want to build on this research. They now know that the finding is less
reliable than they previously thought. It might prevent them from wandering into
a potential blind alley.

As with anything in science, allowing the publication of
self-nonrreplications opens the door to gaming the system. People could p-hack
their way to a significant finding, publish it and then fail to “replicate” the
finding in a second paper. As an added bonus, the self-nonreplication will also
give them the aura of earnest, self-critical, and ethical researchers.
Moreover, the self-nonreplication pretty much inoculates the finding from
“outside” replication efforts. Why try to replicate something that even the
authors themselves could not replicate?

That’s not two, not three, but four birds with one stone!
You might think that I’m making up the inoculation motive for dramatic effect.
I’m not. A researcher I know actually suspects another researcher of using the
inoculation strategy.

How worried should we be about the misuse of
self-nonreplications? I’m not sure. One potential safeguard is to have the
authors explain why they performed the replication. Did they think there was
something wrong with the original finding or were they just trying to build on
it and were surprised to discover they couldn’t reproduce the original finding?
And if a researcher makes a habit of publishing self-nonreplications, I’m sure
people would be on to them in no time and questions would be asked.

So I think we should publish self-nonreplications. (1) They
help to make the scientific record more accurate. (2) They are likely to
prevent other researchers from ending up in a cul-de-sac.

The concern about double-dipping is only a concern given our
current incentive system, which is one more indication that this system is detrimental
to good science. But that’s a topic for a different post.

Wednesday, July 26, 2017

Today another guest post. In this post, Fernanda Ferreira and John Henderson respond to the recent and instantly (in)famous multi-authored proposal to lower the level of statistical significance to .005. If you want to discuss this post, Twitter is the medium for you. The authors' handles are @fernandaedi and @JhendersonIMB.

Fernanda Ferreira

John M. Henderson

Department of Psychology and Center for Mind and Brain

University of California, Davis

The paper “Redefine Statistical Significance” (henceforth, the “.005 paper”), written by a consortium of 72 authors, has already made quite a splash even though it has yet to appear in Nature Human Behavior. The call to a redefinition of statistical significance from .05 to .005 would have profound consequences across psychology, and it is not clear to us that the broad implications across the field have been thoroughly considered. As cognitive psychologists, we have major concerns about the advice and the rationale for this severe prescription.

In cognitive psychology we test theories motivated by a body of established findings, and the hypotheses we test are derived from those theories. It is therefore rarely true that any experimental outcome will be treated as equally likely. Our field is not effects-driven—we’re in the business of building and testing functional theories of the mind and brain, and effects are always connected back to those theories.

Standard practice in our subfield of psychology has always been based on replication. This has been extensively discussed in the literature and in social media, but it seems helpful to repeat the point: All of us were trained to design and conduct a theoretically motivated experiment, then design and conduct follow-ups that replicate and extend the theoretically important findings, often using converging operations to show that the patterns are robust across measures. This is why the stereotype emerged that cognitive psychology papers were typically three experiments and a model, where “model” is the subpart of the theory tested and elaborated in this piece of research.

Standard practice is also to motivate new research projects from theory and existing literature; the idea for a study doesn’t come out of the blue. And the first step when starting a new project is to make sure the finding or phenomenon to be built upon replicates. Then the investigator goes on to tweak it, play with it, push it, etc., all in response to refined hypotheses and predictions that fall out of the theory under investigation.*

Now, at this point, even if you agree with us, you might be thinking, “Well what would be the harm in going to a more conservative statistical criterion? Requiring .005 would only have benefits, because then we guard against Type I error and we avoid cluttering up the literature with non-results.” Unfortunately, as many have pointed out in informal discussions concerning the .005 paper, and as the .005 paper acknowledges as well, there are tradeoffs.

First, if you do research on captive undergraduates or you use M-Turk samples, then Ns in the hundreds might be no big deal. In the article, the authors estimate that a shift to .005 will necessitate at least a 70% increase in sample sizes, and they suggest this is not too high a price to pay. But setting aside the issue of non-convenience samples, this estimate is for simple effects, and we’re rarely looking for simple effects. In our business it’s all about statistical interactions, and for those, this recommendation can lead to much larger increases in sample size. And if your field requires you to test non-convenience samples such as heritage language learners, or people with any type of neuropsychological condition such as aphasia, or people with autism, dyslexia, or ADHD, or even just typically developing children, then these Ns might be unattainable. Testing such participants also requires trained, expensive staff. And yet the research might be theoretically and practically important. So if you work with these non-convenience samples, subject testing is costly. It probably requires real money to pay those subjects and the research assistants doing the testing, and the money is almost always going to come from research grants. And we all know what the situation is with respect to research funding—there’s very little of it. But even if you had the money, and you didn’t care that it came at the expense of the funding of maybe some other scientist’s project, where would you find the large numbers of people that this shift in alpha level would require? What this means in practice is that some potentially important research will not get done.

Let’s turn now to Type II error. The authors of the .005 piece, to their credit, discuss the tradeoff between settling for Type I versus Type II error, and they come down on the side that Type I is costlier. But this can’t be true as a blanket statement. Missing a potential effect because you’ve set the false positive rate so conservatively could have major implications for theory development as well as for practical interventions. A false positive is a thing that a researcher might follow up and discover to be illusory; but a false negative is not a thing and therefore is likely to be ignored and never followed up, which means that a potentially important discovery will be missed.

Some have noted that the negative reaction to the .005 article has been surprisingly strong. A response we’ve heard to the kinds of concerns we’ve expressed is that the advocates of the .005 paper are not urging .005 as a publication standard, but merely as the alpha level that permits the use of the word “significant” to describe results. However, it is easy to foresee a world in which (if these recommendations are adopted) editors and reviewers start demanding .005 for significance and use it as a publication standard. After all, the goal of the piece presumably isn’t just to fiddle with terminology.

We think the strong reaction against .005 is also in part because the nature of common practice in different areas of psychology are not well represented by those advocating for major changes to research practice like the .005 proposal. Relatedly, we think it’s unfortunate that, today, in the popular media, one frequently sees references to “the crisis in psychology”, when those of us inside psychology know that the entire field is not in crisis. The response from these advocates might be to say that we’re in denial, but we’re not – as we outlined earlier, the approach to theory building, testing, replication, and cumulative evidence that’s standard in cognitive psychology (and other subareas of psychology) makes it unlikely that a cute but illusory effect will survive.

So our frustration is real. We would like to see the conversation in psychology about scientific integrity broadened to include other subfields such as ours, and many others.

-----
*When we say these are standard practices in cognitive psychology, we don’t intend to imply that these practices are not standard in other areas; we’re simply talking about cognitive psychology because it’s the area with which we’re most familiar.