Friday, May 18, 2018

My former colleague Roy Baumeister famously said that replication
is a "career niche for bad experimenters.”* I like to use this quote in my talk. Roy
is wrong, of course. As anyone who has tried to conduct a replication study
knows, it requires a great deal of skill to perform replications. This leads to
the question Is there a career niche for
replicators?

I was asked this question yesterday when I gave a talk on
Making Replication Mainstream at the marvellous Donders Institute for Cognition, Brain, and Behaviour in Nijmegen. I get asked this question regularly. My standard
answer is that it is not a good career choice. Implicit in this answer is the
idea that in order to become a tenured faculty member, one has to make a unique
contribution to the literature. Promotion-and-tenure writers are always asked
to comment on the uniqueness of a candidate’s work. Someone who only conducts
replication studies would run the risk of not meeting the current requirements to become and remain faculty members.

During lunch, a group of us got to talking some more about
this issue, to which I hadn't given sufficient thought, as it soon turned out.

It was pointed out that there is a sizeable group of researchers who
would like to remain in science, have excellent methodological skills but don’t
necessarily have the ambition/creativity/chutzpah/temerity to pursue a career
as faculty member.

These researchers, was the thinking at our lunch table, are
perfectly suited to conduct replication research. The field would benefit
greatly from their work. If we truly want to make replication mainstream, there
ought to be a career niche for them.

If faculty member is not a viable option, then what would be a good career niche for replicators? It was suggested at our table that replicators should become staff members,
much like lab managers. They would not be evaluated on the originality or uniqueness of their
publications. In fact, maybe they would not even be on the
publications, just as lab managers often are not on publications. Faculty
members select studies for replication and replicators conduct them and by doing so make a value contribution to our science.

I think this is a fair summary of our discussion. I have no
strong opinions on this career niche for replicators yet but I wonder what ya'lls thoughts on this are.

----
* The link is to a paywalled article but I'm sure you can scihub your way to it.

The authors start with two important observations. First, semantic priming experiments yield robust effects, whereas “social priming” (I’m following the authors’ convention of using quotation marks here) experiments do not. Second, semantic priming experiments use within-subjects designs, whereas “social priming” experiments use between-subjects designs. The authors are right in pointing out that this latter fact has not received sufficient attention.

The authors’ goal is to demonstrate that the second fact is the cause of the first. Here is how they summarize their results in the abstract: “These results indicate that the key difference between priming effects identified as more and less reliable is the type of experimental design used to demonstrate the effect, rather than the content domain in which the effect has been demonstrated.”

This is not what the results are telling us. What the authors have done, is to take existing well-designed experiments (not all of which are priming experiments by the way, as was already pointed out in the social media), and then demolish them to create, I’m sorry to say, more train wrecks of experiments in which only a single trial for each subject is retained. By thus getting rid of the vast majority of trials, the authors end up with an “experiment” that no one in their right mind would design. Unsurprisingly, they find that in each of the cases the effect is no longer significant.

Does this show that “the key difference between priming effects identified as more and less reliable is the type of experimental design used to demonstrate the effect”? Of course not. The authors imply that having a within-subjects design is sufficient for finding robust priming effects, of whatever kind. But they have not even demonstrated that a within-subjects design is necessary for priming effects to occur. For example, based on the data in this manuscript, it cannot be ruled out that a sufficiently powered between-subjects semantic priming effect would, in fact, yield a significant result. We already know from replication studies that between-subjects “social priming” experiments do not yield significant effects, even with large power.

More importantly, the crucial experiment that a within-subjects design is sufficient to yield “social priming” effects is absent from the paper. Without such an experiment, any claims about the design being the key difference between semantic and “social priming” are unsupported.

So where does this leave us? The authors have made an important initial step in identifying differences between semantic and “social priming” studies. However, to draw causal conclusions of the type the authors want to draw in this paper, two experiments are needed.

First, an appropriately powered single-trial between-subjects semantic priming experiment. To support the authors’ view, this experiment should yield a null result. This should of course be tested using the appropriate statistics. Rather than using response times the authors might consider using a word-stem completion task. Contrary to what the the authors would have to predict, I predict a significant effect here. If I’m correct, it would invalidate the authors’ claim about a causal relation between design and effect robustness.

Second, the authors should conduct a within-subjects “social priming” effect (that is close to the ones that they describe in the introduction). Whether or not this is possible, I cannot determine.

If the authors are willing to conduct these experiments--and omit the uninformative ones they report in the current manuscript—then they would make a truly major contribution to the literature. As it stands, they merely add more train wrecks to the literature. I therefore sincerely hope they are willing to undertake the necessary work.

Smaller points

p. 8. “In this approach, each participant is randomized to one level of the experimental design based on the first experimental trial to which they are exposed. The effect of priming is then analyzed using fully between-subjects tests.” But the order in which the stimuli were presented was randomized, right? So this means that this analysis actually compares different items. Given that there typically is variability in response times across items (see Herb Clark’s 1973 paper on the “language-as-fixed-effect fallacy”), this unnecessarily introduces noise into the analysis. Because there usually also is a serial position effect, this problem cannot be solved by taking the same item. One would have to take the same item in the same position. Therefore, it is impossible to take a single trial without losing experimental control over item and order effects. This is another reason why the “experiments” reported in this paper are uninformative.

p. 9. The Stroop task is not really a priming task, as the authors point out in a footnote. Why not use a real priming task?

p. 15. “It is not our intention to suggest that failures to replicate priming effects can be
solely attributed to research design.” Maybe not, but by stating that design is “the key difference,” the authors are claiming it has a causal role.

p. 16. “We anticipate that some critics will not be satisfied that we have examined ‘social
priming’.” I’m with the critics on this one.

p. 17. “We would note that there is nothing inherently “social” about either of these features of priming tasks. For example, it is not clear what is particularly “social” about walking down a hallway.” Agreed. Maybe call it behavioral priming then?

p. 18. “Unfortunately, it is not possible to ask subjects to walk down the same hallway 300 times after exposure to different primes.” Sure, but with a little flair, it should be possible to come up with a dependent measure that would allow for a within-subjects design.

p. 19. “We also hope that this research, for once and for all, eliminates content area as an explanation for the robustness of priming effects.” Without experiments such as the ones proposed in this review, this hope is futile.

Wednesday, January 31, 2018

A number of years ago, my colleagues Peter Verkoeijen, Katinka Dijkstra, several undergraduate students, and I conducted a replication of Experiment 5 of Kidd & Castano (2013). In that study, published in Science, participants were exposed to an excerpt from either literary fiction or from non-literary fiction.

Kidd and Castano hypothesized that brief exposure to literary fiction as opposed to non-literary fiction would enhance empathy in people because of the greater psychological finesse in literary novels than in non-literary novels. Anyone who has read, say, Proust as well as Michael Crichton will probably intuit what Kidd and Castano were getting at.

Their results showed indeed that people who had been briefly exposed to the literary excerpt showed more empathy in Theory of Mind (ToM) tests than participants who had been briefly exposed to the non-literary excerpt.

Because the study touches on some of our own interests, text comprehension, literature, empathy and because of a number of reasons detailed in the article, we decided to replicate one of Kidd & Castano’s experiments, namely their Experiment 5. Unlike Kidd and Castano, we found no significant effect of text condition on ToM. We wrote that study up for publication in the Dutch journal De Psycholoog, a journal targeted at a broad audience of practitioners and scientists.

Because researchers from other countries kept asking us about the results of our replication attempt, we decided to make them more broadly available by writing an English version of the article with a more detailed methods and results section than was possible in the Dutch journal. This work was spearheaded by first author Iris van Kuijk, who was an undergraduate student when the study was conducted. A preprint of the article can be found here. An attentive reader who is familiar with the Dutch version and now reads the English version will be surprised. In the Dutch version the effect was not replicated but in the English version it was. What gives?

And this brings us to the wrinkle mentioned in the title. The experiment relies on subjects having read the excerpt. However, as any psychologist knows, there are always people who don’t follow instructions. To pinpoint such subjects and later exclude their data, it is useful to know whether they’ve actually read the texts. In both experiments, reading times per excerpt were collected.

We originally reasoned that it would be impossible for someone to read and understand a page in under 30 seconds. So we excluded subjects who had one or more reading times < 30 seconds per page. This ensured that our sample included subjects who had at least spent a reasonable amount of time on each excerpt. This would give the manipulation, reading a literary vs. non-literary excerpt optimum chance to work.

Upon reanalyzing the data for the English version, my co-authors noticed that Kidd and Castano had used a different criterion for excluding outliers. They had used a criterion that was less stringent than ours. They had excluded subjects whose average reading times were < 30 seconds. This potentially includes subjects who may have had long reading times for one page but may have skimmed another.

Our original approach ensured that people had at least spent a sufficient amount of time on each page. This still does not guarantee that they actually comprehended the excerpts, of course. For this, it would have been better to include comprehension questions, such that subjects with low comprehension scores could have been excluded, as is common in text comprehension research.

Because we intended to conduct a direct replication, we decided to adopt the exclusion used by Kidd and Castano, even though we thought our own was better. And then something surprising happened: the effect appeared!

What to make of this? On the one hand, you could say that our direct replication reproduced the original effect (very closely indeed). On the other hand, we cannot come up with a theoretically sound reason why the effect would appear with a less-stringent exclusion criterion, which gives the manipulation less chance to impact ToM responses, and disappears with a more stringent criterion.

Nevertheless, if we want to be true to the doctrine of direct replication, which we do, then we should count this as a replication of the original effect but with a qualification. As we say in the paper:

“Taken together, it seems that replicating the results of Kidd and Castano (2013) hinges on choosing a particular set of exclusion criteria that a priori seem not better than alternatives. In fact, […] one could argue that a more stringent criterion regarding reading times (i.e., smaller than 30s per page rather than smaller than 30s per page on average) is to be preferred because participants who spent less than 30 seconds on a page did not adhere to the task instruction of reading the entire text carefully.”

The article also includes a mini meta-analysis of four studies, including the original study and our replication. The meta-analytic effect is not significant but there is significant heterogeneity among the studies.