You could let them stand as two unrelated utterances. But
that’s not what you did, right? You inferred that the cigarette caused a fire,
which destroyed the forest. We interpret new information based on what we know
(that burning cigarettes can cause fires) to form a coherent
representation of a situation. Rather than leaving the sentences unconnected, we impose a causal connection between the events described by the sentences.

George W. Bush exploited this tendency to create coherence
by continuously juxtaposing
Saddam and 9-11, thus fooling three-quarters of the American public into believing that Saddam was behind the attacks, without stating this explicitly.

Sir
Frederic Bartlett proposed that we are continuously engaged in an effort
after meaning. This is what remembering, imagining, thinking, reasoning, and understanding are: efforts to establish coherence. We try to forge connections between what we see and what we know. Often, we encounter obstacles to coherence and we strive
mightily to overcome them.

Take for example the last episode of Game of Thrones. One of the characters, Stannis Baratheon, barely
survives a battle and is shown wounded and slumped against a tree. Another
character strikes at him with a sword. But right before the sword hits, there
is a cut to a different scene. So is Stannis dead or not? This question is hotly debated in news groups (e.g., in this
thread). The vigor of the debate is testament to people's intolerance for ambiguity and their effort after meaning.

Stannis Baratheon, will he make it or not?

The arguments pro or contra Stannis being dead are made at
different levels. Some people try to resolve the ambiguity at the level of the
scene. No, Stannis could not have been killed: the positioning of the characters and
the tree suggests that the sword would have struck the tree rather than
Stannis. Other people jump up to the level of the story world. No, Stannis cannot
be dead because his arc is not complete yet. Or: yes, he is dead because there
is nothing anymore for him to accomplish in the story—let’s face it, he even
sacrificed his own daughter, so what does he have left to live for! Yet other
people take the perspective of the show. No, he is not dead because so far every
major character on the show that is dead has been shown to have been
killed; there are no off-screen deaths.
Finally, some people take a very practical view. No Stannis
cannot be dead because the actor, Stephen Dillane, is still under contract at
HBO.

The internet is replete with discussions of this type, on
countless topics, from interpretations of Beatles lyrics to conspiracy theories
about 9-11. All are manifestations of the effort after meaning.

Science is another case in point. In a recent interview in
the Chronicle for Higher Education,
Diederik Stapel tries to shed light on his own fraud by appealing to the effort
after meaning:

I think the problem
with some scientists […], is you’re really genuinely interested. You really
want to understand what’s going on. Understanding means I want to understand, I
want an answer. When reality gives you back something that’s chaos and is not
easy to understand, the idea of being a scientist is that you need to dig
deeper, you need to find an answer. Karl Popper says that’s what you need to be
happy with — uncertainty — maybe that’s the answer. Yet we’re trained, and
society expects us to give an answer.

You don’t have to sympathize with Stapel to see that he has
a point here. Questionable research practices are ways to establish coherence
between hypothesis and data, between different experiments, and between data
and hypothesis. Omitting nonsignificant findings is a way to establish coherence between
hypothesis and data and among experiments. You can also establish coherence
between data and hypothesis simply by inventing a new hypothesis in light of
the data and pretending it was your hypothesis all along (HARKing). And if you don’t do any of these things and submit a paper with data that don’t allow you to tell a completely coherent story, your manuscript is likely to get rejected.

So the effort after meaning is systemic in science. As
Stapel says, when nature does not cooperate, there is a perception that we have
failed as scientists. We have failed to come up with a coherent story and we
feel the need to rectify this. Because if we don't, our work may never see the light of day.

Granted, data fabrication is taking the effort after meaning to the
extreme--let’s call it the scientific equivalent of sacrificing your
own daughter. Nevertheless, we would do well to acknowledge that as scientists we
are beholden to the effort after meaning. The simple solution is to arrange our science such
that we let the effort after meaning roam free where it is needed—in theorizing and in
exploratory research—and curb it where it has no place, in confirmatory research. Preregistration is an important step toward accomplishing this.

Meanwhile, if you want to give your effort after meaning a workout, don’t hesitate to weigh in on the Stannis debate.

Perspectives on
Psychological Science’s first registered
replication project, RRR1, was targeted at verbal overshadowing,
the phenomenon that describing a visual stimulus, in this case a human face, is
detrimental to later recognition of this face compared to not describing the
stimulus. A meta-analysis of31 direct
replications of the original finding provided evidence of verbal overshadowing.
Subjects who described the suspect were 16% less likely to make a correct
identification than subjects who performed a filler task.

One of my students wanted to extend (or conceptually
replicate) the verbal overshadowing effect for her master’s thesis by using
different stimuli and a different distractor task. I’m not going to talk about
the contents of the research here. I simply want to address the question that’s
posed in the title of this post: p=.20,
what now? Because p=.20 is what we found after having run 148 subjects, obtaining a verbal overshadowing effect of 9% rather than RRR1's 16%.**

Option 1.The effect is not significant, so this
conceptual replication “did not work,” let’s file drawer the sucker. This response
is probably still very common but it contributes to publication bias.

Option 2. We consider this a pilot study and now
perform a power analysis based on it and run a new (and much larger) batch of
subjects. The old data are now meaningless for hypothesis testing. This is
better than option 1 but is rather wasteful. Why throw away a perfectly good
data set?

Option 3. Our method wasn’t sensitive enough. Let’s
improve it and then run a new study. Probably a very common response. But it may be
premature and is not guaranteed to lead to a more decisive result. And you’re
still throwing away the old data (see option 1).

Liverpool FC, victorious in the 2005 Champions League final

in Istanbul after overcoming a 3-0 deficit against AC Milan

Option 4.The effect is not significant, but if we
also report the Bayes factor, we can at least say something meaningful about
the Null hypothesis and maybe get it published. This seems to become more common nowadays. It is not a bad idea
as such, but it is likely to get misinterpreted
as: H0 is true (even by the
researchers themselves). The Bayes factor tells us something about the support
for a hypothesis relative to some other hypothesis given the data such as they are. And what the data are here is: too
few. We found BF10= .21, which translates to about 5 times more
evidence for H0 than for H1, but this is about as meaningful as the score in a
soccer match after 30 minutes of play. Sure, H0 is ahead but H1 might well
score a come-from-behind victory. There are after all 60 more minutes to play!

Option 5. The
effect is not significant but we’ll keep on testing until it is. Simmons et
al. have provided a memorable
illustration of how problematic optional stopping is.In his blog, Ryne Sherman describesa Monte Carlo simulation of p-hacking, showing that it can inflate
the false positive rate from 5% to 20%. Still, the intuition that it would be
useful to test more subjects is a good one. And that leads us to…

Option 6. The
result is ambiguous, so let’s continue testing—in a way that does not inflate the Type I error rate—until we have decisive
information or we've run out of resources. Researchers have proposed several ways of sequential testing that
does preserve the normal error rate. Eric-Jan
Wagenmakers and colleagues show how repeated testing can be performed in a Bayesian
framework and Daniël
Lakens has described sequential testing as it is performed in the medical
sciences. My main focus will be on a little-known method proposed in psychology by Frick (1998),
which to date has been cited only 17 times in Google Scholar. I will report
Bayes factors as well. The method described by Lakens could not be used in this
case because it requires one to specify the number of looks a priori.

Frick’s method is called COAST (composite open adaptive sequential test). The idea is appealingly
simple: if your effect is >.01 and <.36, keep on testing until the
p-value crosses one of these limits.*** Frick’s simulations show that this procedure keeps the overall alpha level
under .05. Given that after the first test our p was between the lower and upper limits, our Good Ship DataPoint was in deep waters. Therefore, we continued testing. We
decided to add subjects in batches of 60 (barring exclusions) so as to not
overshoot and yet make our additions substantive. If DataPoint failed to reach shore before we'd reached 500 subjects, we would abandon ship.

Voyage of the Good Ship Data Point on the Rectangular Sea of Probability

Batch 2: Ntotal=202,
p=.047. People who use optional
testing would stop here and declare victory: p<.05! (Of course, they wouldn’t mention that they’d already
peeked.) We’re using COAST, however, and although the good ship DataPoint is
in the shallows of the Rectangular Sea of Probability, it has not reached the
coast. And BF10=0.6, still leaning toward H0.

Batch 3: Ntotal
= 258, p=.013, BF10=1.95. We’re
getting encouraging reports from the crow’s nest. The DataPoint crew will likely
not succumb to scurvy after all! And the BF10 now favors H1.

Batch 5: Ntotal
=359, p=.016, BF10=1.10. Heading back
in the right direction again.

Batch 6: Ntotal
=421, p=.015, BF=1.17. Barely closer.
Will we reach shore before we all die? We have to ration the food.

Batch 7: Ntotal
=479, p=.003, BF10=4.11. Made it! Just before supplies ran out and the captain
would have been keelhauled. The taverns will be busy tonight.

Some lessons from this nautical exercise:

(1) More data=better.

(2) We now have successfully extended the verbal overshadowing
effect, although we found a smaller effect, 9% after 148 subjects and 10% at the end of the experiment.

(3) Although COAST gave us an exit strategy, BF10=4.11 is
encouraging but not very strong. And who knows if it will hold up? Up to this
point it has been quite volatile.

(4) Our use of COAST worked because we were using Mechanical Turk.
Adding batches of 60 subjects would be impractical in the lab.

(5) Using COAST is simple and straightforward. It preserves an overall alpha level of .05. I prefer to use it in conjunction with Bayes factors.

(6) It is puzzling that methodological solutions to a lot of our problems are right there in the psychological literature but that so few people are aware of them.

Coda
In this post, I have focused on the application of COAST and largely ignored, for didactical purposes, that this study was a conceptual replication. More about this in the next post.

Footnotes

Acknowledgements: I thank Samantha Bouwmeester, Peter Verkoeijen, and Anita Eerland for helpful comments on an earlier version of this post. They don't necessarily agree with me on all of the points raised in the post.*Starring in the role of DataPoint is the Batavia, a replica of a 17th century Dutch East Indies ship, well worth a visit. ** The original study, Schooler and Engstler-Schooler (1990), has a sample of 37 subjects and the RRR1 studies typically had 50-80 subjects. We used chi-square tests to compute p-values. Unlike the replication studies, we did not collapse the conditions in which subjects made a false identification and in which they claimed the suspect was not in the lineup because we thought these were two different kinds of responses. I computed Bayes factors using the BayesFactor package in R. I used the contingencyTableBF function with sampleType = "indepMulti", fixedMargin = "rows", priorConcentration= 1. In this analysis, we separated false alarms from misses, unlike in the replication experiments. This precluded us, however, from using one-sided tests.*** For this to work, you need to decide a priori to use COAST. This means, for example, that when your p-value is >.01 and <.05 after the first batch, you need to continue testing rather than conclude that you've obtained a significant effect.

Wednesday, March 11, 2015

The Many Labs enterprise is on a roll. This week, a
manuscript reporting Many Labs 3 materialized on the already invaluable Open Science Framework. The manuscript reports a large-scale
investigation, involving 20 American and Canadian research teams, into the “end-of-semester effect.”

The lore among researchers is that subjects run at the end
of the semester provide useless data. Effects that are found at the beginning
of the semester somehow disappear or become smaller at the end. Often this is
attributed to the notion that less-motivated/less-intelligent students procrastinate
and postpone participation in experiments until the very last moment. Many Labs 3 notes that there is very little empirical evidence pertaining to the
end-of-semester effect.

To address this shortcoming in the literature, Many Labs 3 set out to conduct 10 replications
of known effects to examine the end-of-semester effect. Each experiment was performed twice by each of the 20 participating teams: once at the beginning of the semester and once at the end of the semester, each time with different subjects, of course.

It must have been a
disappointment to the researchers involved that only 3 of the 10 effects
replicated (maybe more about this in a later post) but Many Labs 3 remained undeterred and went ahead to examine
the evidence for an end-of-semester effect. Long story short, there was none. Or
in the words of the researchers:

It is possible that
there are some conditions under which the time of semester impacts observed
effects. However, it is unknown whether that impact is ever big enough to be
meaningful

This made me wonder about the reasons for expecting an
end-of-semester effect in the first place. Isn’t this just a fallacy born out
of research practices that most of us now frown upon: running small samples, shelving
studies with null effects, and optional stopping?

New projects are usually started at the beginning of a
semester. Suppose the first (underpowered) study produces a significant effect.
This can have multiple reasons:

(1)the
effect is genuine;

(2)the researchers stopped when the effect was
significant;

(3)the
researchers massaged the data such that the effect was significant;

(4)it
was a lucky shot;

(5)any
combination of the above.

How the end-of-semester effect might come about

With this shot in the arm, the researchers are motivated to conduct a second study, perhaps with thesame N and exclusionary and outlier-removal criteria as the first
study but with a somewhat different independent variable. Let’s call it a conceptual replication. If this study, for whatever
reason, yields a significant effect, the researchers might congratulate
themselves on a job well done and submit the manuscript.

But what if the first study does not produce a significant effect? The authors probably conclude that the idea is not worth pursuing after all,
shelve the study, and move on to a new idea. If it’s still early in the
semester, they could run a study to test the new idea and the process might repeat itself.

Now let’s assume the second study yields a null effect, certainly
not a remote possibility. At this juncture, the authors are the proud owners of a Study 1 with an
effect but are saddled with a Study 2
without an effect. How did they get this lemon? Well, of course because of those good-for-nothing numbskulled students who wait until the end of the semester before signing up for an experiment! And thus the the “end-of semester fallacy”
is born.

Thursday, February 26, 2015

The journal Basic and Applied Social Psychology (BASP) has taken a resolute and bold step. A recent editorial announces that it has
banned the reporting of inferential statistics. F-values, t-values, p-values and the like have all been declared personae non
gratae. And so have confidence intervals. Bayes factors are not exactly banned but
aren’t welcomed with open arms either; they are eyed with suspicion, like a mysterious traveler in a tavern.

There is a vigorous debate in the scientific literature and
in the social media about the pros and cons of Null Hypothesis Significance
Testing (NHST), confidence intervals, and Bayesian statistics (making researchers in some frontier towns quite nervous). The editors at BASP have seen enough of this debate and have decided to do away with inferential statistics altogether. Sure, you're allowed to submit a manuscript that’s loaded with
p-values and statements about significance or the lack thereof, but they will be
rigorously removed, like lice from a
schoolchild’s head.

The question is whether we can live with what remains. Can
we really conduct science without summary statements? Because what does the journal offer in their
place? It requires strong descriptive statistics, distributional information, and larger samples. These are all good things but we need to have a way to summarize our results, not just
because so we can comprehend and interpret them better ourselves and because we need to communicate them but also because we need to make decisions based on them as researchers, reviewers, editors, and users. Effect sizes are not banned and so will provide summary information that will be used to answer questions like:

--what will the next experiment
be?

--do the findings support the hypothesis?

--has or hasn’t the finding been
replicated?

--can I cite finding X as
support for theory Y?*

As to that last question, you can hardly cite a result
saying This finding supports or does not support the hypothesis but here are the descriptives. The reader will want more in
the way of a statistical argument or an intersubjective criterion to decide one way or the other. I have no idea how researchers, reviewers, and editors are
going to cope with the new freedoms (from inferential statistics) and
constraints (from not being able to use inferential statistics). But that’s actually
what I like about the BASP's ban. It gives rise to a very interesting real-world
experiment in meta-science.

Sneaky Bayes

There are a lot of unknowns at this point. Can we really live without inferential statistics? Will Bayes sneak in through the half-open door and occupy the premises? Will no one dare to submit to the journal? Will authors balk at having their manuscripts shorn of inferential statistics? Will the interactions among authors, reviewers, and editors yield novel and promising ways of interpreting and communicating scientific results? Will the editors in a few years be BASPing in the glory of their radical decision? And how will we measure the success of the ban on inferential statistics? The wrong way to go about this would be to see whether the policy will be
adopted by other journals or whether or not the impact factor of the journal
rises. So how will we determine whether
the ban will improve our science?

Questions, questions. But this is why we conduct experiments and this is why BASP's bravedecision should be given the benefit of the doubt.

--------FootnotesI thank Samantha Bouwmeester and Anita Eerland for feedback on a previous version and Dermot Lynott for the Strider picture.*

N

ote that I’m not saying: “will the paper be accepted?” or “does the researcher deserve tenure?”

Wednesday, January 28, 2015

What to do when the crops are failing because of a drought? Why, we persuade the Gods to send rain of course! I'll let the fourth Roman Emperor, Claudius, explain:

Derek Jacobi stuttering away as

Claudius in the TV series I Claudius

There is a black stone called the Dripping Stone, captured originally from the Etruscans and stored in a temple of Mars outside the city. We go in solemn procession and fetch it within the walls, where we pour water on it, singing incantations and sacrificing. Rain always follows--unless there has been a slight mistake in the ritual, as is frequently the case.*

It sounds an awful lot as if Claudius is weighing in on the replication debate, coming down squarely on the side of replication critics, researchers who raise the specter of hidden moderators as soon as a non-replication materializes. Obviously, when a replication attempt omits a component that is integral to the original study (and was explicitly mentioned in the original paper), omission of that component borders on scientific malpractice. But hidden moderators are only invoked after the fact--they are "hidden" after all and so could by definition not have been omitted. Hidden moderators are slight mistakes or imperfections in the ritual that are only detected when the ritual does not produce the desired outcome. As Claudius would have us believe, if the ritual is performed correctly, then rain always follows. Similarly, if there are no hidden moderators, then the effect will always occur, so if the effect does not occur, there must have been a hidden moderator.**

And of course nobody bothers to look for small errors in the ritual when it is raining cats and dogs, or for hidden moderators when p<.05

I call this the Dripping Stone Fallacy.

Reviewers (and readers) of scientific manuscripts fall prey to a mild(er) version of the Dripping Stone Fallacy. They scrutinize the method and results sections of a paper if they disagree with its conclusions and tend to give these same sections a more cursory treatment if they agree with the conclusions. Someone surely must have investigated this already. If not, it would be rather straightforward to design an experiment and test the hypothesis. One could measure the amount of time spent reading the method section and memory for it in subjects who are known to agree or disagree with the conclusions of an empirical study.

Even the greatest minds fall prey to the Dripping Stone Fallacy. As Raymond Nickerson describes: Louis Pasteur refused to accept or
publish results of his experiments that seemed to
tell against his position that life did not generate
spontaneously, being sufficiently convinced of
his hypothesis to consider any experiment that
produced counterindicative evidence to be
necessarily flawed.Confirmation bias comes in many guises and the Dripping Stone Fallacy is one of them. It makes a frequent appearance in the replication debate. Granted, the Dripping Stone Fallacy didn't prevent the Romans from conquering half the world but it is likely to be more debilitating to the replication debate.

Footnotes* RobertGraves, Claudius the God, Penguin Books, 2006, p. 172.** This is and informal fallacy; it is formally correct (modus tollens) but is based on a false premise.