Tuesday, January 1, 2013

Coming out of the file drawer

My previous post was about the why
of replication studies. This one is about my first foray
into the replication business. That is, my first venture outside the file
drawer (where several nonreplications of other people’s work reside, as well as
nonreplications of studies of my own that were never submitted because we were
unable to replicate the initial finding). I’m coming out of the file drawer, so
to speak.

I’m not going to discuss the contents of the study here. I’m
just going to talk about a couple of things my co-author, Diane Pecher, and I
learned from our replication efforts.

I’ve got the power!

Psychology experiments are chronically underpowered. Simmons,
Nelson, and Simonsohn suggest you need at least 20 subjects per condition,
which is more than many psychology experiments have. At a recent symposium, a statistician
even said that to be informative, experiments should have at least 100 subjects;
otherwise they are merely exploratory (I’m paraphrasing). I have heard people
scoff at these suggestions (they may not be feasible for studies using special populations and not necessary for psychophysics experiments) but whatever the right number is, it is true that Ns
are too small in the vast majority of psychology experiments, including my own.

Running 100+ subjects is difficult
to accomplish in many labs given the size of the subject pools and the
availability of lab space. My guess is that it would take the better part of a
year to run a study of that size in our lab. But no need to worry; there are
alternatives. We ran our experiments on Mechanical Turk, a database
maintained by Amazon. Turkers participate for
small amounts of money in HITs (Human Intelligence Tasks). Thousands of people
in the United States and India are registered in the system. We limited our
samples to people living in the United States, primarily because two of the
experiments we were trying to replicate were run in the United States (the
other two were run in England).

Keeping false
positives at Bayes

With a large N, the likelihood of false positives is high in
classic Null-hypothesis significance tests. An inconsequential difference might
show up as significant. An alternative is to compute the Bayes factor, which is a
likelihood ratio that allows you to assess the strength of the alternative
hypothesis versus the Null hypothesis or the other way around. To be
conclusive, the Bayes factor requires more evidence for the alternative
hypothesis with larger samples than does for example a t-test but it also allows you to determine whether a
small effect is consequential.
Bayes factors can easily be computed using Jeffrey Rouder’s web site at the University of
Missouri. You just put in a t-value
and the sample size and it will return the Bayes factor—actually three of them;
we used the JZS Bayes factor.

Unlike standard hypothesis-testing statistics, Baysian
statistics don’t force you to define your sampling plan ahead of time.
According to a very insightful
paper by Wagenmakers and colleagues—in a must-read special issue of Perspectives on Psychological Science—you
can continue collecting data until the Bayes factor seems to stabilize (I must
admit the article is a bit hazy on this part, or maybe I am). In our case it meant
that we could compute a combined Bayes factor over two experiments that were
essentially identical, which gave us even more power. This move was suggested
to us by Eric-Jan Wagenmakers, an expert in Bayesian statistics (which I am
most definitely not).

Two heads are better
than one

Armed with our large samples and Bayes factors, we were
ready to analyze the data. And here we did something that I think is highly
unusual in psychological research. We each performed our own analysis of the
data and then compared our results. We were humbled to see that on several
occasions we didn’t get the same outcome. True, we weren’t far apart and the
differences were inconsequential and easy to resolve, but it taught us a good
lesson. It is important to have multiple people analyze the data—an error is
easily made (my bet is that the literature is replete with them). The files that I
created to analyze the data (which include the raw data) can be found here.

Taking the
experimenter and the lab out of the loop

One big advantage of on-line experiments is that there is no
experimenter involved, so there cannot be any experimenter effects. Whatever
results you obtain, they cannot be caused by the professional demeanor, friendly
attitude, white lab coat, or short skirt of your research assistant.

There is another advantage. Turkers don’t go to the lab to
participate in experiments. They might be at home on the couch, in the office pretending
to do their regular job, on the train, in the airport, or in a coffee shop
(though preferably not in what the Dutch call a coffee shop). We ask subjects
about their environment and the noise level in it, and they generally tell us
that they work in quiet environments. We tend to believe them because they are
highly conscientious subjects. They often provide thoughtful feedback on our
experiments.

But how can lack of control be an advantage? It is an
advantage in terms of reproducibility. Evidently, results like ours were not
caused by the academic setting of the experiment, the color of the walls in the
experiment room, the close confines of a cubicle (though some Turkers probably
operate from cubicles), the red light on the door of the experiment room, and
so on.

This means that replication attempts of on-line studies are
relatively straightforward. For example, if anyone wanted to replicate us, they
can get our data-collection programs (contact me, as I still need to post them
online), create a link to them on Mechanical Turk, and with a couple hundred
dollars, you’re in business. You will have your data within a day or so.

6 comments:

Thanks for the link to the Bayesian calculator. Are there guidelines yet for what Bayesian ratios we should expect in most psychology experiments?

I'm running a study now involving motor priming and the initial results look good from the old school perspective: p = 0.05, Cohen's d = .55, N=50, between subjects design. But I plug the numbers into the Bayesian calculator and the ratio is only 0.28, far less than the ratio you cited in your Turk replication.

The 95% CIs for each condition still overlap, to the extent that I imagine I'd need at least 50 more subjects to show a clear separation, even though the p-value could be driven well below .05 much sooner than the CIs diverge.

These are the cleanest results I've ever obtained in a first attempt at a new concept. I'm not sure my study could be run through Turk at all, and I'm pretty sure I couldn't access participants with the required demographic backgrounds. I have a large subject pool, so I can add 50 more in February/March without difficulty, but as you indicate, I think most researchers face severe limitations with studies that can't be executed online. Effect sizes in social/cognitive psychology aren't usually high enough to be captured reliability with small samples (not that ES depends on N mathematically, but that there's a relationship between p values, sample sizes, and effect sizes in conducting studies).

And through all of this, I'm not sure I even understand how p values and Bayesian ratios systematically relate to each other, if they do. My impression is that Bayesian analysis tests the specific alternative that's being used in the study, as opposed to merely indicating the probablity of getting these results when the conditional populations are theoretically identical. It shifts our statistical attention toward the likelihood of the alternative hypothesis rather than the probability of our data existing in a null world? This distinction seems so subtle if you're used to (mis-)using p values as statements about your hypothesis.

Let me preface this by repeating I'm not at all an expert in Bayesian statistics. However, I believe this article, http://pps.sagepub.com/content/6/3/291.full.pdf., might be helpful to you. It addresses a lot of your questions, including the relation between p-values and Bayes factors and how to interpret the latter. I believe yours is already nothing to sneeze at. http://pps.sagepub.com/content/6/3/291.full.pdf.

Ah but what I wanted to ask is whether the new standards could put a lot of social/cog questions out of reach for many people in these fields using p-values now as sufficient evidence. How many people are getting effects that would stand up to Bayesian analysis? Or non-overlapping CIs? The standards for publication in regard to effects must, in a way, come down if the standards for analysis go up. A theoretically sound design should be publishable regardless of the results as long as the results are informative.

It's remarkable how often I've found people talking about non-replications in their labs (usually of other people's work) as if those efforts provided valuable information yet journals look with deep suspicion on null results. I asked one editor about this and he replied that there could be many reasons for not seeing a difference between conditions! Annnnnnnnd?

I think we must also abandon the Platonic mode of writing and publishing in which studies are reported in ways that fit schematic views of scientific research. At SPSP a couple years ago the editor of one of the top social journals was asked whether he liked to see a chronology of the work with various hypotheses considered or a clean story, and he dismissed chronologies as mere historical records. I kind of snorted and looked around to expect similar reactions but if anyone else had a problem with his response, I didn't see it. If we constantly reframe the research process in publications with a fictional narrative that obscures the exploratory nature of much psychological research, the clean story that makes for easy bedtime reading only fuels the conceptual and statistical misunderstandings that have made so many "findings" in print highly doubtful now in retrospect. Not every notion and every lab debate needs to be reported but somehow our publications should reflect the actual process more than they obscure it. This would reflect the humility of science so much discussed in introductory textbooks.

I don't think it's an either-or-situation. If the standards for analysis go up, the standards for publication in regard to effects don't necessarily have to go down. Maybe the publication pressure should go down instead so that researchers can take more time to run studies. However, I'm the first one to admit that I'm a very impatient guy, so I'm glad that I can run online experiments.

Everybody always says it's hard to get non-replications published but I wonder how much of this is self-handicapping. In my three years as Editor-in-Chief, I've handled well over a 1000 manuscripts. Practically none of them featured non-replications.

There should be a place for both chronologically accurate and plot-based accounts. Aristotle (if I may throw a Greek philosopher back at you) already argued that historians should use the former and dramatists and epicists the latter. Maybe the chronological accounts should go into archival journals and maybe the plot-based ones in blogs. The former would be part of the scientific record and the latter would be ways to inform a broader audience.

In my next post, I'll describe a chronologically-based 14-experiment Behemoth we are writing at the moment.