Header

Saturday, November 28, 2015

Psychology is the study of relationships between intangible constructs as seen through the lens of our measures and manipulations. We use manipulation A to push on construct X, then look at the resulting changes in construct Y, as estimated by measurement B.

Sometimes it's not clear how one should manipulate construct X. How would we make participants feel self-affirmed? Or if we wanted participants to slow down and really think about a problem? Or conversely, how would we get them to think less and go with their gut feeling? While we have a whole subfield dedicated to measurement (psychometrics), methods and manipulations have historically received less attention and less journal space.

So what can we do when we don't know how to manipulate something? One lowest-common-denominator manipulation of these complicated constructs is to ask participants to think about (or, if we're feeling ambitious, to write about) a time when they exhibited Construct X. That, it's assumed, will lead them to feel more Construct X and lead them to exhibit behaviors consistent with greater levels of Construct X.

I wonder, though, at the statistical power of such experiments. Will remembering a time your hunch was correct lead you to substantially greater levels of intuition use for the next 15 minutes? Will writing about a time you felt good about yourself lead you to achieve a peaceful state of self-affirmation where you can accept evidence that conflicts with your views?

Effect-Size Trickle-Down
If we think about an experiment as a path diagram, it becomes clear that a strong manipulation is necessary. When we examine the relationship between constructs X and Y, what we're really looking at is the relationship between manipulation A and measurement B.

Rectangles represent the things we can measure, ovals represent latent constructs, and arrows represent paths of varying strengths. Path b1 is the strength of Manipulation A. Path b2 is the relationship of interest, the association between Constructs X and Y. Path b3 is the reliability of Measurement B. Path b4 is the reliability of the measurement of the manipulation check.

Although path b2 is what we want to test, we don't get to see it directly. X and Y are latent and not observable. Instead, the path that we see is the relationship between Manipulation A and Measurement B. This relationship has to go through all three paths, and so it has strength = b1 × b2 × b3. Since each path is a correlation between -1 and +1, the magnitude of b1 × b2 × b3 must be equal to or less than that of each individual path.

This means that your effect on the dependent variable is almost certain to be smaller than the effect on the manipulation check. Things start with the manipulation and trickle down from there. If the manipulation can only barely nudge the manipulated construct, then you're certain not to see effects of the manipulation on the downstream outcome.

Minimal Manipulations in the Journals
I wonder if these writing manipulations are effective. One time I reviewed a paper using such a manipulation. Experimental assignment had only a marginally significant effect on the manipulation check. Nevertheless, the authors managed to find significant differences in the outcome across experimental conditions. Is that plausible?

I've since found another published (!) paper with such a manipulation. In Experiment 1, the manipulation check was not significant, but the anticipated effect was. In Experiment 2, the authors didn't bother to check the manipulation any further.

This might be another reason to be skeptical about social priming: manipulations such as briefly holding a warm cup of coffee are by nature minimal manipulations. Even if one expected a strong relationship between feelings of bodily warmth and feelings of interpersonal warmth, the brief exposure to warm coffee might not be enough to create strong feelings of bodily warmth.

(As an aside, it occurs to me that these minimal manipulations might be why, in part, college undergraduates think the mind is such a brittle thing. Their social psychology courses have taught them that the brief recounting of an unpleasant experience has pronounced effects on subsequent behavior.)

Sunday, November 22, 2015

One is often asked, it seems, to extend someone a p-value on credit. "The p-value would be lower if we'd had more subjects." "The p-value would have been lower if we'd had a stronger manipulation." "The p-value would have been lower with a cleaner measurement, a continuous instead of a dichotomous outcome, the absence of a ceiling effect..."

These claims could be true, or they could be false, conditional on one thing: Whether the null hypothesis is true or false. This is, of course, a tricky thing to condition on. The experiment itself should be telling us the evidence for or against the null hypothesis.

So now we see that these statements are very clearly begging the question. Perhaps the most accurate formulation would be, "I would have stronger evidence that the null were false if the null were false and I had stronger evidence." It is perfectly circuitous.

When I see a claim like this, I imagine a cockney ragamuffin pleading, "I'll have the p-value next week, bruv, sware on me mum." But one can't issue an IOU for evidence.

Sunday, October 4, 2015

Last week, I got to meet Andrew Gelman as he outlined what he saw as several of the threats to validity in social science research. Among these was the fallacious idea of "significance under duress." The claim in "significance under duress" is that, when statistical significance is reached under less-than-ideal conditions, it implies that the underlying effect must be very powerful. While this sounds like it makes sense, this claim does not follow.

Let's dissect the idea by considering the following scenario:

120 undergraduates participate in an experiment to examine the effect of mood on preferences for foods branded as "natural" relative to conventionally-branded foods. To manipulate mood, half of the participants write a 90-second paragraph about a time they felt bad, while the other half write a 90-second essay about a control topic. The outcome is a single dichotomous choice between two products. Even though a manipulation check reveals the writing manipulation had only a small effect on mood, and even though a single-item outcome provides less power than would rating several forced choices, statistical significance is nevertheless found when comparing the negative-writing group to the neutral-writing group, p = .030. The authors argue that the relationship between mood and preferences for "natural" must be very strong indeed to have yielded significance despite the weak manipulation and imprecise outcome measure.

Even though the sample size is better than most, I would still be concerned that a study like this is underpowered. But why?

Remember that statistical power depends on the expected effect size. Effect size involves both signal and noise. Cohen's d is the difference in means divided by the standard deviation of scores. Pearson correlation is the covariance of x and y divided by the standard deviations of x and y. Noisier measures will mean larger standard deviations and hence, a smaller effect size.

The effect size is not a platonic distillation of the relationship between the two constructs you have in mind (say, mood and preference for the natural). Instead, it is a ratio of signal to noise between your measures -- here, condition assignment and product choice.

Let's imagine this through the lens of a structural equation model. Italicized a and b represent the latent constructs of interest: mood and preference for the natural, respectively. Let's assume their relationship is rho = .4, a hearty effect. x and y are the condition assignment and the outcome, respectively. The path from x to a represents the effect of the manipulation. The path from b to y represents the measurement reliability of the outcome. To tell what the relationship will be between x and y, we multiply each path coefficient as we travel from x to a to b to y.

When the manipulation is strong and the measurement reliable, the relationship between x and y is strong, and power is good. When the manipulation is weak and the measurement unreliable, the relationship is small, and power falls dramatically.

Because weak manipulations and noisy measurements decrease the anticipated effect size, thereby decreasing power, studies can still have decent sample sizes and poor statistical power. Such examples of "significance under duress" should be regarded with the same skepticism as other underpowered studies.

There's a lot of bullshit going around. The life cycle of the bullshit is extended by publication bias (running many trials and just reporting the ones that work) and p-hacking (torturing the data until it gives you significance).

Meta-analysis is often suggested as one solution to these problems. If you average together everybody's answers, maybe you get closer to the true answer. Maybe you can winnow out truth from bullshit when looking at all the data instead of the tally of X significant results and Y nonsignificant results.

That's a nice thought, but publication bias and p-hacking make it possible that the meta-analysis just reports the degree of bias in the literature rather than the true effect. So how do we account for bias in our estimates?

Bayesian Spike-and-Slab Shrinkage Estimates

One very simple approach would be to consider some sort of "bullshit factor". Suppose you believe, as John Ioannidis does, that half of published research findings are false. If that's all you know, then for any published result you believe that there's a 50% chance that there's an effect such as the authors report it (p(H1) = .5) and a 50% chance that the finding is false (p(H0) = .5). Just to be clear, I'm using H0 to refer to the null hypothesis, H1 to refer to the alternative hypothesis.

How might we summarize our beliefs if we wanted to estimate the effect with a single number? Let's say the authors report d = 0.60. We halfway believe in them, but we still halfway believe in the null. So on average, our belief in the true effect size delta is

delta = (d | H0) * probability(H0) + (d | H1) * probability(H1)

or

delta = (0) * (0.5) + (0.6) * (0.5) = 0.3

So we've applied some shrinkage or regularization to our estimate. Because we believe that half of everything is crap, we're able to improve our estimates by adjusting our estimates accordingly.

This is roughly a Bayesian spike-and-slab regularization model: the spike refers to our belief that delta is exactly zero, while the slab is the diffuse alternative hypothesis describing likely non-zero effects. As we believe more in the null, the spike rises and the slab shrinks; as we believe more in the alternative, the spike lowers and the slab rises. By averaging across the spike and the slab, we get a single value that describes our belief.

Bayesian Spike-and-Slab system. As evidence accumulates for a positive effect, the "spike" of belief in the null diminishes and the "slab" of belief in the alternative soaks up more probability. Moreover, the "slab" begins to take shape around the true effect.

So that's one really crude way of adjusting for meta-analytic bias as a Bayesian: just assume half of everything is crap and shrink your effect sizes accordingly. Every time a psychologist comes to you claiming that he can make you 40% more productive, estimate instead that it's probably more like 20%.

But what if you wanted to be more specific? Wouldn't it be better to shrink preposterous claims more than sensible claims? And wouldn't it be better to shrink fishy findings with small sample sizes and a lot of p = .041s moreso than a strong finding with a good sample size and p < .001?

Bayesian Meta-Analytic Thinking by Guan & Vandekerckhove

This is exactly the approach given in a recent paper by Guan and Vandekerckhove. For each meta-analysis or paper, you do the following steps:

Ask yourself how plausible the null hypothesis is relative to a reasonable alternative hypothesis. For something like "violent media make people more aggressive," you might be on the fence and assign 1:1 odds. For something goofy like "wobbly chairs make people think their relationships are unstable" you might assign 20:1 odds in favor of the null.

Ask yourself how plausible the various forms of publication bias are. The models they present are:

M1: There is no publication bias. Every study is published.

M2: There is absolute publication bias. Null results are never published.

M3: There is flat probabilistic publication bias. All significant results are published, but only some percentage of null results are ever published.

M4: There is tapered probabilistic publication bias: everything p < .05 gets published, but the chances of publishing get worse the farther you get from p < .05 (e.g. p = .07 gets published more than p = .81).

Look at the results and see which models of publication bias look likely. If there's even a single null result, you can scratch off M2, which says null results are never published. Roughly speaking, if the p-curve looks good, M1 starts looking pretty likely. If the p-curve is flat or bent the wrong way, M3 and M4 start looking pretty likely.

Update your beliefs according to the evidence. If the evidence looks sound, belief in the unbiased model (M1) will rise and belief in the biased models (M2, M3, M4) will drop. If the evidence looks biased, belief in the publication bias models will rise and belief in the unbiased model will drop. If the evidence supports the hypothesis, belief in the alternative (H1) will rise and belief in the null (H0) will drop. Note that, under each publication bias model, you can still have evidence for or against the effect.

Average the effect size across all the scenarios, weighting by the probability of each scenario.

(d | Mx, H0) is "effect size d given that publication bias model X is true and there is no effect." We can go through and set all these to zero, because when the null is true, delta is zero.

(d | Mx, H1) is "effect size d given that pubication bias model X is true and there is a true effect." Each bias model makes a different guess at the underlying true effect. (d | M1, H1) is just the naive estimate. It assumes there's no pub bias, so it doesn't adjust at all. However, M2, M3, and M4 say there is pub bias, so they estimate delta as being smaller. Thus, (d | M2, H1), (d | M3, H1), and (d | M4, H1) are shrunk-down effect size estimates.

p(M1, H1) through p(M4, H0) reflect our beliefs in each (pub-bias x H0/H1) combo. If the evidence is strong and unbiased, p(M1, H1) will be high. If the evidence is fishy, p(M1, H1) will be low and we'll assign more belief to skeptical models like p(M3, H1), which says the effect size is overestimated, or even p(M3, H0), which says that the null is true.

Then to get our estimate, we make our weighted average. If the evidence looks good, p(M1, H1) will be large, and we'll shrink d very little according to publication bias and remaining belief in the null hypothesis. If the evidence is suspect, values like p(M3, H0) will be large, so we'll end up giving more weight to the possibility that d is overestimated or even zero.

Summary

So at the end of the day, we have a process that:

Takes into account how believable the hypothesis is before seeing data, gaining strength from our priors. Extraordinary claims require extraordinary evidence, while less wild claims require less evidence.

Takes into account how likely publication bias is in psychology, gaining further strength from our priors. Data from a pre-registered prospective meta-analysis is more trustworthy than a look backwards at the prestige journals. We could take that into account by putting low probability in pub bias models in the pre-registered case, but higher probability in the latter case.

Uses the available data to update beliefs about the hypothesis and publication bias both, improving our beliefs through data. If the data look unbiased, we trust it more. If the data looks like it's been through hell, we trust it less.

Provides a weighted average estimate of the effect size given our updated beliefs. It thereby shrinks estimates a lot when the data are flimsy and there's strong evidence of bias, but shrinks estimates less when the data are strong and there's little evidence of bias.

It's a very nuanced and rational system. Bayesian systems usually are.

That's enough for one post. I'll write a follow-up post explaining some of the implications of this method, as well as the challenges of implementing it.

Monday, June 29, 2015

The Problem with PET-PEESE?

Will Gervais has a very interesting criticism of PET-PEESE, a meta-analytic technique for correcting for publication bias, up at his blog. In it, he tests PET-PEESE's bias by simulating many meta-analyses, each of many studies, using historically-accurate effect sizes and sample sizes from social psychology. He finds that, under these conditions and assuming some true effect, PET-PEESE performs very poorly at detecting the true effect, underestimating it by a median 0.2 units of Cohen's d.

When I saw this, I was flattened. I knew PET-PEESE had its problems, but I also thought it represented a great deal of promise compared to other rotten old ways of inspecting for publication bias, such as trim-and-fill or (shudder) Fail-Safe N. In the spirit of full disclosure, I'll tell you that I'm 65 commits deep into a PET-PEESE manuscript with some provocative conclusions, so I may be a little bit motivated to defend PET-PEESE. But I saw some simulation parameters that could be tweaked to possibly give PET-PEESE a better shot at the true effect.

My Tweaks to Will's Simulation

One problem is that, in this simulation, the sample sizes are quite small. The sample sizes per cell distributed according to a truncated normal, ~N(30, 50), bounded by 20 and 200. So the minimum experiment has just 40 subjects across two cells, the modal experiment has just 60 subjects across two cells, and no study will ever exceed 400 subjects across the two cells.

These small sample sizes, combined with the small true effect (delta = .275), mean that the studies meta-analyzed have miserable power. The median power is only 36%. The maximum power is 78%, but you'll see that in fewer than one in ten thousand studies.

The problem, then, is one of signal and noise. The signal is weak: delta = .275 is a small effect by most standards. The noise is enormous: at n = 60-70, the sampling error is devastating. But what's worse, there's another signal superimposed on top of all this: publication bias! The effect is something like trying to hear your one friend whisper a secret in your ear, but the two of you are in a crowded bar, and your other friend is shouting in your other ear about the Entourage movie.

So as I saw it, the issue wasn't that PET-PEESE was cruelly biased in favor of the null or that it had terrible power to detect true effects. The issue was small effects, impotent sample sizes, and withering publication bias. In these cases, it's very hard to tell true effects from null effects. Does this situation sound familiar to you? It should -- Will's simulation uses distributions of sample sizes and effect sizes that are very representative of the norms in social psychology!

But social psychology is changing. The new generation of researchers are becoming acutely aware of the importance of sample size and of publishing null results. New journals like Frontiers or PLOS (and even Psych Science) are making it easier to publish null results. In this exciting new world of social psychology, might we have an easier time of arriving at the truth?

Simulations

To test my intuition, I made one tweak to Will's simulation: Suppose that, in each meta-analysis, there is one ambitious grad student who decides she's had enough talk. She wants some damn data, and when she gets it, she will publish it come hell or high water, regardless of the result.

In each simulated meta-anaysis, I guarantee a single study with n = 209/cell (80% power, two-tailed, to detect the true homogenous effect delta = 0.275). Moreover, this single well-powered study is made immune to publication bias. Could a single, well-powered study help PET-PEESE?

Well, it doesn't. One 80%-powered study isn't enough. You might be better off using the "Top Ten" estimator, that looks only at the 10 largest studies, or even just interpreting the single largest study.

What if the grad student runs her dissertation at 90% power, collecting n = 280 per cell?

Maybe we're getting somewhere now. The PEESE spike is coming up a little bit and the PET spike is going down. But maybe we're asking too much of our poor grad student. Nobody should have to determine the difference between delta = 0 and delta = 0.275 all by themselves. (Note that, even still, you're probably better off throwing all the other studies and meta-analysis and meta-regressions in the garbage and just using this single pre-registered experiment as your estimate!)

Here's the next scenario: Suppose somebody looked at the funnel plot from the original n = ~20 studies and found it to be badly asymmetrical. Moreover, they saw the PET-PEESE estimate couldn't detect the effect as significantly different from zero. Rather than pronounce the PET-PEESE estimate as the true effect size, they instead suggested that the literature was badly biased and that a replication effort was needed. So three laboratories each agreed to rerun the experiment at 80% power and publish the results in a Registered Report. Afterwards, they reran the meta-analysis and PET-PEESE.

Even with these three unbiased, decently-powered studies, PET-PEESE is still flubbing it badly, going to PET more often than it should. Again, you might be better off just looking at the three trustworthy studies in the Registered Report than try to fix the publication bias with meta-regression.

I'm feeling pretty exhausted by now, so let's just drop the hammer on this. The Center for Open Science decides to step in and run a Registered Report with 10 studies, each powered at 80%. Does this give PET-PEESE what it needs to perform well?

No dice. Again, you'd be better off just looking at the 10 preregistered studies and giving up on the rest of the literature. Even with these 10 healthy studies in the dataset, we're missing delta = .275 by quite a bit in one direction or the other: PET-PEESE is estimating delta = 0.10, while naive meta-analysis is estimating delta = .42.

Summary

I am reminded of a blog post by Michele Nuijten, in which she explains how more information can actually make your estimates worse. If your original estimates are contaminated by publication bias, and your replication estimates are also contaminated by publication bias, adding the replication data to your original data only makes things worse. In the cases above, we gain very little from meta-analysis and meta-regression. It would be better to look only at the large-sample Registered Reports and dump all the biased, underpowered studies in the garbage.

The simple lesson is this: There is no statistical replacement for good research practice. Publication bias is nothing short of toxic, particularly when sample sizes and effect sizes are small.

So what can we do? Maybe this is my bias as a young scientist with few publications to my name, but if we really want to know what is true and what is false, we might be better off disregarding the past literature of biased, small-sample studies entirely and only interpreting data we can trust.

The lesson I take is this: For both researchers and the journals that publish them, Registered Report or STFU.

(Now, how am I gonna salvage this meta-analysis???)

Code is available at my GitHub. The bulk of the original code was written by Will Gervais, with edits and tweaks by Evan Carter and Felix Schonbrodt. You can recreate my analyses by loading packages and the meta() function on lines 1-132, then skipping down to the section "Hilgard is going ham" on line 303.

Monday, May 4, 2015

There
has recently been some discussion as to whether Bayes factor is biased in favor
of the null. I am particularly sensitive to these concerns as somebody who sometimes uses Bayes factor to argue in favor of the null. I do not want
Reviewer 2 to think that I am overstating my evidence.

I would
like to address two specific criticisms of Bayes factor, each arguing that the choice of an alternative hypothesis makes it too easy for researchers to argue for the null.

Simonsohn

In a recent blog post, Dr. Simonsohn writes “Because I am not
interested in the distribution designated as the alternative hypothesis, I am
not interested in how well the data support it.”

Of course, if one does not like one alternative hypothesis, one can choose another. Bayes factor is just the tool, and it's up to the analyst to make the tool answer a valuable question.

I asked Dr. Simonsohn for clarification on what he thought might make a good alternative hypothesis. He suggested a point-alternative hypothesis describing the minimum effect size of interest. That way, the Bayes factor yielded would not be too hasty to lean in favor of the null.

That smallest effect size will vary across context. For
example, for gender discrimination I may have one standard of too small to
care, and for PSI I will have a much lower standard, and for time travel a tiny
standard (a few seconds of time travel are a wonderful discovery).

Personally, I do not think this is a good alternative hypothesis. It makes the null and alternative hypothesis too similar so that their predictions are nigh-indiscriminable. It makes it nearly impossible to find evidence one way or the other.

Left panel: Depiction of null hypothesis and "minimum effect of interest" alternative. Null hypothesis: δ = 0. Alternative hypothesis: δ = 0.01. Right panel: Probability of data given each hypothesis and 200 observations, between-subjects design. The hypotheses are so similar as to be indistinguishable from each other.

Imagine if we did a priori power analysis with this alternative hypothesis for conventional null hypothesis significance testing. Power analysis would tell us we would need hundreds of thousands of observations to have adequate power. Less than that, and any significant results could be flukes and Type I errors, and nonsignificant results would be Type II errors. It's the Sisyphean Decimal Sequence from last post.At some point, you have to live with error. The conventional testing framework assumes an effect size and establishes Type I and Type II error rates from there. But what justifies your a priori power assumption? Dr. Simonsohn's newest paper suggests a negative replication should indicate that the previous study had less than 33% power to detect its effect. But why would we necessarily care about the effect as it was observed in a previous study?

Every choice of alternative hypothesis is, at some level, arbitrary. No effect can be measured to arbitrary precision. Of all the inferential techniques I know, however, Bayes factor states this alternative hypothesis most transparently and reports the evidence in the most finely-grained units.In practice, we don't power studies to the minimum interesting effect. We power studies to what expect the effect size to be given the theory. The alternative hypothesis in Bayesian model comparison should be the same way, representing our best guess about the effect. Morey et al. (submitted) call this a "consensus prior", the prior a "reasonable, but somewhat-removed researcher would have [when trying to quantify evidence for or against the theory]."

Schimmack

Dr. Schimmack
also thinks that Bayes factor is prejudiced
against small effects
and that it makes it too easy to land a prestigious JEP:G publication ruling in
favor of the null. In his complaint, he examines an antagonistic collaboration among
Matzke, Nieuwenhuis, and colleagues. Nieuwenhuis et al. argue that horizontal
eye movements improve memory, while Matzke et al. argue that they have no such
effect. Data is collected, and we ask questions of it: Whose hypothesis is
supported, Nieuwenhuis’ or Matzke’s?

In the
data, the effect of horizontal eye movements was actually negative. This is
unusual given Matzke’s hypothesis, but very
unusual given Nieuwenhuis’ hypothesis. Because the results are 10 times
more likely given Matzke’s hypothesis than Nieuwenhuis’, we rule in favor of Matzke’s
null hypothesis.

Dr. Schimmack
is dissatisfied with the obtained result and wants more power:

“[T]his design has 21% power to reject the null-hypothesis
with a small effect size (d = .2). Power for a moderate effect size (d = .5) is
68% and power for a large effect size (d = .8) is 95%.

Thus, the decisive study that was designed to solve the dispute
only has adequate power (95%) to test Drs. Matzke et al.’s hypothesis d = 0
against the alternative hypothesis that d = .8. For all effect sizes between 0
and .8, the study was biased in favor of the null-hypothesis.”

Dr. Schimmack is concerned that the sample size is too small to distinguish the null from the alternative. The rules of the collaboration, however, were to collect data until the Bayes factor was 10 for one or the other hypothesis. The amount of data collected was indeed enough to distinguish between the two hypotheses, as the
support is quite strong for the no-effect-hypothesis relative to the
improvement-hypothesis. Everybody goes to the pub to celebrate, having
increased their belief in the null relative to this alternative by a factor of
10.

But
suppose we tried to interpret the results in terms of power and significance. What
would we infer if the result was not significant? Dr. Schimmack’s unusual
comment above that “for all effect sizes between 0 and .8, the study was biased
in favor of the null-hypothesis” leads me to worry that he intends to interpret
p > .05 as demonstrating the truth
of the null – a definite faux pas in null-hypothesis
significance testing.

But what
can we infer from p > .05? That the results have no
evidentiary value, being unable to reject the null hypothesis? That the
obtained result is (1 – Power)% unlikely if the alternative hypothesis δ = 0.5
were true? But why would we care about the power based on the alternative
hypothesis δ = 0.5, and not δ = 0.1, or δ = 1.0, or any other point-alternative
hypothesis?

Dr. Niewenhuis
understands his theory, formulated a fair hypothesis, and agreed that a test of
that hypothesis would constitute a fair test of the theory. I can see no better
or more judicious choice of alternative hypothesis. In a well-designed
experiment with a fair hypothesis, the Bayesian test is fair.

Dr.
Schimmack further argues that “[The]
empirical data actually showed a strong effect in the opposite direction, in
that participants in the no-eye-movement condition had better performance than
in the horizontal-eye-movement condition (d = -.81). A Bayes Factor
for a two-tailed hypothesis or the reverse hypothesis would not have favored
the null-hypothesis.” This is an interesting phenomenon, but beside the
point of the experiment. Remember the question being asked: Is there a positive
effect, or no effect? The obtained data support the hypothesis of no effect
over the hypothesis of a positive effect.

If one wishes to pursue the new hypothesis of a negative effect in a future experiment,
one can certainly do so. If one thinks that the negative effect
indicates some failure of the experiment then that is a methodological, not
statistical, concern. Keep in mind that both researchers agreed to the validity
of the method before the data were collected, so again, we expect that this is
a fair test.

Summary

Bayes factor provides an effective summary of evidence. A Cauchy or half-Cauchy distribution on the effect size often makes for a fair and reasonable description of the alternative hypothesis. Scientists who routinely read papers with attention to effect size and sample size will quickly find themselves capable of describing a reasonable "consensus prior." Having to describe this alternative hypothesis sometimes makes researchers uneasy, but it is also necessary for the interpretation of results in conventional testing. If a test of a subtle effect is statistically significant in a sample of 20, we suspect a Type I error rather than a true effect. If that subtle effect is not statistically significant in a sample of 20, we suspect a Type II error rather than a true effect. Specification of the alternative hypothesis makes these judgments transparent and explicit and yields the desired summary of evidence.

Monday, April 20, 2015

I love Bayesian model comparison. It’s my opinion that null
hypothesis testing is not great because 1) it gives dichotomous accept/reject
outcomes when we all know that evidence is a continuous quantity and 2) it can
never provide evidence for the null, only fail to reject it. This latter point
is important because it’s my opinion that the null is often true, so we should
be able to provide evidence and assign belief to it. By comparison, Bayesian model comparison has neither weakness of NHST. First, it yields a "Bayes factor", the multiplicative and continuous change in beliefs effected by the data. Second, it can yield Bayes factors favoring the null hypothesis over a specified alternative hypothesis.

Despite my enthusiasm for Bayesian model comparison,
one criticism I see now and again about
Bayesian model comparison is that the obtained Bayes factor varies as a
function of the hypothesis being tested. See e.g. this Twitter thread or Simonsohn (2015):

When a default Bayesian test favors the null hypothesis,
the correct interpretation of the result is that the
data favor the null hypothesis more than that one specific
alternative hypothesis. The Bayesian test could conclude
against the same null hypothesis, using the same
data, if a different alternative hypothesis were used, say,
that the effect is distributed normal but with variance of
0.5 instead of 1, or that the distribution is skewed or has some
other mean value.*

To some researchers, this may seem undesirable. Science and analysis are supposed to be "objective," so the subjectivity in Bayesian analysis may seem unappealing.

To a
Bayesian, however, this is the behavior as intended. The Bayes factor is supposed to vary according to the
hypotheses tested. The answer should
depend on the question.

Asking the Right
Question

The
problem reminds me of the classic punchline in Douglas Adams’ Hitchhiker’s Guide to the Galaxy. An
advanced civilization builds a massive supercomputer at great expense to run
for millions of years to provide an answer to life, the universe, and everything.

Eons
later, as the calculations finally complete, the computer pronounces its
answer: “Forty-two.”

Everyone
winces. They demand to know what the computer means by forty-two. The computer
explains that forty-two is the correct answer, but that the question is still
unknown. The programmers are mortified. In their haste to get an impressive
answer, they did not stop to consider that every answer is valuable only in the
context of its question.

Bayesian
model comparison is a way to ask questions. When you ask different questions of
your data, you get different answers. Any particular answer is only valuable
insofar as the corresponding question is worth asking.

An Example from PSI
Research

Let’s
suppose you’re running a study on ESP. You collect a pretty decently-sized
sample, and at the end of the day, you’re looking at an effect size and confidence interval (ESCI) of d = 0.15 (-.05, .35). Based on this, what is your inference?

The NHST inference is that you didn't learn
anything: you failed to reject the null, so the null stands for today, but maybe
in the future with more data you’d reject the null with d = .03 (.01, .05) or something. You can
never actually find evidence for the null so long as you use NHST. In the most
generous case, you might argue that you've rejected some other null hypothesis
such as δ > .35.

The ESCI inference is that the true effect of ESP
is somewhere in the interval.** Zero is in the interval, and we don’t believe
that ESP exists, so we’re vaguely satisfied. But how narrow an interval around
zero do we need before we’re convinced that there’s no ESP? How much evidence
do we have for zero relative to some predicted effect?

Bayesian Inferences

Now you
consult a Bayesian (Figure 1). You ask the Bayesian which she favors: the null hypothesis δ
= 0, or the alternative hypothesis δ ≠ 0. She shakes her head. Your alternative
hypothesis makes no predictions. The effect could be anywhere from negative
infinity to positive infinity, or so close to zero as to be nearly equal it. She
urges you to be more specific.

Figure 1. Ancient Roman depiction of a Bayesian.

To get
an answer, you will have to provide a more specific question. Bayesian model
comparison operates by comparing one or more model predictions and seeing which
is best supported by the data. Because it is a daunting task to try to
precisely predict the effect size (although we often attempt to do so in a priori power analysis), we can assign
probability across a range of values.

Trying
again, you ask her whether there is a large effect of ESP. Maybe the effect of
ESP could be a standard deviation in either direction, and any nonzero effect
between d = -1 and d = 1 would be considered evidence of
the theory. That is, H1: δ ~ Uniform(-1, 1) (see Figure 2). The Bayesian tells you
that you have excellent evidence for the null relative to this hypothesis.

Figure 2. Competing statements of belief about the effect size delta.

Encouraged,
you ask her whether there is a medium effect of ESP. Maybe ESP would change
behavior by about half a standard deviation in either direction; small effects
are more likely than large effects, but large effects are possible too. That is,
H2: δ ~ Cauchy(0.5) (see Figure 3). The Bayesian tells you that you have pretty
good evidence for the null against this hypothesis, but not overwhelming
evidence.

Figure 3. A Cauchy-distributed alternative hypothesis can be more conservative, placing more weight on smaller, more realistic effect sizes while maintaining the possibility of large effects.

Finally,
you ask her whether you have evidence against even the tiniest effect of ESP.
Between the null hypothesis H0: δ = 0 and the alternative H3:
δ ~ Cauchy(1x10^-3), which does she prefer? She shrugs. These two hypotheses
make nearly-identical predictions about what you might see in your experiment (see Figure 4).
Your data cannot distinguish between the two. You would need to spend several
lifetimes collecting data before you were able to measurably shift belief from this
alternative to the null.

Figure 4. The null and alternative hypotheses make nearly-identical statements of belief.

And
after that, what’s next? Will you have to refute H4: δ ~
Cauchy(1×10^-4), H5: δ ~ Cauchy(1×10^-5), and so on? A chill falls
over you as you consider the possibilities. Each time you defeat one decimal
place, another will rise to take its place. The fate of Sisyphus seems pleasant
by comparison.

The
Bayesian assures you that this is not a specific weakness of Bayesian model
comparison. If you were a frequentist, your opponents could always complain
that your study did not have enough power to detect δ = 1×10^-4. If you were
into estimation, your opponents could complain that your ESCI did not exclude δ
= 1×10^-4. You wonder if this is any way to spend your life, chasing eternally
after your opponents’ ever-shifting goalposts.

It is my opinion that these minimal-effect-size hypotheses are not questions worth asking in most psychological research. If the effect is truly so tiny, its real-world relevance is minimal. If the phenomenon requires several thousand observations to distinguish signal from noise, it is probably not practical to study it. I think these hypotheses are most often employed as last lines of epistemic defense, a retreat of the effect size from something meaningful to something essentially untestable.

At some
point, you will have to draw a limit. You will have to make an alternative hypothesis and declare “Here is the
approximate effect size predicted by the theory.” You won’t have to select the
specific point, because you can spread the probability judiciously across a
range of plausible values. It may not be exactly the hypothesis every single
researcher would choose, but it will be reasonable and judicious, because you
will select it carefully. You are a scientist, and you are good at asking meaningful questions. When you ask that meaningful question, Bayesian model comparison will give you a meaningful answer.

In Summary

Bayesian
model comparison is a reasonable and mathematically-consistent way to get
appropriate answers to whatever your question. As the question changes, so too
should the answer. This is a feature, not a bug. If every question got the same
answer, would we trust that answer?

We must
remember that no form of statistics or measurement can hope to measure an
effect to arbitrary precision, and so it is epistemically futile to try to
prove absolutely the null hypothesis δ = 0. However, in many cases, δ = 0 seems
appropriate, and the data tend to support it relative to any reasonable alternative hypothesis. The argument that the null cannot be supported relative to Ha:
δ = 1×10^-10 is trivially true, but scientifically unreasonable and unfair.

Asking good questions
is a skill, and doing the appropriate mathematics and programming to model the
questions is often no small task. I suggest that we appreciate those who ask
good questions and help those who ask poor questions to try other, more
informative models.In my next post, I'll cover some recent, specific critiques of Bayesian model comparison that, in my opinion, hinge on asking the wrong questions for the desired answers.---------------------------------------Thanks to Jeff Rouder, Richard Morey, Chris Engelhardt, and Alexander Etz for feedback. Thanks to Uri Simonsohn for clarifications and thoughts.* Simonsohn clarifies his point briefly in the second half of this blog post -- he is moreso dissatisfied with the choice of a particular alternative hypothesis than he is alarmed by the Bayes factor's sensitivity to the alternative. Still, it is my impression that some readers may find this subjectivity scary and therefore unfortunately avoid Bayesian model comparison.** This isn't true either. It is a common misconception that the 95% ESCI contains the true effect with 95% probability. The Bayesian 95% highest-density posterior interval (HDPI) does, but you need a prior. Even then you still have to come to some sort of decision about whether that HDPI is narrow enough or not. So here we are again.