Sunday, December 18, 2016

After performing a study, you can correctly conclude there
is an effect or not, but you can also incorrectly conclude there is an effect
(a false positive, alpha, or Type 1 error) or incorrectly conclude there is no
effect (a false negative, beta, or Type 2 error).

The goal of collecting data is to provide evidence for or against a hypothesis.
Take a moment to think about what ‘evidence’ is – most researchers I ask can’t
come up with a good answer. For example, researchers sometimes think p-values are evidence, but p-values are only correlated with
evidence.

Evidence in science is necessarily relative. When data is more likely assuming one model is true
(e.g., a null model) compared to another model (e.g., the alternative model),
we can say the model provides evidence for the null compared to the alternative
hypothesis. P-values only give you the
probability of the data under one model – what you need for evidence is the relative
likelihood of two models.

Bayesian and likelihood approaches should be used when you
want to talk about evidence, and here I’ll use a very simplistic likelihood
model where we compare the relative likelihood of a significant result when the
null hypothesis is true (i.e., making a Type 1 error) with the relative
likelihood of a significant result when the alternative hypothesis is true
(i.e., *not* making a Type 2 error).

Let’s assume we have a ‘methodological fetishist’ (Ellemers, 2013) who is adamant about controlling their alpha
level at 5%, and who observes a significant result. Let’s further assume this
person performed a study with 80% power, and that the null hypothesis and
alternative hypothesis are equally (50%) likely. The outcome of the study has a
2.5% probability of being a false positive (a 50% probability that the null
hypothesis is true, multiplied by a 5% probability of a Type 1 error), and a
40% probability of being a true positive (a 50% probability that the
alternative hypothesis is true, multiplied by an 80% probability of finding a
significant effect).

The relative evidence for H1 versus H0 is 0.40/0.025 = 16. In
other words, based on the observed data, and a model for the null and a model
for the alternative hypothesis, it is 16 times more likely that the alternative
hypothesis is true than that the null hypothesis is true. For educational
purposes, this is fine – for statistical analyses, you would use formal
likelihood or Bayesian analyses.

Now let’s assume you agree that providing evidence is a very
important reason for collecting data in an empirical science (another goal of data
collection is estimation – but I’ll focus on hypothesis testing here). We can now ask
ourselves what the effect of changing the Type 1 error or the Type 2 error
(1-power) is on the strength of our evidence. And let’s agree that we will
conclude that whichever error impacts the strength of our evidence the most, is
the most important error to control. Deal?

We can plot the relative likelihood (the probability a
significant result is a true positive, compared to a false positive) assuming
H0 and H1 are equally likely, for all levels of power, and for all alpha
levels. If we do this, we get the plot below:

Or for a rotating version (yeah, I know, I am an R nerd):

So when is the evidence in our data the strongest? Not
surprisingly, this happens when both types of errors are low: the alpha level
is low, and the power is high (or the Type 2 error rate is low). That is why
statisticians recommend low alpha levels and high power. Note that the shape of
the plot remains the same regardless of the relative likelihood H1 or H0 is
true, but when H1 and H0 are not equally likely (e.g., H0 is 90% likely to be
true, and H1 is 10% likely to be true) the scale on the likelihood ratio axis
increases or decreases.

Now for the main point in this blog post: we can see that an
increase in the Type 2 error rate (or a reduction in power) reduces the
evidence in our data, but it does so relatively slowly. However, we can also
see that an increase in the Type 1 error rate (e.g., as a consequence of
multiple comparisons without controlling for the Type 1 error rate) quickly
reduces the evidence in our data. Royall (1997)
recommends that likelihood ratios of 8 or higher provide moderate evidence, and
likelihood ratios of 32 or higher provide strong evidence. Below 8, the
evidence is weak and not very convincing.

If we calculate the likelihood ratio for alpha = 0.05, and
power from 1 to 0.1 in steps of 0.1, we get the following likelihood ratios: 20,
18, 16, 14, 12, 10, 8, 6, 4, 2. With 80% power, we get the likelihood ratio of
16 we calculated above, but even 40% power leaves us with a likelihood ratio of
8, or moderate evidence (see the figure above). If we calculate the likelihood ratio for power = 0.8
and alpha levels from 0.05 to 0.5 in steps of 0.05, we get the following
likelihood ratios: 16, 8, 5.3, 4, 3.2, 2.67, 2.29, ,2, 1.78, 1.6. An alpha
level of 0.1 still yields moderate evidence (assuming power is high enough!)
but further inflation makes the evidence in the study very weak.

Type 1 error control is important if we care
about evidence. Although I agree with Fiedler, Kutzner,
and Kreuger (2012) that a Type 2 error is also
very important to prevent, you simply can not ignore Type 1 error control if
you care about evidence. Type 1 error control is more important than Type 2
error control, because inflating Type 1 errors will very quickly leave you with
evidence that is too weak to be convincing support for your hypothesis, while
inflating Type 2 errors will do so more slowly. By all means, control Type 2 errors - but not at the expense of Type 1 errors.

I want to end by pointing out that Type 1 and Type 2 error
control is not a matter of ‘either-or’. Mediocre statistics textbooks like to
point out that controlling the alpha level (or Type 1 error rate) comes at the expense of the beta (Type
2) error, and vice-versa, sometimes using the horrible seesaw metaphor below:

But this is only true if the sample size is fixed. If you
want to reduce both errors, you simply need to increase your sample size, and
you can make Type 1 errors and Type 2 errors are small as you want, and contribute
extremely strong evidence when you collect data.

Ellemers, N. (2013). Connecting the dots: Mobilizing theory to reveal
the big picture in social psychology (and why we should do this): The big
picture in social psychology. European Journal of Social Psychology, 43(1),
1–8. https://doi.org/10.1002/ejsp.1932

Friday, December 9, 2016

I’m happy to announce my first R package ‘TOSTER’ for
equivalence tests (but don’t worry, there is an old-fashioned spreadsheet as
well).

In an earlier blog post I talked about equivalence tests. Sometimes
you perform a study where you might expect the effect is zero or very small. So
how can we conclude an effect is ‘zero or very small’? One approach is to
specify effect sizes we consider ‘not small’. For example, we might decide that
effects larger than d = 0.3 (or smaller than d = -0.3 in a two-sided t-test), are ‘not small’. Now, if
we observe an effect that falls between the two equivalence bounds of d = -0.3
and d = 0.3 we can act (in good old-fashioned Neyman-Pearson approach to statistical inferences) as if the effect is ‘zero or very small’. It might not be
exactly zero, but it is small enough. You can check out a great interactive visualization of
equivalence testing by RPsychologist.

We can use two one-sided tests to statistically reject
effects ≤
-0.3, and ≥
0.3. This is the basic idea of the TOST (two one-sided tests) equivalence
procedure. The idea is simple, and it is conceptually similar to the
traditional null-hypothesis test you probably use in your article to reject an effect of
zero. But where all statistics programs will allow you to perform a normal t-test, it is not yet that easy to perform a TOST equivalence test (Minitab is one exception).

Let’s try a practical example (this is one of the examples from
the vignette
that comes with the R package).

Eskine (2013) showed that participants who had been exposed to
organic food were substantially harsher in their moral judgments relative to
those in the control condition (Cohen’s d =
0.81, 95% CI: [0.19, 1.45]). A replication by Moery & Calin-Jageman (2016,
Study 2) did not observe a significant effect (Control: n = 95, M = 5.25, SD =
0.95, Organic Food: n = 89, M = 5.22, SD = 0.83). The authors have used
Simonsohn’s recommendation to power their study so that they have 80% power to
detect an effect the original study had 33% power to detect. This is the same
as saying: We consider an effect to be ‘small’ when it is smaller than the effect size the original study
had 33% power to detect.

With n = 21 in each condition, Eskine (2013) had 33% to detect an
effect of d = 0.48. This is the effect the authors of the replication study designed their study to
detect. The original study had shown an effect of d = 0.81, and the authors
performing the replication decided that an effect size of d = 0.48 would be the
smallest effect size they will aim to detect with 80% power. So we can use this
effect size as the equivalence bound. We can use R to perform an equivalence
test:

You see, we are just using R like a fancy calculator, entering all
the numbers in a single function. But I can understand if you are a bit
intimidated by R. So, you can also fill in the same info in the spreadsheet (click picture to zoom):

Using a TOST equivalence procedure with alpha = 0.05, and without assuming
equal variances (because when sample sizes are unequal, you
should report Welch’s t-test by default), we can reject effects larger than
d = 0.48: t(182) = -3.03, p = 0.001.

The R package also gives a graph, where you see the observed
mean difference (in raw scale units), the equivalence bounds (also in raw
scores), and the 90% and 95% CI. If the 90% CI does not include the equivalence
bounds, we can declare equivalence.

Moery and Calin-Jageman concluded from this study: “We again
found that food exposure has little to no effect on moral judgments” But what
is ‘little to no”? The equivalence test tells us the authors successfully rejected
effects of a size the original study had 33% power to reject. Instead of saying ‘little to no’ we can put a number on
the effect size we have rejected by performing an equivalence test.

If you want to read more about equivalence tests, including
how to perform them for one-sample t-tests,
dependent t-tests, correlations, or
meta-analyses, you can check out a practical primer on equivalence testing using the TOST procedure I've written. It's available as a pre-print on PsyArXiv. The R code is available on GitHub.

Saturday, November 12, 2016

One widely recommended approach to increase power is using a
within subject design. Indeed, you need fewer participants to detect a mean difference
between two conditions in a within-subjects design (in a dependent t-test) than in a between-subjects
design (in an independent t-test).
The reason is straightforward, but not always explained, and even less often expressed in the easy equation below. The sample size needed in within-designs (NW) relative to the sample
needed in between-designs (NB), assuming normal distributions, is (from Maxwell &
Delaney, 2004, p. 561, formula 45):

NW = NB
(1-ρ)/2

The “/2” part of the equation is due to the fact that in a two-condition
within design every participant provides two data-points. The extent to which
this reduces the sample size compared to a between-subject design depends on
the correlation between the two dependent variables, as indicated by the (1-ρ)
part of the equation. If the correlation is 0, a within-subject design simply
needs half as many participants as a between-subject design (e.g., 64 instead
128 participants). The higher the correlation, the larger the relative benefit
of within designs, and whenever the correlation is negative (up to -1) the
relative benefit disappears. Note than when the correlation is -1, you need 128
participants in a within-design and 128 participants in a between-design, but
in a within-design you will need to collect two measurements from each
participant, making a within design more work than a between-design. However, negative correlations between dependent variables in psychology are rare, and perfectly negative correlations will probably never occur.

So what does the correlation do so that it increases the power of
within designs, or reduces the number of participants you need? Let’s see what effect the correlation has on power by
simulating and plotting correlated data. In the R script below, I’m simulating two
measurements of IQ scores with a specific sample size (i.e., 10000), mean (i.e.,
100 vs 106), standard deviation (i.e., 15), and correlation between the two
measurements. The script generates three plots.

We will start with a simulation where the correlation
between measurements is 0. First, we see the two normally distributed IQ
measurements, with means of 100 and 106, and standard deviations of 15 (due to
the large sample size, the numbers equal the input in the simulation, although
small variation might still occur).

In the scatter plot, we can see that the correlation between
the measurements is indeed 0.

Now, let’s look at the distribution of the mean differences.
The mean difference is -6 (in line with the simulation settings), and the
standard deviation is 21. This is also as expected. The standard deviation of
the difference scores is √2 times as large as the standard deviation in each
measurement, and indeed, 15*√2 = 21.21, which is rounded to 21.
This situation where the correlation between measurements is zero equals the
situation in an independent t-test,
where the correlation between measurements is not taken into account.

Now let’s increase the correlation between dependent
variables to 0.7.

Nothing has changed when we plot the means:

The correlation between measurements is now strongly
positive:

The important difference lies in the standard deviation of
the difference scores. The SD = 11 instead of 21 in the simulation above.
Because the standardized effect size is the difference divided by the standard
deviation, the effect size (Cohen’s dz in within designs) is larger in this
test than in the test above.

We can make the correlation more extreme, by increasing the
correlation to 0.99, after which the standard deviation of the difference
scores is only 2.

If you run the R code below, you will see that if you set the
correlation to a negative value, the standard deviation of the difference scores actually increases.

I like to think of dependent variables in within-designs as dance
partners. If they are well-coordinated (or highly correlated), one person steps
to the left, and the other person steps to the left the same distance. If there
is no coordination (or no correlation), when one dance partner steps to the
left, the other dance partner is just as likely to move to the wrong direction
as to the right direction. Such a dance couple will take up a lot more space on
the dance floor.

You see that the correlation between dependent variables is
an important aspect of within designs. I recommend explicitly reporting the
correlation between dependent variables in within designs (e.g., participants responded significantly slower (M = 390, SD = 44) when they used their feet than when they used their hands (M = 371, SD = 44, r = .953), t(17) = 5.98, p < 0.001, Hedges' g =
0.43, Mdiff = 19, 95% CI
[12; 26]).

Since most dependent variables in within designs in
psychology are positively correlated, within designs will greatly increase the
power you can achieve given the sample size you have available. Use within-designs when
possible, but weigh the benefits of higher power against the downsides of order
effects or carryover effects that might be problematic in a within-subject
design. Maxwell and Delaney's book (Chapter 11) has a good discussion of this topic.