Thursday, October 29, 2015

Nuijten et
al (2015) created statcheck, a free R package that you can set
to work on a pdf or html file, or a folder of files, to check the reported t-tests, F-tests, correlations, and some others tests. Like your
spellchecker, you will want to run statcheck when working as an editor,
reviewer, author, supervisor, or teacher on any empirical article that contains
t-tests, F-tests, correlations, or chi-square tests.

Here’s how
it works. First, you need to install open source software that will allow R to
convert PDF files to text. The steps are a bit long and tricky, but I made a step-by-step summary which should help you to get this to work.

Then, to
check a single article, run the following R code (changing the path to the PDF
you want to check):

# install and load statcheck

if(!require(statcheck)){install.packages('statcheck')}

library(statcheck)

checkPDF("C:/Users/Daniel/statcheck/Zhang2015.pdf")

That’s all.

You will get output (click on the screenshot below for a bigger version) where you can see a
column for the reported p-values, the re-computed p-values, a summary of each
test, and then a column called Error which will say FALSE if there is no error,
and TRUE if there is an error. I analyzed a recent paper my PhD student Chao Zhang wrote, and I was happy to see the
way we worked on this article (Chao doing the analyses, me double-checking them)
prevented us from making errors. I also looked at earlier papers, and I
regrettably did make a few rounding errors and copy-paste errors in my
publications. Even though nothing changed the conclusions (indicated by the
column ‘DecisionError’), usin Statcheck would have easily prevented these
errors. Statcheck can make some errors, so be sure to check where each tests is
identified correctly, especially when it flags something as an error.

Some Errors We Make

Nuijten and
colleagues applied Statcheck to a huge amount of articles, and report how often
people make errors when reporting statistical tests in a new paper. When reading the paper, I
immediately saw how useful Statcheck was. But I also felt some annoyance that
there was no clear analysis of the things we did wrong. I felt someone told me
I was doing things wrong, without telling me what it was I did wrong. But then a wise man
said I should not blame the authors for not writing the paper I would have
written. Which is especially true given that Nuijten et al have shared all
their data, and their beautiful and reproducible analysis script.

So I took a
look at what we did wrong (R script), and below I will give a recommendation on how to fix
a large majority of the problems.

Of the 258105 tests,
there were 24961 errors,
of which 3581 were decision errors (changing the conclusion of p > 0.05 to p < 0.05 or vice versa), but they are all caused mainly by the
same types of errors. First, people make copy-paste errors. Second, people
reported p = 0.000 1279 times, when they should have reported p < 0.001.
Three errors are worth looking into in some more detail.

Incorrect use of < instead of =

By far the
largest number of errors is the use of < instead of =. For example, F(1, 68) = 4.88, p < .03 is incorrect, because the p-value is actually 0.0305, which is not < 0.03. It happens
thousands and thousands of times. Indeed, if we look at the difference between
the reported and re-computed p-values
for all the errors, we see the difference in p-values is mostly tiny (smaller than 0.01). This is the main reason.
When you read the byline ‘One in eight articles contain data-reporting mistakes that affect their conclusions' you
might not think the solution is simply to replace ‘<’ by ‘=’. I believe it largely
is (but this deserves a closer look).

Use of one-sided tests

Using
one-sided tests, without saying so (or at least without Statcheck recognizing
the words ‘one-sided’, ‘one-tailed’, or ‘directional’ in the text) is another
source of errors. The frequency of one-tailed tests (as I assume, without
pre-registration of the analysis plan) is rather high. One-tailed tests are
fine, and perhaps even more in line with your prediction than a two-tailed
test, but I’d feel more comfortable if people pre-register one-sided
predictions if they have them, and report them if they are performed. Statcheck is great for finding non-disclosed
one-tailed tests.

Incorrect Rounding and Reporting

963 times,
people round a p-value between 0.05
and 0.06 to p < 0.05. The latter
is clearly wrong (but remember people make the same rounding error far removed
from the magical p = 0.05 threshold
as well, so this is just the incorrect use of < instead of = as noted
above). 241 times, researchers report a p >= 0.055 to p < 0.05, and 128 times, people round a p-value
between 0.055 and 0.06 to p = 0.05
(really using the = sign). This is just pathetic. When you hear ‘1.4% of p-values are grossly inconsistent’, this
is the kind of behavior you think about. It makes up approximately 10% of the 3581 decision errors, and even though it is just 0.14% of all reported p-values, I think it is depressingly high. Statcheck
can help reduce these errors.

Altogether,
the 3581 decision errors are made up mostly by incorrect rounding,
the use of one-sided tests without explicitly stating this through the words
‘one-tailed’, ‘one-sided’ or ‘directional’, the use of < instead of =, and
the approximately 350 (give or take a hundred) false positives (note there might also be false negatives, which would increase the number of errors).

These
errors are visible in the plot below. In the left of the graph, we see
differences of -1, where Statcheck often computes a p-value of 1 because it misunderstands the test. The large bar in
the center is mainly due to the use of < instead of =, and the slightly
larger slope on the left of this large bar is due to the use of one-sided
tests, and incorrect rounding.

My main
goal in looking at the data in detail was to be able to provide practical
recommendations to prevent the specific errors we make (even though Nuijten et al suggest co-authors double-check their analyses and share all data). The recommendation is surprisingly straightforward, and nicely with the theme of this blog on how 20% of the effort will fix 80% of the problems:

Report exact p-values, rounded to three decimals (e.g., p = 0.016), or use p <
0.001. Mention the use of one-tailed tests. Double-check all numbers (for
example by using Statcheck!).

I'd like to thanks Michele Nuijten for her help in correcting some of my assumptions and analyses, and for feedback on an earlier draft of this blog post.

Sunday, October 11, 2015

People find it difficult to think about
random variation. Our mind is more strongly geared towards recognizing patterns
than randomness. In this blogpost, you can learn what random variation looks
like, how to reduce it by running well-powered studies, and how to meta-analyze
multiple small studies. This is a long read, and most educational if you follow the assignments. You'll probably need about an hour.

We'll use R, and the R script at the bottom of this post (or download it from GitHub). Run the first section (sections are differentiated by # # # #) to install the required packages and change some setting.

IQ tests have been designed such that the mean IQ of the entire population of
adults is 100, with a standard deviation
of 15. This will not be true for every sample
we draw from the population. Let’s
get a feel for what the IQ scores from a sample look like. Which IQ scores will
people in our sample have?

Assignment
1

We will start by simulating a random sample of 10 individuals. Run the script in the section #Assignment 1. Both the mean, as the standard deviation, differ from the true mean in the population. Simulate some more samples of 10 individuals and look at the means and SD's. They differ quite a lot. This type of variation is perfectly normal in small samples of 10 participants. See below for one example of a simulated sample.

Let’s simulate a larger sample, of 100 participants by changing the n=10 in line 23 of the R script to n = 100 (remember R code is case-sensitive).

We
are slowly seeing what is known as the normal
distribution. This is the well known bell shaped curve that represents the
distribution of many variables in scientific research (although some other
types of distributions are quite common as well). The mean and standard
deviation are much closer to the true mean and standard deviation, and this is
true for most of the simulated samples. Simulate at least 10 samples
with n = 10, and 10 samples with n = 100. Look at the means and standard
deviations. Let’s
simulate one really large sample, of 1000 people (run the code, changing n=10
to n=1000). The picture shows one example.

Not every simulated study of 1000 people
will yield the true mean and standard deviation, but this one did. And although
the distribution is very close to a normal distribution, even with a 1000
people it is not perfect.

The accuracy with which you can measure the
IQ in a population is easy to calculate when you know the standard deviation,
and the percentage of long-run probability of being of making an error. If you
choose a 95% confidence interval, and want to estimate IQ within an error range
of 2 IQ points, you first convert the 95% confidence interval to a Z-score
(1.96), and use the formula:

N = (Z * SD/error)2

In this example, (1.96*15/2) 2 =
216 people (rounded down). Feel free to check by running the code with n = 216
(remember that this is a long term average!)

In addition to planning for
accuracy, you can plan for power. The power of a study is the probability of
observing a statistically significant effect, given that there is a true effect
to be found. It depends on the effect size, the sample size, and the alpha
level.

We can simulate experiments,
and count how many statistically significant results are observed, to see how
much power we have. For example, when we simulate 100.000 studies, and 50% of
the studies reveal a p-value smaller than 0.05, this means the power of our
study (given a specific effect size, sample size, and alpha-level) is 50%.

We
can use the code in the section of Assignment 2. Running this code will take a while. It
will simulate 100000 experiments, where 10 participants are drawn from a normal
distribution with the mean of 110, and a SD of 15. To continue our experiment,
let’s assume the numbers represent measured IQ, which is 110 in our samples. For
each simulated sample, we test whether the effect differs from an IQ of 100. In
other words, we are testing whether our sample is smarter than average.

The program returns all p-values, and it
will return the power, which will be somewhere around 47%. It will also yield a
plot of the p-values. The first bar
is the count of all p-values smaller than 0.05, so all statistically
significant p-values. The percentage
of p-values in this single bar
visualizes the power of the study.

Instead of simulating the power of the
study, you can also perform power calculations in R (see the code at the end of assignment 2). To calculate the power of a
study, we need the sample size (in our case, n = 10), the alpha level (in our
case, 0.05), and the effect size, which for a one-sample t-test is Cohen’s d,
which can be calculated as d = (X-μ)/σ, or
(110-100)/15 = 0.6667.

Assignment
2

Using the simulation and the pwr package, examine what happens with the power of the
experiments when the sample size is increased to 20. How does the p-value distribution change?

Using the simulation and the pwr package, examine what happens with the power of the
experiments when the mean in the sample changes from 110 to 105 (set the sample
size to 10 again). How does the p-value
distribution change?

Using the simulation and the pwr package, examine what happens with the power of the
experiments when the mean in the sample is set to 100 (set the sample size to
10 again). Now, there is no difference between the sample and the average IQ.
How does the p-value distribution
change? Can we formally speak of ‘power’ in this case? What is a better name in
this specific situation?

Variance in two groups, and their difference.

Now, assume we have a new
IQ training program that will increase peoples IQ score with 6 points. People
in condition 1 are in the control condition – they do not get IQ training.
People in condition 2 get IQ training. Let’s simulate 10 people in each group,
assuming the IQ in the control condition is 100, and in the experimental group
is 106 (the SD is still 15 in each group) by running the code for Assignment 3.

The graph you get will look like a version
of the one below. The means and SD for each sample drawn are provided in the
graph (control condition on the left, experimental condition on the right).

The
two groups differ in how close they are to their true means, and as a
consequence, the difference between groups varies as well. Note that this
difference is the main variable in statistical analyses when comparing two
groups. Run at least 10 more simulations to look at the data pattern.

Assignment 3

Compared
to the one-sample case above, we now have 2 variable group means, and two
variable standard deviations. If we perform a power analysis, how do you think
this additional variability will influence the power of our test? In other
words, for the exact same effect size (e.g., 0.6667), will the power of our
study remain the same, will it increase, or will it decrease?

Test
whether your intuition was correct or not by running this power analysis for an
independent samples t-test:

In dependent samples, the mean in one sample correlates with the
mean in the other sample. This reduced the amount of variability in the
difference scores. If we perform a power analysis, how do you think this will
influence the power of our test?

Effect
size calculations for dependent samples are influenced by the correlation
between the means. If this correlation is 0.5, the effect size calculation for
the dependent case and the independent case is identical. But the power for a
dependent t-test will be identical to the power in a one-sample t-test.

Verify
this by running the power analysis for a dependent samples t-test, with a true effect size of 0.6667, and compare the power
with the same power analysis for a one-sample t-test we performed above:

Up until know, we have talked about the
variation of data points within a single study. It is clear that the larger the
sample size, the more the observed difference (in the case of two means) or the
more the observed correlation (in the case or two related variables) mirrors
the true difference or correlation. We can calculate the variation in the effects we are interested in directly.
Both correlations are mean differences are effect sizes. Because mean
differences are difficult to compare across studies that use different types of
measures to examine an effect, or different scales to measure differences on,
whenever multiple effect sizes are compared researchers often use standardized
effect sizes. In this example, we will focus on Cohen’s d, which provides
the standardized mean difference.

As explained in Borenstein, Hedges,
Higgins, & Rothstein (2009) a very good approximation of the variance of d is provided by:

This formula shows that the variance of d depends only on the sample size and
the value of d itself.

Single study meta-analysis

Perhaps you remember that whenever the 95%
confidence interval around an effect size estimate excludes zero, the effect is
statistically significant. When you want to test whether effects sizes across a
number of studies differ from 0, you have to perform what is known as a meta-analysis. In essence, you perform
an analysis over analyses. You first analyze individual studies, and then
analyze the set of effect sizes you calculated from each individual study. To
perform a meta-analysis, all you need are the effect sizes and the sample size
of each individual study.

Let’s first begin with something you will
hardly ever do in real life: a meta-analysis of a single study. This is a
little silly, because a simple t-test
of correlation will tell you the same thing – but that’s educational to see.

We will simulate one study examining our IQ
training program. The IQ in the control condition has M = 100, SD = 15, and in
the experimental condition the average IQ has improved to M = 106, SD = 15. We
will randomly select the sample size, and draw between 20-50 participants in each condition.

Run
the code in assignment 6 (I'm skipping some parts I do use in teaching - feel free to run that code to explore variation in correlations) to see the data. Remove the # in front of the set.seed line to get
the same result as in this example.

Assignment
6

If we perform a
meta-analysis, we get almost the same result - the calculations used by the meta
package differ slightly (although it will often round to the same 2 digits
after the decimal point), because it uses a different (Wald) type of tests and
confidence interval – but that’s not something we need to worry about here.

Run the simulation a number
of times to see the variation in the results, and the similarity between the
meta-analytic result and the t-test.

The meta-analysis compares the
meta-analytic effect size estimate (which in this example is based on a single
study) to zero, and tests whether the difference from zero is statistically
significant. We see the estimate effect size g = 0.7144, a 95% CI, and a z-score (2.7178), which is the test
statistic for which a p-value can be
calculated. The p-value of 0.0066 is
very similar to that observed in the t-test.

95%-CI z
p-value

0.7143703 [0.1992018; 1.2295387] 2.7178 0.0066

Meta-analysis
are often visualized using forest plots. We see a forest plot summarizing our
single test below:

In this plot we see a number (1) for our
single study. The effect size (0.71), which is Hedges's g, the unbiased estimate of Cohen's d, and the confidence interval [0.2; 1.23]
are presented on the right. The effect size and confidence interval is also
visualized. The effect size by the orange square (the larger the sample size,
the bigger the square is) and the length of the line running through it is the
95% confidence interval.

A small-scale meta-analysis

Meta-analyses are made to analyze more than
one study. Let’s analyze 4 studies, with different effect sizes (0.44, 0.7, 0.28,
0.35) and sample sizes (60, 35, 23, 80 and 60, 36, 25, 80).

Researchers have to choose between a fixed effect model or a random effects model when performing a
meta-analysis.

Fixed effect models assume a single true effect size underlies all the
studies included in the meta-analysis. Fixed effect models are therefore
only appropriate when all studies in the meta-analysis are practically
identical (e.g., use the same manipulation) and when researchers do not want to
generalize to different populations (Borenstein, Hedges, Higgins, &
Rothstein, 2009).

By contrast, random effects models allow the true effect size to vary from study to
study (e.g., due to differences in the manipulations between studies). Note
the difference between fixed effect and random effects (plural, meaning multiple effects). Random effects models
therefore are appropriate when a wide range of different studies is examined
and there is substantial variance between studies in the effect sizes. Since
the assumption that all effect sizes are identical is implausible in most
meta-analyses random effects meta-analyses are generally recommended
(Borenstein et al., 2009).

The meta-analysis in this assignment, where
we have simulated studies based on exactly the same true effect size, and where
we don’t want to generalize to different populations, is one of the rare
examples where a fixed effect meta-analysis would be appropriate – but for
educational purposes, I will only show the random effects model. When variation
in effect sizes is small, both models will give the same results.

In a meta-analysis, a weighted mean is
computed. The reason studies are weighed when calculating the meta-analytic
effect size is that larger studies are considered to be more accurate estimates
of the true effect size (as we have seen above, this is true in general).
Instead of simply averaging over an effect size estimate from a study with 20
people in each condition, and an effect size estimate from a study with 200
people in each condition, the larger study is weighed more strongly when
calculating the meta-analytic effect size.

R makes it relatively easy to perform a
meta-analysis by using the meta or metafor package. Run the code related to Assignment 7. We get the following
output, where we see four rows (one for each study), the effect sizes and 95%
CI for each effect, and the %W (random), which is the relative weight for each
study in a random effects meta-analysis.

95%-CI %W(random)

1 0.44
[ 0.0802; 0.7998] 30.03

2 0.70
[ 0.2259; 1.1741] 17.30

3 0.28
[-0.2797; 0.8397] 12.41

4 0.35
[ 0.0392; 0.6608] 40.26

Number
of studies combined: k=4

95%-CI z p-value

Random
effects model 0.4289 [0.2317; 0.6261] 4.2631 < 0.0001

Quantifying
heterogeneity:

tau^2 =
0; H = 1 [1; 1.97]; I^2 = 0% [0%; 74.2%]

Test of
heterogeneity:

Q d.f.
p-value

1.78
3 0.6194

The line below the summary gives us the
statistics for the random effects model. First, the meta-analytic effect size
estimate (0.43) with the 95% CI [0.23; 0.63], and the associated z-score and p-value. Based on the set of studies we
simulated here, we would conclude it looks like there is a true effect.

The same information is visualized in a
forest plot:

The meta-analysis also provides statistics
for heterogeneity. Tests for
heterogeneity examine whether there is large enough variation in the effect
sizes included in the meta-analysis to assume their might be important
moderators of the effect. For example, assume studies examine how happy
receiving money makes people. Half of the studies gave people around 10 euros,
while the other half of the study gave people 100 euros. It would not be
surprising to find both these manipulations increase happiness, but 100 euro
does so more strongly that 10 euro. Many manipulations in psychological
research differ similarly in their strength. If there is substantial
heterogeneity, researchers should attempt to examine the underlying reason for
this heterogeneity, for example by identifying subsets of studies, and then examining
the effect in these subsets. In our example, there does not seem to be substantial
heterogeneity (the test for heterogeneity, the Q-statistic, is not
statistically significant).

Assignment
7

Play around with the effect
sizes and sample sizes in the 4 studies in our small meta-analysis. What
happens if you increase the sample sizes? What happens if you make the effect
sizes more diverse? What happens when the effect sizes become smaller (e.g.,
all effect sizes vary a little bit around d = 0.2). Look at the individual
studies. Look at the meta-analytic effect size.

Simulating small studies

Instead of typing in specific number for
every meta-analysis, we can also simulate a number of studies with a specific
true effect size. This is quite informative, because it will show how much
variability there is in small, underpowered, studies. Remember that many
studies in psychology are small and underpowered.

In this simulation, we will randomly draw
data from a normal distribution for two groups. There is a real difference in
means between the two groups. Like above, the IQ in the control condition has M
= 100, SD = 15, and in the experimental condition the average IQ has improved
to M = 106, SD = 15. We will simulate between 20 and 50 participants in each
condition (and thus create a ‘literature’ that consists primarily of small
studies).

You can run the code we have used above
(for a single meta-analysis) to simulate 8 studies, perform the meta-analysis,
and create a forest plot. The code for Assignment 8 is the same as earlier, we
just changed the nSims=1 to nSims=8.

The forest plot of one of the random
simulations looks like:

The studies show a great deal of
variability, even though the true difference
between both groups is exactly the same in every simulated study. Only 50% of
the studies reveal a statistically significant effect, but the meta-analysis
provides clear evidence for the presence of a true effect in the fixed-effect
model (p < 0.0001):

95%-CI %W(fixed) %W(random)

1 -0.0173 [-0.4461; 0.4116] 14.47 13.83

2 -0.0499 [-0.5577; 0.4580] 10.31 11.16

3 0.6581 [ 0.0979; 1.2183] 8.48 9.74

4 0.5806 [ 0.0439; 1.1172] 9.24 10.35

5 0.3104 [-0.1693; 0.7901] 11.56 12.04

6 0.4895 [ 0.0867; 0.8923] 16.40 14.87

7 0.7362 [ 0.3175; 1.1550] 15.17 14.22

8 0.2278 [-0.2024; 0.6580] 14.37 13.78

Number of studies combined: k=8

95%-CI z p-value

Fixed effect model 0.3624 [0.1993; 0.5255] 4.3544 < 0.0001

Assignment
8

Pretend these would be the
outcomes of studies you actually performed. Would you have continued to test
your hypothesis in this line of research after study 1 and 2 showed no results?

Simulate at least 10 small
meta-analyses. Look at the pattern of the studies, and how much they vary. Look
at the meta-analytic effect size estimate. Does it vary, or is it more
reliable? What happens if you increase the sample size? For example, instead of
choosing samples between 20 and 50 [SampleSize<-sample(20:50, 1)], choose
samples between 100 and 150 [SampleSize<-sample(100:150, 1)].

Meta-Analysis, not Miracles

Some
people are skeptical about the usefulness of meta-analysis. It is important to
realize what meta-analysis can and can’t do. Some researchers argue
meta-analyses are garbage-in, garbage-out.
If you calculate the meta-analytic effect size of a bunch of crappy studies,
the meta-analytic effect size estimate will also be meaningless. It is true
that a meta-analysis cannot turn bad data into a good effect size estimation.
Similarly, meta-analytic techniques that aim to address publication bias (not discussed in this blog post) can
never provide certainty about the unbiased effect size estimate.

However,
meta-analysis does more than just provide a meta-analytic effect size estimate
that is statistically different from zero or not. It allows researchers to
examine the presence of bias, and the presence of variability. These analyses
might allow researchers to identify different subsets of studies, some stronger
than others. Very often, a meta-analysis will provide good suggestions for
future research, such as large scale tests of the most promising effect under
investigation.

Meta-analyses are not always
performed with enough attention to
detail (e.g., Lakens, Hilgard, & Staaks, 2015). It is important to
realize that a meta-analysis has the potential to synthesize a large set of
studies, but the extent to which a meta-analysis succesfully achieves this is
open for discussion. For example, it is well-known that researchers on opposite
sides of a debate (e.g., concerning the question whether aggressive video games
do or do not lead to violence) can publish meta-analyses reaching opposite
conclusions. This is obviously undesirable, but points towards the large
degrees in freedom in choosing which articles to include in the meta-analysis,
as well as other choices that are made throughout the meta-analysis.

Nevertheless, meta-analyses
can be very useful. First of all, small scale-meta-analyses can actually
mitigate publication bias, by allowing researchers to publish individual
studies that show statistically significant effect and studies that do not show
statistically significant effect, while the overall meta-analytic effect size
provides clear support for a hypothesis. Second, meta-analyses provide us with
a best estimate (or a range of best estimate, given specific assumptions of
bias) of the size of effects, or the variation in effect sizes depending on
specific aspects of the performed studies, which can inspire future research.

That’s a lot of information
about variation in single studies, variation across studies, meta-analyzing
studies, and performing power analyses to design studies that have a high
probability of showing a true effect, if it’s there! I hopethis is helpful in designing studies and evaluating their
results.