A blog on statistics, methods, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Thursday, October 29, 2015

Checking your Stats, and Some Errors we Make

Nuijten et
al (2015) created statcheck, a free R package that you can set
to work on a pdf or html file, or a folder of files, to check the reported t-tests, F-tests, correlations, and some others tests. Like your
spellchecker, you will want to run statcheck when working as an editor,
reviewer, author, supervisor, or teacher on any empirical article that contains
t-tests, F-tests, correlations, or chi-square tests.

Here’s how
it works. First, you need to install open source software that will allow R to
convert PDF files to text. The steps are a bit long and tricky, but I made a step-by-step summary which should help you to get this to work.

Then, to
check a single article, run the following R code (changing the path to the PDF
you want to check):

# install and load statcheck

if(!require(statcheck)){install.packages('statcheck')}

library(statcheck)

checkPDF("C:/Users/Daniel/statcheck/Zhang2015.pdf")

That’s all.

You will get output (click on the screenshot below for a bigger version) where you can see a
column for the reported p-values, the re-computed p-values, a summary of each
test, and then a column called Error which will say FALSE if there is no error,
and TRUE if there is an error. I analyzed a recent paper my PhD student Chao Zhang wrote, and I was happy to see the
way we worked on this article (Chao doing the analyses, me double-checking them)
prevented us from making errors. I also looked at earlier papers, and I
regrettably did make a few rounding errors and copy-paste errors in my
publications. Even though nothing changed the conclusions (indicated by the
column ‘DecisionError’), usin Statcheck would have easily prevented these
errors. Statcheck can make some errors, so be sure to check where each tests is
identified correctly, especially when it flags something as an error.

Some Errors We Make

Nuijten and
colleagues applied Statcheck to a huge amount of articles, and report how often
people make errors when reporting statistical tests in a new paper. When reading the paper, I
immediately saw how useful Statcheck was. But I also felt some annoyance that
there was no clear analysis of the things we did wrong. I felt someone told me
I was doing things wrong, without telling me what it was I did wrong. But then a wise man
said I should not blame the authors for not writing the paper I would have
written. Which is especially true given that Nuijten et al have shared all
their data, and their beautiful and reproducible analysis script.

So I took a
look at what we did wrong (R script), and below I will give a recommendation on how to fix
a large majority of the problems.

Of the 258105 tests,
there were 24961 errors,
of which 3581 were decision errors (changing the conclusion of p > 0.05 to p < 0.05 or vice versa), but they are all caused mainly by the
same types of errors. First, people make copy-paste errors. Second, people
reported p = 0.000 1279 times, when they should have reported p < 0.001.
Three errors are worth looking into in some more detail.

Incorrect use of < instead of =

By far the
largest number of errors is the use of < instead of =. For example, F(1, 68) = 4.88, p < .03 is incorrect, because the p-value is actually 0.0305, which is not < 0.03. It happens
thousands and thousands of times. Indeed, if we look at the difference between
the reported and re-computed p-values
for all the errors, we see the difference in p-values is mostly tiny (smaller than 0.01). This is the main reason.
When you read the byline ‘One in eight articles contain data-reporting mistakes that affect their conclusions' you
might not think the solution is simply to replace ‘<’ by ‘=’. I believe it largely
is (but this deserves a closer look).

Use of one-sided tests

Using
one-sided tests, without saying so (or at least without Statcheck recognizing
the words ‘one-sided’, ‘one-tailed’, or ‘directional’ in the text) is another
source of errors. The frequency of one-tailed tests (as I assume, without
pre-registration of the analysis plan) is rather high. One-tailed tests are
fine, and perhaps even more in line with your prediction than a two-tailed
test, but I’d feel more comfortable if people pre-register one-sided
predictions if they have them, and report them if they are performed. Statcheck is great for finding non-disclosed
one-tailed tests.

Incorrect Rounding and Reporting

963 times,
people round a p-value between 0.05
and 0.06 to p < 0.05. The latter
is clearly wrong (but remember people make the same rounding error far removed
from the magical p = 0.05 threshold
as well, so this is just the incorrect use of < instead of = as noted
above). 241 times, researchers report a p >= 0.055 to p < 0.05, and 128 times, people round a p-value
between 0.055 and 0.06 to p = 0.05
(really using the = sign). This is just pathetic. When you hear ‘1.4% of p-values are grossly inconsistent’, this
is the kind of behavior you think about. It makes up approximately 10% of the 3581 decision errors, and even though it is just 0.14% of all reported p-values, I think it is depressingly high. Statcheck
can help reduce these errors.

Altogether,
the 3581 decision errors are made up mostly by incorrect rounding,
the use of one-sided tests without explicitly stating this through the words
‘one-tailed’, ‘one-sided’ or ‘directional’, the use of < instead of =, and
the approximately 350 (give or take a hundred) false positives (note there might also be false negatives, which would increase the number of errors).

These
errors are visible in the plot below. In the left of the graph, we see
differences of -1, where Statcheck often computes a p-value of 1 because it misunderstands the test. The large bar in
the center is mainly due to the use of < instead of =, and the slightly
larger slope on the left of this large bar is due to the use of one-sided
tests, and incorrect rounding.

My main
goal in looking at the data in detail was to be able to provide practical
recommendations to prevent the specific errors we make (even though Nuijten et al suggest co-authors double-check their analyses and share all data). The recommendation is surprisingly straightforward, and nicely with the theme of this blog on how 20% of the effort will fix 80% of the problems:

Report exact p-values, rounded to three decimals (e.g., p = 0.016), or use p <
0.001. Mention the use of one-tailed tests. Double-check all numbers (for
example by using Statcheck!).

I'd like to thanks Michele Nuijten for her help in correcting some of my assumptions and analyses, and for feedback on an earlier draft of this blog post.

11 comments:

Nice post! I agree - I also discovered errors in my own manuscripts when checking them (at least, before I started to use knitr).

For people who want to check their own manuscript and are less amazed by the idea of installing several command line tools (or even never started R): They can type the relevant test statistics into the p-checker app: http://shinyapps.org/apps/p-checker/

This is a bit more manual work, but probably easier for many. (And you get additional indices of evidential value as a free add-on!).

Excellent recommendation! Originally I wanted to show how to export the automatically retrieved test statistics to a txt file, and then plug it in to p-checker, but the post was getting too long. So let's correct it in the comments for the die-hard reader! The code:

will save the statcheck analysis, and write the identified test statistics to a txt file (report.txt). You can just open the txt file and copy paste the test statistics in p-checker, and get a p-curve analyses, TIVA and other tests for publication bias, etc.

Felix, I think the idea of fully automized statistics checks using p-checker is a worthwhile blog to write!

# This is a retracted paperdownload.file(url="http://www.communicationcache.com/uploads/1/0/8/8/10887248/money_and_mimicry-when_being_mimicked_makes_people_feel_threatened.pdf", destfile="check.pdf", method="curl")

This is fantastic! I have shared it with my professor, Jay Van Bavel, and we've shared it with the whole lab. We are making it a policy to run this program before submitting any manuscript. I suspect this may become common practice in our field in short order.

One question. I am on a Mac, and I have noticed that the instructions for adding xpdf to the path are geared for Windows users. I'm therefore at a loss as to how to install the script on my machine! Would you mind providing a little guidance about how to get the script set up for Mac users? (I suspect this will be useful to more than me, given the prevalence of Mac users in the field!)

Thanks!

Daniel YudkinAdvanced Doctoral Candidate in Social Psychology, New York University

You wrote that it works with correlations, but that seems not to be the case, the few PDFs I've tried.

It worked with most papers I've tried but APA papers like the JEPs and Emotion did not work for me. It seems the equal signs are coded as underscores or blankspaces in those papers. It can to some extent be fixed manually by "search and replace-function" in the text editor, I guess.

Great help for checking errors in one's own manuscripts, nevertheless!

APA style research paper writing is a professional style of writing. It may also be referred to as a particular standard format that is followed for writing academic and research papers. See more statistics homework help