Coding errors can be avoided

An article in the American Journal of Political Science was corrected after the coding of a political attitude variable was accidentally the wrong way around. Pre-publication cross-checks by the authors and the journal, as well as publication of the original data and variable transformations can avoid such problems.

Interestingly, the replicators did not frame their work as a replication, nor did they make a big fuss about it. They merely stated that they were “demonstrating that the authors had miscoded their religious and ideological measures, mistaking the conservative and religious pole for the liberal and secular pole”. They made the original authors aware, moved the error discussion to an appendix, and extended their analysis based on corrected data. This procedure is commendable, because it works against fears of replication witch hunts.

The Erratum to the article reads: “The interpretation of the coding of the political attitude items in the descriptive and preliminary analyses portion of the manuscript was exactly reversed. Thus, where we indicated that higher scores in Table 1 (page 40) reflect a more conservative response, they actually reflect a more liberal response. Specifically, in the original manuscript, the descriptive analyses report that those higher in Eysenck’s psychoticism are more conservative, but they are actually more liberal; and where the original manuscript reports those higher in neuroticism and social desirability are more liberal, they are, in fact, more conservative.”

This error and the replication study were discussed by science writer Rolf Degen on twitter and on Political Science Rumors. For example, one commenter asked: “Isn’t this the kind of mistake that actually requires a retraction? This isn’t a judgment call about model specification.” Another stated, that the error could have been detected earlier on by the journal “if they had submitted the raw (de-identified) data, and the R/Stata syntax they used for recoding those variables.”

Reading the articles, I had the exact same thoughts:

1. There are many corrections in political science, but not many rejections (LaCour was an expeption). What determines the forgiveness in our field, and is it good or bad? [I’ll blog on this another time.]

2. How can we avoid simple coding errors?

Let’s talk about the latter.

First, authors should conduct a replication of all the code with exactly the same data before submitting to a journal. For this purpose, the raw, untouched data need to be stored separately from the analysis. The syntax or Rcode to clean and transform the variables must be accessible and well commented. Ask your research assitant, co-author or someone in your lab to double check not only the analysis, but also the very beginnings of the data production. “In-house” pre-publication checks can avoid disaster (read this story about why “It is good we spotted the problem before we submitted the paper.” in cancer research). By the way, technically this is a duplication (using the same data and code).

Second, journals should do the same cross-checks. Ideally, not only for the analysis, but also for steps from the raw data to the transformed variables that enter the models. This is a lot of work, but simply re-running code on already cleaned data will only tell you so much. The American Journal of Political Science actually already conducts such pre-publication checks, but either this article was published before they started it, or they did not do the full re-analysis to detect the coding errors. So even though AJPS is a pioneer with its pre-publishing checks, more can be done.

Third, in an ideal case authors should make sure they publish the original data (untouched, unchanged) and code for the variable transformations. I know that this opens a can of worms. Where does “original data” start? How many raw data sets from open availably sources such as the World Bank do we have to store at our webpages? Doesn’t this go too far? If you cannot upload 20 data sets with each separate variable that you have collected in raw form, at least structure your data files accordingly on your computer. As I wrote earlier: All changes, e.g. variable transformations, should be made in the Rcode. Never touch original raw data (changing .csv files manually). The advantage is that anyone else (and yourself) can start with the same, original data file from scratch. You can then share this for cross-checks by co-authors and make it available to anyone with doubts about your results later. Nothing establishes more credibility in your work than you being transparent and providing all information. Even about the raw data. Even if it is time consuming.

A note on the side: names

In this blog post I have deliberately discussed the papers, and not used the author names. You will see the names when you click on the links. This may make it a more complicated read, but I have found that being less personal, and more about the work itself, can help avoid skepticism against replicators. I will try to do this more often on this blog now.