Learning about Reproducible Research on Coursera: Recap Week 1

I’m doing the free Coursera course on reproducibility by Johns Hopkins University to improve my own teaching. Week 1 gave a great introduction into why reproducible research is important, what literate statistical programming means, and which software is worth learning for your career.
The course Reproducible Research, taught by Roger Peng from Johns Hopkins University, is divided into four weeks. Week 1 covers an introduction to reproducibility concepts, and shows how to set up and structure your data analysis in general.

Here’s a selection of points I found really helpful. This is my personal interpretation, so I recommend to have a look at the original videos and slides. Below are some screenshots to illustrate my points that are directly taken from Peng’s teaching materials.

What is reproducibility? What is replication?

Peng makes a clear distinction between:

replicating a study, which is to independently collect new data, run the analysis, and (hopefully) get to the same results, and

reproducing a study, which is to use the provided analytical data to re-run the exact same models and check if you get the same result.

Since many studies are time consuming or expensive, we cannot always replicate published work. Therefore, reproducing work is a practical, feasible way to cross-check existing research. If you insist on a full replication of all work with new data, you might end up with ‘nothing,’ as indicated in Peng’s slide.

My take on this:
What Peng, a biostatistician, describes, works for all social and natural sciences. In my own field, political science, there’s a discussion about ‘replication’ versus ‘duplication,’ ‘reanalysis’ or ‘reproduction.’ For many scholars, a first and simple step in checking published work is to use the provided data set from the original author to see if the results can be duplicated or reproduced (King 2003, 98). This way, you can uncover errors in the data set, faulty coding procedures or other issues with the variable construction, in order to achieve “reliability in research results” (King 2003, 99). Reanalyzing work based on the same data is therefore important as an initial step.

Roger Peng calls this ‘reproduction,’ Gary King calls it ‘duplication,’ and others have called it ‘reanalysis.’

There is a second step that further advances knowledge.

You can replicate the results using different, independently collected data, which goes beyond duplication. “If different, independently collected data are used to study the same problem, the reanalysis is called a replication.” (Herrnson 1995, 452).

The above is more clear in theory than in actual research and teaching. When I see published studies that cross-check previous work, they do not always call the different steps in their analysis ‘duplication’ or ‘replication.’ Most often, an article questioning previous work is called simply a ‘replication study,’ whether they use the same data and look at different subsets (e.g. other time periods within the original data), or collect new data.

By the way, the distinction between using the original versus newly collected data resembles the difference between a ‘direct’ and a ‘conceptual’ replication in psychology (Schmidt 2009). Direct replication is the repetition of an experiment using exactly the same procedure, i.e. same set-up and statistical analysis. A conceptual replication would test the same hypothesis as the original work, but use a different experimental set-up, data and methods.

To sum up, what Peng describes as replication versus reproduction of a study, is the same as replication versus duplication in political science, and conceptual versus direct replication in psychology.

Reproducibility in the research process

Another point Peng illustrates really well is an overview over what goes into a project, and how much of that we can see. The research process usually looks like the one in the below slide: we only see a small portion of the work in a published article, while data collection, analysis, statistical code etc. are ‘invisible.’ When you want to reproduce a published article, you have to go backwards through the process (arrow at the bottom).

Literate statistical programming

To help others to reproduce your work, you have to provide the analytical data, code, and documentation about both. Analytical data are the data used for the models (not the raw data sets, which are usually not cleaned and much larger). Ideally, in literate statistical programming, you would provide that in one ‘go’ instead of separate files. For example, it might be not very transparent to provide R code, tables, the paper, and documentation files separately. Literate statistical programming means to provide a stream of text and code, ideally weaved into one ‘human-readable document’ (this is a question in the quiz for week 1). You then provide a file that lets another researcher run your code and read the comments and interpretations at the same time.

Software to achieve that goal is Sweave (output is a pdf), knitr (integrates LaTeX, R Markdown, HTML as output files).

How the online course works

The videos for the first week were a nice introduction to reproducibility. It took me an hour or two to watch them on my phone, and another 20 minutes to go through the weekly quiz with multiple choice questions. I admit it took me 3 attempts to get the full score, and I blame the early hours for that. I should have had breakfast first.

I like that the course material is presented in several short videos rather than one long session. I went through it wherever I had a bit of time on my hands. Slides are online as well (on the desktop version; on the iPhone app I could only click on the videos).

The online discussion forum is great. Students who signed up for the course have already started sharing ideas and helping each other. It looks like a nice little community of people interested in reproducibility.