GitLab is a great platform for active, ongoing, collaborative research. It enables folks to work together easily and share that work in the open. This is especially poignant given the problems in sharing code in academia, across time and people.

It's no surprise that GitLab, a platform for collaborative coding and Git repository hosting, has features for reproducibility that researchers can leverage for their own and their communities’ benefit.

What exactly is reproducibility?

Reproducibility is a core component in a variety of work, from software engineering to research. For software engineers, the ability to reproduce errors or functionality is key to development. For researchers, reproducibility is about independent verification of results/methods, to build on top of previous work, and to increase the impact, visibility, and quality of research. Y’know. That Sir Isaac Newton quote in every reproducibility presentation ever: "If I have seen further, it is by standing on the shoulders of giants."

Like all things, reproducibility exists on a spectrum. I like Stodden et al’s definitions from the 2013 ICERM report, so I’ll use those:

ICERM Report Definitions

Potential Real-World Examples

Reviewable Research: Sufficient detail for peer review and assessment

The code and data are openly available

Replicable Research: Tools are available to duplicate the author’s results using their data

The tools (software) used in the analysis are freely available for others to confirm results

Confirmable Research: Main conclusions can be attained independently without author’s software

Others can reach the conclusion using similar tools, not necessarily the same as the author, or on a different operating system

Auditable Research: Process and tools archived such that it can be defended later if necessary

The tools, environment, data, and code are put into a preservation-ready format

Open/Reproducible Research: Auditable research made openly available

Everything above is made available in a repository for others to examine and use

The last bullet there is the goal – open and reproducible research. Releasing code and data are key to open research, but not necessarily enough for reproducibility. This is where the concept of computational reproducibility becomes important, where whole environments are captured. You could also look at it this way:

How can GitLab help?

There are a few solutions out there, including containers (such as Docker or Singularity) for active research, and o2r and ReproZip for capturing and reproducing completed research. For this post, I’m going to focus on active research and containers.

I like GitLab for research reproducibility because it makes working together simple, and seamless. There’s no hacking together 100 different third-party services. GitLab has hosting, LFS, and integrated Continuous Integration for free, for both public and private repositories! Everything is integrated in a single GitLab repository which, if made publicly available, can enable secondary users to reproduce results in a more streamlined fashion. You can also keep these private to a group – you control the visibility of everything in one repository in one place, as opposed to updating permissions across multiple services.

There are a few key features that set GitLab apart when it comes to containers and reproducibility. The first is that GitLab doesn’t use a third-party service for continuous integration. It’s shipped with CI runners which can use Docker images from GitLab’s registry. Basically, you can use the Docker Container Registry, a secure, private Docker registry, to choose a container that GitLab CI uses to run each job in a separate and isolated container.

If you don’t feel like using the GitLab registry, you can also use images from DockerHub or a custom Docker container you’re already using locally. These can be integrated with GitLab CI, and if made public, any secondary users can use it as well!

Let's look at an example

This process is set up in a single file, a .gitlab-ci.yml. Another feature that makes my life easier – GitLab can syntax-check the CI config files! The .gitlab-ci.yml file describes the pipelines and stages, each of which has a different function and can have its own tags, produce its own artifacts, and reuse artifacts from other stages. These stages can also run in parallel if needed. Here’s an example of what a basic config file looks like with R:

It’s also worth noting that you can use different containers per step in your workflow, if you outline it in your .gitlab-ci.yml. If your data collection script runs in one environment but your analysis script needs another, that’s perfectly fine using GitLab, and others have the information to reproduce it easily! Another feature that puts GitLab apart is that a build of one project can trigger a build of another – AKA, multi-project pipelines. For those of you working with big data, you can automatically spin up and down VMs to make sure your builds get processed immediately with GitLab’s CI as well.

Here are some other great resources and examples of using GitLab to make research more reproducible:

Beyond reproducibility, there are a lot of features that make GitLab an ideal place for me to work and organize my research. I’d urge folks to look at the feature list and see how they can get started!

About the Guest Author

Vicky Steeves is the Librarian for Research Data Management and Reproducibility at New York University, a dual appointment between the Division of Libraries and Center for Data Science. In this role, she works supporting researchers in creating well-managed, high quality, and reproducible research through facilitating use of tools such as ReproZip. Her research centers on integrating reproducible practices into the research workflow, advocating openness in all facets of scholarship, and building/contributing to open infrastructure.