Investigating Docker and R

Docker and R: How are they used and could they be used together?
That is the question that we regularly ask ourself. And we try to keep up with other people’s work! In this post, we are going to share our insights with you.

Dockerizing R

Several implementations of besides the one by R-core exist today, together with numerous integrations into open source and proprietary software (cf. Englisch and German Wikipedia pages). In the following we present the existing efforts for using open source R implementation with Docker.

Rocker

The most prominent effort in this area is the Rocker project. It was initiated by Dirk Eddelbuettel and Carl Boettiger and containerizes the main R implementation. For an introduction, you may read their blog post here or follow this tutorial from rOpenSci.

With a big choice of pre-build Docker images, Rocker provides optimal solutions for those who want to run R from Docker containers. Explore it on Github or Docker Hub, and soon you will find out that it takes just one single command to run instances of either base R, R-devel or Rstudio Server. Moreover, you can run specific versions of R or use one of the many bundles with commonly used R packages and other software, namely tidyverse and rOpenSci).

Images are build monthly on Docker Hub, except devel tags which are build nightly. Automated builds are disabled, instead builds are triggered by CRON jobs running on a third party server (cf. GitHub comment).

Bioconductor

If you come from Bioinformatics or neighboring disciplines, you might be delighted that Bioconductor provides several images based on Rocker’s rocker/rstudio images. See the help page, GitHub, and Open Hub for more information. In short, the Bioconductor core team maintains release and devel images (e.g. bioconductor/release_base2), and contributors maintain image with different levels of pre-installed packages (each in release and devel variants), which are based on Bioconductor views (e.g. bioconductor/devel_proteomics2 installs the views Proteomics and MassSpectrometryData).

Image updates occur with each Bioconductor release, except the devel images which are build weekly with the latest versions of R and Bioconductor based on rocker/rstudio-daily.

MRO

Microsoft R Open (MRO) is an “enhanced R distrubition”, formerly known as Revolution R Open (RRO) before Revolution Analytics was acquired by Microsoft. MRO is compatible with main R and it’s packages. “It includes additional capabilities for improved performance, reproducibility, and platform support.” (source); most notably these are the MRAN repository a.k.a. CRAN Time Machine, which is also used by versioned Rocker images, and the (optional) integration with Intel® Math Kernel Library (MKL) for multi-threaded performance in linear algebra operations (BLAS and LAPACK).

o2r team member Daniel created a Docker image for MRO inkluding MKL. It is available on Docker Hub as nuest/mro, with Dockerfile on GitHub.
It is inspired by the Rocker images and can be used in the same fashion. Please note the extended licenses printed at every startup for MKL.

Renjin

Renjin is a JVM-based interpreter for the R language for statistical computing developed by BeDataDriven. It was developed for big data analysis using existing R code seamlessly in cloud infrastructures, and allows Java/Scala developers to easily combine R with all benefits of Java and the JVM.

While it is not primarily build for interactive use on the command line, this is possible. So o2r team member Daniel created a Docker image for Renjin for you to try it out. It is available on Docker Hub as nuest/renjin, with Dockerfile on GitHub.

pqR

pqR tries to create “a pretty quick version of R” and fixing some perceived issues in the R language. While this is a one man project by Radford Neal, it’s worth trying out such contributions to the open source community and to the discussion on how R should look like in the future (cf. a recent presentation), even if things might get personal. As you might have guess by now, Daniel created a Docker image for you to try out pqR: It is available on Docker Hub as nuest/pqr, with Dockerfile on GitHub.

[WIP] FastR

Also targeting performance, FastR is “is an implementation of the R Language in Java atop Truffle, a framework for building self-optimizing AST interpreters.” FastR is planned as a drop-in replacement for R, but relevant limitations apply.

While GraalVM has a Docker Hub user, no images are published probably because of licensing requirements, as can be seen in the GitHub repository orcale/docker-images, where users must manually download a GraalVM release, which requires an Oracle Account… so the current tests available in this GitHub repository try to build FastR from source based on the newest OpenJDK Java 9.

Dockerizing Research and Development Environments

So why, apart from the incredibly easy usage, adoption and transfer of typical R environments, would you want to combine R with Docker?

Ben Marwick, Associate Professor at the University of Washington, explains in this presentation that it helps you manage dependencies. It gives a computational environment that is isolated from the host, and at the same time transparent, portable, extendable and reusable. Marwick uses Docker and R for reproducible research and thus bundles up his works to a kind of Research Compendium; an instance is available here, and a template here.

A new solution to ease the creation of Docker containers for specific research environments is containerit.
It creates Dockerfiles (using Rocker base images) from R sessions, R scripts, R Markdown files or R workspace directories, including the required system dependencies.
The package was presented at useR!2017 and can currently only be installed from GitHub.

While Docker is made for running tools and services, and providing user interfaces via web protocols (e.g. via a local port and a website opened in a browser, as with rocker/rstudio or Jupyter Notebook images), several activities exists that try to package GUI applications in containers. Daniel explores some alternatives for running RStudio in this GitHub repository, just for the fun of it. In this particular case it may not be very sensible, because RStudio Desktop is already effectively a browser-based UI (unlike other GUI-based apps packages this way), but for users with reluctance to a browser UI and/or command line interfaces, the “Desktop in a container” approach might be useful.

Running Tests

The package dockertest makes use of the isolated environment that Docker provides: R programmers can set up test environments for their R packages and R projects, in which they can rapidly test their works on Docker containers that only contain R and the relevant dependencies. All of this without cluttering your development environment.

The package gitlabr does not use Docker itself, but wraps the GitLab API in R functions for easy usage. This includes starting continuous integration (CI) tests (function gl_ci_job), which GitLab can do using Docker, so the function has an argument image to select the image run to perform a CI task.

In a completely different vein but still in the testing context, sanitizers is an R package for testing the compiler setup across different compiler versions to detect code failures in sample code. This allows testing completely different environments on the same host, without touching the well-kept development environment on the host. The packages’s images are now deprecated and superseded by Rocker images (rocker/r-devel-san and rocker/r-devel-ubsan-clang).

Dockerizing Documents and Workflows

Some works are dedicated to dockerizing R-based documents.

The package liftr (on CRAN) for R lets users enhance Rmd files with YAML-metadata (example), which enables rendering R Markdown documents in Docker containers. Unlike containerit, this metadata must be written by the author of the R Markdown document.

liftr is used in the DockFlow initiative to containerize a selection of Bioconductor workflows as presented in this poster at BioC 2017 conference.
Liftr also supports Rabix, a Docker-based toolkit for portable bioinformatics workflows. That means that users can have Rabix workflows run inside the container and have the results integrated directly into the final document.

The Bioconductor package sevenbridges (see also above) has a vignette on creating reproducible reports with Docker. In recommends a reproducible script or report with docopt respectively R markdown (parametrized reports).
The cloud-based Seven Bridges platform can fulfill requirements, such as required Docker images, within their internal JSON-based workflow and “Tool” description format (example), for which the package provides helper functions to create Tools and execute them, see this example in a vignette. Docker images are used for local testing of these workflows based on Rabix (see above), where images are started automatically in the background for a user, who only uses R functions. Automated builds for workflows on Docker Hub are also encouraged.

Control Docker Containers from R

Rather than running R inside Docker containers, it can be beneficial to call Docker containers from inside R. This is what the packages RSelenium and googleComputeEngineR do.

Selenium provides tools for browser automation, which are also available as Docker images. They can be used, amongst others, for testing web applications or controlling a headless web browser from your favorite programming language. In this tutorial, you can see how and why you should use RSelenium to interact with your Selenium containers.

googleComputeEngineR provides an R interface to the Google Cloud Compute Engine API. It includes a function called docker_run that starts a Docker container in a Google Cloud VM and executes R code in it. Read this article for details and examples. There are similar ambitions to implement Docker capabilities in the analogsea package that interfaces the Digital Ocean API.

googleComputeEngineR and analogsea use functions from the harbor package for R (only available via GitHub). It may be used to control Docker containers that run either locally or remotely.

A more recent alternative to harbor is the package docker, also available on CRAN with source code on GitHub. Using a DRY approach, it provides a thin layer to the Docker API using the Docker SDK for Python via the package reticulate. The package is best suited for apt Docker users, i.e. if you know the Docker commands and life cycle. However, thanks to the abstraction layer provided by the Docker SDK for Python, docker also runs on various operating systems (including Windows).

R and Docker for Complex Web Applications

Docker, in general, may help you to build complex and scalable web applications with R.

Mark McCahill presented at an event of the Duke University in North Carolina (USA) how he provided 300+ students each with private RStudio Server instances. In his presentation (PDF / MOV (398 MB)), he explains his RStudio farm in detail.

The platform R-hub helps R developers with solving package issues prior to submitting them to CRAN. In particular, it provides services that build packages on all CRAN-supported platforms and checks them against the latest R release. The services utilize backends that perform regular R builds inside of Docker containers. Read the project proposal for details.

The package plumber (website, repository) allows creating web services/HTTP APIs in pure R. The maintainer provides a ready to use Docker image trestletech/plumber to run/host these applications with excellent documentation including topics such as multiple images under one port and load balancing.