Trying to find useful things to do with emerging technologies in open education and data journalism

An R-chitecture for Reproducible Research/Reporting/Data Journalism

It’s all very well publishing a research paper that describes the method for, and results of, analysing a dataset in a particular way, or a news story that contains a visualisation of an open dataset, but how can you do so transparently and reproducibly? Wouldn’t it be handy if you could “View Source” on the report to see how the analysis was actually done, or how the visualisation was actually created from an original dataset? And furthermore, how about if the actual chart or analysis results were created directly as a result of executing the script that “documents” the process used?

As regular readers will know, I’ve been dabbling with R – and the RStudio environment – for some time, so here’s a quick review of how I think it might fit into a reproducible research, data journalism or even enterprise reporting process.

The first thing I want to introduce is one of my favourite apps at the moment, RStudio (source on github). This cross platform application provides a reasonably friendly environment for working with R. More interestingly, it integrates with several other applications:

RStudio offers support for the git version control system. This means you can save R projects and their associated files to a local, git controlled directory, as well as managing the synchronisation of the local directory with a shared project on Github. Library support also makes it a breeze to load in R libraries directly from github.

R/RStudio can pull in data from a wide variety of sources, mediated by a variety of community developed R libraries. So for example, CSV and XML files can be sourced from a local directory, or a URL; the RSQLite library provides an interface to SQLite; RJSONIO makes it easy to work with JSON files; wrappers also exist for many online APIs (twitteR for Twitter, for example, RGoogleAnalytics for Google Analytics, and so on).

RStudio provides built in support for two “literate programming” style workflows. Sweave allows you to embed R scripts in LaTeX documents and then compile the documents so that they include the outputs from/results of executing the embedded scripts to a final PDF format. (So if the script produces a table of statistical results based on an analysis of an imported data set, the results table will appear in the final document. If the script is used to general a visual chart, the chart image will appear in the final document.) The raw script “source code” that is executed by Sweave can also be embedded explicitly in the final PDF, so you can see the exact script that was used to create the reported output (stats tables of results, or chart images, etc). If writing LaTeX is not really your thing, RMarkdown allows you to write Markdown scripts and again embed executable R code, along with any outputs directly derived from executing that code. Using the knitr library, the RMarkdown+embedded R code can be processed to produce an HTML output bundle (HTML page + supporting files (image files, javascript files, etc)). Note that if the R code uses something like the googleVis R library to generate interactive Google Visualisation Charts, knitr will package up the required code into the HTML bundle for you. And if you’d rather generate an HTML5 slidedeck from your Rmarkdown, there’s always Slidify (eg check out Christopher Gandrud’s course “Introduction to Social Science Data Analysis” – Slidify: Things are coming together fast, example slides and “source code”).

A recent addition, RStudio now integrates with RPubs.com, which means 1-click publishing of RMarkdown/knitr’d HTML to a hosted website is possible. Presumably, it wouldn’t be too hard to extend RStudio so that publication to other online environments could be supported. (Hmm, thinks… could RStudio support publication using Github pages maybe, or something more general, such as SWORD/Atom Publishing?!) Other publication routes have also been demonstrated – for example, here’s a recipe for publishing to WordPress from R).

Oh, and did I mention that as well as running cross-platform on the desktop, RStudio can also be run as a service and accessed via a web browser. So for example, I can log into a version of RStudio running on one of OU/KMi’s server and access it through my browser…

Here’s a quick doodle of how I see some of the pieces hanging together. I had intended to work on this a little more, but I’ve just noticed the day’s nearly over, and I’m starting to flag… But as I might not get a chance to work on this any more for a few days, here it is anyway…

PS I guess I should really have written and rendered the above diagram using R, and done a bit of dogfooding by writing this post in Rmarkdown to demonstrate to the process, but I didn’t… The graph was actually rendered from a .dot source file using Graphviz. Here’s the source, so if you want to change the model, you can… (I’ve also popped the script up as a gist):