Trying to find useful things to do with emerging technologies in open education and data journalism

How Reproducible Data Analysis Scripts Can Help You Route Around Data Sharing Blockers

For aaaagggggggeeeeeeesssssss now, I’ve been wittering on about how just publishing “open data” is okay insofar as it goes, but it’s often not that helpful, or at least, not as useful as it could be. Yes, it’s a Good Thing when a dataset is published in support of a report; but have you ever tried reproducing the charts, tables, or summary figures mentioned in the report from the data supplied along with it?

If a report is generated “from source” using something like Rmd (RMarkdown), which can blend text with analysis code and a means to import the data used in the analysis, as well as the automatically generated outputs, (such as charts, tables, or summary figures) obtained by executing the code over the loaded in data, third parties can see exactly how the data was turned into reported facts. And if you need to run the analysis again with a more recent dataset, you can do. (See here for an example.)

But publishing details about how to do the lengthy first mile of any piece of data analysis – finding the data, loading it in, and then cleaning and shaping it enough so that you can actually start to use it – has additional benefits too.

In the above linked example, the Rmd script links to a local copy of a dataset I’d downloaded onto my local computer. But if I’d written a properly reusable, reproducible script, I should have done at least one of the following two things:

either added a local copy of the data to the repository and checked that the script correctly linked relatively to it;

and/or provided the original download link for the datafile (and the HTML web page on which the link could be found) and loaded the data in from that URL.

Where the license of a dataset allows sharing, the first option is always a possibility. But where the license does not allow sharing on, the second approach provides a de facto way of sharing the data without actually sharing it directly yourself. I may not be giving you a copy of the data, but I am giving you some of the means by which you can obtain a copy of the data for yourself.

As well as getting round licensing requirements that limit sharing of a dataset but allow downloading of it for personal use, this approach can also be handy in other situations.

For example, where a dataset is available from a particular URL but authentication is required to access it (this often needs a few more tweaks when trying to write the reusable downloader! A stop-gap is to provide the URL in reproducible report document and explicitly instruct the reader to download the dataset locally using their own credentials, then load it in from the local copy).

Or as Paul Bivand pointed out via Twitter, in situations “where data is secure like pupil database, so replication needs independent ethical clearance”. In a similar vein, we might add where data is commercial, and replication may be forbidden, or where additional costs may be incurred. And where the data includes personally identifiable information, such as data published under a DPA exemption as part of a public register, it may be easier all round not to publish your own copy or copies of data from such a register.

Sharing recipes also means you can share pathways to the inclusion of derived datasets, such as named entity tags extracted from a text using free, but non-shareable, (or at least, attributable) license key restricted services, such as the named entity extraction services operated by Thomson Reuters OpenCalais, Microsoft Cognitive Services, IBM Alchemy or Associated Press. That is, rather than tagging your dataset and then sharing and analysing the tagged data, publish a recipe that will allow a third party to tag the original dataset themselves and then analyse it.

“Reproducibility is more than just the scripts and data, it includes the underlying operating system, and software versions! However, it’s definitely a variable definition, because people might choose to operate along different points of the reproducibility dimension. For the way I like to define it, for an analysis or tool to be reproducible, I must provide the entire environment (operating system, software, correct versions, and if possible, data) inside some kind of container. That may be a Docker or Singularity (singularity.lbl.gov) container, or a virtual machine. It really should all work, seamlessly, at the push of a button / start of a container without needing to think about “Do I have R installed? Is this to be run on Windows or Linux or something else?” What really starts to make my head spin is thinking about the computer or power as a dependency, and that in even 20-30 years a lot of our current code and software will likely be un-usable (does your computer have a floppy disk drive still, anyone? A zip drive?) and most definitely a lot of the URLs that host data will be gone. Maybe we won’t even use that uri anymore. Anyway, lots to think about. thanks for the good post!”

Agreed that reproducibility extends to other issues – I like things like mybinder and docker containers for that reason, as well vagrant constructed VMs. The road is long though to persuading to folk to share reproducible stuff: data, code, environment info, container build scripts, etc etc.

It’s easier if they work in this contexts from the outset, which is why I think trying to get postgrad research students to adopt approaches like Docker is important – if they start their research career reproducible environments, then it’s more likely they’ll continue with the practice.