Visual Perception, Data Visualization, and Science

Minimum Expectations for Open Data in Research

Open data allows people to independently check a paper’s analysis or perform an altogether new analysis. It’s also a way of allowing future work to perform meta-analyses and ask questions that may not have been asked in the original paper. Therefor, it’s important to make experiment data public, provide it completely, and make it accessible for it to be useful to others.

But many missteps can happen that reduce the value of open data. These tips should help ensure that your data is indeed open, useful, and accessible.

Provide at least the minimum information

Experiment data should include an entry for each trial, usually as a row in the table. It shouldn’t be aggregated by condition or by subject. Here is a minimum set of columns that would be appropriate for most experiments:

Keep it tidy

If your data format is complicated, provide a copy of the data that is formatted in a way that’s easy to process and analyze. The simplified format should still include all of the data, and you should also provide the raw original data for transparency. In other words, please simplify your arrays nested inside of JSON objects nested inside of CSV cells.

Use an accessible format

Use free and open formats. Stick to CSV when possible. JSON is ok if necessary. If a project really needs some other format, make sure there are clear instructions for reading it. No Microsoft Excel, unless you plan on buying a copy of the software for everyone.

Common mistakes

Aggregating the data – Some people post a single data point per subject or per condition. Many assumptions are made when aggregating, so it’s critical to provide raw unbiased data without locking people into a particular approach for aggregation.

Skipping the response variable – While it’s useful to know whether a response is correct, recording the actual response is more important in case there are concerns about how “correctness” was calculated.

Skipping the data dictionary – I know you think your column names make perfect sense. Well, they don’t. Make a text file with a very brief description of every column in your data.

Not putting the data URL in the paper – How is anyone supposed to know how to get the data unless you put it in paper? Don’t make anyone email you! I recommend putting it in the abstract.

Failing to check text entries for identifying information – You never know what information people will type into a textbox. One strategy is to drop that column from the open data and make it available on request.

Your results are not too big

Open data repositories can handle your data. People manage to share huge results from astrophysical data to fMRI volumes that vary over time for dozens of subjects. A CSV or JSON that’s under 5GB would easily fit on an open science repository like OSF and figshare. For larger datasets, you can break it up into multiple files or use repositories like Data Dryad.

Most experiments fit on a couple floppy disks and could be downloaded over a dial-up modem, so it was silly to not have open data in 1998, let alone 2018. We have a multitude of free fast reliable research repositories, so there’s no excuse anymore.