Managing research data right, from the start

I talk a lot on this blog about one of my big personal interests, data management, but I’m always excited to have an excuse to discuss another interest of mine, university data policy. Today’s excuse to delve into policy comes from one of my data-policy-research collaborators, who sent me a data story so thorny that I just had to discuss it here on the blog.

The case involves a prominent Alzheimer’s researcher, Paul Aisen, who ran the Alzheimer’s Disease Cooperative Study at UC San Diego and just took a job at new Alzheimer’s center run by USC. Aisen is taking 8 staff members with him to the new center, plus his National Institute on Aging grant and its corresponding data. Unfortunately, UC San Diego says that Aisen does not have permission to transfer these grant resources to USC. The data is particularly sticky issue here, as UC San Diego is alleging that the researcher transferred the data to an Amazon server and won’t share the password with UC San Diego administrators. The result is that UC San Diego – or more specifically, the UC System Regents – are now suing both Aisen and USC over the money and the data.

There’s a few issues going on in this case that are worth discussing. First, can the researcher take the grant to another institution? Second, who owns the data? Third, can the researcher take the data to another institution?

The first issue involves grant administration. The news article about this lawsuit states that “university declined to let [Aisen] keep the associated government funding.” UC San Diego likely has some authority to do this as grants are usually given to universities to administer on behalf of the researcher and not directly to the researchers themselves. So while most institutions allow researchers to transfer grants when they move jobs, it’s not necessarily a given – especially where funding covers a whole center rather than a single research group.

However, just because the university owns the data doesn’t mean a researcher doesn’t have rights to the data when he/she leaves the university. PI’s at UC schools are allowed to take a copy of the data with them but can’t take the original without written permission from their Vice Chancellor for Research (this presumes that the data is not “tangible research material”, which the researcher cannot remove at all without written permission). So at the very least, university system policy states that Aisen cannot prevent UC San Diego from accessing and maintaining the master copy of the data. On the flip side, Aisen should be able to take some data with him to USC but it would only be a copy of the data for which he was listed as PI on the grant and not the whole study dataset, which dates back to 1991.

So without knowing the specifics of the case, I would say that UC San Diego seems to have a good claim to the data. This directly results from having a clear data policy.

My own research has found that such university data policies are becoming more common but are far from ubiquitous. While these policies do provide important clarity, anecdotal evidence – like this story – suggests that universities are mainly leveraging these policies when significant amounts of money or prestige are involved. I think that’s a shame because such policies can be very helpful for data decision making.

The other key issue here is the fact that the university owns the research data. This is something that many researchers are uncomfortable with but is often a routine part of doing research at a university; it’s akin to the university claiming patent rights. That said, individual researchers usually get to make most all decisions about the data (in their capacity as data stewards) and should expect something in return for this deal. Namely, universities should take their ownership claim seriously and devote enough university resources to the care and maintenance of “their” data.

I’m looking forward to hearing more details about the case and going beyond my personal speculation to see how things are resolved. In the meantime, it’s makes for another good story to share on the importance of clear data policy.

I’ve been so busy talking about documentation on the blog recently that I’ve forgotten to share an awesome project that I’ve been working on: the data management video series!

Over the course of the last semester, I worked with an intern to create a series of 10 data management videos. The videos cover a range of topics and are all available on YouTube, so not only can you watch them whenever but you are also free to embed them on other webpages. I’m all for sharing content and, while these videos were predominantly made for researchers at my university, the more researchers who learn this stuff the better.

I’ve been talking a lot about documentation on this blog over the last few months but there is definitely one more issue I need to address before we move onto other topics: taking better notes. Taking better notes is really at the heart of improving your documentation because this is the main way that researchers document their work.

To review, having sufficient documentation is central to making your data usable and reusable. If you don’t write things down, you’re likely to forget important details over time and not be able to interpret a dataset. This is most apparent for data that needs to be used a year or more after collection, but can also impact the usability of data you acquired last week. In short, you need to know the context of your research data – such as sample information, protocol used, collection method, etc. – in order to use it properly.

All of this context starts with the information you record while collecting data. And for most researchers, this means taking better notes.

Most scientists learn to take good notes in school, but it’s always worth having a refresher on this important skill. Good research notes are following:

Clear and concise

Legible

Well organized

Easy to follow

Reproducible by someone “skilled in the art”

Transparent

Basically, someone should be able pick up your notes and be able to tell what you did without asking you for more information.

The problem a lot of people run into is not recording enough information. If you read laboratory notebook guidelines (which were established to help prove patents), they actually say that you should record any and all information relating to you research in your notebook. That includes research ideas, data, when and where you spoke about your research, references to the literature, etc. The more you record in your notebook, the easier it is to follow your train of thought.

I would also recommend employing headers, tables, and any other tool that helps you avoid having a solid block of text. These methods can not only help you better organize your information, but make it easier for you to scan through everything later. And don’t forget to record the units on any measurements!

Overall, there is no silver bullet to make you notes better. Rather, you should focus on taking thorough notes and practice good note taking skills. It also helps to have another person look over your notes and give you feedback for clarity. Use whatever methods work best for you so long as you are taking complete notes.

Research notebooks have been used for hundreds of years. We can still refer to Michael Faraday’s meticulous notes or read Charles Darwin’s observations that lead to the theory of evolution. These documents show that handwritten research notes have been and will continue to be useful. But to get the most out of your research notes, you need to start by taking better notes.

I challenge you this month to think about your research notes and work to take clearer, more consistent, and more thorough notes. Your ultimate goal is to make sure you have all of the documentation you need for whenever you use your data.

The likely retraction (currently an expression of concern) in question concerns a study published in Science last year looking at the effect of canvassing on changing people’s minds. Study participants took pre- and post-canvassing online surveys to judge the effect of canvassing on changing opinions. While the canvassing data appears to be real, it looks like the study’s first author, Michael LaCour, made up data for the online surveys.

The fact of the faked data is remarkable enough, but what particularly interests me is how it was discovered. Two graduate students at UC-Berkeley, David Broockman and Joshua Kalla, were interested in extending the study but had trouble reproducing the original study’s high response rate. Upon contacting the agency who supposedly conducted the surveys, they were told that the agency did not actually run or have knowledge of the pre- and post-tests. Evidence of misconduct mounted when Broockman and Kalla were able to access the original data from another researcher who posted it in compliance with a journal’s open data policy. They found anomalies once they started digging into the data.

In my work, I talk a lot about the Reinhart and Rogoff debacle from two years ago where a researcher gaining access to the article’s data led to the fall of one of the central papers supporting economic austerity practices. We’re seeing a similar result here with the LaCour study. But in this case, problems arose due to a common practice in research: using someone else’s study as a starting point for your own study. Building from previous work is a central part of research and bad studies have problematic downstream effects. Unfortunately, such studies aren’t easy to spot without digging into the data, which often isn’t available.

There’s an expression that goes “pictures or it didn’t happen,” suggesting that an event didn’t actually take place unless there is photographic proof. I think this expression needs to be coopted for research to be “data or it didn’t happen.” Unless you can show me the data, how do I know that you actually did the research and did it correctly?

I’m not saying that all research is bad, just that we need regular access to data if we’re going to be able to do research well. We can’t build a house on a shaky foundation and without examining the foundation (data) in more detail, how will we find the problems or build the house well?

So next time you publish an article, share the data that support that article. Because remember, data or it didn’t happen.

In my last post, I discussed my philosophy on documentation in that most researchers need to take better notes and augment them with a few key types of documentation, as needed. I’ve already blogged about a few of these special documentation types – data dictionaries, README.txt files, and e-lab notebooks – but one structure we haven’t examined here is templates. Let’s correct that now.

Templates are one of my favorite recommendations for adding structure to research notes and making sure that you’ve recorded all of the necessary information. They coopt the benefits of a formal metadata schema – making documentation easy to search across, helping you record all essential information, and providing consistency – without all of the fiddliness or rigidity. This makes templates much easier to adopt and use.

So how do templates work? Basically, you sit down at the start of data collection and make a list of all the information that you have to record each time you acquire a particular dataset. Then you use this as a checklist whenever you collect that type of data. That’s it.

You can use templates as a worksheet or just keep a print out by your computer or in the front of your research notebook, whatever works best for you. Basically, you just want to have the template around to remind you of what to record about your data.

Let’s look at an example. When I was a practicing chemist, there were a few critical pieces of information I needed to record every time I ran an experiment. This list included the following:

Date

Experiment

Scan number

Laser beam powers

Laser beam wavelengths

Sample concentration

Calibration factors, like timing and beam size

Using this list as a template, I would then record the necessary information every time I did an experiment. The result might look something like the following:

2010-06-05

UV pump/visible probe transient absorption spectroscopy

Scan #3

5 mW UV, visible beam is too weak to measure accurately

266 nm UV, ~400-1000 nm visible

5 mMol trans-stilbene in hexane

UV beam is 4 microns, visible beam is 3 microns

Basically, the list is memory aid to make sure my notes include everything they should for any given experiment. And I could even use different templates for different types of experiments to be more thorough.

Remembering to record the necessary details is the biggest benefit of using a template, as this is an easy mistake to make in documentation. Templates can also help you sort through handwritten notes if you always put the same information in the same place on a notebook page. Basically, templates are a way to add consistency to often chaotic research notes.

I challenge you to try out a template or two and see if they help you record the better notes. Because, as I’ve said before, research data without documentation are useless and, honestly, having insufficient documentation can be just as frustrating. So make your data better by using a template!

The panel itself was entitled “Beyond Metadata” and I spoke about different methods for teaching documentation types other than metadata. I was particularly excited to be on this panel because I think that librarians’ love of metadata doesn’t always translate into what’s needed in the laboratory. So even though your funder may ask in a data management plan for the metadata schema you plan to use, most of the time that’s not the documentation type you really need.

My general philosophy on research documentation is as follows:

Most researchers don’t need formal metadata schemas, unless you have a big (time/size/collaborative) project to organize or are actively sharing your data.

Your first strategy for documentation should be to improve your research notes/lab notebook that you are likely already using.

That said, you can augment your notes strategically with documentation structures such as README.txt files, data dictionaries, and templates.

It’s actually this latter category of documentation types that you find me talking about a lot, as these are the ones that can really help but that many researchers do no know about.

There are plenty of good reasons to improve your documentation (including giving you the ability to reuse your own data, making sure you don’t lose important details, and being transparent for the sake of reproducibility), but we often don’t teach documentation to researchers beyond the basics. So here are a few resources I’ve created so you can learn to improve your documentation:

Looking over this list, I realize that there are a few gaps in the content of this blog when it comes to documentation practices. So look for future posts on templates and good note taking practices!

Research may yet get to the point where metadata is commonplace but we have many useful documentation structures to employ in the meantime. Research notes in particular have been used effectively for hundreds of years and will continue to be useful. In the end, you should use whatever documentation type that works well for you and ensures that you record the best information you can about your data.