Friday, March 4, 2016

On Open Data and its Benefits and Drawbacks

Open Access Button, an app designed to help people access scholarly research (and report when they're denied access), will be releasing a beta version of a browser add-on, the Open Data Button, tomorrow (March 5, as part of Open Data Day). The beta currently works with Chrome, and a Firefox add-on will be available soon. This handy button is designed to help people access the data behind scholarly research: if the data are available in the Open Science Framework, it will give you a link to it, and if it isn't available, it will start a request to the author to make the data available.

This browser add-on is the latest event in the call for increased transparency in scholarly research. And secondary data analysis can make important contributions to the literature. Just as researchers shouldn't "reinvent the wheel" with data collection instruments - instead drawing on past work - it would be a much better use of resources to reanalyze previously collected data that can answer your question than going through the resource-intensive process of primary research. Secondary data can also be used in education, such as in statistics courses. When I took structural equation modeling in grad school, our final project was supposed to use secondary data (which could have been our own data if we had something that would work), so that we could jump right into practicing analyses with data, instead of spending time collecting the data first.

Despite all the potential benefits of open data, researchers are still hesitant to make their data freely available. When I was teaching research methods, I contacted researchers of recent articles I had assigned as course reading - my students had asked for more review and experience with statistics, and I thought analyzing the data in real-time while they had the article with the aggregated results in front of them would help solidify these concepts. I was very clear in my emails that I was requesting the data purely for a classroom exercise, and had no intention to conduct additional analyses or publish anything I found in the data. I heard back from two researchers, who both said no and were quite defensive about it. One even claimed to no longer have the data, despite the fact that the study had been published that year (to be fair, there can be a large gap between completing the research and getting the work published). The rest never responded.

Why might these researchers have been so against the idea? The jaded response is the fear that we might not confirm the published results, which could mean anything from simple accidents (typos in data tables) to purposeful fraud. Of course, there are other perfectly understandable explanations:

The researchers have additional papers planned with the data that are in (or soon to be in) progress - There's nothing worse than getting scooped: coming up with a great idea to discover someone else got there first. Imagine how much worse it would be if they scooped you with your own data that you painstakingly collected.

Releasing the data could risk participant confidentiality - Obviously, freely available data should never contain identifiers, like name or address. In fact, for most of our research at work, anything considered identifiable information (based on HIPAA) is stored in a separate crosswalk file, which is linked to the data by a randomly assigned ID number. However, research has shown that even some basic demographic data can make participants identifiable. For instance, a study using 2000 census data found that 63% of the US population could be identified with gender, zip code, and birth date. Redacting that information down to year or year and month of birth, and/or county instead of zip, drastically decreases the proportion of identifiable participants. Studies done on a much smaller scale than the census, and in more specific populations or organizations, could also introduce risks to participant confidentiality. While researchers can, and should, clean their datasets before making them public to remove any sensitive variables, it may not be initially clear how these variables or the combination of multiple variables make your sample potentially identifiable.

Making the data freely available eliminates the possibility of tracking usage/data use agreements - Even a researcher who is willing to share his/her data might want to know who is using it and how. Data use agreements could be used to prevent the potential for scooping mentioned above, by stipulating how the data can and cannot be used. And just as researchers are excited to see who is citing their work, they may also be excited to see who is using their data. Of course, this button would allow tracking to some extent - only once the research is published.

Overall, I do want to be clear that I think open data is a good idea, though there are important situations where it simply isn't a possibility. I wonder if/how this could be used with qualitative data. Obviously, one possibility is simply to have redacted transcripts/fieldnotes available for download as a PDF or Word document. However, unlike quantitative research, where variables are clearly labeled as representing important constructs, qualitative data is by its nature unstructured, making it less obvious how a codebook was applied. And single lines of text can represent multiple constructs, depending on the particular coding approach a researcher used (that is, some approaches allow for simultaneous coding of the same text, while others do not). So alternatively, it might make sense to include reports from qualitative analysis software, where researchers can pull out all quotes/text they've classified within a particular code.

This is all just one researcher's opinion, and others will likely have very good reasons for a different opinion. What do you think about the open data movement?