Friday, 19 October 2012

Recently I've had a few emails from people expressing concern about data licensing, especially when it comes to assigning DOIs to datasets so they can be formally cited. The assumption seems to be that if a dataset has a DOI assigned to it, the data must therefore be open. This isn't the case.

Citation and open data seem to have got tangled together. Yes, citation is a mechanism for encouraging researchers to make their data open, but it doesn't follow that everything you cite has to be open.

Let's take an example of a journal paper. You can cite a journal paper whether it's open or not, and the citation simply gives information about the paper and where you can find it. The DOI for the paper will take you to a landing page, and the landing page then tells you what restrictions are on the paper (if any). It's commonplace to cite a paper that you have to pay to access - I know I've done it many a time.

Similarly, say you want to cite the Book of Kells (Trinity College Dublin MS 58). That's easy - in fact I've just done it. But for the casual reader to access it, you'd need to travel to Dublin, go to Trinity College Library, pay €9 and look at whatever page happens to be open on display at that particular time. (I'm sure there are more stringent restrictions on researchers who actually want to be able to flick through the pages!)

So, there's plenty of precedent for researchers citing things that aren't open, or are restricted in some way. Data will be no different.

DataCite themselves have accounted for some situations where access to data might be restricted (because of confidentiality issues, embargo periods, etc.) in the publication year in the mandatory properties and also in the date element in the optional properties of the DOI metadata schema.

Publication Year “If an embargo period has been in effect, use the date when the embargo period ends. “

The landing page for a DOI-ed dataset needs to be completely open with the relevant information about why there is restricted access and/or what to do to get full access.

There's a planned JISC-British Library DataCite Workshops, focusing on managing and citing sensitive data, taking place on Monday 29th October in the British Library Conference Centre, which will look in greater detail at exactly these sorts of issues. Registration is still open!

For me, I want to spread the word that you can cite data without having to make it open. Open data is, of course, something to be encouraged wherever possible. But scientists are nervous enough about open data and the possibility of getting scooped, or having legal or IPR issues causing problems. Going for the softly, softly approach of citing data whether it's open or not will allow researchers to get used to the idea of data citation. Once they get credit for their work in creating the datasets, that's when we can show them how much more credit they can get for making them open.

And in a lot of cases, data needs to be restricted for very good reasons (for example protecting patient confidentiality). Penalising the researchers who created those datasets by not allowing them citations because their data can't be make open just seems unfair.