Tuesday, 26 November 2013

Citing dynamic data is a topic that just keeps coming around, every time data citation is mentioned, usually as a way of pointing out that data citation is not like text citation, because people can and will want to get their hands on the most recent data in a dataset, and simply don't want to wait for a frozen version. There's also confusion about what makes something citeable or not (see "DOI != citeable" by Carl Boettiger), tied into the whole DOI for citation thing and the requirements for a dataset to have a DOI assigned.

As I've said many times before, citing data is all about using it to support the scholarly record. We have other methods of linking data to papers, or data to other data - that's what the Internet is all about after all. I maintain that citation is all about getting back to exactly the thing the author of the article was talking about when they put the citation in the article.

If you’re citing something so you can simply point to it ("the most recent version of the dataset can be found at blah"), and aren’t really that worried about whether it’s changed since the pointer was made, then you can do that easily with a citation with a http link in it. That way you go automatically to the most recent version of the dataset.

If however, you need to be sure that the user gets back to exactly the same data each time, because that's the data you used in your analysis, then that data becomes part of the scientific record and needs to be frozen. How you get back to that exact version is up to the dataset archive – it can be done via frozen snapshots, or by backing out changes on a database – whatever works.

(For a more in-depth discussion of frozen data versus active data, see the previous post here.)

Even if you’re using a DOI to get to a frozen version of the dataset, there should still be a link on the DOI landing page which points to the most recent version of the dataset. So if a scientist wants to get to the most recent version of the dataset, but only has a DOI to a frozen version, then they can still get to the most recent version in a couple of hops.

It is (theoretically) possible to record all changes to a dynamic dataset and guarantee (audited by someone) that, if needed, the data repository could back out all those changes to recreate the original dataset as it was on a certain date. However, the BODC did a few tests a while back, and discovered that backing out the changes made to their database would take weeks, depending on how long ago the requested version was. (This is a technical issue though, so I’m sure people are already working on solving it.)

You could instigate a system where citation is simply a unique reference based on a database identifier and the timestamp of extraction – as is already used in some cases. The main issue with this (in my opinion) is convincing users and journal editors that this is an appropriate way to cite the data. It’s been done in some fields (e.g. accession numbers) but hasn’t really gained world-wide traction. I know from our own experience at BADC that telling people to cite our data using our own (permanent) URLs didn’t get anywhere because people don’t trust urls. (To be fair, we were telling them this at a time when data citation was even less used than it is now, so that might change here and now.)

Frozen data is definitely the easiest and safest type to cite. But, we regularly manage datasets that are continually being updated, and for a long term time series, we can't afford to wait the twenty odd years for the series to be finished and frozen before we start using and citing it.

So we've got a few work-arounds.

For the long running dataset, we break the dataset up into appropriate chunks, and assign DOIs to those chunks. These chunks are generally defined on a time basis (yearly, monthly), and this works particularly well for datasets where new data is continually being appended, but the old data isn't being changed. (Using a dead-tree analogy, the chunks are volumes of the same work which is released in a series and at different times - think of the novels in the seriesA Song of Ice and Fire for example - now that's a long running dataset which is still being updated*)

A related method is the ONS (Office for National Statistics) model, where the database is cited with a DOI and an access date, on the understanding that the database is only changed by appending new data to it – hence any data from before the access date will not have changed between now and when the citation was made. As soon as old data is updated, the database is frozen and archived, and a new DOI is assigned to the new version.

For datasets where the data is continually being updated, and old measurements are being changed as well as new measurements appended, we take snapshots of the dataset at a given point in time, and those snapshots are frozen, and have the DOIs assigned to them. This is effectively what we do when we have a changing dataset, but the dataset is subject to version control. It also parallels the system used for software releases.

It's worth noting that we're not the only group thinking about these issues, there's a lot of clever people out there trying to come up with solutions. The key thing there is bringing them all together so that the different solutions can work together rather than against each other - one of the key tenets of the RDA.

DOIs aren’t suitable for everything, and citing dynamic data is a problem that we have to get our heads around. It may well turn out that citing frozen datasets is a special case, in which case we’ll need to come up with another solution. But we need to get people used to citing data first!

So, in summary – if all you want from a citation is a link to the current version of the data: use a url. If you want to get back to the exact version of the data used in the paper so that you can check and verify their results: that’s when you need a DOI.

_________________________________________

* Pushing the analogy a bit further - I'd bet there's hordes of "Game of Thrones" fans out there who'd dearly love to get their hands on the active version of the next book in "A Song of Ice and Fire", but I'm pretty sure George R.R. Martin would prefer they didn't!

I think there's a crucial distinction we need to draw between data that is "active" or "working" and data that is "finished" or "frozen"*, i.e. suitable for publication/consumption by others.

There's a lot of parallels that can be drawn between writing a novel (or a text book, or an article, or a blog post) and creating a dataset. When I sit down to write a blog post, sometimes I start at the beginning and write until I reach the end. In which case, if I was doing it interactively, then it might be useful for a reader to watch me type, and get access to the post as I'm adding to it. I'm not that disciplined a writer however - I reread and rewrite things. I go back, I shuffle text around, and to be honest, it'd get very confusing for someone watching the whole process. (Not to mention the fact that I don't really want people to watch while I'm writing - it'd feel a bit uncomfortable and odd.)

In fact, this post has just been created as a separate entity in its own right - it was originally part of the next post on citing dynamic data - so if the reader wanted to cite the above paragraph and was only accessing the working draft of the dynamic data post, well, when they came back to the dynamic data post, that paragraph wouldn't be there anymore.

It's only when the blog post is what I consider to be finished, and is spell-checked and proofread, that I hit the publish button.

Now, sometimes I write collaboratively. I recently put in a grant proposal which involved coordinating people from all around the world, and I wrote the proposal text openly on a Google document with the help of a lot of other people. That text was constantly in flux, with additions and changes being made all the time. But it was only finally nailed down and finished just before I hit the submit button and sent it in to the funders. Now that that's done, the text is frozen, and is the official version of record, as (if it gets funded) it will become part of the official project documentation.

The process of creating a dataset can be a lot like that. Researchers understandably want to check their data before making it available to other people, in case of others finding errors. They work collaboratively in group workspaces, where a dataset may be changed lots very quickly, without proper version control, and that's ok. There has to be a process that says "this dataset is now suitable for use by other people and is a version of record" - i.e. hitting the submit, or the publish button.

But at the same time, creating datasets can be more like writing a multi-volume epic than a blog post. They take time, and need to be released in stages (or versions, or volumes, if you'd prefer). But each of those volumes/versions is a "finished" thing in its own right.

I'm a firm believer that if you cite something, you're using it to support your argument. In that case, any reader who reads your argument needs to be able to get to the thing you've used to support it. If that thing doesn't exist anymore, or has changed since you cited it, then your argument immediately falls flat. And that is why it's dangerous to cite active datasets. If you're using data to support your argument, that data needs to be part of the record, and it needs to be frozen. Yes, it can be superseded, or flat out wrong, but the data still has to be there.

You don't have this issue when citing articles - an article is always frozen before it is published. The closest analogy in the text world for active data is things like wiki pages, but they're generally not accepted in scholarly publishing to be suitable citation sources, because they change.

But if you're not looking to use data to support your argument, you're just doing the equivalent of saying "the dataset can be found at blah", well, that's when a link to a working dataset might be more appropriate.

My main point here is that you need to know whether the dataset is active or frozen before you link/cite it, as that can determine how you do the linking/citing. The user of the link/citation needs to know whether the dataset is active or not as well.

In the text world, a reader can tell from the citation (usually the publisher info) whether the cited text is active or frozen. For example, a paper from the Journal of Really Important Stuff (probably linked with a DOI), will be frozen, whereas a Wikipedia page (linked with a URL) won't be. For datasets, the publishers are likely to be the same (the host repository) whether the data is frozen or not - hence ideally we need a method of determining the "frozen-ness" of the data from the citation string text.

In the NERC data centres, it's easy. If the text after the "Please cite this dataset as:" bit on the dataset catalogue page has a DOI in it, then the dataset is frozen, and won't be changed. If it's got a URL, the dataset is still active. Users can still cite it, but the caveat there is that it will change over time.

We'll always have active datasets and we'll want to link to them (and potentially even freeze bits of them to cite). We (and others) are still trying to figure out the best ways to do this, and we haven't figured it out completely yet, but we're getting there! Stay tuned for the next blog post, all about citing dynamic (i.e. active) data.

In the meantime, when you're thinking of citing data, just take a moment to think about whether it's active or not, and how that will affect your citing method. Active versus frozen is an important distinction!

____________________________
* I love analogies and terminology. Even in this situation, calling something frozen implies that you can de-frost it and refreeze it (but once that's done, is it still the same thing?) More to ponder...

On top of that I also managed to decide it'd be a good idea to apply for a COST Action on data publication. Thankfully 48 other people from 25 different countries decided that it'd be a good idea too, and the proposal got submitted last Friday (and now we wait...) Oh, and I put a few papers in for the International Digital Curation Conference being held in San Francisco in February next year.

Anyway, they're all my excuse for not having blogged for a while, despite the list I've been building up of things to blog about. This post is really by way of an update, and also to break the dry spell. Normal service (or whatever passes for it 'round these parts) will be resumed shortly.

And just to make it interesting, a couple of my presentations this year were videoed. So, you can hear me present about the CODATA TG on data citation's report "Out of Cite, Out of Mind"here. And the lecture I gave on data management for the OpenAIRE workshop May 28, Ghent Belgium can be found here.