Tuesday, 29 January 2013

During the PREPARDE project workshop at the International Digital Curation Conference, one of the presenters raised the thought that data journals may just be a temporary phenomena pending better data organisation and credit. (I'm paraphrasing from memory here, so forgive me if I get it wrong!) Their thinking was that we want to make data a first class scientific object and will do so through data citation, and then we will also want to enhance existing scientific publications with links back to the data they use and associated interactive gubbins. So therefore, data journals, which publish the datasets and a brief paper describing them won't be needed, because you'll either cite the data directly, or have links in an analysis-and-conclusions article to the data.

I'm not arguing with the need for proper data citations, or the benefits they'll give. I also agree that analysis-and-conclusions articles will and should have better links to the data that underlies them. I do think though that there's an awfully big jump between a dataset, in a repository, ready to be cited, and a full analysis-and-conclusions article.

(A brief digression - I know we're piggy-backing on article publication to provide data creators with the credit they deserve for creating the datasets, and this is nowhere near the ideal way of doing it! But that's a subject for another post, so, for today, let's go with the whole data publication thing as a given.)

Let's start with direct citations of datasets. Ok, so you've created your dataset and you've put it in a repository somewhere, and cite it using a permanent id (DOI/ARK/whatever). Using that citation, another researcher can go and find your dataset where it's stored, and will have at least the minimum level of metadata given in the citation (Authors, Title, Publisher, etc.) What the user of the dataset doesn't get is any indication of how useful the data is likely to be (apart from what they can guess through their knowledge of the authors' and repository's reputation), and they may not get any information at all about whether or not the dataset meets any community standards, is in appropriate formats, or has extra supporting metadata or documentation.

This isn't a particularly likely situation for most discipline-based repositories, who have a certain amount of domain knowledge to ensure that community standards are met. But for institutional or general repositories, who may have to cover subject areas from art history to zoology, they simply won't be able to provide this depth of knowledge. So a data citation can easily provide the who, where and maybe the what of a dataset (who created it? where is it stored? what is it - or at least what is it called?) , but doesn't automatically provide any information on the how or the why the dataset was created - which is important for when it comes to judging the quality and reuse potential of the dataset.

Looking from the other end, analysis-and-conclusions papers tend to be pretty long things, and they often have to describe a lot in terms of the methods used for the analysis. Having to explain the data collection and processing method before you even get to the analysis methods is a pain (even if you'd only have to do it once and would then cite that first paper), but is still an essential part of the paper if the conclusions are to hold up.

Yes, it will be great to click on a graph and be taken to the raw data that created that plot, but you'd still need to provide metadata for that subset of the dataset (and most repositories only store and cite the full dataset, not subsets). Clicking through to a subset of the data doesn't give the whole picture of the dataset either, what if that particular data subset was cherry-picked to best support the conclusions drawn in the paper? There's technical issues there, which I'm sure will be solved, but they aren't yet.

It's also about the target audience as well. If I'm looking for datasets that might be useful to me, I don't want to be trawling through pages of analytical methods to find them. Ditto if I'm interested in new statistical techniques, all the stuff about how the data was collected is noise to me. Splitting the publication between data article (which gives all the information about calibrations and instrument set-up and the like) and analysis-and-conclusions and citing the former from the latter seems sensible to me. Not to mention that it might work out quicker to publish two smaller papers than one large one (and would certainly be easier to write and review!)

So I really do think there's a long-term place for data journals, between data citation and analysis-and-conclusions articles. Data articles allow for the publication of more information about a dataset (and in a more human-readable way) than can be captured in a simple metadata scheme and a repository catalogue. Data articles also provide a mechanism for the dataset's community to judge the scientific quality and potential reuse of the dataset through peer-review (open or closed, pre- or post-publication).

I think a data article is also a sign that the data producer is proud of their data and is willing to publicise it and share it with the community. I know that if I had a rubbish dataset that I didn't want other people using, but had been told by someone important that it had to be in a repository, then I'd be sure to put it somewhere with the minimum amount of metadata. Yes, it could still be cited, but it wouldn't necessarily be easy to use!

There's only one way to find out if data journals are just a temporary stepping stone between data and analysis-and-conclusions articles until data citation becomes common practice and enhanced publications really get off the ground. And that's to keep working to raise the profile of data citation and data publication (whether in a data article, or as a first class part of an analysis-and-conclusions article) so it becomes the norm that data is made available as part of any scientific publication.

In the meantime, let's keep talking about these issues, and raising these points. The more we talk about them and the more we try to make data citation and enhanced publications happen, the more we're raising consciousness about the importance of data in science. That's all to the good!