Tuesday, 7 February 2012

I was invited to give a talk to the British Geological Survey on the 25th January, on the topic of data citation and publishing, and why it's important. I've been doing this talk in a variety of guises in different places for a while now, but I thought it'd be good to put it up here too. Consider it an on-line lecture, if you will.

(Click on any of the slide images to see the larger versions)

The key point here is that science should be reproducible, different people running the same experiment at different times should get the same result. Unfortunately, until someone invents a working time machine, you can't just pop back to last week to collect some observational data, so that's why we have to archive it properly.

Often, the only part of the scientific process that gets published is the conclusions from a dataset. And, if the data's rubbish, so will be the conclusions. But we won't know that until we can look at the data.

This is a bit of blurb about the data citation project, and the NERC data centres, and why we care about data in the first place.

There's a nice picture drawn by Robert Hooke in the above slide - showing us that in the past it might have been tedious and time consuming to collect data, but it was at least (relatively) easy to publish. Not so much anymore.

And we're only going to be getting more data... Lots of people call it "the data deluge". If we're going to be flooded with data, it's time to start building some arks!

Data sharing is often put forward as a way of dealing with the data deluge. It has its good points...

...but in this day and age of economic belt-tightening, hoarding data might be the only thing that gets you a grant.

Data producers put a lot of effort into creating their datasets, and at the moment, there's no formal way of recognising this, which will help the data producers when it comes to facing a promotion board.

There are lots of drivers to making data freely available, and to cite and publish it. From a purely pragmatic view, and wearing my data centre hat, we want a carrot to encourage people to store their data with us in appropriate formats and with complete metadata.

The project aims can basically be summed up as us wanting a mechanism to give credit to the scientists who give us data, because we know how tricky a job it is. But it has to be done if the scientific record is to stand.

The figure in this slide is key here, especially when it comes to drawing the distinction between "published" with a small "p" and "Published" with a big "P". We want to get data out into the open, and at the same time have it "Published", providing guarantees as to its persistence and general quality. What we definitely don't want is to have the data locked away on a floppy disk in a filing cabinet in an office somewhere.

Data centres are fitting into the middle ground between open and closed, and "published" and "Published", and we're hoping to help move things in the right directions.

I'm far from an expert on cloud computing, but there are many questions to be answered before shoving datasets into the cloud or on a webpage. These things, like discoverability, permanence, trust, etc, are all things that data centres can help with.

This is an example of thousand year old data that's preserved very well indeed. Unfortunately we've lost the supporting information and the context that went with it, meaning we've got several different translations with different meanings.

It's not enough to simply store the bits and bytes, we need the context and metadata too.

It's easy enough to stick your dataset on a webpage, but it takes effort to ensure it's all properly documented, and that other people can use it without your input. There's also risks - someone might find errors, or use your work to win funding.

Data centres know that the work involved in preparing a dataset for use by others is needed, and that's why we want to help the data producers and ensure they get credit for it.

Of course, in some cases where sharing data is mandatory, but the data producer doesn't really want to do it, it's a simple matter of not doing the prep work, and then the data's unusable to anyone but the creators.

(The example files in the pictures come from one of my own datasets, before they were put into the BADC with all their metadata and in netCDF. I know what they are, but no one else would...)

So, we're going to cite data using DOIs, and these are the reasons why. Main ones being, they're commonly used for papers, and scientists are familiar with them.

Now we're getting into the detail. These are our rules about what sort of data we can /will cite. Note that these are self-imposed rules, and we're being pretty strict about them. That's because we want a DOI-ed dataset to be something worth having.

Data centres served data as our day job - we take it in from scientists and we make it available to other interested parties.

The data citation project is working on a method of citing data using DOIs - which will give the dataset our "data centre stamp of approval", meaning we think it's of good technical quality and we commit to keeping it indefinitely.

The scientific quality of a dataset has to be evaluated by peer review by scientists in the same domain. That's going to be a tricky job, and we're partnering up with academic publishers to work further on this.

Data Publication, with associated scientific peer review would be good for science as a whole, and also good for the data producers. It would allow us to test the conclusions published in the literature, and provide a more complete scientific record.

Of course, publishing data can't really be done in the traditional academic journal way. We need to take advantage of all these new technologies.

We're not the first to think of this - data journals already exist, and more are on the horizon. There does seem to be a groundswell of opinion that data is becoming more and more important, and citation and publication of data are key.

This pretty much sums up the situation with the project at the moment. At the end of this phase, all the NERC data centres will have at least one dataset in their archive with associated DOI, and we'll have guideline documents published for the data centre and data producers about the requirements for a dataset to be assigned a DOI.

Users are coming to us and asking for DOIs, and we're hoping to get more scientists interested in them. We're also encouraging the journals who express an interest in data publication, and are encouraging them to mandate dataset citation in their papers too.