Tuesday, 22 November 2011

It's possible to spend a lot of time arguing about what a dataset actually is (and believe me, plenty of people have, myself included!)

I don't have a definitive answer, but for myself, I tend to default to the idea of what's scientifically meaningful as a dataset. For example, a single peak flood measurement at a certain place for a given year could count as a dataset, but a single rain gauge measurement from a site where the gauge has been in place for years wouldn't. And of course, it all depends on the scientific domain as well.

Sometimes projects can act as a convenient guide - if a project was run for x years and provided a wodge of data, then that data can be packaged up as a project dataset. Sometimes a dataset can be all the data resulting from a given instrument for a given period of operation. The important thing is that common sense needs to be applied to how "thinly sliced" a dataset should be. I really don't want to see the concept of minimum publishable unit applied to data, thankyouverymuch!

An analogy that I tend to use a lot is a book. Like a book, a citeable (DOI-able) dataset should be easily identifiable, stable, complete and (hopefully) have enough information in it so that you can understand what it's all about, without having to refer to (too many) other sources of information. Yes, the dataset can be structured in such a way that you can refer to parts of it easily (chapter and verse analogy), but it doesn't mean that every single segment of the dataset should have its own DOI (or that each verse in a book should be published independently in its own cover).

My completely off-the-cuff and not entirely serious example of how you'd go about referencing segments of a particular dataset is:

Of course, datasets are more than books, and there's lots of different ways of slicing and dicing them to produce scientifically meaningful datasets. At the moment, because we're in the early stages of assigning DOIs to our hosted datasets, we're pretty much making a decision on a case by case basis, in the hopes that some general guidelines will surface along the way. (Thankfully, they do seem to be.)

One idea that quickly got assigned to the "not now - tricky" pile is the notion that users might want at some stage to effectively create a new derived dataset which is made up of smaller bits of other people's datasets, and would then want to cite this derived dataset as a whole. This "user-defined" citation would save space in the valuable real estate of a paper's references, and would provide a link to a list of the other sub-citations, in a format that was both human and machine readable. Provided that each of the sub-citations allowed you to easily and accurately get to the relevant sections of the other datasets, then the derived dataset would count as a citeable object.

This is achievable now, the technologies are ready and mature, but this rapidly starts getting tricky when you start thinking of the roles involved and how to assign credit - the author of this new derived dataset is not so much an author, more a compiler or editor, for example.

Having hierarchies of dataset citations aren't so problematic. For example, we've already made the decision that for large datasets where the dataset is continually modified by appending new files to it (for example, the rain gauge measurements mentioned above where files are created on a daily basis), then we can assign a DOI to a given period's worth of data at a time. For the rain gauge measurements, it's convenient and sensible to assign a DOI to each year's worth of data after the year's complete, and then, when the rain gauge is moved, or otherwise taken out of service, to give the entire time series one DOI.

Citation is actually a really good prod for us, to encourage us to really crystallize our thinking about what a dataset is, and how to deal with it. It's all too easy to have fuzzy datasets being random piles of files, or entries in a database table, without having defined any rules on where their edges are. I don't have the answers, but I do feel like we're getting close to at least some of them!
____________________________________A lot of the thoughts in this post came about after conversations with the many people involved in various citation workshops/projects etc, including, but not limited to, my co-workers in the NERC SIS data citation and publication project and the CODATA Task Group on Data Citation. Thanks are due to them all! (I'm sure I'll be repeating that lots in future posts too!)

Monday, 14 November 2011

Way back in the day, when I was a wet-behind-the-ears graduate student, my first proper science job was in pre-processing a large scientific dataset. My job was to convert signal levels received from a satellite (Italsat) radio beacon (at 20, 40 and 50 GHz) into attenuation levels. In other words, convert this:

to this:

with the eventual aim of producing something like this:

a process which involved 4 major steps, 4 different computer programmes, and 16 intermediate files for each day of measurements. Each month of preproccessed data represented somewhere between a couple of days and a week's worth of effort. It was a job where attention to detail was important, and you really had to know what you were looking at from a scientific perspective.

I started work on this project in 1999. In 2006 (five years after the dataset was finished) we finally got a publication out of it:

We shared our data with another group. They got a publication out of it in 2003, three years before we did. We weren't part of the author list, though I believe we got an acknowledgement.

A quick Google Scholar for "Italsat Sparsholt" gives 48 papers which mention Italsat (the satellite) and Sparsholt (the receive station where the data came from), 37 of which weren't written by members of the project team.

But of course, it's citations, not acknowledgements, that are important when it comes to things like how to measure how influential your work is.

And yes, I supposed we could have published quicker. But our job was to collect and quality-control and generally make our datasets as good as they possibly could be. And they are good, and they are important, but unfortunately not in a way that's easily measured.

So, that's why I'm pushing so hard for datasets to be accepted as first class scholarly outputs. I've spent years of my life, making a dataset the best it can be, only to be pipped to the post when it comes to publishing, and having no way of knowing if that work has actually be worthwhile or not. (And no, I'm not bitter, honest!)

Data citation is something I believe in, because I've been there. I've also submitted data to a data centre (and got infuriated with the format requirements and metadata requests). But now, many years down the line, I'm on the data management side of the fence, and I can see how important it is to encourage scientists who produce data to put their data in archives/data centres where it can be properly looked after. Giving them credit through data citation has got to be part of it, at least until the point where science as a whole comes up with a better method for tracking scientific impact and importance!

As an aside - if anyone needs convincing of the importance of digital archiving and curation - the only way I could get the images above into this blog post was by taking a digital photo of the hard copy of the report. The original files were in a format (.ps) that Windows doesn't seem to like anymore...