The Meaning of Curation: Data Curation

Unlike the meaning of the term archive, I think curation is a 'doing' word that describes the actions of an information professional working with content or resources. The term has been applied to many sectors recently, and I sense its meaning may have come adrift from its more traditional interpretation – for instance a museum curator, or an art curator. For this reason, this post will be the first of a series as we try and move towards a disambiguation of this term.

When it comes to 'digital', the curation term has been commonly applied in the field of research data, but it may also have some specialist uses in the field of digital preservation (for instance, Chapel Hill at the University of North Carolina offers a Certificate in Digital Curation). In this post however, I will look at the term as it’s been applied to research data.

Research data is one of the hot topics in digital preservation just now. In the UK at least, universities are working hard at finding ways to make their datasets persist, for all kinds of reasons – compliance with research council requirements, funder requirements, conformance with the REF Framework, and other drivers (legal, reputational, etc.). The re-use of data, repurposing datasets in the future, is the very essence of research data, and this need is what makes it distinct from other digital preservation projects.This is precisely where data curation has a big part to play. In no particular order, here's our partial list:

1. Curation provides meaningful access to data. This could be cataloguing, using structured hierarchies, standards, common terms, defined ontologies, vocabularies, thesauri. All of these could derive from library and archive standards, but the research community also has its own set of subject-specific and discipline-specific vocabularies, descriptive metadata standards and agreed thesauri. It could also involve expressing that catalogue in the form of metadata (standards, again); and operating that metadata in a system, such as a CMS or Institutional Repository software. The end result of this effort ought to be satisfied end users who can discover, find, use, and interpret the dataset.

If further unpicking is needed, I could regard those as three different (though related) skills; a skilled cataloguer doesn’t necessarily know how to recast their work into EAD or MARC XML, and may rely on a developer or other system to help them do that. On the other hand, those edges are always blurring; institutional repository software (such as EPrints) was designed to empower individual users to describe their own deposited materials, working from pre-defined metadata schemas and using drop-down menus for controlled vocabularies.

2. Curation provides enduring access to data. This implies that the access has to last for a long time. One way of increasing your chances of longevity is by working with other institutions, platforms, and collaborators. Curation may involve applying agreed interoperability standards, such as METS, a protocol which allows you to share your work with other systems (not just other human beings). Since it involves machines talking to each other, I’ve tended to regard interoperability as a step beyond cataloguing.

Another aspect of enduring access is the use of UUIDs – Universal Unique Identifiers. If I make a request through a portal or UI, I will get served something digital - a document, image, or data table. For that to happen, we need UIDs or UUIDs; it’s the only way a system can “retrieve” a digital object from the server. We could call that another part of curation, a skill that must involve a developer somewhere in the service, even if the process of UID creation ends up being automated. You could regard the UID as technical metadata, but the main thing is making the system work with machine-readable elements; it’s not the same as “meaningful access”. UUIDs do it for digital objects; there’s also the ORCID system, which does it for individual researchers. Other instances, which are even more complex, involve minting DOIs for papersand datasets, making citations “endure” on the web to some degree.

3. Curation involves organisation of data. This one is close to my archivist heart. It implies constructing a structure that sorts the data into a meaningful retrieval system and gives us intellectual control over lots of content. An important part of organisation for data is associating the dataset or datasets with the relevant scholarly publications, and other supporting documentation such as research notes, wikis, and blogs.

In the old days I would have called this building a finding aid, and invoked accessioning skills such as archival arrangement - “sorting like with like” - so that the end user would have a concise and well organised finding aid to help them understand the collection.The difference is that now we might do it with tools such as directory trees, information packages, aggregated zip or tar files, and so on. We still need the metadata to complete the task (see above) but this type of “curation” is about sorting and parsing the research project into meaningful, accessible entities.

If we get this part of curation right, we are helping future use and re-use of the dataset. If we can capture the outputs of any secondary research, they stand a better chance of being associated with the original dataset.

4. Curation is a form of lifecycle management. There is a valid interpretation of data curation that claims “Data curation is the management of data throughout its lifecycle, from creation and initial storage, to the time when it is archived for posterity or becomes obsolete and is deleted.” I would liken this to an advanced form of records management, a profession that already recognises how lifecycles work, and has workflows and tools for how to deal with them. It’s a question of working out how to intervene, and when to intervene; if this side of curation means speaking to a researcher about their record-keeping as soon as they get their grant, then I’m all for it.

5. Curation provides for re-use over time through activities including authentication, archiving, management, preservation, and representation. While this definition may seem to involve a large number of activities, in fact most of them are already defined as things we would do as part of “digital preservation”, especially as defined by the OAIS Reference Model. The main emphasis for this class of resource however is “re-use”. The definition of what this means, and the problems of creating a re-usable dataset (i.e. a dataset that could be repurposed by another system) are too deep for this blog post, but they go beyond the idea that we could merely create an access copy.

Authentication is another disputed area, but I would like to think that proper lifecycle management (see above) would go some way to building a reliable audit trail that helps authentication; likewise the correct organisation of data (see above) will add context, history and evidence that situates the research project in a certain place and date, with an identifiable owner, adding further authentication.

To conclude this brief overview I make two observations:

Though there is some commonality among the instances I looked at, there is apparently no single shared understanding of what “data curation” means within the HE community; some views will tend to promote one aspect over another, depending on their content type, collections, or user community.

All the definitions I looked at tend to roll all the diverse actions together in a single paragraph, as if they were related parts of the same thing. Are they? Does it imply that data curation could be done by a single professional, or do we need many skillsets contributing to this process for complete success?

In future posts on this theme, Steph Taylor will look at the matter of data discovery and define what curation means in that context. I will revisit the term “digital curation” and see if it requires disambiguation from “data curation”.