More than spilled milk- CREAM at the Research Data Spring workshop

The Jisc Research Data Spring workshop at the Warwick Conference Centre in Coventry had some welcome moments of blue sky before the mid-December dull grey set in. These included a breakout session from one of the projects [1]; Collaboration for Research Enhancement by Active Metadata (CREAM). Their breakout session explored the active use of metadata in the arts and sciences, a theme the project members have been exploring for some time [2].

The workshop was titled ‘Observations on Commonalities of Process’, and led by Iris Garrelfs and Graham Klyne’s two-handed presentation on key parallels between the arts and sciences. Iris Garrelfs spoke as an artist who works “on the cusp of music, art and sociology” [3]. Graham Klyne of Nine by Nine [4], spoke as an ex University of Oxford bioinformatician, and contributor to many semantic web standards.

This seemed unaccustomedly philosophical territory for a Jisc programme workshop, in my experience anyway. And, despite any seasonal temptation, nobody made any rubbish puns about C.P Snow, or made too much of his big theme. It was the rifts between the ‘two cultures’ of the sciences and humanities, which the chemist-turned-novelist famously wrote about [5] and are still with us to today. Much of the research-driven impetus behind Research Data Management has come from the STEM disciplines. Perhaps understandably, given the impact of the EPSRC’s data policy on UK institutions, this has antagonised many humanities researchers who would rather deal with policy directions couched in their own terms. So I guess if Snow were still around he would have approved of this session.

Rather than becoming bogged down in the differences of terminology and epistemology, the session brought fresh thinking on common methods and tools for dealing with arts and humanities metadata. The four discussion themes were:

planning and agility

decision-making

workflow and lifecycle

dissemination

Each theme was introduced by Iris and Graham, based on the project’s effort to develop a model for ‘actively used metadata’. They included reflections on the research processes followed by artists at University of the Arts London, and by chemists and geoscientists in Southampton and Edinburgh. So there were many contributions from collaborators Athanasios Velios, Simon Coles, and others outside the project.

1. Planning and agility

Some of the fresh thinking mentioned earlier is in the shape of Iris Garrelfs' Procedural Blending model. This is an abstract framework for describing creative processes, set out in her PhD thesis [6], and based on her work in sound art.

If I picked up Iris’s quick introduction to the model correctly the gist of it is that creative processes do not follow a stepwise linear process from input to output, but blend parallel strands of action (or ways of framing a problem), that become joined together at key points in the research process. The question is, how can this be recorded in useful ways?

Provenance metadata is part of the answer for CREAM. Reflecting on his involvement in the W3C PROV collaboration, Graham Klyne’s take on this standard for provenance metadata [7] was that it offers a very useful structure for encoding process, but it is not forthcoming about describing its less mechanical aspects. The Procedural Blending model, he said, has offered a fascinating counterbalance to PROV. It may offer the provenance standards a broader framework for these less tangible aspects of data management. Of course provenance is a retrospective record of action, and research planning and workflow design are prospective. Addressing the tacit and intangible seems key to working out how to apply the provenance metadata emerging from a project as a resource for planning-in-action.

At first arts-science differences were most evident when the project began working out how to take that forward, but then parallels became clear. These include several aspects of the trade-off between planning and agility in research.

Amendments and changes in process. The project has considered research around chemical reactions, responses to planning that research, and the role of improvisation. At one extreme improvisation can be thought of as ‘developing a plan in the moment’. At the other it can refer to points where a researcher is responding to observations and adapting (say) a spreadsheet to record experiment outcomes.

Re-framing. Iris pointed out that artists are used to taking conceptual and physical objects and turning them on their head to look at from a different perspective, whether literally or metaphorically. Science aims to nail down processes in a more definitive and reproducible way. But as Graham and others commented, the way that science research is reported suggests design that is more planned than it is in practice. So CREAM has become focused on the messy aspects of research design; not just when the milk gets spilled, as it were, but acting on the smell of it; those points when arbitrary choices are made, or the data that researchers are faced with suggest a new line of investigation. Here it is detailed knowledge of background that makes for the ability to make decisions about what line to take.

These points resonated with challenges to reproducibility that the RDM community is trying to address. Simon Coles mentioned for example that a minority, perhaps 20%, of chemical syntheses are reproducible, because of tacit knowledge and arbitrary decisions that do not get recorded.

Neil Jefferies made a further connection with the under-reporting of negative results, and the idea that capturing this vast body of knowledge of ‘what didn’t work’ could save time by identifying what won’t work in future. Simon Coles pointed out that accounting for negative results that don’t go according to plan is a very different thing from accounting for agility in planning. And Southampton ex-colleague Mathew Addis pointed out that the desire for this level of accounting varies by discipline, but isn’t restricted to academia. Chemists try to record everything, and so does the pharma industry.

The Southampton University Chemistry groups’ work with electronic lab notebooks (ELN) has taught them a thing or two about what people actually record and do with paper and digital notebooks. The greater sense of ownership researchers have of paper notebooks affects their willingness to make the switch. ELN take-up is difficult where there are specific research values around data ownership, and they are investigating ways to encourage submission of ELNs.

Decision making

Humanities scholars tend to deal with the subjectivity of research decision-making differently to scientists. Iris spoke about artists and scholars concern to understand motivations and influences. She pointed out that history and archaeology are concerned with similar problems as scientific reproducibility – doing forensic studies of ‘how people got there’. Generalising across the humanities, the view tends to be much more that provenance is debatable, while for scientists it’s a record of the path of their research that does not need or deserve debate.

These differences can be productive though; some present commented on the value of capturing motivations in science. There is also value in drawing out the role of decisions that can’t be planned for e.g. apparently arbitrary decisions about what is looked at, or selected for analysis. Conventional recording processes rarely allow for this in the sciences, and appraisal processes emphasise deliberate and rational choice.

Having light shed on arbitrary decision-making from models of the creative process may help to incorporate into the scientific record metadata on how ideas are made, and how spontaneity is dealt with. As Simon Coles pointed out research depends on the ability to deviate from a plan on the fly. And as Neil Jefferies remarked, collaborations often depend on people knowing that they share a similar feeling about the problem at hand that comes from their aggregated history. Exposing those aspects that influence how decision are made could help with reproducibility, and are challenging to record.

Workflow and lifecycle

If you have tried to apply research data lifecycle models in practice and thought ‘ok that’s fine for the fly-by overview, but life’s not like that’, you will probably appreciate the problem CREAM is trying to tackle. One of the marked similarities the project found was between the procedural blending model and models of the research lifecycle more common in the sciences.

There are well-known barriers to the practicality of documenting more fine-grained and realistic metadata; the prime one being justifying the expense of doing it. But there are nuances to the cost-benefit trade-offs. Obviously automating the metadata gathering helps, but only if the metadata is more meaningful and useful than that which is handcrafted but based on fallible memory and hindsight-based rationalisations about what happened. This is where the CREAM collaborators believe workflow models that allow for provenance metadata to be applied prospectively may help. So far, they said they had been pleasantly surprised at how much the fluidity of the artistic view could be useful to shed light on scientific process and, from that artistic perspective the potential of scientific workflow techniques for recording process.

Dissemination

Ownership and attribution issues were the key ones highlighted at this point in the discussion. Copyright and plagiarism concerns drive a reluctance to record research processes. For some present that pointed to the need to enable a hierarchy of access to data. Workflows for research data sharing must allow for much of the data to be kept to known collaborators for much of the time. The RDM community’s general invocation to be as open as possible, as quickly as feasible, can drown out that message.

Fiona Murphy, who coupled her humanities background with her experience in scientific publishing, highlighted an important question - how important is it who actually makes the observations that create data? From a reproducibility perspective these observations should, in principle at least, be independent of the observer making them, but how is that actually viewed in practice? Some of the scientists present were happy to acknowledge that some people are better than others at making observations. Had there been more humanists or sociologists of science in the room this might have sparked further debate about epistemology, or about how researchers’ biographies and social networks actually affect what research gets done.

Other participants reiterated the earlier point that science reporting also tends to play down the creative elements of the research process. And from her arts background Iris Garrelfs mentioned that the convention of working within a genre, and following its rules of provenance (among other things), has similarities with the call for reproducibility in science.

Conclusions

Two main points wrapped up this session; firstly that scientific and humanistic datasets can be used by researchers on the other side of the divide for purposes neither imagined, so it makes sense to have common data management frameworks. The other was to encourage researchers, and others involved in the RDM field, to go beyond a mandate-fulfilling view of reproducibility. Records of process aren’t just useful for re-treading your own path, they can be a resource for doing things outside your own field.

From my own point of view I liked this workshop a lot, and was pleased to see there’s another workshop planned for IDCC, and it promises to be more interactive [8]. Many of the themes will be familiar to provenance researchers and also touch on sociology of science. I was also reminded of Arthur Koestler’s ‘bisociation’ theory of creative thinking [9]. I used that in my very first published journal article [1989, lost to digital rot, but papyrus still available!] so it had plenty of personal resonances.

CREAM are pursuing a novel approach, and more recent parallels struck me. On the sociological side of things some work by the Information Systems group at the LSE on ‘Collective agility, paradox and organizational improvisation’ based on a study of particle physics research processes in the GridPP collaboration [10]. A little more current than that, the Research Data Alliance has several groups addressing the ‘planning and agility’ theme. These are the interest group on Active Data Management Plans [11], plus another on ‘De-constructing the Data Lifecycle- Agile Curation’ [12].

CREAM is part of a flurry of tech development aiming for better record-making tools in research. The hope is they’ll offer metadata that’s actually useful for research before it’s done, as well as more accurate about how it’s done, all with less effort and higher usability than the traditional lab record or artists notepad. The results remain to be seen.