Reflections on research data management

10 November, 2011

Cameron Neylon

Reflections on research data management: RDM is on the up and up but data driven policy development seems a long way off.

The Research Data Management movement is moving on apace. Tools are working and adoption is growing. Policy development is starting to back up the use of those tools and there are some big ambitious goals set out for the next few years. But has the RDM movement taken the vision of data intensive research to its heart? Does the collection, sharing, and analysis of data about research data management meet our own standards? And is policy development based on and assessed against that data? Can we be credible if it is not?

Watching the discussion on research data management over the past few years has been an exciting experience. The tools, that have been possible for some years, now show real promise as the somewhat raw and ready products of initial development are used and tested.

Practice is gradually changing, if unevenly across different disciplines, but there is a growing awareness of data and that it might be considered important. And all of this is being driven increasingly by the development of policies on data availability, data management, and data archiving that stress the importance of data as a core output of public research.

The vision of the potential of a data rich research environment is what is driving this change. It is not important whether individual researchers, or even whole community, gets how fundamental a change the capacity to share and re-use data really is. The change is driven by two forces fundamentally external to the community.

The first is political, the top down view from government that publicly funded research needs to gain from the benefits they see in data rich commerce. A handful of people really understand how data works at these scales but these people have the ear of government.

The second force is one of competition. In the short term adopting new practices, developing new ways of doing research, is a risk. In the longer term, those who adopt more effective and efficient approaches will simply out compete those who do not or can not. This is already starting happening in those disciplines already rich in shared data and the signs are there that other disciplines are approaching a tipping point.

Data intensive research enables new types of questions to be asked, and it allows us to answer questions that were previously difficult or impossible to get reliable answers on. Questions about weak effects, small correlations, and complex interactions. The kind of questions that bedevil strategic decision-making and evidence based policy.

So naturally you’d expect that the policy development in this area, being driven by people excited by the vision of data intensive research, would have deeply embedded data gathering, model building, and analysis of how research data is being collected, made available, and re-used.

I don’t mean opinion surveys, or dipstick tests, or case studies. These are important but they’re not the way data intensive research works. They don’t scale, they don’t integrate, and they can’t provide the insight into the weak effects in complex systems that are needed to support decision making about policy.

Data intensive research is about tracking everything, logging every interaction, going through download logs, finding every mention of a specific thing wherever on the web it might be.

It’s about capturing large amounts of weakly structured data and figuring out how to structure it in a way that supports answering the question of interest. And its about letting the data guide you the answers it suggests, rather than looking within it for what we “know” should be in there.

What I don’t see when I look at RDM policy development is the detailed analysis of download logs, the usage data, the click-throughs on website. Where are the analyses of IP ranges of users, automated reporting systems, and above all, when new policy directions are set where is the guidance on data collection and assessment of performance against those polcies?

Without this, the RDM community is arguably doing exactly the same things that we complain about in researcher communities. Not taking a data driven view of what we are doing.

I know this is hard. I know it involves changing systems, testing things in new ways, collecting data in ways we are not used. Even imposing disciplinary approaches that are well outside the comfort zone of those involved.

I also know there are pockets of excellent practice and significant efforts to gather and integrate information. But they are pockets. And these are exactly the things that funders and RDM professionals and institutions are asking of researchers. They are the right things to be asking for, and we’re making real progress towards realizing the vision of what is possible with data intensive research.

But just imagine if we could support policy development with that same level of information. At a pragmatic and political level it makes a strong statement when we “eat our own dogfood”. And there is no better way to understand which systems and approaches are working and not working than by using them ourselves.