Supporting Output title: Recommendations for Implementing a Virtual Layer for Management of the Complete Life Cycle of Scientific Data

Editors: Tobias Weigel, Peter Wittenburg

Citation: Tobias Weigel, Peter Wittenburg (eds) (2017): Recommendations for Implementing a Virtual Layer for Management of the Complete Life Cycle of Scientific Data.

Summary

We are at a tipping point in the development of a common conceptual framework and set of tools and components which will revolutionize the management of scientific data. It is widely acknowledged, as detailed below, that the current volumes and complexity of data now being collected and, even more so, the inevitable and enormous increase in that volume and complexity, have reached the point where action is required. At the same time, and largely in response to this perceived crisis, a number of principles for the management of scientific data have arisen and been widely endorsed. The danger now is that agreement will stop at the level of principles and that multiple non-interoperable domain and technology specific silos will continue to arise, all based on the abstract principles, and we will lose the opportunity of leveraging the current crisis to create a common set of tools and components based on an agreed conceptual approach.

What follows is our summary of the current agreed-upon principles, a more detailed analysis of the requirements implied by those principles, and the current state of work on those requirements, as reflected in the work of RDA, which we believe has the broadest base and most neutral view of the situation. This includes brief summaries of the requirements and the current state of work on repositories, registries, identifiers, metadata, types, licenses, and, in general, the whole ecosystem of interlinked digital objects needed for managing the life cycle of scientific data. We end with a more detailed view of the requirements for selected components extending partly the FAIR principles (Findable-Accessible-Interoperable-Reusable, Appendix A).

Action is now required to put in place operational infrastructural components based on this and similar analyses. Some of these components already exist at an operational level with wide experience across communities, while others are yet at a prototype or concept stage, i.e. no design from scratch is intended and we can build on extensive knowledge built up in various regions. We should now install a systematic approach where these components can mature and ultimately enable communities to build new services and proof that added value can be achieved by means of combination of components. There will, of course, be a risk in doing this and some of these components will surely fail or otherwise prove inadequate. In some cases waiting another five or ten years would perhaps result in better designs and implementations based on technology advances between now and then but by that time the interoperable silo problem will have gained ground and be difficult to displace. The real risk at the moment is in not building a common core infrastructure according to our best current information.

In addition to the current widely adopted recommendations by funders the essence of which recommendations will come next can be summarized as:

Digital objects should be stored in trustworthy repositories that are assessed regularly using DSA/WDS1 guidelines and those repositories should be registered in open registries such as re3data2.

Trustworthy repositories need to assign PIDs to all digital objects and register them with trustworthy PID service providers, such as the International DOI Federation3 and the European Persistent ID Consortium for eResearch4, that guarantee their resolution to meaningful state information.

The digital objects referenced in points 1 and 2 above are not restricted to the data itself but also include schemas, queries, concepts and concept vocabularies, all of which need to be registered in open registries and assigned PIDs if they are cited or referenced.

There are still many issues to be explored and questions to be answered, but we believe that science would be well-served if future scientific data infrastructure projects accepted and followed these high level recommendations.

Thank you for doing this. I think it is a nice summation of many conversations across RDA in the last few years. And while not perfect, I think it does a good job of laying out the core components. My reservations about registries notwithstanding, I do not disagree with any of the suggested essential components. I don't think I have any components to add either, although #10 is extremely broad. Indeed #10 may be a result of the broad adoption of the first nine components, not a essential component in its own right.

I think the document still needs some editing, and is still a draft, but it could be a very useful reference for navigating the current state of the data infrastructure. It should be updated periodically. Every year or so, perhaps.

Some more specific comments:

overall the document could use a little editing to enhance readability and to make the core recommedations more clear.

Section 4 and maybe 5 are the important bits. They need to come forward more. I think sections 2 and 3 could be shortened significantly, and then you don't need the summary section 1.

Please map your table in section 4 of the current state of things more clearly to the 10 recommended components, so we can better assess status.

The status table makes reference to various RD "groups". It would be helpful if the specific groups were named and linked.

Comments on the table:

Trustworthy repositories -- what is a "DFT certified repository"?

Repository API-- This refers to a new RDA WG. Is this the RDA Repository Core Description WG? If so, it appears to be inactive.

PID registry and resolving system -- yes the handle system works well. The key question is who will run the major handle services. Thoughts on that?

I don't fully understand Section 5. It's a bit scattershot. The principles are not very clear, and they use different terms in different ways, e.g. 'assigned' and 'registered'. It's not clear how these principles relate to the components in section 4, but some of it is redundant with section 4.

Finally, I think you should incorporate some of the GDOC paper directly into the primary document rather than make it an appendix. Perhaps an intro that briefly discusses the state of principles (current section 2 and 3) and outlines the GDOC concept. Then go to the meat - the actual recommendations (current section 4 and 5).