Trusted repositories are at the core of the Data Fabric and thus we need proper mechanisms to register repositories (of different kinds) and obviously these registries must allow us to include machine readable information of various sorts such as the port for accessing certain services. We have three types of approaches to this component and its services:

There are a number of research domain initiatives often only meant for human usage such as in the humanities (http://digitalhumanities.org/centernet/) which was started by a large number of people worldwide and which is hosted at Maryland Institute for Technology in the Humanities.

The "library" world has pushed registries of trusted repositories that host published data and support linking from DOIs for example. The two relevant registries are from re3data (www.re3data.org) and databib (www.databib.org).

Those who are working on computational jobs running at several places which goes back to the Grid initiative need machine processible registries with lots of detailed information on the services being offered by the various centers (and also some human readable info for example for securty aspects). These requirements probably come close to what is needed for the Data Fabric. This community created the GOCDB specifications and software (http://goc.egi.eu/) which is maintained by EGI (EU Grid Initiative) and which for example is also being used in the EUDAT federation. There is additional information about the GOCDB and its usage

There may be more initiatives that are using such well-specified registries. I think that it is an urgent issue to exchange the schemas to see how flexible these approaches are and whether they can play a role in a Data Fabric landscape.

Maybe worth a note here - the Clarin Center Registry contains only the active Clarin centers, but it was made with machine readable access in mind and defines the metadata of a center in CMDI metadata format:

We uploaded a paper on data management principles, trends and components created by a group of authors to the DFIG wiki to open a broad discussion. Since this paper contains an unsorted and unprioritised list of components it can be seen as a contribution to what we defined as a next step for the DFIG discussion: extracting components from Use Cases and our expertise and defining their functions. The authors assume that there will be some agreement, but also disagreement and certainly gaps. Therefore we would like to encourage everyone interested to comment on the content. We see it as important to carry out this discussion in the DFIG wiki to have one place for it.

The paper will be presented at various meetings to also promote discussions and motivate the participants to make their comments in the DFIG wiki and to get their colleagues interested in this. The paper got the file name "paris" to indicate that we would like to have a new state of discussion at the Paris plenary. So please motivate your colleagues to submit use case descriptions and comments. We, as chairs, will try to summarise the state of discussions at various moments.

Knowledge is derived from data, which are often derived from other data, which are derived from experiments, which use samples. Tracing forwards and back along this chain is an important piece of housekeeping which IT could help with.

There are good standards for this work, notably the W3C's PROV-O. Implementing them is not very hard and will deliver considerable benefit. We intend to do some such work in the West-Life project.

....and there is BioSharing Information Resources that also has a Registry of Databases in the life sciences, progressively being linked to metadata standards (part of the Registry of Standards). More exist in the ELIXIR programme and similarly I imagine there are existing resources in other domains too.

In the listed principles (section 4), perhaps a recommendation could be included along the following lines: “Encourage the adoption and re-use of common schemas and schema related standards for defining meta-data models.” Chris’s prov-o is a good generic example, and there are many domain-specific examples that could be cited, e.g. those from OGF (e.g. GLUE2 for example for describing common entities within an e-infrastructure).

To achieve unambiguous communication I suggest to sync terminology with Computer Science and common IT practice. In particular, "logical" and "physical" stores. This comes from database world where logical and physical representations are different views on the same data (cf Wikipedia). Here, however, data and metadata are meant. This will lead to massive confusion -> suggest to find a different wording.

Also, in another place I encountered "data type" which is used in a way very different from established use in CS and IT. Maybe to differentiate a naming like "information types" or "data categories" (both not normatively used AFAIK) could help to disambiguate.

Further, a technical note on PIDs: in the discussion I find a mix of concerns:

One issue, that is according to my perception underrepresented in the document, is the technical and organizational support of data management plans. This has also been a topic in Karlsruhe and seems to be an issue that is becoming more and more prominent.

Stake holders require/demand/recommend the creation of data management plans. However, there are a couple of issues here:
1. It is unclear if the stake holders hold the projects accountable for following the plan. It seems to be a matter of chance if a reviewer includes data management as part of the final review and what kind of consequences will have to be enforced.

2. Data management is too complex and too time consuming for the researcher in the day-to-day business, but at the same time to short to actually get the required competence. Outsourcing to research infrastructures may solve the issue here. CLARIN-D for example launched a first version of a data management plan generator where a CLARIN centre will take of the data management for a project

3. If the data management is partly handled by an infrastructure, this also means that the infrastructure needs to maintain the required competence, i.e. in the research questions, archiving and in following up with the projects. It also means that the organizational prerequisits at the infrastructure have to be created, for example tools for following up deadlines, etc. Though this could be done with project management tools, ticketing systems, etc. at present I don't see that this has been done before or that appropriate workflows do exist supporting data management over a period of a couple of years.
At present this relies on "paper" describing the plan and an expert following up on this plan.

Some important data services are missing from your list: data curation, data searching/querying,…..

It is also important to emphasize the distinction between data discovery and data searching

By data discovery it is intended the non trivial extraction of implicit, previously unknown, and potentially useful information from data.

By data searching it is intended the traditional query processing.

Data Management Commons

In my understanding the two basic components of data are: its structure and its semantics and both are discipline-dependent as each discipline has its own data modeling requirements.

In fact, conventional tabular (relational) database systems are adequate for analysing objects (galaxies, spectra, proteins, events, etc.), but the support for time-sequence, spatial, text and other data types is awkward. For some scientific disciplines (astronomy, oceanography, fusion, and remote sensing) an array data model is more appropriate. Database systems have not traditionally supported science’s core data type: the N-dimensional array. Some other disciplines, i.e., biology and genomics, consider graphs and sequences more appropriate for their needs. Lastly, solid modelling applications want a mesh data model. The net result is that “one size will not fit all”, and science users will need a mix of specialized database management systems.

Also the internal as well the external data characteristics are discipline –dependent. By external characteristics, I intend, for example, data provenance information, data context information that are discipline-dependent.

By internal characteristics I intend data uncertainty, data accuracy, precision that are also discipline-dependent!

In the light of these considerations, the statement “we need to distinguish the external characteristics from the internal characteristics to ensure that we really can separate common data management tasks from discipline–specific heterogeneity ..” seems not appropriate.

In fact, this statement implies that the heterogeneity is only due to the external characteristics of the data but unfortunately this is not true!

Central Role for PIDs

In my opinion, here there are some points that need to be stressed:

how is some part of a database to be identified/cited?

how should data stored in a repository that has complex internal structure and that is subject to change be identified/cited?

In essence, it is important to emphasize that PIDs should be assigned at the level of granularity (data sets) appropriate for a functional use that is envisaged.

Registered Data and Trusted Repositories

Here, I think that it is important to mention the classification of data collections proposed by [National Science Board 2005, “Long-lived Digital Data Collections: Enabling Research and Education in the 21st Century”]:

Scientific data are stored into managed data collections/databases. Data collections fall into one of three functional categories

Research Data Collections

Community Data Collections

Reference Data Collections

Each one of these type of collections has its own curation, registration, access procedures and rules, standards and set of users.

Physical and Logical Store

This section completely ignores the achievements of the database research community. The distinction between physical and logical level of data description is a founding concept of the database community. This distinction was introduced in the 1970 (!) with the definition of a formal data model, the relational model, which permitted creation of software companies (for ex. ORACLE) building database management software. It allowed the development of data applications independent from the storage structure of the data. Actually, the database community had introduced three levels: physical schema, logical schema and conceptual schema.

In the data intensive science, a new level, the metadata level, has been introduced. However, this metadata level does not substitute the logical level (the well known logical schema) of the database management systems.

We can say that, to a certain degree, the logical level describes the semantics of the internal characteristics of the data, while the metadata level describes the semantics of the external characteristics of data.

Federations

Here the term “data federation” seems to me not appropriate. We can speak about “Federations of Data Repositories and Data Centers” or “Federated Data Architectures” but not data federations, while it makes sense using the terms data integration and data fusion.

Change Data Culture

“Convince researchers to adhere to a simple high-level data model with digital objects being registered and metadata described”

It is not clear what do you mean by “high level data model”: logical level, conceptual level ? and what modeling techniques should be adopted to define this “ simple high level data model”?

It seems to me that it is not possible to have “a common high-level data model” as the several scientific disciplines have very different data modeling requirements.

Discoverability

When you are speaking about data discovery I suppose that you mean the capability to find data that support research requirements.

The use of PIDs allows a researcher to pinpoint the location of relevant datasets or (portions of) databases. I think that it is unfeasible the assignment of PIDS to single data. Therefore you need search and query capabilities to find the required data contained in datasets/databases identified by the PIDs.

In the big data era the traditional query processing is not adequate; other paradigms of information seeking have been proposed: data exploration, knowledge-based data retrieval, user’s query intent, etc. and, I think , it is worthwhile to mention them. Also mentioning the technologies enabling these new paradigms, i.e., database cracking, adaptive indexing, etc.

You also need not only metadata registries but also discipline-specific ontologies, taxonomies, data dictionaries, data inventories, etc. to efficiently support data discovery.

Interpretation and Re-use

Here, the role of ontologies is not at all considered. Ontologies constitute a key enabling technology as they provide the semantic underpinning that enables reuse of research data. Current research is exploring the use of formal ontologies for specifying content-specific agreements for a variety of data/knowledge reuse activities.

Data interpretability is a necessary condition but not sufficient to guarantee data (re)usability. This is important and should be stressed.

Data Management/Stewardship

Honesty, this section is very poor. The efforts of the database community to build database management systems able to efficiently manage scientific data are completely ignored : the open-source SciDB, DB Monet, high performance DBMSs, triplestores for the storage and retrieval of triples, etc.

Also, the recent technologies for managing big data, like Hadoop are not mentioned.

A Report on Data Management Trends cannot ignore all these projects, systems, and technologies that are very relevant for an efficient management of large scientific data collections.

Metadata System

My understanding of a metadata system is a system able to store and query metadata information. If this assumption is correct, then which is the difference between metadata system and metadata registry?

Schema Registry System

My understanding (as a member of the database community) is that each local database system has its own schema that describes, from a logical point of view, the data stored locally. The registry service is an infrastructural service provided by a data infrastructure for the purpose of helping users to identify the local system where the desired data are stored and under which conditions the access to this system is permitted. In a certain way, we can say that the registry contains meta-information about the different local schemata.

It could be also very useful to have a data service registry that helps users looking for a given data tool/service. To be able to create a data service registry it is necessary to be able to describe formally the functionality of a data service/tool. Unfortunately, such metadata models are currently missing.

I suggest that this Report should encourage the development of such metadata models.

Final General Comment

During the last 35 years, data management principles such as physical and logical independence, declarative querying and cost-based optimization have led to a multi-billion dollar industry. More importantly, these technical advances have enabled the first round of business intelligence applications and laid the foundation for managing and analyzing Big Data today. The many novel challenges and opportunities associated with Big Data necessitate rethinking many aspects of these data management platforms, while retaining other desirable aspects.

(Challenges and Opportunities with Big Data – A community white paper developed by leading researchers across the United States)

A Report on Data Management Trends without the active participation of members of the database Community who spent many years in conducting research activity on topics very relevant for the management of data is destined to be incomplete and inadequate to constitute a Reference Document.

thanks for the incoming comments. We just requested a session for the Paris plenary meeting. The intention is to collect many comments and use cases so that we can have deep discussions at the Paris meeting. I will wait until July to go through all comments in detail.

- The term "open data" in section 4.1 needs defining clearly or referencing. There are differing interpretations such as open data with or without authentication.

- Section 5.4 on metadata system. In the earth sciences we are moving towards capturing metadata at the point of collection and preserving it through to data delivery e.g. OGC Sensor Web Enablement. This will hopefully reduce the resource required for archival and ingestion of data by data repositories.

Data Management (2.3)

We agree with previous comments that this section needs further refinement.

If a huge number of Digital objects receive identifiers that are persistent for at least the object's lifetime, we need adequate management tools and accepted processes that allow for a significant amount of automation of typical data management tasks. A first suggestion for a principle such tools should support is to move away from treating every object in singular form and work towards repeatable actions on larger numbers of similar objects. The reasoning here is that the increasing number of objects must still be managed with the same limited amount of resources.

The document already proposes to use persistent identifiers as the primary tokens for object access. Establishing identifier services at large scale is however a costly and time-intensive effort; to some extent, this is so because such identifiers must be managed just like the objects themselves. Establishing management for identifiers only therefore seems wasteful: Rather, the management of objects and identifiers should work through the same mechanisms as much as possible. To enable management of objects beyond a view focusing on single items, adequate mechanisms should, for example, be able to select objects by their most important characteristics or aggregate them at multiple levels of granularity and provide basic CRUD operations on such object collections. This should be part of a step-by-step transition that leads from classic file systems to a PID-based data organization approach. Putting forward such a strategy should help with building acceptance for PID-based solutions and addressing scalability concerns.

Persistent Identifiers (2.4 / 5.1)

The document should include a viewpoint that for PIDs to provide added value over other forms of identification, we will need smarter resolvers that offer additional services beyond getting from an identifier to an object location. Examples for such services include being able to retrieve an object’s metadata or licensing information, to learn about possible processing services or aggregation mechanisms. Registries (section 5.8) for such added-value services at the resolvers’ level are also needed and should be maintained by recognized international organizations.

As a consequence of the high value that is put in persistent identifiers and their relation to data management as explained above, we also need coherent organizations that support such approaches.

An operational setup for such organizations must find a compromise between two goals:

Different communities have varying requirements which cannot be ignored. Policies that are essential to one community (such as long-term availability of identifiers for well-curated objects) may be a hindrance for others (who want to manage large numbers of objects with a limited lifetime).

Nonetheless, it is inefficient to establish completely separated organizations for these purposes. The basis for agreement should be that digital objects need identifiers and the provisioning of such identifiers must be well managed. In addition, there should be an open set of operational service providers that offer operational policies and added-value services geared towards specific community needs. There can and should be some competition among these service providers; however, the general principles of object identification and value-added services should remain a common element. Already existing legacy systems must be integrated to reduce the impact on the scientific communities that currently use them and operate services depending on them.

Providing general identifier services for different kinds of entities is a commonality stretching across initiatives such as DONA/EPIC, THOR and EUDAT. These organizations should come to a coherent view and sustainable business model that acknowledges both points 1 and 2.

thanks for your interesting response, the comments are really helpful. Feel free to put the points on the Wiki.

With respect to the use cases, preparatory work is currently going on. The status is:

a) CORBEL

We have the project proposal. 28 September will be the kick-off in Paris and I will give a presentation at the BioMedBridges Meeting (17 November 2015). I hope that we are a step further in 2-3 months.

b) Meta-registry

Currently we are preparing a paper describing state of the art, model and feasibility of the meta-registry. We hope to have a first prefinal version during summer.

The "Paris" paper is full in line with our thoughts and corresponds to the discussion we had so far in ECRIN. We agree to the principles and proposals formulated. There are some comments:

a) There should be a strong point made to link data management principles to the actual workflow of generating data. Let me give you an example: a PID should be assigned to a dataset if it is useful and necessary. If you build up a clinical trial database you wiill continuously add and change data. There is no PID necessary because here you have the audit trail which stores all actions. A PID should be assigned, for example, when the database is cleaned and frozen, which is a definite working step in the workflow of clinical trials. This frozen database should be automatically linked to a PID. So the idea should be to define steps in the workflow of data management, which are clear defined and have implications (e.g. frozen database is used for analysis and statistical report). in summary, the mapping between workflow and the proposed data management actions should be explored more deeply in order to have reliable and efficient procedures. Although this is context- and discipline specific, General principles could be formulated (PID for source data, PID for data used for final analysis, etc.).

** This is a huge issue in almost all communities. Almost all have somewhere “dynamic data” as we call it and indeed the questions are a) how to make it quasi-static and citable and b) when is it really static. There is also the issue of granularity: at what level do you want to assign PID and where do you use fragments added to PIDs. In RDA the group on (dynamic) Data citation just finished and I think that they are giving very clear guidelines. Please have a look – excellent people. In RDA EU we will have a team that is ready to write guidelines etc. from 1.9 and I could imagine that one of their jobs is to write guidelines in PID usage for example together with the communities.

b) There are several approaches available to assign PID to digital objects. I assume that it will not be possible to harmonize these procedures. So a more federated approach relying on what is available should be preferred. Publishers, e.g. will rely on the DOI system because there has been major investment. A worldwide highly available and scalable PID system (see 5.1) is an excellent idea but feasibility has to be questioned. I would prefer to develop a strategy build upon what is existing and what can be done for those cases, where currently no PID is used.

** DOI system is indeed worldwide and one has a few service providers such as DataCite and CrossRef. For general handles we have set up the DONA foundation which is worldwide with an increasing number of so-called MPAs: one will be DOI Foundation, one will be GWDG GÖttingen, The Chinese have one already, the Russians want one, the French want one, etc. So I see that we will have under the DONA Foundation a worldwide network of service people and already now one can use these services – so you could register at GWDG for example. We will also write more about this. There will be some cases where people want to stick with what they have. But if in the Internet of things every device has a PID from a worldwide service people will adapt. But let’s see.

c) In the paper PID for digital objects and ID system for actors are explored. For me this is not enough because you may expand to PID of projects (e.g. related to funders) or to PID of institutions (e.g. legal sponsors in clinical Trials, University departments). It may be much more important to know that a specific dataset is linked to an institution than to an individual person.

Fully d’accord. Our accelerator colleagues for example that deal with complex equipment where configurations (filters, etc) change for each measurement already started to create configuration files automatically, give it a PID and cite it. So yes you are right and there is nothng preventing you to assign PIDs to referable stuff.

d) The idea with trusted repositories is good. In ECRIN we have developed a data centre certification programme for clinical trial units. This covers also long term preservation of clinical trial data. For your information, I have included a document related to our current standards. It could be interesting to initiate discussion between RDA and ECRIN about the certification issue because we also think about a trsuted repository for clinical trial data.

Did you have a look what the WDSDSA group is doing? They bring together the certifications from the World Data System and the Data Seal of Approval towards one set of criteria and at least to me it is obvious that we are at a starting point. Your community has much experience with even more fine grained rule systems and it would be great to arrange a meeting between your experts and this RDA group.

With respect to the use cases it is not clear for us, what the role of these use cases will be. Within ECRIN we are currently working on two major use cases:

a) Providing acces to patient-level clinical trial data

This use case is part of the 4 years H2020-funded Project CORBEL and will be lead by ECRIN. It is the aim of this project to build up a repository of patient-level clinical trial data to be used for reanalyses, secondary analyses, meta analyses or subgroup analyses. Ii consists of different steps: Establishment of a multi-stakeholder patient-level taskforce, building consensus on the procedure and IT-solutions and development of pilots/demonstrators.

b) Meta-registry for linkage and identification of clinical trial documents and data

This is a proposal currently discussed with ECRIN, ELIXIR, EUDAT, EGI and other partners. The idea is to provide a generic, standardized and federated approach to bring together all clinical trial data source providers with the objective to allow users to identify all documents and data belonging to one clinical trial and to return metadata about the data objects identified. As a first step a publication describing the current status of clinical trial data sources and the data model is planned.

** All the Use Case stuff is on components that are needed for efficient management, access and re-use of data. So a description of the essentials of CORBEL seems to be of great interest since you must have done some analysis on essential components to be setup. The same probably holds for the Meta-registry. So if you have short summarizing descriptions with a kind of architecture diagram, that would help a lot in moving the discussion about common components ahead. Once I sat together with Wolfgang to describe the basic architecture in ECRIN which was a very useful enterprise. The description is still being used.

It would be interesting to hear from you what kind of link between these use cases and the RDA use cases could be made.

I hope that I managed to look at all comments which I received via this wiki channel and in personal emails. Here are a few responses from my viewpoint. I hope that until the Paris plenary also the other co-authors will make up their mind. Anyhow I would like to thank all people who commented on the document since it was quite useful for me at least to see where people have different views etc.

At the Paris plenary we will have two sessions about these issues and I hope that we can address some of the issues that were raised. In our sessions we will spend some time also on discussing the issues of testing/tesbeds/adoption/experience aggregation. etc. and the roles RDA, DFIG and others could play. We will come up with a first suggestion for an agenda very quickly.