Laying the firm foundations for a Jisc UK Research Data Discovery Service

The beta version of the Discovery Service is updated every fortnight and a list of changes is sent to the project mailing list and posted on this blog.

Here’s the latest update from our developer Mark Winterbottom.

Technical Update (30 June, 2017)

Key Improvements

This sprint, Mark has been working on creating the custom harvest extension. Here are the items that are completed and working:

Database table models setup.

Database setup command.

Creating new harvest endpoints.

Updating and deleting harvest endpoints in the system.

Adding custom field mappings.

Creating new harvest jobs in the job table.

Mark also made some progress getting the asynchronous tasks working, however this is still in progress.

Focusing on next

Mark is moving to the Research Data Shared Service project for a few weeks so the current development work in RDDS is on hold. The rest of the team will be using the time to focus on testing and ensuring the current set of requirements are updated and prioritised for the next phase of development. Unfortunately, there won’t be any technical updates during this time. The two weekly cycle of technical updates will restart once Mark is back on the project.

Added the ability to create new harvest endpoints in the custom harvest extension (RDD-320)

Had a productive sprint retrospective and planning meeting with Chris and Dom where we decided on some process improvements for the bi-weekly sprints.

Focusing on next

The next sprint will focus on:

Deploy the CSW and bug fixes to live.

Add more harvest endpoints to live where applicable.

Continue work on the dynamic harvest endpoint.

Other notes

Just a heads up that Mark will be deploying changes to the live site on Monday morning. In order to add support for geo-spatial search, he needs to re-install the RDS Database Instance to upgrade PostgreSQL to a newer version. As a result, this will require some downtime (1 hour maximum).

That’s the end of this update. If you would like any further information about the project, or contact details, check the project page on the Jisc website.

The beta version of the Discovery Service is updated every fortnight and a list of changes is sent to the project mailing list and posted on this blog.

Here’s the latest update from our developer Mark Winterbottom.

Technical Update (2 June, 2017)

Key Improvements

The focus of this sprint has been to get the remaining institutions into the system. Here are the main items completed in this sprint:

Got the CSW Harvesting working on the Staging site.

Got the geo-spatial search working.

Enabled PostGIS Postgres extension in RDS.

Created automated build for solr docker image.

Harvested data from CCDC.

Harvested data from remaining HEI’s and Data Centre’s where endpoint is available.

The staging site has been rebuilt and I’ll be sending out some emails asking for feedback from specific sites early next week.

Focusing on next

In the next sprint I will focus on creating a dynamic harvester which gives us more control over field mappings in the web interface in CKAN. We’ll be publishing a blog post shortly to explain how this will work.

That’s the end of this update. If you would like any further information about the project, or contact details, check the project page on the Jisc website.

Technical Update (19 May, 2017)

Key Improvements

Fixed a bug where some harvesting processes were becoming unresponsive. The issue was related to a bug in the Terraform setup scripts and is now resolved (RDD-309).

Created a process for adding unit tests for CKAN extensions which are executed on the Strider Continuous Integration server (RDD-150).

Got the CSW harvester working in the docker containers. However there is still some work to do on this until I can deploy it to the live state (RDD-306)

Added Science and Technology Facilities Council (STFC) to the site and harvested metadata (RDD-61)

Fixed issue with the ‘format’ and ‘language’ fields being duplicated (RDD-302)

Attempted to harvest from Natural Environment Research Council, Archaeology Data Service and UK Data Archive however I am having issues with the endpoints (I’m working with the site contacts to resolve these).

Worked on CKAN extension that allows us to dynamically configure the field mappings in the config (RDD-303).

Focusing on next

The next sprint will be focused around adding the remaining Phase 2 data centres (when the endpoints are accessible) and the Phase 3 participants.

That’s the end of this update. If you would like any further information about the project, or contact details, check the project page on the Jisc website.

If you attended the recent webinar for this phase of the project, or read the report on the webinar, you will know that we have launched the beta version of the Discovery Service and will be providing regular updates on the progress of the project.

We are now re-harvesting metadata from participants involved in phase 2 of the project, before harvesting from new participants who volunteered in phase 3. Following the re-harvesting, the focus will be on reviewing the prioritisation of existing and new requirements. The work is being done in two week sprints on our test server with fortnightly updates to the beta system. When there’s an update to the beta, an email is sent to the project’s mailing list to inform participants on what’s changed or new in this version.

From now on we’ll be posting the updates via the blog, as well as the mailing list. If you want to get more involved in the project you can subscribe to the mailing list (JISC-UKRDDS@JISCMAIL.AC.UK) or check this blog for regular updates.

Here’s the first of our updates from our developer Mark Winterbottom.

Technical Update (5 May, 2017)

The focus so far has been about making the system stable, reliable and scalable as well as harvesting data on the new beta site.

Key Improvements

The beta site is now online and running as docker containers on AWS (the site should now be noticeably faster and more responsive).

CKAN has been upgraded from 2.5 to the latest stable release 2.6.2.

I have initiated harvest jobs for the following institutions and asked for feedback from the respective contacts:

Introduction

The following post is a report from the first webinar (held on 27 April 2017) for the third phase of the Research Data Discovery Service project. The aims of the webinar were to welcome new participants, provide an update of the project, introduce the new beta version (http://researchdiscoveryservice.jisc.ac.uk), highlight progress and review requirements from phase 2 and 3.

Note: slide numbers are shown in red to show how the text corresponds with the following presentation.

Welcome and introductions

All new participants and existing participants were welcomed to the webinar (Slide 3). Participants from phase 2 were thanked for agreeing to continue to be part of the project. Some of the content from previous webinars and workshops is repeated for the benefit of new participants. This is the first in a series of webinars. Future ones will be providing project updates and encouraging open discussion. There are still plans for face-to-face workshops during the project, but only when there is a need and beneficial to the project AND participants.

The project team for phase 3 were introduced and all contributed to the webinar. They are as follows:

Christopher Brown – Project Manager

Catherine Grout – Project Director

Dom Fripp – Metadata Developer

Ade Stevenson – Technical Innovations Coordinator

Mark Winterbottom – Technical Developer

In phase 2 there were 9 HEIs and 6 Data Centres funded to participate in the project and a further 5 HEIs who volunteered later in the project. Since publicising the project in phase 3, asking for further volunteers, there are a further 9 HEIs and two organisations. (Slide 4/5). The aim is to include all HEIs in the UK with a research data collection, this will include all Shared Service pilots and IRUSdataUK pilots too.

Project update and overview

The project (Slide 6) is developing a platform that enables the discovery of research data from across UK higher education institutions and data centres, which will bring a number of benefits to these organisations (Slide 7). These benefits (Slide 8) include an increased visibility and transparency of research data. The project has been through a number of phases. Following the initial pilot (Slide 9), phase 2 funded a number of participants from HEIs and Data Centres to provide metadata for harvesting and work with the project to determine the requirements for a discovery service. There were a number of outputs from phase 2 (Slide 10), including the alpha test system with data harvested from participating HEIs and Data Centres. In phase 3 (Slide 11) the project will move from a test to a production ready, tested service, include metadata harvested from more data sources and implement further requirements. A beta version of the service is now available (http://researchdiscoveryservice.jisc.ac.uk). This will be used as the basis for further development and include a complete re-harvest from all data sources.

So far, within phase 3, the focus has been on promoting the project to expand the number of participants and a lot of technical work has been going on behind the scenes (Slide 12). The following technical update summarises the work that’s been going on to produce this latest beta version:

Mark has been back on the project for 2 months and working on improving Infrastructure.

Alpha site was running on a single server which worked fine for showing the concept but had a few issues:

More than 8 services squeezed onto one server.

Single point of failure.

Disk was filling up with logs and data.

Harvesting process was taking a long time.

Database continued to grow without regular backups.

Manual process for deploying changes (slow and painful to push new updates)

Needed a solution that was secure, scalable, reliable and backed-up.

Decided to split the service up into containers using Docker

Can spread the services across multiple servers.

Can expand services when doing heavy processing like harvesting and resource scanning.

Can shrink resources when not running process intensive services.

Implemented Continuous Integration

Automate the process of pushing new versions to live.

Automate unit testing.

Improve speed at which we can iterate through bugs and features.

Make use of AWS hosted services such as RDS:

More stable, optimize DB with automated backups.

Offloads database maintenance to Amazon.

Since back, been working on configuring infrastructure and creating container apps.

Continue with dev process from phase 2 where we work in 2 week sprints and bi-weekly updates are sent to the email distribution list.

System status – Review of latest updates to the service

All organisations from phase 2 will have their metadata re-harvested to the new beta system (Slide 13). This includes the volunteer HEIs. Once this is complete we will start harvesting new participants from phase 3. The endpoints for all participants are listed in a Google Doc (http://bit.ly/RDDS3_harvest_status). This includes the current status for harvesting from each endpoint. The objective is to have all these working as soon as possible. When there is an issue, the JIRA ticket listed will provide the relevant details. All participants are included in this document. The new participants are currently in the “backlog” and will be added ASAP. The tickets will be set to “Done” (closed) when complete. Further issues will result in new tickets or tickets could be reopened.

Requirements (Slide 14) are listed and tracked using JIRA (https://jiscdev.atlassian.net/projects/RDD/). The categories of requirements were defined early in phase 2 after requirements gathering at the first workshop. User stories were collected and MoSCoW prioritisation (Slide 15) was used. Requirements were extracted from these user stories and from the HEI/Data Centres requirements reports (Slide 16). Following the latest re-harvesting, the focus will be on reviewing the prioritisation of existing and new requirements and implementing them in two week sprints. These will be implemented on the beta site with an email going out to the project mailing list showing what requirements have been implemented. The current issues are harvesting and metadata mapping (Slide 17) and we’ll look at other issues once these have been resolved.

Metadata

The two key aims for phase 3 centred around metadata concern the quality and representation of the harvested metadata within the CKAN client (Slide 18).

At the end of phase 2, we launched a vote for which metadata fields in the application profile would be of most benefit to a user of the service. The results of this vote are important for two reasons. Firstly, it gives a broad consensus around what fields are considered most important for discovery and what the minimal metadata for a record should contain.

Secondly, the vote can be used to order the metadata on screen so that a user is accessing the important metadata first. This can help simplify a record at the point of discovery (good UX), enable accurate citation, and, hopefully, encourage users to click through to the original repository record, which is desirable when there is additional metadata content as source, which might be of use.

In addition to this, the University of Glasgow will be conducting a piece of work in developing clear information and guidance for service users about the complex area of dataset rights and licences. This work will broadly follow the work that has been done in the cultural heritage sector recently to solve a similar problem (see http://rightsstatements.org/en/)

There has also been discussion with CORE (https://core.ac.uk/) to compare the services and look at potential ways of working together, especially in connecting data to papers.

Phase 3 (next steps)

The next steps for phase 3 (Slide 19), includes the implementation of requirements, listed in JIRA (Slide 20), via prioritisation and development sprints. The work still required (Slide 21) includes the following:

Regular releases of Beta with details sent via the JISC-UKRDDS mailing list

Improve usability

Improve search functionality

What are the aspirations for the future service (Slide 22)? The Discovery Service fits within the umbrella of the Research Data Shared Services project (Slide 23), which, under Research @ Risk, is developing a shared service (provided by Jisc) for effective Research Data Management. This offers a number of benefits:

Cost savings and efficiencies

Common approaches and practice

Research system standardisation and interoperability

The discovery service fits within this as a national aggregation service. We will be looking at integrating with the shared service further into phase 3. The “caterpillar” diagram (Slide 24) shows Jisc’s R&D process. Following the discovery and alpha stages, we’re now in the beta stage. The next step is to deliver this as a service. This is most likely to involve the Discovery Service being established (Slide 25) within the Jisc Digital Resources directorate’s set of services (https://www.jisc.ac.uk/content). This will involve consideration of a number of areas including:

Establishing a service team and how this fits with Research Data Discovery Service activities

Various other system admin tasks such as backup, disaster recovery, log config and rotation, DNS, proxying, caching, mail routing, system performance testing, system monitoring

Set up of any required service supporting applications, e.g. wiki for documentation, blogs etc.

Dealing with ongoing developments including necessary developments in response to essential new requirements or ongoing service enhancements

Community building for use of the service

Training / Workshops

Promotional events and social media.

An essential part of the project is ensuring participants provide feedback on how the system is developing, confirm the requirements are implemented and checking their metadata (Slide 26). In phase 2 there were a number of advisory groups set up to support the project. Originally, there was going to be one advisory group in phase 3, but so far there hasn’t been a need as all communication is shared via the mailing list. However, we will set up groups as required, especially when we need a more focussed discussion on areas such as technical development or metadata, for example. The JISC-UKRDDS mailing list will continue as the main communication outlet and there will be further webinars to update everyone on progress. Workshops will be held as required for feedback and face-to-face discussions.

Questions

Comments were made during the webinar and these were followed up via email by participants. However, a number of questions were asked and these are collated here.

What are the plans for working with Pure/Elsevier and new Pure API (v5.9), due out in June 2017?

There have been ongoing discussions with Elsevier as part of this project and the Shared Services. The service did work with a previous version of Pure via OAI-PMH. We will endeavour to use the new functionality within v5.9 to harvest into the Discovery Service.

I note the service is still linking directly to individual files. As ever, still don’t think this appropriate! Are we retaining this model?

We will look into this functionality to see if it can be improved once the harvesting and mapping work is complete. We want the system to be as easy-to-use as possible and this includes accessing the underlying data. We’ve also been looking at how other data portals work, particularly those built using CKAN.

Do we know when next RD Shared Service pilots’ day is?

This is still to be determined but the Shared Service project will contact all the pilots to let them know.

In my previous post, I described the latest phase (three) of the Research Data Discovery Service. In this post I’d like to describe in more detail the plans for harvesting metadata from other UK HEIs and Data Centres with research data collections.

In phase 2 metadata was harvested from 9 HEIs (Hull, St Andrews, Glasgow, Oxford Brookes, Edinburgh, Oxford, Southampton, Leeds and Lincoln) and 6 Data Centres (Archaeology Data Centre, Cambridge Crystallographic Data Centre, ISIS/ICAT – STFC, UK Data Service, Visual Arts Data Centre and NERC), all funded to participate in the pilot. The participants also provided a set of requirements for a discovery service, provided harvestable endpoints and helped test the alpha system as it developed. As the project progressed, a further five HEIs (Sheffield, Bath, Nottingham, Lancaster and Bristol) volunteered to be involved and we started to incorporate their metadata into the system near the end of phase 2.

I’m glad to say that all of the participants from phase 2 are keen to continue to be involved in the project. In phase 3 we plan to add as many UK HEIs and Data Centres, that have research data collections, into the Discovery Service. There are a number of institutions that are also part of the Research Data Shared Service and the Research Data Metrics for Usage projects and, if they haven’t already been involved in this project, we will be looking to include them as well. However, at this point we would like to hear from any other institutions that have a research data collection and an endpoint that we can use to harvest the metadata into the Discovery Service. It was clear in phase 2 that some institutions have well established research data management policies and practices, while others are less well advanced. It doesn’t matter what stage of this process you have reached, we would still like to hear from you.

We will be working to enhance the current test service, adding functionality to match requirements, and ensuring there is a fully functional and tested system ready to transfer to a production service (provided it meets the relevant criteria and the business case is agreed within Jisc). Incorporating other participants’ metadata (potentially for all HEIs with research data collections) is an important objective of the project.

If you are interested in being involved in this latest phase of the project, or would like to discuss this further, please contact Christopher Brown.

The latest phase of the UKRDDS will run from October 2016 to September 2017 and follows on from the second phase of the project. This post summarises work from the second phase and what’s planned for this third phase.

Phase 2

This Jisc-led second phase of the project ran from March 2015 to September 2016 and included support from the Digital Curation Centre and the UK Data Service, on HEI and Data Centre engagement respectively. It built on the pilot work with the aim of running a test UK Research Data Discovery Service. The main aim of the second phase was to lay the firm foundations for the service by harvesting metadata from 9 HEIs and 6 Data Centres, each funded to participate in the project. These pilot organisations provided metadata of their research data collections for harvesting, provided a set of user requirements and helped to test the alpha system. The alpha service was made publicly available during development to ensure the research community had the opportunity to test its functionality. This phase came to an end with a final workshop for all participants where the alpha system was tested, requirements were reviewed and the plans for the next phase were presented.

Phase 3

The third phase of the project has the following objectives:

moving the test service from alpha to beta;

enhancing the service by adding further requirements;

incorporating other participants’ metadata (potentially for all HEIs with research data collections);

running as an enhanced beta service to allow for further testing;

at the end of the project have a fully functional system ready to operate as a service (provided it meets the relevant criteria and the business case is agreed within Jisc).

It’s hoped that all participants from phase 2 continue to be involved during phase 3 of the project. In phase 2 there were three Advisory Groups – User, Researcher and Technical & Metadata. However, in phase 3 there will be one Advisory Group with voluntary participation from all those HEIs and Data Centres having their metadata harvested into the service. It’s expected that sub-groups could form to discuss specific issues, for example metadata mapping. This structure will ensure the project continues to get input and feedback from participants to ensure the system satisfies the needs of its users.

At the final workshop (see previous post) valuable feedback was provided as to how to make sure the project is a success. This includes ensuring the project engages with researchers and other users as soon as possible to further test the system and make sure it is satisfying their needs and not just those of the participants. Also, requirements have been mainly coming from the data collection perspective, but these need to be gathered from the user perspective sooner rather than later. Phase 2 focussed on primary types of data but we should look at secondary types of data (see scope of datasets) in the context of researchers using the service. Other use cases need to be considered, such as those from a funder’s perspective. Other questions raised included: What about all the other data internationally in subject based data centres – do we want that or not? Is the distinction between UK and non UK data important? For now, the focus remains with UK datasets.

This work will allow us to move from a test service to a production ready one. We will be able to harvest from more data sources, do more formal and informal system testing, look at further requirements (refining and implementing them), develop a business case for the service with the ultimate aim of delivering a more mature and tested service to Digital Resources (the area of Jisc that runs and supports services, such as the Archives Hub).

In developing additional functionality we will review existing requirements set to “won’t” (from the MoSCoW prioritisation process performed early on in phase 2) and out of scope, gather further requirements from the final workshop, and potentially other requirements, integrate more closely with the Research Data Shared Service work and the IRUSdataUK project.

How to get involved

In phase 2 metadata was harvested from 9 HEIs and 6 Data Centres funded to participate in the pilot. A further four HEIs volunteered to be involved in the project and their metadata was added to the system near the end of phase 2. In phase 3 the plan is to add more (if not all) HEIs and Data Centres that have research data collections. This will necessitate a set of requirements to join the service. These are currently being finalised, but the minimum requirements are:

Research data metadata can be provided

There is a harvestable endpoint

It’s a supported schema

A named contact is available for support to

Check harvest and metadata

Report issues

Request manual harvest, if required

Liaise with the developers when adding metadata to the service

Jisc will provide a developer/admin to liaise with the support person

If you are interested in being involved in this latest phase of the project, or would like further information, please email Christopher Brown.

The project team and representatives from all participating pilot HEIs and Data Centres convened in London for the third workshop of the UK Research Data Discovery Service on 13 October 2016. This was the final workshop of what is now known as phase 2 of the project, which ran from March 2015 to September 2016 (extended from the original end date of July 2016).

The objectives of the workshop were to review the second phase of the project, discuss what still needs to be achieved in the next (third) phase of the project and how people can be involved and engaged.

Prior to the workshop, all the relevant sources of information were collated on the workshop’s padlet. This includes links to the shared notes, an online app for collecting sticky notes, all supporting documentation and slides.

To collect as much feedback as possible during the workshop, in addition to the exercises, posters were put up titled Questions, Issues, Ideas and a FLAP (Future considerations, Lessons learned, Accomplishment and Problem areas) board for phases 2 and 3. Any notes added to these posters have been transcribed into the shared spreadsheet mentioned in the group exercises.

Presentations

The day started with Catherine Grout describing what had changed in the landscape since phase 2. The research data discovery service sits within a suite of Jisc work called “Research at Risk”, which offers tools, services, advice and guidance to those involved with research data management in the UK. In particular the Research Data Shared Service will offer a simple solution that meets the needs of institutions and the requirements for funders.

The project is managed by Christopher Brown and he summarised the work of phase 2. This phase had brought the pilot into alpha status, laying the firm foundations for a potential service. A further year of work will make this a more hardened system with further testing and user feedback, making the service more valuable and useful. Further HEIs have been brought into the project, in addition to the original participating pilots.

User stories, supported by a “MoSCoW” prioritisation process, have driven the development of a range of outputs resulting in the alpha system and associated research and documentation (links to the latter, and a list of participants, are available via the padlet).

Recent focus has been on system testing, with changes made on the staging server and using the live server as a benchmark for testing. Harvesting continues, alongside development on other requirements and specific issues (NERC, VADS) and the addition of other HEIs (Nottingham, Sheffield, Lancaster, Bath and Bristol). Feedback on the project and participating pilots’ involvement will be an important method of assessing phase 2 and directing phase 3.

Dom Fripp has worked on metadata mapping for the project. A new “metadata profile document” has been circulated and is open for comments and questions (currently on version 1.1), alongside a mapping document. These documents inform the work of our developer in building the metadata schema into CKAN. The mapping exercise is very important work and is of interest globally – comments are very welcome. This is still a live process, and issues that arise should be shared and reported to be addressed in future development (the example of issues related to migration between DataCite 3 and 4 was noted). In future this documentation will be migrated to github.

Group Exercises

The main focus of the workshop wasn’t to listen to presentations but for participants to engage in a number of group exercises.

The first exercise was to assess and test the current alpha system on the staging server. Delegates could work alone or in groups at their tables. There were four tables and reporting back was done one table at a time. The areas suggested for testing included – your organisation’s metadata; any fields missing; is the harvested data correct; search functionality; presentation of results; usability. These were suggestions and other areas could be tested.

Notes were added to a poster under the categories of Bug, Error and Feedback. These have been transcribed into the following shared spreadsheet (along with notes from the Requirements exercise). These will be reviewed and checked against existing JIRA tickets. For any new issues a new ticket will be created.

The second exercise was a follow on to the first and delegates were asked the following questions:

Does the service satisfy the requirements of your organisation?

What further requirements should be added?

What should be improved?

They were asked to write their answers down on sticky notes and put them on a poster under the following categories: Drop, Add, Keep or Improve.

As with the first exercise, the notes have been transcribed into the shared spreadsheet.

Both exercises provided valuable feedback on the current system and ideas for future requirements.

The Road Ahead

The day finished with Christopher Brown describing plans for the next phase of work and how participants could be involved.

In phase 2 we engaged with participants and gathered user stories, prioritised and implemented requirements based on these user stories, evaluated software and chose CKAN, developed an Alpha system, harvested metadata from participants into this system and are now moving to Beta.

Phase 3 will run from October 2016 to September 2017 and will allow us to move from a test service to a production ready one. We will be able to harvest from more data sources, do more formal and informal system testing, look at further requirements (refining and implementing them), develop a business case for the service with the ultimate aim of delivering a more mature and tested service to Digital Resources (the area of Jisc that runs and supports services, such as the Archives Hub).

In developing additional functionality we will review existing requirements set to “won’t” and out of scope, gather further requirements from this workshop, and potentially others, integrate more closely with the Research Data Shared Service work and the IRUSdataUK project.

It’s hoped that all participants would continue to be involved during phase 3 of the project. This would be at a level expected from all new participants wishing to have their metadata harvested into the discovery service.

The day ended with a thank you to all the participants and the project team for the help and support in running an engaging and productive workshop, and to all the participants who have helped throughout phase 2 of the project.

The following post has been written by Dom Fripp – senior metadata curation developer at Jisc and part of the UKRDDS team.

Earlier this year, Torsten Reimer wrote a blog post entitled “Less is more? A metadata schema for discovery of research data”. In it, he considered what metadata schemas were in use in UK HEI repositories and catalogues to aid the discovery of research datasets. He also raised the possibility that, given the similar motivations acting on the HEIs (funder mandates, research integrity and the increasing awareness of the value of data), that the metadata requirements might be very similar, maybe even the same.

His rationale was to compare various metadata schemas from institutional research data repositories and look at what was common between them. This conversation was extended into a Birds of a Feather session at the recent IDCC16 conference in Amsterdam. The outcomes of that are included as an addendum of his original blog post, from which he drew two conclusions, more of that later.

Torsten’s list compared the metadata fields currently used for research data at Imperial College and Cambridge University. The list of shared fields was:

Title

Author/contributor name(s)

Author/contributor ORCID iD(s)

Abstract

Keywords

Licence (e.g. CC BY)

Identifier (ideally DOI)

Publication date

Version

Institution(s) (of the authors/contributors)

Funder(s) (ideally with grant references; can also be “none/not externally funded”)

At the same time (and unbeknownst to each other) I was taking a similar approach in the preparation of a metadata schema for Jisc’s UK Research Data Discovery Service. This schema was based on Use Cases and User Requirements, provided by a mixture of HEIs and specialist data centres in the UK. I detailed the process in a recent blog post.

Due to the similarity of the work, it was straightforward to compare Torsten’s list with the newly minted schema. As the table below indicates, there is complete overlap.

Imperial & Cambridge

UK Research Data Discovery Service profile

Title

Title

Author/contributor name(s)

Creator

Author/contributor ORCID iD(s)

Creator identifier

Abstract

Description

Keywords

Keywords

Licence (e.g. CC BY)

License

Identifier (ideally DOI)

Unique Resource Identifier

Publication date

Date

Version

Relation type / related identifier*

Institution(s) (of the authors/contributors)

Publisher / creator affiliation**

Funder

Funder

Grant reference

Project number

* in the UK Research Data Discovery Service metadata profile, following Datacite, Version is handled within related fields so, rather than numbered, the versions are linked by successor / previous identifiers. These can be numbered accordingly, e.g. (adapted from Datacite table 9)

<relatedIdentifier relatedIdentifierType=”DOI”

relationType=”IsNewVersionOf”>10.98765/4321

</relatedIdentifier>

** This is not a clear mapping as the publisher is not necessarily the creator or contributor institution. In the UK Research Data Discovery Service metadata profile, this can be handled in the creator affiliation field.

This result gives credence to Torsten’s first conclusion in his blogpost, that the minimum metadata requirement “…may be, at least partly, a UK-specific issue.”

If it is, then the good news is that there seems to be a lot in common between UK HEI and data centre repository metadata. It can be argued that this indicates a set of common experiences and requirements to which similar metadata fields have been applied. There’s a reason why the profile for the UK Research Data Discovery Service shares a lot in common with the Datacite and Dublin Core schemas – not only because those schemas have been driven by user requirements but also they are currently sufficient for most cases of discipline-neutral descriptive metadata for research data.

The second part of the preparatory work undertaken to develop the UK Research Data Discovery Service profile was to look at good metadata practice in the schemas that support other research data aggregators and discovery tools around the world.

In terms of establishing what is common between different metadata profiles, I took a similar route to Torsten by looking at a variety of profiles that supported internationally implemented discovery services. The approach is documented in more detail in my previous post. To simplify the task (some of the profiles are expansive) I listed only what was mandatory within the schemas, as a fair assessment of what was core. The results are shown in the following bar chart.

Analysis revealed that even mandatory fields aren’t always common and that very few fields outside of creator, title, type and the resource identifier are considered mandatory in most schema.

Why is this? There are two reasons I’d like to mention (and many more I won’t have thought of so please feel free to comment below).

Firstly, the metadata requirements for a discovery service is likely to have grown up around the scope and requirements of the project. It is important to stress that this analysis is not comparing the quality or success of the schema – they support different projects with different aims – but merely the shared fields to see if the findings shed any light on the potential for international metadata requirement.

Secondly, aggregators take different approaches to mandatory metadata requirements. ANDS uses the RIF-CS schema which has many mandatory elements. This is because ANDS is addressing a national solution to a creation through preservation. This includes discovery, value, access and re-use standards that requires administrational and disciplinary metadata.

On the other hands, B2FIND ( the discovery element of EUDAT) requires only a title and a uniform resource identifier, yet offers integration with Dublin Core, ISO 19115, MarcXML, CMDI and DDI. It too is part of the larger EUDAT infrastructure.

So it seems that there is no consensus to be drawn from internationally implemented schemas. If Torsten’s hunch is correct, and the minimum metadata set is a UK problem then there is potential to develop a stronger answer during the Jisc Research Data Shared Service pilot. This will put more HEIs together and get them talking about research data metadata standards that will play a key role in the resultant infrastructure.

Torsten’s second conclusion in his post was “When engaging in discussions with metadata experts there is no such thing as a pragmatic definition.”

This I agree with. If the ongoing work with the UK Research Data Discovery Service and the broader requirements of the Shared Services Pilot are anything to go by, the pragmatism required is not a matter of definition but of approach. The cumulative effect of use cases, requirements and behaviour in these projects could potentially result in a consensus on minimum metadata requirements in the UK.