Provide tools for current DSpace repositories to migrate to these schemas (i.e. edit their metadata registry and data), if desirable (i.e. provide tools for migrating elements not compliant with dcterms to "local" registry).

Outstanding issues for committers and community

Is it possible to ultimately implement DCTERMS with full functionality (vocabularies, etc.)? What changes to the data model will be necessary?

How will this proposal integrate with other suggested changes to DSpace metadata, including Proposal for Metadata Enhancement? How might it affect integration with Fedora? How might it affect other desired changes to metadata in DSpace, including implementing functional structured metadata such as MODS, METS, and PREMIS?

What challenges will this proposal present—or solve—for harvesting?

To enable repositories to migrate existing metadata to the DCTERMS schema, we will need to develop robust tools for repositories to deploy. (Note: A curation task has been added to 4.0.)

Should DSpace admin/internal metadata (not including DIM) have its own schema ("dspace"), or use 'local' schema?

Recommendation background

Update current default 'dc' schema in DSpace metadata registry to current standards

Background:

The default DSpace metadata registry ships with the 'dc' schema (which is the default DSpace metadata schema). It was designed to comply with the Dublin Core Libraries Working Group Application Profile, modeled on flat, extensible Qualified Dublin Core.

[The default DSpace metadata registry schema is "namespace" = http://dublincore.org/documents/dcmi-terms/ and "name" = dc. This use of "dcmi-terms" should not be confused with DCMI Metadata Terms. It is a qualified 'dc' schema for DSpace based on an application profile (Dublin Core Libraries Working Group Application Profile (LAP)). DSpace used the LAP as the starting point for its application of Dublin Core, borrowing most of the qualifiers from it and adapting others to fit. Some qualifiers were also added to suit DSpace needs.The 'namespace' it is declaring is not a DCMI namespace. The default DSpace schema is not dc: namespace http://purl.org/dc/elements/1.1/ (the collection of legacy properties that make up the Dublin Core Metadata Element Set, Version 1.1 [DCMES]) or dcterms: namespace http://purl.org/dc/terms/ (the collection of all DCMI properties, classes and encoding schemes (other than the properties in the Dublin Core Metadata Element Set, Version 1.1 [DCMES], the classes in the DCMI Type Vocabulary [DCMI-TYPE] and the terms used in the DCMI Abstract Model) http://purl.org/dc/terms/)]

DCMI has not updated its Qualified Dublin Core standard since 2005. The community standard has shifted towards DCMI Metadata Terms, which, unlike QDC, is not a flat schema based on the schema.element.qualifier format. DCTERMS include range and domain values. A particular term may link to another term that it refines or is refined by (for example: the dcterm "hasPart" refines "relation"; "created" refines "date").

Rationale:

DCTERMS is the currently maintained DCMI standard.

As Sarah Shreeves recently commented:"I want to strongly urge the group to look at conforming with DCMI terms (http://dublincore.org/documents/dcmi-terms/) - even if we can't conform to the vocabulary, etc, this is the most up to date and current form of the namespace. If we use the dc qualifiers document we will be perpetuating the same problem, IMO. I think we can, as Tim suggests, have a graceful path forward. I will admit that a real part of my fear of just moving to DC Qualified is that DSpace--in terms of metadata--will continue to be seen as out of touch with where much of the metadata world is headed."

Also, from http://dublincore.org/documents/dces/: "Since 1998, when these fifteen elements [dc: namespace] entered into a standardization track, notions of best practice in the Semantic Web have evolved to include the assignment of formal domains and ranges in addition to definitions in natural language. Domains and ranges specify what kind of described resources and value resources are associated with a given property. Domains and ranges express the meanings implicit in natural-language definitions in an explicit form that is usable for the automatic processing of logical inferences. When a given property is encountered, an inferencing application may use information about the domains and ranges assigned to a property in order to make inferences about the resources described thereby.Since January 2008, therefore, DCMI includes formal domains and ranges in the definitions of its properties. So as not to affect the conformance of existing implementations of "simple Dublin Core" in RDF, domains and ranges have not been specified for the fifteen properties of the dc: namespace (http://purl.org/dc/elements/1.1/). Rather, fifteen new properties with "names" identical to those of the Dublin Core Metadata Element Set Version 1.1 have been created in the dcterms: namespace (http://purl.org/dc/terms/). These fifteen new properties have been defined as subproperties of the corresponding properties of DCMES Version 1.1 and assigned domains and ranges as specified in the more comprehensive document "DCMI Metadata Terms" [DCTERMS].Implementers may freely choose to use these fifteen properties either in their legacy dc: variant (e.g., http://purl.org/dc/elements/1.1/creator) or in the dcterms: variant (e.g., http://purl.org/dc/terms/creator) depending on application requirements. The RDF schemas of the DCMI namespaces describe the subproperty relation of dcterms:creator to dc:creator for use by Semantic Web-aware applications. Over time, however, implementers are encouraged to use the semantically more precise dcterms: properties, as they more fully follow emerging notions of best practice for machine-processable metadata."

Lockdown schemas offering migratory tools to pull out local customizations and push into new local schema. Make it possible but not easy to delete or edit elements in DCTERMS schema. Continue to enable the addition of qualifiers in the 'dc' schema.

For staging purposes, we recommend that DSpace ship with 4 registries in Phase 2, to support ultimate migration to DCTERMS and to standardize namespaces by pushing local customizations not compliant with DC or DCTERMS into a local schema.:

1) 'dcterms' (DCTERMS) - which will be the default metadata schema

2) 'dc' schema

3) 'dspace' schema for system/admin metadata

4) 'local' schema - which would ship with some elements migrated out of 'dc' because not compliant with QDC, and enabled for the purpose of local customizations

Relevant JIRA tickets

(please add any JIRA tickets that could be affected by this proposal!)

Areas/processes that will be affected by registry update

What areas and processes will be affected by these shifts? Is there any documentation of what features in DSpace are making use of certain fields? Where will the code be affected? Where are metadata elements hardcoded?

16 Comments

As repository manager (I'm technically proficient but not a programmer) , one of the issues we have with the existing DSpace metadata schema is that

DC isn’t sufficient for describing publication items (i.e., research outputs). We needed to provide a solution for a complex project that involved melding the existing repository content which contains mainly theses content to becoming the institutional collection for research outputs. We implemented a metadata crosswork between our dspace repository and the research outputs management system to transfer data from one system to another. Because Dublin Core doesn’t support metadata at article level (e.g., start-page, end-page etc), we had to create local schema for the crosswork to achieve interoperability with other research management systems which is not ideal.

We need something more granular and beyond the idea of ‘core metadata’ for simple and generic resource description.

Can we please have some feedback/comment about the above from the cmtr?

I agree with Yanan. DSpace as it is does not support granular metadata. At the same time, the simple structure of element, qualifier and authority makes it easy to extend the metadata set and adapt it to our own needs.

The customization of the metadata format has been done by the community in different ways, in most of the cases looking for the same goal: the extension of the existing qualified DC or the creation of a new schema with two levels (element and qualifier). This extension is necessary for defining granular information, like conference name, location, start and end page, etc... This granularity can then be easily used for exporting using other formats like MODS and MARC, while it is also available for import from existing databases or through reference managers.

The actual metadata formatting is simple and flexible. However, it is based on old DC definitions which does not work for harvestable standards beyond DC. Internally the simplicity should be preserved, but still there is a need of applying richer metadata standards. All the extra elements (see examples above), which now have been defined in many ways, should be standardized. Tools to rework the granular elements should be available to create different metadata formats (in the first place for harvesting), not only as a translation of DC qualified as it is now. All the existing values should be available for harvesting. Not only the elements, but also the authority and language values. In my opinion, the implementation of authority which contains unique identifiers (ISSN, DOI and surely URI – related to linked open data), could turn out as the most important development of metadata in DSpace, but which is at the moment not translated to harvestable metadata.

The main functionalities of a repository should be the use of a submission module to collect content and the delivery of quality metadata useful for being completely harvested with all the meaningful values. In Type is an important structuring element for metadata, which should be better supported in the submission interface. There is generic metadata, but besides that different types (book, book section, journal contribution, interview, …) have specific elements. That should help to define the necessary granularity. There is also metadata available from databases in different formats (e.g. RIS, Bibtex). They are more granular than the DC qualified used in DSpace. This should be resolved too.

These ideas are based on our experience with the development of OceanDocs and AgriOcean Dspace. Gradually, we became convinced that we needed a better handling of metadata than a basic DSpace can offer. We worked therefore on three levels:

Adaptation of the submission module, using a type-based submission interface. For every type only the relevant fields are shown

Creation of extra elements to refine the metadata: e.g for a journal reference: journal name, volume, issue, start and end page. We simply extended the existing dc qualified, not bothering to create a new schema for our extra elements. For us, it is simply an internal presentation which should be translatable to standards. For consistency, we concatenated some of the fields in existing DC qualified fields.

Extending the crosswalk tools for exposing metadata formats in OAI. First, all the values in the metadata value table can be used. Reformatting tools makes it possible to create rich metadata. AgriOcean DSpace supports metadata formats like MODS, VOA3R AP and AGRIS AP. We also start to use authority values containing URIs for AGROVOC terms as attributes in MODS, but as element values in VOA3R AP. It is on this level that we have to follow standards, which go beyond DC and DC translated to other formats.

For me, it proves that for internal use a simple model can work. I agree that updating is necessary. This update should foresee granular approach more standardized, where adaptation and extension should still be possible. DSpace can only provide good quality metadata by using a good submission module. Finally, crosswalk tools are needed to translate internal metadata to rich standard formats (in the first place for OAI harvesting – and, as second stage, for exposing Linked Open Data).

During the discussion of this agenda at OR2013 at the DSpace 'committers' meeting on Monday, I volunteered to provide some tool assistance to facilitate the program. I have completed a draft of the first tool, but before I offer it as a patch to the codebase, I wanted to make sure it addressed the basic needs (mostly phase 1 stuff, but could be generally useful). Please let me know if there is functionality not described here that would be valuable. Here's a description of the 'MetadataMapper' tool:

Basics: it is written as a curation task, so it can be deployed to any DSpace version 1.8 or later, ie. without waiting to upgrade the DSpace instance. It might make sense to bundle it with 4.0, however, so that it comes 'right out of the box'.

Functions: The user defines a set of desired metadata transformations in a simple map:

dc.contributor.author -> dc.creator

dc.embargo.terms -> dspace.embargo.terms

....

This map is placed in a config file read by the curation task, which will then take all the metadata values found (if any) in the left side and move them to the right side.

Move means that they are copied from the source to the target, then deletedfrom the source. As with all curation tasks, these move operations can be done to a single

Item or Items (one by one), to all Items in a collection, to all items in a community, or to the whole repository. You can run the task as many times as you like, either in the Admin UI (Manakin only) or from a command line.

The tool can add some special handling to these operations depending on how the metadata has been set up. There are 3 cases:

(1) Replacement . this means that whatever is on the right side is removed and replaced with what is on the left

(2) Merge - this means that the left side values are added to the right side, but any existing right side values are preserved

(3) Assignment - this means that the left side is moved to the right only if there is nothing on the right side

Using these in combination, I think you can do most things you intuitively want to, like combining 2 fields into one new one, etc As a safeguard, you can run the task in 'preview' mode, which will display what operations it would perform, but not update the Item.As with any task, you can (if run from the command-line) capture all the specific changes to a file for later reference. The info provided looks like this (one line for each item):

1721.1/123 dc.contibutor.author (3) merged with dc.creator (4)

This means the tool copied 3 values from contributor.author into creator, which previously already had 4 values.

Let me know if this sounds like it will cover what we need as far as Item metadata (I realize there are a lot of other issues, like input-forms, crosswalks, etc)

Thank you, Richard. This sounds really well thought out to me, between the levels at which the curation task might be applied, the option of previewing, and the capture of changes.

Two questions:

Does this process assume that, prior to deployment, repository managers will add and enable any new metadata elements included in the mappings? Or is that somehow built into the curation task (I'm assuming not)?

Would it be possible to include, in addition to the number of values copied, the values themselves? i.e., 1721.1/123 dc.contributor.author (3) [x, y, z] merged with dc.creator (4) [a, b, c, d] ?

That's right - it expects the metadata fields to have been defined. I considered automatically creating them from the mappings, but thought that it would make it too easy for typos to accidentally create unintended fields. The task does, though, verify before running that the right-hand fields exist, and complains if it doesn't find them (The left-hand fields are OK in the sense that if they are typos, it will never find data to move, so are innocuous). I'm imagining as part of the schema migration(s) we will publish new registry XML files, and there is already a loader for them.

It would be possible, although the output could be rather sizable if we are moving abstracts, etc in large collections. I thought that since the field values were being preserved, we didn't need to record them. However, come to think of it, there is one case where that isn't true: in cases of assignment where the right-hand is already occupied (the left-hand data is then essentially discarded). I'll look into capturing those field values, in that case.

Another reason one might want to log the values is if they are changed in any way. I didn't mention it before, but the task does also have a simple transformation capability, meaning that before adding to the right-hand field, one can twiddle with the value a bit. An example would be:

Curious what the use case might be for assignment, where left-hand data is discarded is the right-hand is occupied? Are you thinking of this as a way to run a check to ensure that the data has transferred?

I agree with your setup wherein the registry fields are already defined rather than somehow established or created within the migration tool.

In addition to a transformation capability, there are use cases around a validation capability for the tool-- one that will alert users if they are transferring non-compliant data into a field.

You might consider the alternative of having a separate validation task, since you might want to run that by itself in other cases. If you happen to be mapping, you could separately validate the old MD beforehand, the new MD afterward, or both. This seems to me like a case in which two simple tools beat one more complex tool.

Assignment is meant to be a sort of 'safe replace' or 'choose best value' operation. If the right-hand field has been newly created, then merge and replace do the same thing - just copy values into it. (This will be the overwhelmingly most common case). If there are values present however, one has to decide what the relationship of them are to the left-hand values: should we combine them, since we are basically cross-cataloging in two fields? This may make sense sometimes, but typically only if the field is multivalued. Should we discard the right-hand side? If so, use replace. Suppose though, we have begun cataloging into the right side field, but not bothered to remove any superseded values in the left-side (not cleaned up past practice) In this case, neither merge nor replace seem right - thus assignment. It essentially means "if there is a value there, it's the one I want to keep".

Make sense?

BTW - I concur with Mark Wood on validation as an independent concern, thus meriting a different tool

The need for additional metadata suited to specific uses of DSpace seems to me to be precisely the reason that DSpace was designed to support multiple namespaces. Sites which archive images of pottery will have different needs than sites which archive chemical research reports or musical performances. I think that DSpace could and should ship with additional namespaces which could be loaded by sites that need them. It won't ever have everything that everyone wants, because people are endlessly creative in identifying new wants.

There are several distinct needs in this area, I think:

DSpace needs some namespace it can rely on for basic operations without any customization. DC has filled that role and DCTERMS may continue to do so.

DSpace has some concepts of its own that have been shoved into "DC". But DCMI defines what DC means, not DSpace. These should move to an internal namespace which is not exposed, since they have no meaning to other systems.

Each site may have some concepts of its own as well. They should be collected into one or more local namespaces to facilitate preventing the leaking of information which is meaningless outside the site.

Types of materials may have unique needs. If one site has these needs, probably others do too. The first thing to do is to look around and see if there is already a suitable metadata standard, and if so use it. Otherwise ask around to see if it's feasible to hammer out a standard among sites with similar needs, to publish and share. Otherwise, it's probably really local concepts and should go in a local namespace.

One thing that is asked for often is article-level metadata. Almost as frequently, someone points to PRISM as an answer. I can't say whether it's a good answer, but it's an example showing that what you want might already have been standardized. Don't work more than you have to!

I feel that too much metadata customization for DSpace takes place in the dark rather than being discussed and shared. One of the things I hope for from this metadata renovation is that that will change. Oddly enough, DSpace arguably makes it entirely too easy to deal with (some) metadata issues by just tweaking the default namespace and moving on. We haven't done enough to encourage reliance on the community, not just of DSpace sites but the broader community of networked information resources.

Yes it should. It is already written as described, but I wanted to make sure it met the basic needs before committing to the codebase. As to testing, what environment do you have available? If you have a 1.8+ DSpace (has to be XMLUI if you want to run the task in the admin UI), I can probably send you code to test right now.

On a related note, I'm pondering another tool/service to assist metadata improvement. I think we are somewhat stymied by a basic lack of visibility into exactly how individual sites have customized (or not) their metadata. WIthout this, it's hard to devise automation tools that work for large numbers of sites. To address this knowledge gap, I'm thinking of providing a web service add-on to DSpace that one could remotely query which would profile the metadata usage. What does profile mean? It means listing all the schema that have been defined, and within each schema, listing all the defined metadata fields and how much they are used (= how many MD values in this field in the entire repo, regardless of item, etc). You could 'harvest' these profiles from all participating sites, and combine the results (kind of like OAI-PMH harvesting) to get an aggregate picture of metadata usage. What do you think?

We had to customise our metadata (i.e., adding additional metadata schema) to integrate our DSpace repository with University's Research Management system. As our repository grows bigger and bigger, it would be very useful to know which metadata fields have been used heavily and which ones have not been used, to understand implication of changes to crosswalk etc.

I am cheering from my desk at your suggestion of a web service to remotely query and profile metadata usage in existing DSpace repositories. Our lack of a comprehensive picture of the fields actually in use in DSpace repositories has been a stumbling block. And one that doesn't seem resolvable through intermittent self-reporting in the form of something like surveys. I love your idea.