Introduction: Scope and Definitions

This document, a deliverable from the W3C Library Linked Data Incubator Group, is an attempt to identify a set of useful resources for creating or consuming Linked Data in the library domain. It is intended both for novices seeking an overview of the Library Linked Data domain, and for experts in search of a quick look-up or refresher. The final report @@@CITE@@@ of the Incubator Group suggests that the success of Linked Data in any domain relies on the ability of its practitioners to identify, re-use, or connect to already available datasets and data models. Library Linked Data is not an exception. Such an identification effort is crucial given the complexity and variety of library data resources, many of them already available as Linked Data at the time of writing this report. We hope that this document will help those who undertake such tasks.

This document also tries to provide the Linked Data community with an opportunity to understand the specific viewpoint, resources, and terminology used by the library community for their data, while helping Library and Information Science professionals to get a grasp of the Linked Data concepts corresponding to their own traditions. In previous library terminology explanation efforts, we have identified the following types of resources of interest, which are not mutually exclusive (as shown later):

Datasets: In this report we focus on datasets as collections of structured metadata -- descriptions of things, such as books in a library. The equivalent of a dataset in the library world is a collection of library records. Library records consist of statements about things, where each statement consists of an element ("attribute" or "relationship") of the entity, and a "value" for that element. The elements that are used are usually selected from a set of standard elements, such as Dublin Core (DC). The values for the elements are either taken from value vocabularies such as the Library of Congress Subject Headings (LCSH), or are free text values. Similar concepts to "dataset" include "collection" or "metadata record set". Note that in the Linked Data context, datasets do not necessarily consist of clearly identifiable "records". They are merely consistent set of triples that can be queried or downloaded from a specific point, without making a strict distinction between metadata and data. We expect this view to impact the way the library community conceives of its own data, as (i) it creates or re-uses Resource Description Framework (RDF) vocabularies with domain and range settings and documentation that conforms to best practices, and as (ii) more application cases emerge, in which "traditional" descriptive metadata is being used together with other types of data.Examples:

a record from a dataset for a given book could have a subject element drawn from Dublin Core, and a value for subject drawn from LCSH.

the same dataset may contain records for authors as first-class entities that are linked from their book, described with elements like "name" from Friend of a Friend (FOAF).

a dataset may be self-describing in that it contains information about itself as a distinct entity, for example by including date modified and maintainer/curator elements drawn from Dublin Core.

Value vocabularies: A value vocabulary defines resources (such as instances of topics, art styles, or authors) that are used as values for elements in metadata records. Typically a value vocabulary does not define bibliographic resources such as books but rather concepts related to bibliographic resources (persons, languages, countries, etc.). They are "building blocks" with which metadata records can be populated. Many libraries mandate specific value vocabularies for selecting values for a particular metadata element. A value vocabulary thus represents a "controlled list" of allowed values for an element. Examples include: thesauri, code lists, term lists, classification schemes, subject heading lists, taxonomies, authority files, digital gazetteers, concept schemes, and other types of knowledge organization systems. To be useful for linking of data, value vocabularies should have Hypertext Transfer Protocol (HTTP) Uniform Resource Identifiers (URIs) assigned for each value; these URIs would then appear in a metadata record instead of or in addition to the literal value.Examples:

Metadata element sets or element sets: A metadata element set defines classes and attributes used to describe entities of interest. In Linked Data terminology, such element sets are generally made concrete through (RDF) schemas or (Web Ontology Language (OWL)) ontologies, the term "RDF vocabulary" being often used as an umbrella for these. Usually a metadata element set does not describe bibliographic entities, rather it provides elements to be used by others to describe such entities.Examples:

Dublin Core defines elements such as Creator and Date (but DC does not define bibliographic records that use those elements).

Functional Requirements for Bibliographic Records (FRBR) defines entities such as Work and Manifestation, and elements that link and describe them. Resource Description and Access (RDA) defines elements for cataloging, based on the FRBR model.

FOAF defines elements to describe people that could be used for describing authors.

This report is intended as an entry point for practitioners to find, understand, and explore some exemplar metadata element sets, value vocabularies, and datasets. It is especially grounded by the cases our Incubator Group has gathered. We do not aim here to draw a complete list of the various resources related to the (library) Linked Data "cloud". We hope this report will prove an inspirational complement to more complete listing tools such as Semantic Web search engines (like Sindice or Falcons), other surveys such as the Linked Open Vocabularies survey, or registries such as the Metadata Registry, Schemapedia or the Comprehensive Knowledge Archive Network (CKAN). We of course encourage our readers to also use these, just as we did ourselves for the CKAN data registry we developed.

Library Linked Data at CKAN

CKAN is a registry for data. It is a tool for people to share information about data "packages" of all types and collaboratively describe them. Although the CKAN registry is not itself a Linked Data service, there is a Linked Data version of the information it contains. Much of the data described in CKAN is in Linked Data form.

CKAN organizes data packages as groups that are curated by a community. It is used to maintain information about membership in the wider LOD Cloud as well as the subset that pertains to Library Linked Data--which includes both library datasets and value vocabularies as defined above. The curators of these groups have arrived at a set of conventions for using the tagging facilities in CKAN to describe packages that are to be included. This documentation, below, includes information about size of data, example resources and access methods (e.g., SPARQL Protocol and RDF Query Language (SPARQL) endpoints) and, crucially, links to other data packages. See:

Adding a new package to CKAN aids its visibility: this is a frequently consulted list of packages. Following the conventions of the LOD Cloud and Library Linked Data groups ensures that its relationships to other packages are documented and that it will be counted amongst the growing number of Linked Data corpora and appear in diagrams and visualizations that are produced as part of the study of Linked Data. Having data documented consistently means that we can build tools to gain a greater understanding of their nature and how they fit together. Whilst interesting in itself, this process is important because this kind of understanding makes it easier to determine whether a particular data package is suitable or appropriate for a given task, thus making the data easier to use.

To illustrate an example of the results of this process, consider the diagram below:

The brightly colored circles represent the packages that are part of the CKAN Library Linked Data group. They grey circles represent packages that are connected to but are not members of this group (they typically are members of the CKAN LOD Cloud group). The size of the circles and the thickness of the lines are related to the size of the data and the number of outward links (scaled logarithmically) respectively.

This graphic is generated automatically, on the basis of an algorithm, and represents the state of the CKAN Library Linked Data group at the time this report is published. It has already changed significantly since the beginning of our work, and will surely look different in the near future. For instance, the LC Name Authority File @@@Link to section@@@ has just recently been published and appears to be unconnected on the periphery, but this is likely to change in the coming months.

The graphic shows in fact the difficulty of rendering a complex and evolving web of links, given its current explosive growth. However, it is immediately apparent, for example, that there are some densely connected clusters of packages in library Linked Data, and that many are actually connected through data that are not necessarily library data in themselves -- DBpedia and GeoNames @@@Links to sections@@@ figuring prominently. It is also apparent that linking to other data that does not have this central character is quite common: it's not only the hubs that are useful.

Published Datasets

@@TODO: Just before the final delivery of this document, we will add here a snapshot of the CKAN Library Linked Data group. I.e., a simple bullet list that sums up the packages available there, with direct pointers to these.

Value vocabularies

Published value vocabularies

This section describes value vocabularies, which have been made available as Linked Data and/or are mentioned as being relevant by one of the Incubator Group cases.

Every entry features a brief introduction to the vocabulary, as well as links to their locations. Cases collected by the Incubator Group that refer to the value vocabulary are also listed under each entry.

Classification systems

Dewey Summaries is a suitable data set containing the top classes of Dewey Decimal Classification (DDC) 22. It provides access to the top three levels of the DDC in eleven languages along with access to Abridged Edition 14 (assignable numbers and captions) in three languages.

The Universal Decimal Classification (UDC) is a multilingual classification scheme for all fields of knowledge. The UDC Summary represents a selection of around 2,000 classes extracted from the UDC scheme. [1]

RAMEAU is a subject heading vocabulary used by the French National Library (BnF). It has been developped starting from the subject heading repository of the Quebec University, being derived itself from the Library of Congress Subject Headings (LCSH). RAMEAU has been published as Linked Data by the TELplus project.

A controlled vocabulary system managed by the German National Library (DNB) in cooperation with various library networks. The inclusion of keywords in the SWD is defined by "Rules for the Keyword Catalogue" (RSWK). [2]

The National Diet Library List of Subject Headings (NDLSH) is a list of subject headings applied to the catalog of the National Diet Library (Japan), including mainly the topical headings and some proper name headings. [3]

Name authority data

VIAF is a joint project of multiple national libraries in the world which virtually combine the name authority files of participating institutions into a single name authority service. As of the date of this report, there are 21 authority files of personal, corporate, and conference names from 18 organizations participating in VIAF. [4]

ULAN is a structured vocabulary containing more than 225,000 names as well as biographical and bibliographic information about artists and architects, including a wealth of variant names, pseudonyms, and language variants.

Although ULAN not yet published as Linked Data per se, it is included in the VIAF as the Getty Research Institute's contribution.

LC/NAF provides authoritative data for names of persons, organizations, events, places, and titles, with over 8 million descriptions created over multiple decades, according to different cataloging policies. LC Names is officially called the Name Authority Component (NACO) Authority File and is a cooperative effort in which participants follow a common set of standards and guidelines.

AGROVOC is a multilingual structured and controlled vocabulary designed to cover the terminology of all subject fields in agriculture, forestry, fisheries, food, and related domains (e.g., environment). [7]

Eurovoc is a multilingual, multidisciplinary thesaurus covering the activities of the European Union, the European Parliament in particular. It contains terms in 24 languages (as of the date of this report).[8]

The Library of Congress' Thesaurus for Graphic Materials includes more than 7,000 subject terms to index topic shown or reflected in pictures, and 650 genre/format terms to index types of photographs, prints, design drawings, ephemera, and other categories.[9]

PRONOM is the online registry of technical information about the file formats, software products, and other technical components required to support long-term access to electronic records and other digital objects of cultural, historical, or business value. [10]

Creative Commons provides an infrastructure that consists of a set of copyright licenses and tools to create a balance inside the traditional “all rights reserved” setting that copyright law creates. [11]

Additional sources

WordNet is a lexical database of English in which nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms (called "synsets"). Each synset expresses a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. [12] Wordnet has been published as Linked Data by the Vrije Universiteit Amsterdam.

Freebase is an open, Creative Commons licensed collection of structured data, and a platform for accessing and manipulating that data via the Freebase API. Freebase imports data from a wide variety of open data sources, such as Wikipedia, MusicBrainz, and others.[13] Note that Freebase is essentially a dataset, but its inclusion of many reference resources allows some parts of it to be used as value vocabularies in certain cases.

DBpedia extracts structured information from Wikipedia. The DBpedia data set features labels and abstracts for over three million things, with a half of them classified in an ontology, and contains millions of links to images, external Web pages, and external links to other RDF datasets. [14]. Similarly to Freebase, DBpedia can be seen as a general dataset, but some of the entities it describes--places, persons, "categories"--can be used as reference value vocabularies in some cases.

Work in progress, or relevant for cases but not officially in progress

This thesaurus is used for the subject indexing of the Aquatic Sciences and Fisheries Abstracts (ASFA), an abstracting and indexing service that covers the world's literature on the science, technology, management, and conservation of marine, brackish water, and freshwater resources and environments, including their socio-economic and legal aspects.

The Fisheries Reference Metadata system stores all the classification systems (for species, countries, water areas, commodities, fishing vessels, fishing gears, etc.) used by FAO to describe fisheries observations such as time-series data on fisheries capture and production and species fact sheets.

The Agricultural Thesaurus and Glossary are online vocabulary tools of agricultural terms in English and Spanish provided by the USDA National Agricultural Library (NAL). The subject scope of agriculture is broadly defined in the NAL Agricultural Thesaurus, and includes terminology in the supporting biological, physical, and social sciences. The definitions of terms in the thesaurus were separately published as the Glossary of Agricultural Terms.[15]

A multilingual controlled vocabulary for fine art, architecture, decorative arts, archival materials, and material culture for the purposes of indexing, cataloging, searching, as being a research tool.

A structured, world-coverage vocabulary of over 1.3 million names, including vernacular and historical names, coordinates, place types, and descriptive notes, focusing on places important for the study of art and architecture.

Other value vocabularies relevant to the Library Linked Data field, not mentioned in the cases

The New York Times uses approximately 30,000 tags to power its Times Topics Pages. These tags (categorized into 'people', 'organization', 'place', and 'descriptor') as published as Linked Data and are mapped to Freebase, DBpedia, and GeoNames.

The MARC Countries list identifies current national entities, states of the United States, provinces and territories of Canada and Australia, divisions of the United Kingdom, and internationally recognized dependencies. The entries include references to their equivalent ISO 3166 codes. @@@CITE--link this @@@

The MARC List for Languages provides three-character lowercase alphabetic strings that serve as the identifiers of languages and language groups. It have been cross referenced with ISOs 639-1, 639-2, and 639-5, where appropriate.

The MARC List for Geographic Areas identifies separate countries, first order political divisions of some countries, regions, geographic features, areas in outer space, and celestial bodies. The list contains over 550 different codes.[16]

For each element set, we give a pointer to a human-readable website and indicate the corresponding RDF namespace, as well as a common abbreviation used for it. We also provide or re-use a short description, focused on the main scope or usage domain for the element set. We have sometimes emphasized important design decisions that characterize the element set, including indications on whether the element set is connected to another one, and its relation to traditional library usages. Finally, cases collected by the Incubator Group are also listed under each entry as relevant usage examples.

For illustration purposes, we include a tag cloud rendition of the element sets presented in this section, adapting a site created by Paul Walk:

Note that this tag cloud is a context-specific snapshot of the usage of metadata element sets. In particular, the size of each tag is directly related to the number of individual cases @@@Link to Use Case Report@@@ that use it, as gathered by the Library Linked Data Incubator Group. Beyond this analysis based on the Incubator Group cases, Library Linked Data community members should consider helping maintain precise and up-to-date listings of datasets and value vocabularies, such as the CKAN Library Linked Data group, so that the usage of element sets can be measured. A refined, domain-specific version of the usage statistics for the Linked Open Data Cloud would help the community to develop a clearer idea about which elements sets are widely used and which are less common.

Rendering links between metadata element sets would also be valuable for practitioners willing to re-use data across vocabularies, or to make their data better usable by a wide community. The Upper Mapping and Binding Exchange Layer (UMBEL) constellation has been the first to illustrate connections between classes from popular Linked Data vocabularies. The Linked Open Vocabulary effort generalizes and automatizes the gathering of such information. For a wide range of metadata element sets, e.g., Dublin Core, Linked Open Vocabulary offers a detailed view of the relationships with other element sets, based on the available machine-readable definitions (ontologies).

Metadata element sets published as RDF vocabularies

This sub-section lists the relevant ontologies (OWL or RDFS) available at the time of writing this report extracted from the gathered by the Incubator Group. To help readers orient themselves in our selection, we first introduce metadata element sets that originate from the Libraries, Archives, Museums, and Information communities. We then present relevant element sets, which are rooted in other communities. This categorization is often arbitrary, as many vocabularies already result from cross-community work. We believe, however, that this shows the great potential for the Linked Data approach, where easily sharing, re-using, or extending a diversity of element sets independently from their origin is the rule.

Originating from the Libraries, Archives, Museums, and Information communities

Dublin Core 1.1 is the legacy Dublin Core element set containing 15 basic property elements capable of describing anything. A critical aspect of these properties is the lack of a rdfs:range setting, which allows one to use them both with literal values and fully-fledged RDF resources.

The DCMI Metadata Terms /terms namespace refines the legacy /elements/1.1/ namespace with some rdfs:range restrictions and a variety of new properties. Note that interoperability with the /element/1.1/ set is preserved via rdfs:subPropertyOf.

The Open Archives Initiative Object Reuse and Exchange model define elements to describe aggregations of Web resources, which together form complex digital objects, such as a journal article and its different digital variations and accompanying material. It also proposes a "resource map" mechanism for indicating and describing provenance of metadata on these aggregations, as well as "proxies" to describe any given resource from the perspective of a specific aggregation, when resources are included in different aggregations.

"SKOS provides a model for expressing the basic structure and content of concept schemes such as thesauri, classification schemes, subject heading lists, taxonomies, folksonomies, and other similar types of controlled vocabulary."[17] SKOS deliberately avoids providing rdfs:domains with some of its properties (esp. labelling and note properties), enabling one to re-use them for any kind of resource.

SKOS-XL is a SKOS extension that provides support for describing lexical entities attached to concepts. It "reifies" the labels of skos:Concepts, treating them as fully-fledged RDF resources. This allows them to be annotated further, or support linking them using, say, a "isTranslationOf" property.

The CIDOC object-oriented Conceptual Reference Model (CRM) is developed by the International Council of Museums (ICOM) to represent and make interoperable description of objects from the cultural sector. It makes intensive use of events to link objects, persons, places, and more conceptual notions together.

The IFLA "FRBR family" consists of three conceptual models each covering an aspect of the data recorded in bibliographic and authority records. The entities, attributes, and relationships defined by each of the models are included in the Metadata Registry:

MADS/RDF is designed for use with controlled values for names (personal, corporate, geographic, etc.), thesauri, taxonomies, subject heading systems, and other controlled value lists. The MADS/RDF ontology is mapped to SKOS.

For its Linked Data services, the German National Library has created a namespace that is devoted to detailed description of authority resources (Gemeinsame NormDatei (GND)). This set of classes and properties especially refines the possibilities offered by SKOS and the RDA vocabularies.

VoID is an RDF-based schema for describing linked datasets. With VoID the discovery and usage of linked datasets can be performed both effectively and efficiently. A VoID dataset is a collection of data, published and maintained by a single provider, available as RDF, and accessible, for example, through dereferenceable HTTP URIs or a SPARQL endpoint.

The Upper Mapping and Binding Exchange Layer (UMBEL) Reference Concepts dataset is derived from the OpenCyc ontology. It includes thousands of coherently structured and linked concepts, and is broadly applicable to provide orienting nodes to any knowledge domain. The UMBEL vocabulary provides classes and properties to describe this conceptual knowledge. It also intended to function as the basis for constructing domain ontologies. [18] It re-uses external vocabularies whenever possible.

The name Lexvo is derived from the Ancient Greek λεξικόν (lexicon) and the Latin vocabularium (vocabulary).[19] The ontology provides a vocabulary for defining global URIs for languages, words, characters, and other human language-related objects.

This is a RDF Schema for EXIF -- a standard for images that supports mainly technical metadata, usually embedded in an image file (e.g., JPEG file), where each key of the EXIF specification has been directly mapped to a corresponding property. In order to preserve the groupings of metadata keys that is provided in the original EXIF specification (e.g., pixel composition and geo location), other efforts have been reported, such an EXIF OWL ontology [20].

"The Music Ontology Specification provides main concepts and properties for describing music (i.e., artists, albums, and tracks) on the Semantic Web". It applies the FRBR distinctions to the music domain.

Schema.org is a set of constructs that allow website designers to include structured metadata in their Web pages, to be consumed by the major search engines Bing, Google, and Yahoo! Schema.org is designed to represent resources from a great diversity of domains. It thus duplicates many elements from other element sets, and fails to capture the richness of library data. However, it can be used to exchange simple information about libraries and the resources they own, as demonstrated in this post.

Facebook's Open Graph "protocol" enables the description of resources (movies, books, etc.) that can be of interest to social network members. Its main purpose is to allow websites to include RDFa markup, which is used in combination with the "Like" button to communicate to the Facebook service data about the objects mentioned on web pages.

The Ontology for Media Resources defines a core set of metadata properties for media resources, along with their mappings to elements from a set of existing metadata formats. It is mainly targeted towards media resources available on the Web, as opposed to resources that are only accessible in local archives or museums.

The Europeana Data Model is a vocabulary aimed at representing metadata for cultural objects and giving access to digital representations of these objects. It is positioned in a data aggregation context, where objects can be complex, and several data providers may entertain different views on them. EDM re-uses, extends, or has been inspired by other element sets that address specific needs EDM has to fulfil, notably OAI-ORE, Dublin Core, SKOS, and CIDOC CRM.

EAC-CPF (Encoded Archival Context – Corporate bodies, Persons, and Families) is aimed at representing authoritative information about the context of archival materials, including "the identification and characteristics of the persons, organizations, and families (agents) who have been the creators, users, or subjects of records, as well as the relationships amongst them" [21]. It is a parallel effort to the Encoded Archival Description (EAD) standard for representation of archival finding aids.

A core concept in EAC-CPF is the distinction between agents and identities: a same agent can have different identities, and one identity can correspond to several agents.

Work relevant for EAD in RDF has been done in LOCAH (available here and documentation here) and EuropeanaConnect (see schema here)

Note that the LOCAH element set only handles a part of EAD, and introduces other elements that the LOCAH participants found useful to publish archival collection data as Linked Data. Readers may also be interested in the lightweight Archival vocabulary maintained by Aaron Rubinstein for describing archives and the named entities associated with them.

Metadata element sets from cases for which no RDF vocabulary is available

Categories for the Description of Works of Art (CDWA) includes 532 categories and subcategories for describing describing and accessing information about works of art, architecture, other material culture, groups and collections of works, and related images. A simpler subset of these elements, CDWA Lite, has been created.

PBCore is a metadata standard designed to describe media, both digital and analog. The PBCore XML Schema Definition (XSD) defines the structure and content of PBCore. The element set and related value vocabularies are available at Metadata Registry.