This deliverable provides an overview of current practice regarding knowledge representation in the culturalheritage domain. It does so by providing an overview of the metadata schemas and controlled vocabulariesthat are widely used in the cultural heritage sector.

An overview of current practice is gathered from:



the cultural heritage partners in the project;



cultural heritage institutes throughout Europe;



work done within other (research) projects.

More generic knowledge representation standards and the use of the Semantic Web within the project areoutlined.This overview provides insight into the metadata schemas and controlled vocabularies MultiMatchmight have to deal with and build upon.

The deliverable concludes with a first analysis of the most important schemas and reference models togetherwith a preliminary outline of their possible usability in the MultiMatch project.

Note: in the

Description of Work, the title of this deliverable is listed as “First Analysis of Ontologies in theCH domain”. This title was too narrow to cover the work and thus was amended slightly.

This deliverable provides an overview of current practice regarding knowledge representation in the culturalheritage domain and defines the basis for the approach towards maximum interoperability that will beadopted within the MultiMatch project. The focus is thus on descriptive metadata; in other words, themetadata that identify and describe the object and what it expresses (see further section 1.3). This firstanalysis is intended to be general, with more specific analysis in later deliverables.

In Chapter 1, the cultural heritage domain is divided into the six sub-domains to be targeted

in this study.The methodology used in gathering the information is explained, as well as selection criteria used. A schemeor vocabulary is included only if the following criteria are met:



it is constructed and maintained by a renowned institute in one of the sub-domainsand,



available in electronic formand,



publicly available; in other words, there may be financial but no copyright hindrances to apply themin MultiMatchand,



it is proven an international standardor

a local standard, in use nationwide.

Chapters 2, 3 and 4 give an insight into the metadata schemas and controlled vocabularies MultiMatch mighthave to deal with.Chapter 2 provides a descriptive overview of the metadata schemas and the semanticresources (i.e. thesauri, controlled vocabularies) widely used within the organizations belonging to thespecific sub-domains. Forty have been identified and analyzed in a structured fashion.

Schema

Controlled vocabularies

Archives

2

4

Libraries

3

7

Museums

3

5

Educational sector

2

-

Audiovisual sector

7

2

Geospatial sector

5

2

Chapter 3 provides information on the metadata used by some of the cultural heritage institutions within theconsortium and the Advisory Board. It also lists seventeen European projects and initiatives that are closely

related to MultiMatch, including the MICHAELplus and The European Library projects. Furthermore, itincludes data from a relevant inventory on multilingualism conducted by the MINERVA Plus project andprovides a summary of the use of controlled vocabularies in the cultural heritage domain.

D2.1 First Analysis of Metadata

in the Cultural Heritage Domain

Page5

of118

From this survey it became clear that the uptake of international established controlled vocabularies is quitelimited. Local and nationally established/managed vocabularies are therefore predominant. Part of the reasonfor this is that the available international controlled vocabularies are still not available in every Europeanlanguage (currently there are 20 official languages in the European Union).

We can note, however, that certain controlled vocabularies are particularly popular and have already beenused in many European countries:



Getty Arts and Architecture Thesaurus



The UNESCO thesaurus



Library of Congress Subject Headings (LCSH)



The HEREIN thesaurus



The NARCISSE vocabulary and the EROS project



ICONCLASS (in the field of iconographic description).

Chapter 4 describes some generic knowledge representations and several metadata schemas, ontologies andreference models that are used in various contexts, not only within the cultural heritage domain.

Generic identificationstandards and referencemodels



CIDOC Conceptual Reference Model



Digital Object Identifier



Functional Requirements for Bibliographic Records



SKOS Simple Knowledge Organisation System



RDF Resource Description Framework

Generic Metadata Schema



Dublin Core Metadata Initiative



MPEG-7



MPEG-21

Chapter 4 concludes explaining the relationship between the goals of MultiMatch and the Semantic Web(SW). Here it is noted how much of the technology examined in MultiMatch will consider issues relevant tothe development of the Semantic Web. Thus the project should both add to and benefit from SWtechnologies and research, and provide tools and materials which are exploitable in the context of theSemantic Web.

As part of MultiMatch, documents, within the

Cultural Heritage domain, will be marked-up with semanticinformation (or metadata) from a common vocabulary. One criticism leveled at the SW is the cost associatedwith providing this markup; the project will examine the use of classification and information extractiontechniques to alleviate this problem. The SW is also concerned with the interoperability between differentvocabularies (and ontologies); an issue which will have to be addressed within MultiMatch as well. There arealso issues which relate

to the SW, such as "trust" and the provenance of information, privacy and censorshipand the provision of Web services which, whilst not central, will be examined in the project.

D2.1 First Analysis of Metadata

in the Cultural Heritage Domain

Page6

of118

The fifth and final chapter of this deliverablesummarises the most relevant standard(s) for each sub-domain.

It also gives a preliminary indication of the possible usability of these popular standards for MultiMatch. Inthe sections following, the most relevantgeneric schemas

(Dublin Core, MPEG-7, MPEG-21) andreferencemodels

(FRBR, CIDOC-CRM) are analysed.

Next, the metadata schemas possibly relevant for MultiMatch are analysed according to anumber of criteria

(applying the analysis methodology from De Sutter et. al. [Sutter, 2006]), to provide a first typology of theseschemas in a

tabular overview.

The concluding paragraph outlines further research issues concerning

knowledge representation within theproject.InD2.2 the approach for knowledge representation in MultiMatch will be defined and described indetail.This deliverable, D2.1, thus represents the starting point for the further research needed to decide onthe knowledge representation within MultiMatch.

D2.1 First Analysis of Metadata

in the Cultural Heritage Domain

Page7

of118

1.

Introduction

This ‘First Analysis of Ontologies in the Cultural Heritage domain’ will feed into the specification of the firstprototype. The final approach regarding content interoperability will be defined in conjunction with

work inWP1 and WP3 and (after internal papers) will form the core of Deliverable 2.2 (Content interoperability:metadata and file formats), to be released at PM10.

This first analysis is thus intended to be general, withmore specific analyses to be provided in later deliverables.

defines the user requirements after conducting interviews, log analyses and performing deskresearch. These requirements will provide pivotal input to arrive at a definitive approach regardinginteroperability.

-

Initial work inWP3

deals with the detailed specifications of the first prototype. WP2, and morespecificallytask 2.1, will provide necessary input regarding issues connected with metadata,thesauri/ontologies, and semantic web encoding.

1.1

Outline of this Document

Deliverable 2.1 provides an overview of current practice regarding knowledge representation in the culturalheritage domain. As metadata standards enable interoperability between systems and organisations thatinformation can be exchanged and shared, the overview in this deliverable provides the basis for theapproach towards interoperability that will be adopted within the MultiMatch project.

The primary focus is on descriptive metadata, representing the conceptually meaningful aspects of an object,but some technical dimensions are also into account. Current practice in the diverse areas into which

thecultural heritage domain can be broken down is investigated.



In Chapter 1, the cultural heritage domain is divided into the six sub-domains to be targeted in thisstudy. The methodology adopted and the terminology used are also explained in this introductorychapter.



Chapter 2 provides a descriptive overview of the metadata schemas and the semantic resources (i.e.thesauri, controlled vocabularies) widely used within the organizations belonging to the specific sub-domains.



Chapter 3 provides information on the metadata used by some of the cultural heritage institutionswithin the consortium and within related European projects. Chapter 3 also includes data from arelevant inventory multilingualism conducted by the MINERVA Plus project and provides asummary of the use of controlled vocabularies in the cultural heritage domain.



Chapter 4 describes some generic knowledge representations and several metadata schemas,ontologies and reference models that are used in various contexts, not only the cultural heritagedomain. These knowledge representations can play a role within the MultiMatch project. Thischapter also explains the relationship between the goals of MultiMatch and the Semantic Web.



The fifth and final chapter of this deliverablesummarises the most relevant standard(s) for each sub-domain. This is done by looking at the uptake of standards in section 5.1.This section also gives apreliminary indication of the possible usability of these popular standards for MultiMatch. The mostrelevant generic schemas (Dublin Core, MPEG-7, MPEG-21) and reference models (FRBR,CIDOC-CRM) are then analysed.

D2.1 First Analysis of Metadata

in the Cultural Heritage Domain

Page8

of118

Furthermore, the metadata schemas possibly relevant for MultiMatch are analysed according to four criteria

(the analysis methodology from De Sutter et. al. [Sutter, 2006]), to provide a first typology of these schemasin a

knowledge representation within MultiMatch.InD2.2 (PM 10), the approach thatMultiMatch will adopt for knowledge representation will be defined and described in detail.

1.2

Methodology

The focus of this deliverable is the current practice of knowledge representation in the cultural heritagesector. This surveyprovides the technical partners of the MultiMatch project with a clear view of thedimensions of the data they will have to deal with. It will feed into different other tasks, notable thefunctional specification of the prototype.

Furthermore, this deliverable will provide input for the decision on knowledge representation in theMultiMatch project (to be reported in D2.2).

The methodological approach can be broken down in three parts:



Defining cultural heritage



Information gathering process



Selection Criteria

1.2.1

Defining Cultural Heritage

The concept Cultural Heritage can be defined in many ways. Here are just three examples.

“It is the legacy of physical artefacts and intangible attributes of a group or society that are inheritedfrom past generations, maintained in the present and bestowed for the benefit of future generations.Physical or "tangible cultural heritage" includes buildings and historic places, monuments. Naturalheritage is also an important part of a culture, encompassing the countryside and natural environment.Smaller objects that are considered part of our cultural heritage are stored in libraries, museums andgalleries. Cultural heritage objects are studied by academics and enjoyed by tourists; making it hardto draw boundaries.” (Definition of Wikipedia)

The term cultural heritage collections is intended to cover all types of material collected anddisplayed by museums and related institutions, as defined by ICOM. This includes collections, sitesand monuments relating to natural history, ethnography, archaeology, historic monuments, as well ascollections of fine and applied arts. (Definition of the International Council of Museums-

ICOM2)

1

http://www.digicult.info/pages/index.php

2

http://icom.museum/

D2.1 First Analysis of Metadata

in the Cultural Heritage Domain

Page9

of118

In order to systematically study current practice we use the sub-domain definition advocated by the DEN(Digital Heritage

Netherlands) and ePSINet (the European Public Sector Information Network3):

1.

Archives

2.

Libraries

3.

Museums

4.

Educational sector

5.

Audiovisual sector

6.

Geospatial sector

Clearly, there is significant overlap between these domains. In those cases in which it was unclear in whichcategory an activity should be placed, a judgement was made based on a close examination of the schemasand semantic elements. Those controlled vocabularies that are used across these domains, are listed under thecategory ‘generic’ and described in Chapter 4.

1.2.2

Information gathering process

The methodology adopted for this first analysis of knowledge representation consisted of:

1.

thorough desk research conducted on special interest groups and organisations working on this topic, aswell as personal contacts provided us with the overview and insight presented below.

2.

a questionnaire (see Appendix 1) to our target group: libraries, museums, archives and other culturalinstitutions participating in related European projects. We have sent the questionnaire to:



17 partners of the BRICKS community



6 members of the steering board of the Culture Mondo network



14 partners of the Digital Heritage Network



31 partners or members of the MINERVA project

3.

consultation with experts in-

and outside the consortium by telephone interviews.

1.2.3

Selection criteria

The selection of the knowledge representations in use is based on several criteria. A scheme or vocabulary isincluded if:



it is constructed and maintained by a renowned

institute in one of the sub-domainsand,



available in electronic formand,



publicly available; in other words, there may be financial but no copyright hindrances to apply themin MultiMatchand,



it is proven an international standardor



a local standard,

in use nationwide.

3

http://www.epsigate.org/

D2.1 First Analysis of Metadata

in the Cultural Heritage Domain

Page10

of118

1.3

Domain Terminology

Knowledge representation

This is a two sided concept:

1. Knowledge on cultural heritage objects is represented in metadata schemas (mainly in thesemantic description of a cultural heritage object, not in the technical or administrative part of ametadata schema). Synonym: metadata model.

2. Knowledge on cultural heritage object is also represented in 'controlled vocabularies' or'knowledge organization systems' of all kinds, therewith controlling the content ofseveral metadataelements or attributes of a metadata schema.

information, compiled (automatically and/or manually) in the format ofthe metadata schema concerned, which captures the basic characteristics of a data or informationresource (e.g. acultural heritageobject). Metadata refers to “data about data”, in other words,information that describes information sources or objects, e.g. a Dublin Core record or a record fromthe catalogue of an archive. The format and structure of metadata is often dictated in a set of rules,called metadata schema.

Indirectly, the European Commission stressed the importance of metadata for online accessibility, inthe 'Communication of 30 September 2005' on the Digital Libraries Initiative that deals with culturalheritage and its online preservation and accessibility.

"Questions

of online accessibility are not limited to intellectual property rights. Putting materialonline does not mean it can be found easily by the user, still less that it can be searched and used.Appropriate services allowing the user to discover and work with the content are necessary. Thisimplies structured and quality description of the content, both the collections and the items in them,and support for its use (e.g. annotation)."4

1.Descriptive

metadata-

mainly information to identify and describe the object or informationsource and what it expresses. These metadata include the author/title cataloguing as well as thesubject indexing. In other words, the descriptive metadata include the subgroup of the objectiveelements that formally describe the object (e.g. identification number, title, creation date, creatorname, the language of the object, physical media).

And the subgroup of semantic elements (also called analytical metadata) that contain information onthe subject of the object to enhance access to the resource's contents (e.g. subject keywords,classification codes, abstract). Note, that the descriptive metadata, and especially the semanticelements are the scope of D2.1. Note also: descriptive metadata can be of a technical character, thinkof for instance 'compression schema' (this is the algorithm used to compress the audiovisual essence),the number of pages (book), black and white/colour (photograph, film) or specific information on thestorage medium or carrier.

describe the technological characteristics of the related object (e.g. data thatmust be available to be able to use out the material, file locations, authentication and securityinformation, characteristics needed for computer programming and database management)

"Full, logically organised structure of relations between defined (groups) of metadata and theinformation objects they describe."5

“a set of rules for encoding information that supports specific communities of users.”6

A metadata schema consists of several metadata

elements. For some elements the input is free (e.g.Title), for other elements the input is guided by syntactical rules or guidelines or even restricted bycontrolled vocabularies of all kinds (e.g. thesaurus for subject keywords or closed term list for objecttype).

Metadata element

A metadata element is an item, or an editorial part of metadata. A semantic metadata element is anelement from the descriptive metadata that describes thecultural heritageobject.

A metadata element name is given to a data

element in, for example, a data dictionary or metadataschema or registry. In a formal data dictionary, there is often a requirement that no two data elementsmay have the same name, to allow the data element name to become an identifier, though some data

dictionaries may provide ways to qualify the name in some way, for example by the applicationsystem or other context in which it occurs.

A data element definition is a human readable phrase or sentence associated with a data elementwithin a data dictionary that describes the meaning or semantics of a data element.

Controlled vocabulary

A limited set of terms that must be used to index | represent | tag the subject matter | content ofdocuments | objects (indexing tools in use to describe a cultural heritage object).

These examples illustrate that controlled vocabularies are largely applied for subject keywords orgeneric concept identification. However, controlled vocabularies or lists of preferred terms are alsoapplied for other metadata elements, e.g. person names like author or creator, names of historicalpeople and corporate bodies on the cultural heritage object or as its subject of the cultural heritageobject, geographic places (actual location of the cultural heritage object / place of creation / placewhere the cultural heritage object was found / place as subject of the cultural heritage object) andorganisation names.See also: Authority files in this table.

5

Metadata in the audiovisual production environment : an introduction / Annemieke de Jong.–

Hilversum: Nederlands Instituut voor Beeld en Geluid,2003

6

Murtha Baca, Getty Research Institute

D2.1 First Analysis of Metadata

in the Cultural Heritage Domain

Page12

of118

Classification schemes, Taxonomies and Categorization schemes7

These terms are often used interchangeably. Although there may be subtle differences from exampleto example, in general these types ofknowledge representation provide ways to separate entities intobuckets or relatively broad topic levels. Some examples provide a hierarchical arrangement ofnumeric or alphabetic notation to represent broad topics. These types of knowledge representationmay not follow the strict rules for hierarchy required in the ANSI NISO Thesaurus Standard (Z39.19)(NISO), and they lack the explicit relationships presented in a thesaurus.

Examples of classification schemes include the Library of Congress Classification Schedules (anopen, expandable system), the Dewey Decimal Classification (a closed system of 10 numericsections with decimal extensions), and the Universal Decimal Classification (based on Dewey butextended to include facets). Subject categories are often used to group thesaurus terms in broad topicsets, outside the hierarchical scheme of the thesaurus. Taxonomies are increasingly being used inobject oriented design and knowledge management systems to indicate any grouping of objectsbased on a particular characteristic. "Taxonomy" may also refer to a scheme that presents subjectelements in a hierarchical arrangement based on some characteristic.

Thesauri

These knowledge organization systems are based on concepts, and they show relationships betweenterms. Relationships commonly expressed in a thesaurus include hierarchy, equivalence, andassociative (or related).

These relationships are generally represented by the notation BT (broader term), NT (narrower term),SY (synonym), and RT (associative orrelated). There are standards for the development ofmonolingual thesauri (NISO, 1998; ISO, 1986) and multi-lingual thesauri (ISO, 1985).

It should be noted that the definition of a thesaurus in these standards is often at variance withschemes that are actually called thesauri. There are many thesauri that do not follow all the rules ofthe standard, but are still generally thought of as thesauri. Many thesauri are very large (more than50,000 terms). Most were developed for a specific discipline, or to support a specific product orfamily of products.

Subject headings

This scheme provides a set of controlled terms to represent the subjects of items in a collection.Subject heading lists can be extensive, covering a broad range of subjects. However, thesubjectheading lists structure is generally very shallow, with a limited hierarchical structure. In use, subjectheadings tend to be pre-coordinated, with rules for how subject headings can be joined to providemore specific concepts. Examples include the

Authority files are lists of terms that are used to control the variant names for an entity or the domainvalue for a particular field. Examples include

names for countries, individuals, and organizations.Non-preferred terms may be linked to the preferred versions. This type of knowledge organizationgenerally does not include a deep organization or complex structure. The presentation may bealphabetical

or organized by a shallow classification scheme.

There may be some limited hierarchy applied in order to allow for simple navigation, particularlywhen the authority file is being accessed manually or is extremely large.

7

For the definitions of the several types of controlled vocabularies the following source is used: Taxonomy of Knowledge OrganizationSources/Systems (1).-

Draft June 7, 2000(revised July 31, 2000)

http://nkos.slis.kent.edu/KOS_taxonomy.htm

Last viewed 2006-09-14.

D2.1 First Analysis of Metadata

in the Cultural Heritage Domain

Page13

of118

Specific examples of authority files include the Library of Congress Name Authority File and theGetty Geographic Authority File.

Semantic network

With the advent of natural language processing, there have been significant developments in the areaof semantic networks. These knowledge organization systems structure concepts and terms not ashierarchies but as a network or a Web. Concepts are thought of as nodes with various relationshipsbranching out from them.

The relationships generally go beyond the standard BT, NT and RT. They mayinclude specificwhole-part relationships, cause-effect, parent-child, etc. One of the most noted semantic network isPrinceton's WordNet, which is now used in a variety of search engines.

Ontology

An ontology is a data model that represents the existing

knowledge within a domain and is used toreason about the objects in that domain and the relations between them. Ontologies are used as aform of knowledge representation about the world or some part of it. Ontologies generally describe:Individuals (thebasic or "ground level" objects); Classes (sets, collections, or types of objects);Attributes (properties, features, characteristics, or parameters that objects can have and share);Relations (ways that objects can be related to one another).8

Therefore thesauri and classification schemes can be regarded as ontologies with a relatively littlenumber of relationships.

Ontologies can represent complex relationships between objects, and include the rules and axiomsmissing from semantic networks. Ontologiesthat describe knowledge in a specific area are oftenconnected with systems for data mining and knowledge management.

Upper Ontology (top-level ontology, or foundation ontology).An attempt to create an ontologywhich describes very general concepts thatare the same across all domains. The aim is to have alarge number on ontologies accessible under this upper ontology.

The Semantic Web provides a common framework that allows data to be shared and reused acrossapplication, enterprise, and community boundaries. It is a collaborative effort led by W3C withparticipation from a large number of researchers and industrial partners. It is based on the ResourceDescription Framework (RDF), which integrates a variety of applications using XML for syntax andURIs for naming.

The Semantic Webintent is to enhance the usability and usefulness of the Web and itsinterconnected resources. Within MultiMatch the use of a Semantic Web-compatible markup willguarantee a rich use (mainly in retrieval functionality) of the metadata oncultural heritage objectsprovided by the partners in combination with severalontologies

related to the cultural heritagedomain.A domain ontology (or domain-specific ontology) models a specific domain, or part of theworld. An ontology on arts can be used to say, for instance that “Picasso” is a “Painter”, and that a“Painter” is an “Artist”. The combination of such ontologies together with theMultiMatch

indexesautomatically provides the end user with several extra ways to navigation through theMultiMatch

collection. E.g. this combination can present all cultural heritage objects

from museums in Spain,

8

Definition taken from:www.wikipedia.org

D2.1 First Analysis of Metadata

in the Cultural Heritage Domain

Page14

of118

without the need for the content providing partners to manually add extra metadata to thedescriptions of their objects. See also paragraphsection 4.3.

XML schema

An XML schema is a description of a type of XML document, typicallyexpressed in terms ofconstraints on the structure and content of documents of that type, above and beyond the basic syntaxconstraints imposed by XML itself. An XML schema provides a view of the document type at arelatively high level of abstraction. There are languages developed specifically to express XMLschemas. The Document Type Definition (DTD) language, which is native to the XML specification,is a schema language.

Data model

"A data model is a model that describes in an abstract way howdata

are represented in a businessorganization, aninformation system

or adatabase management system. This term is ambiguouslydefined to mean:

1.

how data generally are organized, e.g. as described inDatabasemanagement system. This is sometimes also called "database model"

2.

or how data of a specific business function are organized logically

(e.g.the data model of some business)

While simple data models consisting of few tables or objects can be created "manually", largeapplications need a more systematic approach. Within the relational database modelling community,theentity-relationship model

method is used to establish a domain-specific data model. Incomputerscience, an entity-relationship model (ERM) is amodel

or conceptual data model, is a map of concepts andtheir relationships, for example, a conceptual schema for a karate studio would include

abstractions

such as student, belt, grading and tournament."9

In this deliverable, data models are referred to as reference models, see also paragraphsection 4.1. Adata model, especially the concepts or entities and relationships of the model, dictate the metadataelements that are needed in the metadata schema that goes along with the data model.

9

www.wikipedia.org

D2.1 First Analysis of Metadata

in the Cultural Heritage Domain

Page15

of118

2

Knowledge Representation in the Cultural Heritage Domain

Knowledge representation in the cultural heritage domain includes metadata schemas on the one hand, andsemantic element definitions (i.e. thesauri, controlled vocabularies) on the other. See also section 1.3 forfurther definitions.

In order to provide a descriptive overview of

metadata in the cultural heritage domain, this chapter presents,for each sub-domain, a selectionof metadata schemas and of controlled vocabularies. The selection of theknowledge representations in use is based on several criteria, listed in section 1.2.3. To start with, somegeneric standards are described. The subsequent descriptions of the selected knowledge representationsappear in alphabetical order, for each sub-domain.

2.1

Generic Standards

The following tables provide an overview of generic metadata standards. The selection consists of: Friend OfA Friend, Wiktionary and WordNet.

Friend Of A Friend

Name

Friend of a Friend

Acronym

FOAF

Status / version

Not available

Type

Standard

Management

Edd Dumbill Editor and publisher, xmlhack.com

Short description

FOAF is a domain-specific vocabulary to support the social interactions of humans withinthe general Web. It provides a vocabulary for describing the kind of information that isfound on people’s home pages in a machine-understandable fashion, e.g. “My name is”, “Iam interested in” and “You can see me in this picture”. This allows queries to be made overcommunities of people, e.g. “Show me pictures of people who are interested in MarilynManson who live near me.”

URL(s)documentation

http://rdfweb.org/topic/FAQ

Available at 2006-06-21

http://www.foaf-project.org/

Available at 2006-06-21

URL guidelines forapplication

http://www-106.ibm.com/developerworks/xml/library/x-foaf.html

Viewed 2006-09-26

XML encodingavailable

Yes (also RDF, Semantic Web)

Wiktionary

Name

The EnglishWiktionary

Acronym

Wiktionary

Status / version

20060704

Type

Standard

Management

Wikimedia

Short description

A collaborative project to produce a free, multilingual dictionary with definitions,etymologies,pronunciations, sample quotations,synonyms, antonyms andtranslations.Wiktionary is the lexical companion to the open-content encyclopediaWikipedia.

TheEnglish

Wiktionary aims to describe all words of all languages, with definitions andD2.1 First Analysis of Metadata

in the Cultural Heritage Domain

Page16

of118

descriptions in English only. For example, seeWörterbuch

(a German word). In order tofind a German definition of that word, visit the equivalent pagein the German Wiktionary.

Number of elements

290,688 entries

Available inlanguage

124 languages

XML encodingavailable

No

URL(s)documentation

http://en.wiktionary.org/wiki/Main_Page

Viewed 2006-10-19

WordNet

Name

WordNet

Acronym

WordNet

Status / version

Version 2.1

Type

Semantic lexicon

Management

Princeton University

Short description

WordNet is not a controlled vocabulary in the sense of a set of preferred terms, but it is anonline lexical reference system whose design is inspired by current psycholinguistictheories of human lexical memory. English nouns, verbs, adjectives and adverbs areorganized into synonym sets, each representing one underlying lexical concept. Differentrelations link the synonym sets.

WordNet is considered to be the most important resourceavailable to researchers in computational linguistics, text analysis, and many related areas.

Number of elements

155,327 unique strings ; 207,016 word-sense pairs

Available inlanguage

English only. However, the Mimida Project10, developed byMaurice Gittens, is a WordNet-based mechanically-generated multilingual semantic network for more than 20 languagesbased on dictionaries found on the Web.

XML encodingavailable

No

Extra information onapplication

MultiWordNet11, developed by Luisa Bentivogli and others is a multilingual lexicaldatabase, developed at ITC-irst, in which the Italian WordNet is strictly aligned withPrinceton WordNet 1.6. The current version includes around 44,400 Italian lemmasorganized into 35,400 synsets which are aligned, whenever possible, with theircorresponding English Princeton synsets. The MultiWordNet database can be freelybrowsed through its on-line interface, and is distributed both for research and commercialuse. Information on the distribution licence is available at the web site.

EuroWordNet12

is a multilingual database with wordnets for several European languages(Dutch, Italian, Spanish, German, French, Czech and Estonian). The wordnets are structuredin the same way as the American WordNet for English (Princeton WordNet, Miller et al1990) in terms of synsets (sets of synonymous words) with basic semantic relations betweenthem.

URL(s)documentation

http://wordnet.princeton.edu/

Viewed 2006-10-02.

URL guidelines forapplication

http://wordnet.princeton.edu/man/wnintro.3WN

(the API documentation)

http://wordnet.princeton.edu/doc

(reference manual WordNet 2.1)Viewed 2006-10-19

10

http://www.gittens.nl/SemanticNetworks.html

11

http://multiwordnet.itc.it/english/home.php

12

http://www.illc.uva.nl/EuroWordNet/

D2.1 First Analysis of Metadata

in the Cultural Heritage Domain

Page17

of118

2.2

Archives

An archive refers to a collection of records, and also refers to the location in which these records are kept.Archives are made up of records which have been created during the course of an individual or organization'slife. In general an archive consists of records which have been selected for permanent or long-termpreservation. Records, which may be in any media, are normally unpublished, unlike books and otherpublications.

2.2.1

Metadata schemas

The following tables provide an overview of the selected metadata schemas used by archives. The selectionconsists of: Encoded Archival Description and General International Standard Archival Description.

Encoded Archival Description

Name

Encoded Archival Description

Acronym

EAD

Status / version

Version 2002

Type

International standard

Management

The standard is maintained in theNetwork Development and MARC Standards Office

of theLibrary of Congress (LC)in partnership with theSociety of American Archivists.

Short description

The EAD Document Type Definition (DTD) is a standard for encoding archival finding aidsusing Extensible Markup Language (XML). Finding aids are indexes used to cataloguedetailed information about collections within anarchive. They are used by researchers todetermine whether information within a collection is relevant to theirresearch. Finding aidsoften describe the scope of the collection, biographical and historical information related tothe collection, and access details. Finding aids can created in various electronic and printformats. The standard format for finding aids isEncoded Archival Description.

about primary research materials held in repositories worldwide. Itprovides tools for a detailed, multilevel description, structured display, navigation, andsearching.

Archives and libraries can use EAD to XML-encode the information in their finding aids forgreater online access.

Syntaxes

In principle, encoded finding aids consist of three parts, the first describing the informationabout the finding aid itself (<eadheader>), the second describing the prefatory matter usefulfor the display or publication of the finding aid (<frontmatter>), and the third onecontaining the description of the archival records or manuscript papers (<archdesc>). TheDocument Type Definition defines document structure, while elements constituteinformational units. Elements

can be modified with attributes. EAD presentation (display) isprescribed using style sheets-

International Standard Archival Authority Record for Corporate Bodies,

Persons and Families: ISAAR(CPF). This standard provides general rules for theconstruction of authority files for the metadata element 'archive builder' (a syntax for namesof organisations, persons and families). See section 2.3.2-

ISAAR.

Extra information onapplication

ISAD(G) is mapped to EAD and vice versa.http://www.getty.edu/research/conducting_research/standards/intrometadata/crosswalks.html

The following tables provide an overview of the selected controlled vocabularies used by archives. Theselection consists of: IPTC thesaurus, International Standard Archival Authority Record,Thésaurusarchitecture et patrimoine and UK Archival thesaurus.

IPTC thesaurus

Name

IPTC Newscodes–

subjectcode

Acronym

IPTC thesaurus

Status / version

Version 17, 2006-08-21

Type

International standard

Management

The International Press Telecommunications Council

Short description

A tree-structured list of thematic keywords. The IPTC Subject Reference System wasdeveloped to allow information providers access to a universal language independentcoding system for indicating the subject content of news items.

Number of elements

Approximately 1,200 terms on all subject areas.

Available inlanguage

Dutch, English, French, German

XML encodingavailable

Yes

Extra information onapplication

A three-level hierarchy where the top level isSubject;the second level isSubject Matterand the third level isSubject Detail.

There are 17 top-levelSubjects,and the IPTC has developedsecondarySubject Matterlists for each of these. To date, there are third-levelSubject Detaillists for three Subjects:Economy, Business and Finance, Politics, and Sport.

International Standard Archival Authority Record for Corporate Bodies, Persons andFamilies

Acronym

ISAAR (CPF)

Status / version

Second edition, 2004

Type

International standard

Management

International Council on Archives

Short description

This standard provides guidance for preparing archival authority records which providedescriptions of entities (corporate bodies, persons and families) associated with the creationand maintenance of archives.

The elements of description for an archival authority record are organized into fourD2.1 First Analysis of Metadata

in the Cultural Heritage Domain

Page20

of118

information areas:

1. Identity Area (where information is conveyed which uniquely identifies the entity beingdescribed and which defines standardized access points for the record)

2. Description Area (where relevant information is conveyed about the nature, context andactivities of the entity being described)

3. Relationships Area (where relationships with other corporate bodies, persons and/orfamilies are recorded and described)

4. Control Area (where the authority record is uniquely identified and information isrecorded on how, when and by which agency the authority record was created andmaintained).

Number of elements

Not available

Available inlanguage

Dutch, English, French, Italian, Portuguese, Spanish, Welsh.

XML encodingavailable

No, but see below.

Extra information onapplication

This standard addresses only part of the conditions needed to support the exchange ofarchival authority information. Successful automated exchange of archival authorityinformation over computer networks is dependent upon the adoption of a suitablecommunication format by the repositories involved in the exchange. Encoded ArchivalContext (EAC)is one such communications format which supports the exchange ofISAAR(CPF) compliant archival authority data over the World Wide Web.

EAC has been developed in the form of Document Type Definitions (DTDs) in XML(Extensible Markup Language) and SGML (Standard Generalized Markup Language).

A subject thesaurus which has been created for the archive sector in the United Kingdom. Itis a controlled vocabulary which archives can use when indexing their collections andcatalogues. The backbone of UKAT is theUNESCO Thesaurus

(UNESCO), a high-levelthesaurus with terminology covering education, science, culture, the social and humansciences, information and communication, politics, law and economics. The UNESCOthesaurus is significantly enhanced to include terms of relevance to the archive communityand its users.

In the scope of this document, a library is defined as a collection of books and periodicals. It can refer to anindividual's private collection, but more often it is a collection of information resources and services that isfunded and maintained by a city or institution.

2.3.1

Metadata schemas

The following tables provide an overview of the selected metadata schemas used by libraries. The selectionconsists of: Machine

The MARC formats are standards for the representation and communication ofbibliographic and related information in machine-readable form. Widely used within theLibrary domain, but rarely in other domains.

Number of elements

> 200 elements

Vocabulariesproposed



theMARC Code List for Organizations

contains short alphabetic codes used torepresent names of libraries and other kinds of organizations that need to beidentified in the bibliographic environment (27.719 elements).



thecountry code list is made up of three parts: Part I: Name Sequence, Part II: CodeSequence, and Part III: Regional Sequence (12 regions).

Furthermore the following controlled vocabularies are mentioned:



For names, one of the most widely used authority files is the Library of CongressName Authority File (or LCNAF;http://authorities.loc.gov/

).



For topics or geographic names, the most used subject authority file is the LCSH.There are many other subject heading lists, such as theSears List of SubjectHeadings

and theArt and Architecture Thesaurus.

Extra information onapplication

MARC 21 has been mapped to the following metadata standards:MODS

;Dublin Core;MARC Character Sets to UCS/Unicode

;Digital Geospatial Metadata

(FGDC) and viceversa. Unimarc is mapped to MARC21.

Thestructure

of MARC records is an implementation of national and internationalstandards, e.g.,Information Interchange Format

(ANSI Z39.2) andFormat for InformationExchange

(ISO 2709).

Applied by thefollowingorganizations e.g.

Libraries worldwide

URL(s)documentation

http://www.loc.gov/marc/

http://www.loc.gov/cds/marcdoc.html

URL guidelines forapplication

Understanding MARC Bibliographichttp://www.loc.gov/marc/umb/

XML encodingYes : a framework for working with MARC data in a XML environment is beingD2.1 First Analysis of Metadata

in the Cultural Heritage Domain

Page23

of118

available

developed:http://www.loc.gov/marc/marcxml.html

A list of some tools that work with HTML, SGML and XML applications is at

http://www.loc.gov/marc/marctools.html

Viewed 2006-10-19

Metadata Object Description Schema

Name

Metadata Object Description Schema

Acronym

MODS

Status / version

Version 3.2

Type

Recommendation

Management

Library of Congress

Short description

AnXML schema for descriptive metadata,library-oriented,compatible with the MARC 21bibliographic format, in other words: optimized for from-MARC conversion of legacyrecords.

Well-suited as a metadata format for OAI harvesting.

This schema may be used to

carry selected data from a subset of existing MARC21 recordsas well as to enable the creation of original resource description records.

Vocabulariesproposed

Lists for use with MODS:



Sources



Authority File



Classification



Form



Genre



Subject



Organizations



Target Audience



Relators and Roles

Value lists



Relators and Roles (MARC)



Form (MARC)



Form (SMD)



Genre (MARC)



Target Audience (MARC)



Organization (MARC)

Extra information onapplication

There are crosswalks available to MARC and to Dublin Core and

vice versa.

Applied by thefollowingorganizations e.g.



OpenOffice Bibliographic Project



Minerva project



Universityof Chicago Press



California Digital Library



Library of Congres is planning toconvert 100K American Memory records

URL(s)documentation

http://www.loc.gov/standards/mods

http://www.loc.gov/standards/mods/v3/mods-3-2.xsd

D2.1 First Analysis of Metadata

in the Cultural Heritage Domain

Page24

of118

URL guidelines forapplication

http://www.loc.gov/standards/mods/v3/mods-userguide.html

Viewed 2006-10-19

XML encodingavailable

Yes

Metadata Encoding and Transmission Language

Name

Metadata Encoding and Transmission Language

Acronym

METS

Status / version

Version 1.5, April 2005

Type

Encoding standard

Management

Library of Congress

Short description

An XML document format for encoding metadata necessary for both management of(compound) digital library objects within a repository and exchange of such objectsbetween repositories (or between repositories and their users). Depending on its use, aMETS document could be used in the role of Submission Information Package (SIP),Archival Information Package (AIP), or Dissemination Information Package (DIP) withintheOpen Archival Information System (OAIS) Reference Model.

METS is an XML Schema designed for the purpose of creating XML document instancesthat express the hierarchical structure of digital library objects, the names and locations ofthe files that comprise those objects, and the associated metadata. METS can, therefore, beused as a tool for modelling real world objects, such as particular document types.

METS is a standard “shell” for encoding data essential for retrieving, preserving, andserving up digital resources; it can be seen as a "wrapper", like MPEG-21.

The need for METS was identified at Digital Library Federation metadata experts meetings,as varied local approaches to non-descriptive metadata are not scaling well & offering littleinteroperability between agencies.

The value of METS is that it offers a standard mode for object “packaging” forpreservation, institutional repositories, other activities.15

Applied by thefollowingorganizations e.g.

British Library, OCLC DCPS, RLG, Harvard, Stanford, UC Berkeley, National Library ofWales are exploring or using for variety of projects.

Library of Congress is planning to use with selected moving images, audio recordings, folklife mixedmedia collections.

The system isdeveloped and maintained in the Library of Congress: the Dewey editorialoffice. Copyrights are owned byOCLC (mailto:DeweyLicensing@oclc.org).

Short description

A universal classification schema, i.e. describing all subjectareas.

At the broadest level, the DDC is divided into tenmain classes, which together cover theentire world of knowledge. Each main class is further divided into tendivisions, and eachdivision into tensections(not all the numbers for the divisions and sections have beenused).

This general knowledge organisation tool has a structural hierarchy:all topics (aside fromthe ten main classes) are part of all the broader topics above them.

The DDC is the most widely used classification system in the world. Libraries in more than135 countries use the DDC to organize and provide access to their collections, and DDCnumbers are featured in the national bibliographies of more than sixty countries.

Libraries of every type (especiallypublic libraries and small academic libraries in the U.S.)apply Dewey numbers on a daily basis and share these numbers through a variety of means(including WorldCat, the OCLC Online Union Catalogue).

Dewey is also used for other purposes, e.g., as a browsing mechanism for resources on theweb. For instance, the subject gateway Renardus has assigned DDC for organizing andaccessing electronic resources.

URL(s)documentation

http://www.oclc.org/dewey

http://www.oclc.org/dewey/versions/ddc22print/glossary.pdf

URL guidelines forapplication

http://www.oclc.org/dewey/versions/ddc22print/intro.pdf

Viewed 15-9-2006

D2.1 First Analysis of Metadata

in the Cultural Heritage Domain

Page26

of118

Functional Requirements on Authority Records

Name

Functional Requirements on Authority Records

Acronym

FRAR

Status / version

Draft, June 2005

Type

Conceptual model

Management

IFLA UBCIM Working Group on

Functional Requirements and Numbering of Authority Records (FRANAR)

Short description

A conceptual model to set up authority records for metadata elements like person name,family name

and organization name according to a predefined structure.

Like the rules for a thesaurus, there are 14 relationship types acknowledged, for instancepseudonym relationship and alternative linguistic form relationship.

XML encodingavailable

No

Applied

by thefollowingorganizations e.g.

Many

URL(s)documentation

http://www.ifla.org/VII/d4/wg-franar.htm

Viewed 04-10-2006

Library of Congress Authority Files

Name

Library of Congress Authority Files

Acronym

LCAF

Status / version

Updated weekly

Type

International standard

Management

Library of Congress

Short description

A set of controlled vocabularies (authority files) for the following metadata elements:subject (see: LCSH), names (person

names, corporate names, meeting names andgeographic names), series and uniform title and name/title.

is a classification system designed for the Library of Congress collection, covering allsubject areas. It has been adopted by many large academic libraries in the U.S.

Number of elements

21 basic classes

Available inlanguage

English

XML encodingavailable

Yes, the LCC records are available in MARCXML format.

Applied by thefollowingorganizations e.g.

It is used by most research and academic libraries in the U.S.and several other countries.

Recommended by VRA.

URL(s)documentation

http://www.loc.gov/catdir/cpso/lcco/lcco.html

Viewed 2006-10-19

http://www.loc.gov/catdir/cpso/lcc.html

Viewed 2006-10-19

Library of Congress Subject Headings

Name

Library of Congress Subject Headings

Acronym

LCSH

Status / version

29th

edition, 2006 (the online version is updated weekly)

Type

International standard

Management

Library of Congress

Short description

A thesaurus on all subject areas.

A structured vocabulary designed to represent the subject and form of the books, serials,and other materials in the Library of Congress collections, with the purpose

of providingsubject access points to the bibliographic records contained in the Library of Congresscatalogues. More broadly, LCSH is used as a tool for subject indexing of library cataloguesand other materials (including visual materials). Available in

print (annual) and microfiche(updated quarterly). Also available on line from various vendors and bibliographic utilities,and as part of the Library of Congress CD-ROM productClassification Plus.