Improving Access to Recorded Language Data

Abstract

This article discusses the work of the Research Data Alliance (RDA) Language Codes Working Group, which is addressing the problem of how scientists can discover data from various research areas that is managed by different disciplinary approaches and standards. The WG is addressing standardisation of metadata elements in two areas: codes for identification of languages and language varieties, and categories for describing the content of resources. The goal is to deliver metadata components that can be used by archives across disciplines which address these two areas, using descriptors linked to a registry of data categories to ensure transparency and consistency and lead to improved discovery and access for researchers across disciplines.

1. Introduction

Researchers in different disciplines collect and store data which includes human language recorded in real time, for example musicologists, linguists, scholars working in performance studies, and others. Discovery of such data should be easy across disciplines, but is currently impeded by different disciplinary approaches and standards. For example, a linguist may have collected recordings of songs performed by speakers of the language they are studying; these recordings are stored in an archive intended primarily for other linguists, and a musicologist may not easily discover the resource even though it might be very relevant to their research. And of course the opposite situation might equally occur with a musicologist collecting data which might be of interest to a linguist. This article will discuss the work of a recently formed RDA Language Codes Working Group within the Research Data Alliance which aims to address this problem by working towards standardisation of metadata elements in two areas: codes for identification of languages and language varieties, and categories for describing the content of resources. The WG aims to deliver metadata components to be used by archives across disciplines which will address these two areas, using descriptors linked to a registry of data categories to ensure transparency and consistency. We hope that the activities of the RDA Language Codes Working Group will lead to improved discovery and access for researchers across disciplines who work with recorded language data as well as improved possibilities for inter-repository data exchange.

2. Identifying languages

It is difficult to provide a single characterisation of the range of data which is of concern for this Working Group, but the presence of some language content (within which we include sign languages) seems basic. Therefore a first step in effective discovery across disciplines and resources is the possibility of accurately identifying the language or languages which are present in a resource. An international standard, ISO639-3, exists which provides a set of three letter codes to identify languages. But this is not unproblematic for a variety of reasons. Firstly, it is not adopted everywhere and secondly the standard implies a rather rigid view of what can be defined as a 'language' even though this is a notoriously difficult concept to pin down. We will detail both of these problems with an example from Australia.

The digital collections of the Australian Institute of Aboriginal and Torres Strait Islander Studies (AISATSIS) use a set of identifiers different to ISO639-3. The divisions recognised by ISO639-3 do not always align with expert understanding and this has been a particular issue for Australian languages, with a number of change requests filed with the registration authority for ISO639-3. A number of these changes relate to issues of granularity, that is, delineating languages from linguistic entities below that level (such as dialects) and above that level (such as macrolanguages and language families). But differences between ISO639-3 and the identifiers used by AIATSIS also reflect differences between insider views of the relevant distinctions and outsider views. Table 1 illustrates some of this complexity by comparing ISO639-3 identifiers and AIATSIS identifiers for the Dhangu lanaguge group within the Yolngu/Yuulngu family of Australian languages. It also includes information from Glottolog, yet another source for language identifiers. Glottolog uses the term 'languoid' to refer to "any type of lingual entity: language, dialect, family, language area".[1]

Table 1: Comparison of ISO639-3, Glottolog and AIATSIS identifiers for the Dhagu language group (AIATSIS has two alternate spellings for Ngaymil which are omitted here.)

The RDA Language Codes Working Group is starting from the position that ISO639-3 is sufficiently entrenched that it cannot be abandoned, but that improvements can be made both in the substance of the standard and in the processes around its maintenance. The various parts of ISO639 currently have different Registration Authorities; for example ISO639-3 is administered by SIL International, while ISO639-2 (a partially superseded set of two letter codes) is administered by the Library of Congress. The Working Group will participate in efforts which have already begun to unify all the parts of ISO639 with a single Registration Authority which would allow for the construction of a single database documenting the complete standard. A single administration would also have advantages in terms of the processes around seeking amendments to the standard. This will remain an important consideration: languages and linguistics scholarship are not static and what should be identifiable and what is recognised as identifiable will change over time.

As part of the ongoing work of ISO Technical Committee 37 (which has responsibility for ISO639), proposals for identification of entities at different levels of granularity are being considered within the ISO process. These efforts cover both the identification of linguistic entities above the level of 'language' (e.g. macro-languages and language families) and entities below that level (e.g. dialects and varieties). The Working Group aims to ensure that expert input to these processes is maximised, that the principles underlying the ISO639 standard sets have a sound linguistic basis, and that registration and revision processes are consistent and transparent. These aims will be achieved by direct input to ISO TC37 (one member of the Working Group is also a member of the working group within ISO TC37 which deals with Language Coding), and by encouraging national standards bodies to be involved in the work of the Technical Committee, for example by seeking observer status in its meetings and by creating national mirror committees. Our assumption is that progress with these issues will lead to more consistent use of the resulting standard by archives and repositories.

3. Content description

Existing metadata schemas for language data (e.g. ISLE Metadata Initiative (IMDI), Open Language Archive Community (OLAC)) include a vocabulary for describing the genres represented in linguistic resources, but these do not necessarily correspond to usage or needs of different disciplines. The Working Group aims to develop a vocabulary for describing the relevant resources which will be sufficiently broad that it can cover the range of material represented, sufficiently accessible that it will be useful to researchers across a range of disciplines, but also sufficiently precise that it will aid discovery. On this last point, we will work from the assumption that an optimal solution will be a high-level, coarse-grained vocabulary which can be extended by individual research communities to achieve the levels of precision in resource discovery which will best serve their needs.

The Working Group will consult across the different research communities to establish the range of resource types which need to be covered and vocabularies for describing that range. The Working Group will implement the results of this consultation by creating a set of metadata elements within the frameworks of the Component Metadata Initiative (CMDI) and the ISOCat data category registry. These technical solutions seem appropriate for the problem being tackled. CMDI is based on the idea that common metadata elements should be useable across different sites without the imposition of a rigid metadata scheme. Given that any solution resulting from the Working Group will have to be retrofitted to existing metadata catalogues, this approach is suitably flexible. Also, the CMDI framework will accommodate the type of extensibility mentioned above. The ISOCat framework (See the contribution of Broeder and Lannom, also in this issue.[2]) seeks to make explicit records of the semantic content of data categories which are easily accessible. This is desirable in any case, but seems to us to be particularly desirable in work such as ours which crosses disciplinary boundaries. Although we anticipate that the content of some proposed data categories may be the subject of considerable debate and discussion between representatives of different disciplines, we see considerable advantages in the outcome of such discussions being treated as part of a data commons rather than being tied to any individual discipline.

4. Outcomes

As mentioned previously, the WG aims to deliver metadata components which address the problems of identifying languages and content types for archived recordings of human language use. Adoption of these deliverables would have benefits for data sharing, discovery, and the interoperability of repositories/archives; and also for both in-domain researchers and cross-disciplinary researchers, as well as the archiving institutions themselves. We believe that this work can lead to improved practice in identification of all aspects of human communication in resource descriptions allowing improved access to resources and improved sharing of resources. Making the semantics of resource description more explicit will enable more informed re-use of resources as well as facilitating the possibility of automated re-use. We also see any increase in the influence which researchers have on standards that they use as a positive outcome of our work.

About the Author

Simon Musgrave is a lecturer in the School of Languages, Literatures, Cultures and Linguistics at Monash University. His research interests include Austronesian languages, language endangerment, African languages in Australia, communication in medical encounters and linguistics as part of digital humanities. He is also a member of the Management Committee of the Australian National Corpus Project.