Annotation of Language Resources

Lecture IV.

TEI and other Language Encoding Recommendations

Abstract

This lecture presents the XML-based Text Encoding Initiative
Guidelines and other language encoding recommendations. TEI can be
used to annotate a wide variety of language resources. We present the
history, organisation and architecture of TEI and illustrate it with
applications to multilingual corpora, lexical databases and feature
structures. We also discuss other encoding recommendations, first some
language engineering standards that came about as a result of EU
projectcs, i.e. EAGLES/ISLE with (X)CES and then a few lexicon
exchange initiatives, i.e. MARTIF, TMX and OLIF.

The Text Encoding Initiative was established in 1987 under
the joint sponsorship of the:

ACH: Association for Computers and the Humanities

ACL: Association for Computational Linguistics

ALLC: Association for Literary and Linguistic Computing.

The impetus for the project came from the humanities computing
community, which sought a common encoding scheme for complex textual
structures in order to reduce the diversity of existing encoding
practices, simplify processing by machine, and encourage the sharing
of electronic texts. But it soon became apparent that a sufficiently
flexible scheme could provide solutions for text encoding problems
generally.

TEI became the only systematised attempt to develop a fully
general text encoding model and set of encoding conventions based upon
it, suitable for processing and analysis of any type of text, in any
language, and intended to serve the increasing range of existing (and
potential) applications and use.

SGML was chosen as the underlying standard for the TEI
Guidelines.

The first draft of the TEI Guidelines for Electronic Text
Encoding and Interchange, TEI P1 was published in
1990.

Used in the text, and are, for the most part, in-line
elements with no consistent internal structure, e.g.
highlighting (<emph>), quotation, <q>, names
<name>, etc. Also in this class is paragraph, <p>,
list, <list> , etc., and some simple linkage, editorial,
bibliographical, etc. elements.

TEI header

Describes an encoded work so that the text itself, its source, its
encoding, and its revisions are all thoroughly documented.

The TEI header gives the meta-data on the TEI document and
consists of four main parts (only first is obligatory):

<fileDesc>

file description, containing a full
bibliographical description of the computer file itself; it
includes information about the source or sources
(<sourceDesc>) from which the
electronic text was derived.

<encodingDesc>

encoding description, which describes the
relationship between an electronic text and its source or sources:
it allows for detailed description of whether (or how) the text was
normalised during transcription, how the encoder resolved
ambiguities in the source, what levels of encoding or analysis were
applied, etc.

<profileDesc>

text profile, containing classificatory
and contextual information about the text, e.g. its subject
matter, the individuals
described by or participating in producing it, etc. It is
of particular use in structured composite
texts such as corpora, where it is often
desirable to enforce a controlled descriptive vocabulary or
to perform retrievals from a body of text in terms of text type or
origin.

<revisionDesc>

revision history, which allows the encoder
to provide a history of changes made during the development of the
electronic text. It is important for version
control and for resolving questions about the history of a
file.

<div type="story">
<head rend="large underlined" type="sub">
President pledges safeguards ...</head>
<head rend="very large bold" type="main">
Major agrees to enforced no-fly zone</head>
<byline>
By George Jones, Political Editor, in Washington</byline>
<p>Greater Western intervention in the conflict in
former Yugoslavia was pledged by President Bush ...</p>
</div>

The EAGLES Recommendations on corpus encoding resulted in CES, the Corpus Encoding
Standard, which is a SGML DTD; it is a particular parameterisation (and
modification) of the TEI P3.

CES specifies a minimal encoding level that corpora must achieve
to be considered standardized in terms of descriptive representation
(marking of structural and typographic information) as well as general
architecture (so as to be maximally suited for use in a text
database). It also provides encoding specifications for linguistic
annotation, together with a data architecture for linguistic corpora.

CES has been used in a number of corpus projects, to a large
extent because it is simpler to use and understand than the full
TEI.

CES recommends stand-off annotation for linguistic analyses

For the encoding of primary data the CES identifies three levels of encoding:

Level 1

The minimum encoding level required for CES conformance,
requiring markup for gross document structure (major text divisions),
down to the level of the paragraph, conformant to the cesDoc
DTD.

Level 2

This level requires that paragraph level elements are correctly
marked, and (where possible) the function of rendition information at
the sub-paragraph level is determined and elements marked
accordingly.

Level 3

This is the most restrictive and refined level of markup for
primary data. It places additional constraints on the encoding of
s-units and quoted dialogue, and demands more sub-paragraph level
tagging.

ISLE,
the International Standards for Language Engineering is a continuation
of EAGLES, and is at the same time a project and a set of co-ordinated
activities regarding the Human Language Technology field.

The aim of ISLE is to develop HLT standards within an
international framework, in the context of the EU-US International
Research Cooperation initiative. Its objectives are to support
national projects, HLT RTD projects and the language technology
industry in general by developing, disseminating and promoting de
facto HLT standards and guidelines for language resources, tools and
products.

ISLE Working Groups:

Computational Lexicons

Natural Interaction and Multimodality

Evaluation

EAGLES/ISLE metadata
initiative has as its goal to make a proposal for a standard of
meta-data descriptions of Multi-Media/Multi-Modal Language
resources. Using such a standard it should become possible to create a
browsable and searchable universe of such resources in the Internet.

an abstract XML(schema,link)-based framework that provides a
“definition of underlying structures and mechanisms needed for the
computer representation of terminological data” and “Independence
with regards any specific format”

LISA
(Localisation Industry Standards Association) was
founded in 1990 as a non-profit association joining the globalization,
internationalisation, localisation, and translation business
communities;

TMX
(Translation Memory eXchange) is a specification (XML DTD)
to allow easier exchange of translation memory data between tools
and/or translation vendors with little or no loss of critical data
during the process.

TMX is defined in two parts:

container format specification: for the higher-level
elements that provide information about the file as a whole and about
entries (a multilingual entry is a translation unit,
<tu>, composed of monolingual segments);

specification of low-level meta-markup format for the content of a
segment of translation-memory text

TMX offers two levels of implementation:

Level 1 (Plain Text Only) - Support for the container only. The
data inside each <seg> element is plain text.
This level is sufficient when the data does not have inline codes.

Level 2 (Content Markup) - Support for both container and
content. Tools supporting TMX Level 2 can re-create the translated
version of an original document by using only the TMX document.

EU project (1996-1999).
Objective: ease integration of
tools and functions that help translating. Subtask: “specifying a common
format for text and lexical resources, with mechanisms for handling
other current document formats, and adapting a range of NLP systems to
accept these formats”.

OLIF is to be a user-friendly vehicle for exchanging
terminological and lexical data: it is XML-compliant and offers
support for NLP systems, such as machine translation, by providing
coverage of a wide and detailed range of linguistic features.

Current official version of OLIF is V2.0, published in February 2002.

A lexical entry is supposed to mimic a feature-value
representation

The content model has been kept very flat, with almost no use
made of attributes

Some support for user extensions a la TEI, i.e. the “%x.”
parameter entity

one of the main achievements of OLIF seems to be that it provides an extensive
list of inflectional classes and grammatical categories for 5 EU
languages

A lexicon is divided into the header and body; the body is
composed of lexical entries.
The element classes of an OLIF v.2 entry are:

monolingual: defines monolingual data; each OLIF entry may contain only one monolingual group

cross-reference: defines cross-reference relations between
the given entry and other entries in the lexicon in the same language

transfer: defines transfer relations between the given entry and other entries in different languages