This document forms the chief deliverable for Work Package 3 of the
ELRA contract for validation of language corpora. It discusses the
theoretical basis underlying our approach to the formal validation of
language corpora, and makes some recommendations about relevant
techniques and practices which may be of assistance in performing such
evaluations, and documenting their results.
Particular attention is paid to the specific case of
morpho-syntactically annotated corpora.

Some confusion exists about the terminology associated with
linguistically annotated corpora. This is partly because the term
tagset is used differently by two different communities.
For the traditional corpus linguist, a tagset is the set of possible
values used to explicitly annotate a text with a linguistic analysis;
for example, the CLAWS tagset comprises a set of values such as
NN1, VVD etc., each of which has a specific significance
(singular common noun, verb past tense, etc.) For the mark-up
specialist however, the term tagset refers to any kind of annotation,
in particular the collection of SGML tags corresponding with the
elements defined in a particular DTD: for example, the TEI defines a
number of tagsets, each containing definitions for specific SGML
elements and attributes.

Both usages reflect the fact that all markup introduced into a text
is identical, at some level of analysis, in the sense that it serves
to record or assert an association between stretches of text and
values taken from some externally defined set of
interpretations. However most people seem to categorize an analysis
such as ``this is a paragraph'' differently from the formally
equivalent judgement ``this is a noun''. The former judgement is said to
be `structural' and the latter
`interpretative'. This kind of categorization also
underlies the notion of `level' of annotation as exemplified by
(inter alia) the Corpus Encoding Specification (Ide 1998), where the distinction is further justified by the observation
that the addition of so-called `structural' markup
is generally easier to automate than that of
`interpretive' markup, since the latter (almost)
invariably requires human judgement and knowledge, while the former
rarely does. Particularly in the case of textual markup,
interpretative judgements tend to be more controversial than
structural ones, if only because the latter relate to aspects of a
text which are accepted as intrinsic to its substance by the community
of text readers. Structural interpretations form part of the
`contracts of literacy' ( Snow
and Ninio, 1986) which form the precondition of a text's
recognition as meaningful by the members of a particular community of
readers.

For purposes of validation, however, the distinction seems
unhelpful. All markup introduced into an corpus should be validated in
the same way, and the validity of the corpus overall is equally
affected by each type of markup used. Nevertheless, we have subdivided
our discussion into two parts, reflecting the division currently made
by most practitioners between structural and interpretative markup,
and which are consequently reflected in actual practices. Structural
markup is most generally to be validated with reference to an abstract
model of textual components and features which is either entirely
intuitive and `common sense' based, or defined in
terms of some consensus-based model such as that of the TEI, restated
as an SGML DTD. Interpretative markup may be similarly theory-free
(see, for example, Leech 1993, but it is
more customary to define it with reference to some explicitly stated
analytic model, and hence to facilitate both automatic validation of
the corpus itself (to check that it is valid in its own terms) and
comparison of two corpora using different markup schemes derived from
a common abstract model.

In section 2 we discuss the process by which the
structural markup defined for a given corpus may be validated. The
formal mechanism used for this purpose is an SGML document type
definition. In section 3 we discuss in more detail
one particular kind of interpretative markup: that which seeks to make
explicit morpho-syntactic analysis of a text. We present here an SGML
scheme for the formal expression of an abstract model that may be used
to validate such analyses both internally and externally. Finally, in
section 4 we suggest some ways in which the result of
either validation exercise may be formally documented. We begin,
however, by describing the model of formal validation which underlies
both descriptions. (For a more detailed discussion of the
principles adumbrated here, see Sperberg-McQueen and
Burnard 1995).

We begin by positing the existence of textual features
or abstractions, instances of which are predicated at various
positions within a document. The function of markup is to indicate
unambiguously the presence of instances of such features. For example,
a document may contain instances of the feature
`segment', whose presence might be signalled by
such markup conventions as:

the start of a new input record;

the presence of some distinguishing code or sign such as a star,
not otherwise present in the text;

the presence of some predefined symbol such as the tag <s>

As noted above, the presence and scope of a feature such as
`singular noun' may be predicated in exactly
the same way.

We further assume that it is possible to define a
grammar for such markup symbols: that is, a grammar which
defines which combinations of such symbols in a document are to be
regarded as legal. Such grammars generally have regard
only to the markup language itself, rather than its extension to the
underlying feature set. A markup grammar may simply enumerate all
legal markup tokens, or simply specify an algorithm for the
identification of markup tokens with no consideration of which markup
tokens might be permitted. A more complex grammar (such as SGML) may
also be used, enabling the formulation of contextual
rules such as ``the tag X is only legal within the scope of the
component identified by tag Y'' in addition to these kinds of rules.
Note however that legality is still defined here in terms of syntax:
only informal legislation can determine whether the content of an SGML
element is `correct' with reference to some
semantic model. Publications such as the TEI Guidelines
typically extend the syntactic definitions embodied in their DTDs by
more or less detailed discussion of the intended semantics of
elements, but rarely provide a formally verifiable abstract model of
such semantics, nor is it entirely clear what such a model might
resemble. Nevertheless, throughout our discussion we will use the
term feature (and derivatives) to refer to components of
such a model, and the term tag (and derivatives) to refer
to components of the markup system used to assert their existence.

This distinction seems to us crucial to the feasibility of
validation: ``A corpus is a collection of utterances, and therefore a
sample of actual linguistic behaviour. However, even if we do not believe
that the distinction between competence and performance is valid,
a corpus is not itself the behaviour, but a record of this behaviour'' ( Stubbs, 1996). The function of the markup in the
corpus is to make explicit, and hence accessible to comparative study,
the recording process for both structural and interpretative encoding in a corpus text.
Without this, neither comparative studies of
different corpora, nor any assessment of the validity of the corpus
`record' with respect to what it
`records' will be possible.

We define the process of validation as follows:

for each feature of interest, does the document contain any
tagging?

is the tagging of the document syntactically correct?

is the tagging of the document consistently applied (i.e. is
every occurrence of a given feature tagged in the same way)?

is the tagging of a document correctly applied, with reference to
some externally (or internally) defined abstract model?

if correct, is the tagging of a document complete, with reference
to some externally (or internally) defined list of mandatory features?

Taking these in reverse order, it is clear that, in the general
case, the last two of these stages are automatable only to the extent that
an abstract model can be formally specified for both the feature
system itself and for the intended correspondence between that and the
tagging employed. We present in section 5.1 below one
such abstract model, the EAGLES Guidelines for morpho-syntactic
annotation ( Leech and Wilson, 1994),
re-expressed as a TEI-conformant feature system, against which any
other set of morpho-syntactic annotations using the same
representation may be validated, without necessarily having to conform
to the EAGLES model. We also discuss the somewhat simpler abstract
model proposed by EAGLES itself in section 3.2 below.

Equally clearly, however, neither the third nor the first of the
stages above can in principle be automated, since both depend
on a human judgement to the effect that such and such a feature is in
fact present, whether or not it is signalled by the tagging in a
text. Such text-comprehension abilities still seem to be somewhat
beyond the state of the art in NLP, despite some advances.

The second of the three stages above is however automatable, to
the extent that the tagging syntax of the document is fully specified.
In an SGML context, this implies the existence of a DTD against which
candidate documents can be verified using an SGML parser. For other
forms of markup, validation may involve other forms of verification,
some of which may be intimately tied in to the behaviour of particular
application software. For example, a document marked up in RTF or
LaTeX may be considered valid so long as Microsoft Word or LaTeX does
not reject it, irrespective of its output. Technical documentation
will often specify what markup should be found in a document: where
the markup syntax is arbitrary or application specific, clearly
special purpose software must be developed to validate it.