Encoding the British National Corpus

Gavin Burnage and Dominic Dunlop
Oxford University Computing Services

Published in English Language Corpora: Design, Analysis and Exploitation, Papers from the 13th international conference on English Language research on
computerized corpora, Nijmegen 1992, edited Jan Aarts, Pieter de Haan
and Nelleke Oostdijk.

The British National Corpus (BNC) project is currently constructing a
100 million word corpus of modern British English for use in
linguistic research. It is a collaborative, pre-competitive initiative
carried out by Oxford University Press (OUP), Longman Group UK Ltd.,
Chambers, Lancaster University's Unit for Computer Research in the
English Language (UCREL), Oxford University Computing Services (OUCS),
and the British Library. The project receives funding from the UK
Department of Trade and Industry and the Science and Engineering
Research Council within their Joint Framework for Information
Technology.

OUCS's main role in the project is to encode all corpus texts in a
standard format, and to act as a central clearing-house for the
exchange and storage of corpus texts for all parties involved in BNC
construction work. The common encoding scheme agreed within the
project is called the `Corpus Document Interchange Format', or CDIF
(Burnard, 1992c). It is an application of the Standard Generalized
Markup Language (SGML) (Goldfarb, 1990; ISO 1986), and conforms in
large measure to the recommendations of the Text Encoding Initiative
(TEI) for the encoding of linguistic corpora (Sperberg-McQueen and
Burnard, 1992a). CDIF is the format in which the BNC will be
published at the end of the project.

OUCS's work is one part of a production line which involves most of
the project's participants directly. The starting point is the
creation of electronic versions of a wide range of texts in British
English; the finishing point is those same texts encoded in CDIF to
show both the structure of each text and the syntactic analysis of
each sentence. This production line is illustrated in figure 1.

The commercial publishers in the project are responsible for the
initial stages of the process, namely supplying electronic versions of
texts selected for inclusion in the corpus in accordance with the
design criteria (British National Corpus, 1991a, 1991b). There are
three ways of obtaining electronic text. One is to use scanners such
as the Kurzweil Data Entry Machine or the Microtek 600; another is to
type in text directly; another is to use material which is already in
electronic form, usually from publishers or existing archives. In
practice, the amount of existing electronic text which fulfils the
corpus design criteria has been a lot smaller than envisaged, which
means that the bulk of the material received at OUCS has been scanned
or typed. Naturally the transcribed spoken material for which Longman
is responsible can only be typed. Both OUP and Longman use their own
internal mark-up schemes for the encoding of the data they supply for
the BNC (Davis 1992) -- though a pre-condition for a text's inclusion
in the corpus is that its automatic conversion to CDIF must be easy to
implement (Burnage 1992a, Clear 1992). This illustrates one important
reason for the use of TEI-conformant SGML in distributing the corpus:
researchers are free to convert to and from the encoding systems and
software they are happy using in their local set-up, but for the
exchange of data between different computational set-ups, a single,
standardized encoding scheme is to be preferred.

The range of texts received at OUCS is very broad, not only in terms
of subject matter and linguistic register, but also in terms of
textual structure. The headings, sections, and paragraphs of an
academic article are usually well marked, and its logical structure is
easy to follow; in contrast, feature articles from colour magazines
often contain short snippets of text which are hard to identify ---
they could be paragraphs, or headings, or captions -- in no particular
order. Moreover, transcription of conversation presents another set of
encoding problems (Crowdy 1991). CDIF has been designed to accommodate
the encoding of texts whose structures differ widely. It is a single
SGML document type definition (DTD) which sets out a formal
description that every BNC text must match in order to become a part
of the corpus. This formal description is broad enough to encompass
the many different types of text intended for the corpus, and rigorous
enough to show up some of the errors which occur in the mark-up of
these texts.

The first task which OUCS performs when new electronic text arrives,
therefore, is to convert the mark-up to the various conventions set out
in the CDIF DTD. The success or otherwise of this conversion can be
gauged on one level by using an SGML parser to check the mark-up of the
text against the formal description in the DTD. The parser reports any
errors in the text, and these can be corrected by hand. If the error
is one likely to recur frequently, small programs can be used to speed
up the correction process. When the document conforms to the DTD,
the parser finds no further errors.

Before this work is carried out, each incoming text is assigned a
unique code name, and stored on disk in accordance with agreed file
storage procedures (Burnage 1992b). As well as identifying each text
in the file system, the code name is used in a database which stores
details about each text and its current progress along the production
line, details of which are updated regularly (Dunlop 1992c).

The fact that a text conforms to the DTD does not necessarily mean
that it is faultless or perfectly encoded. Tags can conform to the
expectations of the DTD and the parser, but still have been misapplied
or misused. The caption to a photograph might mistakenly be labelled a
heading, for example. Certain significant textual features such as
chapters and paragraphs might not have been tagged at all. There may
also be more fundamental problems: portions of text present in the
original book, conversation, or whatever may have been omitted from
the electronic version. Scanning software or transcribers may
inadvertently have introduced typographical errors. For these reasons,
a 'semantic' check follows each successful CDIF parse. A portion of
each text is examined for errors such as those described above, almost
always against the original printed version (in the case of written
texts). Each error that occurs is corrected, as are any similar
errors which can easily be identified in the rest of the text. If this
examination shows that a lot of manual correction will be required to
bring the text up to standard, the text may be `bounced' -- that is,
returned to its original sender. They may decide to correct it, or
simply provide other appropriate texts in its place. Such constraints
are designed to ensure that the production line keeps moving at a
reasonable rate. Full correction of every badly-encoded text would,
unfortunately, cost too much time and effort.

Another task carried out at OUCS is the addition of a CDIF `header'
which supplies bibliographic and other information at the beginning of
every text (Dunlop 1992b). This includes the title, the publisher,
the date and place of publication, the age, sex, and regional origin
of the author, information about the sample size, and so on. For
spoken material, details are given about the people who participated
in the conversations and activities recorded. Much of this information
comes to OUCS as part of the electronic texts prepared by the
publishers, and it is also stored in a database. Including it in a
header for each text allows researchers to find out more about each
text; including it in a database means that researchers can extract
sub-corpora from the main 100-million word corpus. These sub-corpora
can be designed according to the researcher's own needs. The database
also means that while the corpus is being constructed, a continuous
check can be made on the way the stipulated design criteria are being
met. If, for example, too few books by female writers from North-east
England have been added to the corpus, then the publishers who supply
text to OUCS can be alerted and take steps to remedy the imbalance in
the corpus.

After a text has passed the CDIF and semantic checks satisfactorily,
it is sent to UCREL in Lancaster. There syntactic tagging is carried
out before the texts are returned to OUCS for one last CDIF conformance
check. Once this has been done, the text becomes an official part of the
corpus.

There are three full-time staff working for the BNC at OUCS, with a
wide variety of skills and interests. There is therefore a
correspondingly wide range of software tools in use to carry out the
work described above.

Processing power comes from two Sun Microsystems Sparcstation 2
machines running Sun's UNIX operating system. Given the large memory
(32 megabytes) and processing speed (28 mips) of these machines, long
texts can be processed quickly. Hard disk storage space currently
amounts to two gigabytes; this will shortly be doubled.

For processing text, standard UNIX tools such as awk (Aho 1988),
and sed (Doucherty 1991) are in frequent use, along with
perl (Wall 1991) when required.

Also used extensively is the ICON programming language. It was
developed at the University of Arizona under Ralph and Madge Griswold,
and is particularly suited to the manipulation of character strings --
making it an ideal tool for the re-formatting and encoding of text
corpora (Griswold & Griswold 1990). It is available for a wide range
of machines and operating systems, and is in the public domain.

The main SGML tool used is a public-domain parser called SGMLS (Clark
1992). Using the emacs text editor, the parser can be called upon to
analyse the text during an editing session on that text. This speeds
up the checking and correction process considerably.

The bibliographic database is implemented with the INGRES database
management system, which is available to OUCS under a local site licence
agreement (Ingres 1989).

As a three-year project with budget constraints and ambitious data
collection targets, the BNC cannot be over-ambitious in the amount or
complexity of the mark-up that it applies to captured text. The
mark-up which is applied relates to content of written and spoken
texts at a variety of levels:

-- Character level: The corpus is held as plain ASCII text (strictly,
it uses the International Reference Version of ISO 646:1990 (ISO 1990).
Characters outside the limited set permitted by this standard are
represented by mark-up.

-- Word level: Word-class tagging is applied to each word in the
corpus.

-- Phrase level: A small selection of the texts (the `core corpus')
in the BNC has tagging at the phrase level with parse tree analysis.

-- Sentence level: The word-class tagging process divides all texts in
the corpus into segments, which correspond closely to sentences in
running text. Segments are also used in a reference system which
allows a unique reference to be generated for any segment in the
corpus

-- Structural level: Where appropriate and possible, the structure of
each document --- consisting of chapters, sections, paragraphs, or
similar elements --- is marked.

-- Text level: Each text in the corpus is accompanied by a
comprehensive header giving bibliographic information, and listing the
criteria by which the text was selected for inclusion in the corpus.

As has been stated, it was decided at the outset to use SGML in order
that consistent mark-up could be applied throughout the corpus.
Further, the recommendations of the TEI were to be adhered to where
possible.

In some respects, SGML is not itself a mark-up language; rather, it is
a language in which mark-up languages may be defined. Consequently,
it is possible using SGML to express two functionally identical mark-up
languages which are nevertheless incompatible because, for example,
they use different names for the same element, or because they use
different character sets. Such incompatibilities would make it
difficult for researchers using the two schemes to exchange data sets,
so the use of SGML alone does not provide a solution to the problem
caused in the past by lack of mark-up standardization. (It does,
however, address the problem of a lack of common tools: subject only
to capacity limitations, and to the ability to handle optional
extensions to the base standard, any SGML-aware tool can process any
document marked up using SGML.)

The TEI sets out to define an application of SGML which minimizes
incompatibilities between the mark-up used by different researchers,
while allowing both subsetting and extension. Its recommendations try
to describe a spectrum of SGML document type definitions (DTDs) which
may be applied to a wide variety of text types, defining mark-up which
will facilitate the use and, importantly, the exchange of marked-up
text for a wide variety of scholarly and didactic activities.
Sperberg-McQueen and Burnard (1992a) divides the features that
particular researchers might want to address into a number of subsets,
recommending the manner in which tagging should be applied, and giving
names which should be used for tags marking particular types of
element. Those following the recommendations are free to implement as
much or as little of each subset as is required for their application,
and may use tags of their own devising to mark elements not described
in the recommendations. Thus, a TEI-conformant mark-up may be
characterized by the extent to which it implements each subset of the
recommendations.

Broadly, CDIF provides a relatively sparse implementation of the text
body tagging described by the recommendations; a complete (and,
indeed, extended) implementation of the text and corpus header
recommendations; and a medium level of word-class tagging. The
subsections which follow give more detail:

While the language of the BNC is modern British English, which can
generally be represented in ISO 646 IRV, there is a need to represent
accented Roman letters, the Greek alphabet, and a variety of printers'
marks, such as em-dashes, degree signs, and bullets. An annex to SGML
(ISO 1986) provides `public entity sets' which address almost all of
the needs of the BNC, with marks such as á (small letter a with
acute accent), — (em dash), and &degree; (degree sign). Only a
few additional marks have been introduced. These include &ft; and
&inch; (prime and double prime used to indicate measurements in feet
and inches respectively); and &bquo; and &equo; for normalized
beginning and ending quotation marks, replacing the variety of marks
used for this purpose in the original texts. Dunlop (1992a) lists the
marks (entities) used in the BNC.

The works in the BNC inevitably contain words, phrases or passages in
languages other than modern British English. They may be in
non-British English, other modern languages, archaic English or other
languages, or dead languages. Some written works also contain
representations of modern British dialects, or of English spoken with
a non-British accent. Additionally, some of those whose speech is
transcribed in the spoken part of the corpus speak with regional or
ethnic accents. The project does not have the resources to undertake
the very difficult task of marking each departure from standard
British English in its texts. Besides, in many cases such judgements
would inevitably be subjective. Consequently, although CDIF provides
a means by which shifts in language may be marked, this mark-up is not
applied in practice. Instead, languages seen during the process of
semantic checking of a written text are noted in its header.
(Language which cannot be represented using the marks available ---
for example, Hebrew or Japanese --- is deleted.) For spoken texts, no
attempt is made to provide a phonetic or prosodic representation of
the transcribed speech; words are regularized to standard British
spelling. (An exception is made in the case of words which appear in
the project's control lists of vocalized pauses, and regional and
dialectal usages.)

The TEI P2 recommendations (Sperberg-McQueen and Burnard, 1992a)
require any conformant text to have a header which, at a minimum,
gives brief bibliographic information about the electronic text and
its source. Where many texts are assembled to make up a corpus, TEI
P2 describes a separate corpus header, which gives bibliographic
information about the corpus, and information which is common to all
the texts it embodies. In the BNC, both corpus header and text
headers approach the maximum level of detail provided for in TEI P2,
and in some respects exceed it. Reasonably comprehensive
bibliographic information about text titles and authors is provided,
along with the detailed information needed to define and enforce the
selection criteria used in deciding which texts should be included in
the corpus. (See British National Corpus 1991a, 1991.) Headers also
describe the processing undergone by each text, state the restrictions
on the use of each text, and identify the holders of copyrights around
the world. For spoken texts, headers provide as much demographic
information about participants as possible.

TEI P2 provides a rich set of tags which are expected to be
applicable to most conformant texts. Examples are <p> to mark
paragraphs, and a variety of <div>s to mark higher-level structure.
CDIF provides for the use of many of these `common tags', although,
as described in Tag Classification below, there is no requirement that
all features for which CDIF defines a tag be identified in any given
text.

Each word in the BNC has a class assigned to it by CLAWS, a
probabilistic tagger (see Garside et al 1987). A companion paper in
this volume (Eyes 1992) discusses this process. TEI P2 describes a
general mechanism which uses `feature structures' to mark parts of
speech and other features of any language. Used directly, feature
structures can be extremely verbose, and provide for the encoding of
far more information than is necessary to characterize modern English,
or can be captured by a mainly-automatic tagging process.
Consequently, the BNC uses a relatively small set of short `tags'
(actually SGML entity references), which either expand to or point to
`canned' feature structure definitions. (The exact mechanism to be
used in the final corpus has not been determined at the time of
writing.) A list of the entities may be found in Burnard (1992c) or
Leech (1992a); examples of corresponding feature structures is in
Langendoen (1992). Leech (1992b) lists the entities used in the
million-word core corpus, a selection of BNC texts subjected to a more
detailed analysis.

It is the BNC's intention to mark many varieties of text ---
transcribed speech, books, plays, periodicals, letters, handbills...
--- using a common tag set. Consequently, CDIF embodies few of the
provisions for specific text types described in TEI P2. Areas for
which specific special-purpose tagging is defined include poetry,
drama, and, importantly, the transcription of spoken material. In the
latter, tags exist to handle overlap, truncation, manner of delivery,
and a variety of vocal and non-vocal events.

TEI P2 describes a text header applicable to any conformant text, and
a corpus header which is similar in structure, but applicable only to
conformant corpora. The BNC provides a full implementation of both,
with the intention of allowing researchers to build sub-corpora
reflecting some feature or combination of features of the texts
represented in the corpus. Dunlop (1992b) describes the BNC headers
in detail.

As material has been collected for the BNC, it has become apparent
that there are many situations where it would be desirable to link a
number of texts --- or subsections of a single text --- together
because they have some characteristic in common. To give two
examples, the same principal speaker will appear in many spoken corpus
texts; a single reporter may contribute more than one article to a
given edition of a newspaper. While such common features may be
established by examination of text and corpus headers, CDIF provides
no means of making such links explicit. At the time of writing, the
TEI is considering proposals as to how this might be done; however,
resolution will come too late for CDIF.

In discussions with those responsible for data collection for the BNC,
it became apparent that it would not be possible to provide a uniform
level of mark-up across the whole corpus. For example, when text is
captured using optical character recognition, it is cheap and easy to
capture changes in type style, but manual intervention is required to
mark poetry, and to insert footnote text at its point of reference.
Where text is rekeyed, changes in type style may go unnoticed, but the
transcriber can handle poetry and notes accurately and with relative
ease.

Consequently, it was decided to divide CDIF text tags into three
categories:

Required tags, which must be used to mark particular types of feature
if those features appear in a text. (In some cases, as an alternative
to tagging, the content of the feature may be silently deleted from
the electronic transcription: footnotes are a case in point. The
editorial practices declaration in each text header describes the
treatment of such features.) Examples of required tags are <p>, to
mark paragraphs in written text; <u> to mark spoken utterances; and
<note> to mark foot- end- or side-notes, or editorial comments
inserted during BNC processing.

Recommended tags, which are not mandatory, but highly desirable.
Often these mark text features which could cause anomalous results in
corpus-based research if their presence were not noted. Examples are
lists (marked with <list>); poetry (<poem>) and material written to be
spoken (<sp>).

Optional tags, which may appear if sufficient information has been
captured from the original text, or if their use resolves some problem
identified during syntactic or semantic checking. Examples include
<hi> to describe text rendition (no attempt is made to interpret the
semantic reason for changes in rendition); to show quotation
of material written by some person other than the main author of a
text; and <cite> to enclose the citation within a text of another
work.

Burnard (1992b) summarizes the division of CDIF tags into these three
categories. The text header (Dunlop 1992b) lists the tags used in a
particular text.

1. A sample text
<div>s contain paragraphs; they may also contain
a) Lists
b) Poems --- such as
"It's only words,
and words are all I have..."
c) Lower-level <div>s
1.1 A sub-section
Contents of the sub-section*.

Figure 3. Written example with required CDIF mark-up
Figure 2 shows a sample of text as it might appear on the printed
page. In figure 3, the same text appears with only required CDIF
mark-up added. Note that information about rendition and structure
below the top level is not recorded, and the footnote has been
silently deleted.

While the text of figure 3 would be acceptable for inclusion in the
corpus in this form, most corpus texts show more complete tagging, as
shown in figure 4. Here some recommended tags are added, fully
describing the text structure, and identifying the list and poem
fragment that it contains. The footnote also appears, tagged at its
point of reference.

The addition of optional mark-up, shown in figure 5, provides
information about text rendition, and indicates that the poem fragment
is a quotation.

If the example were an actual corpus text, it would also be marked up
with part-of-speech and segmentation information, independent of the
level of other tagging applied. See Eyes (1992) for further details.

Figure 6 shows an example spoken text, set out as if in the printed
script of a play. The main feature of the example is the overlap
between the utterances of the two speakers. As figure 7 shows, this
is handled by marking the start and end of the period of overlap in
each utterance. An `alignment map' at the start of the text shows
the ordering in time of the starting and ending marks, and so
indicates which utterances overlap which others. The transcription
method used for spoken material (see Crowdy 1991) correctly captures
up to three simultaneous utterances. (In an actual corpus text, the
identifiers used for each alignment map location would be longer than
those in the example, as they must be unique across the whole corpus.)

As one of the first attempts to build a large, balanced corpus with
uniform, SGML-based mark-up, the BNC project was bound to encounter
unforeseen difficulties and tasks which took longer than had been
anticipated.

Involvement with the TEI has many benefits, but, in these relatively
early days, has often necessitated waiting for recommendations to
appear before CDIF mark-up specifications can be frozen. This is a
two-way process: experience with issues raised by the BNC has informed
several aspects of the TEI's work, and the close attention of several
of the technical experts who contribute to the TEI recommendations has
been of great assistance to the BNC project.

Future TEI-conformant corpus-building projects will not be burdened by
many of the issues which the BNC, as an early user, has encountered.
However, any builder of a large corpus faces the problem of converting
texts from a variety of source formats into a uniform electronic
format prior to accession to the corpus. Experience on the BNC
project indicates that, the earlier in this process that some
mechanically verifiable form of quality control can be introduced, the
better.

British National Corpus working papers are available on request in
printed or electronic from the authors.

The sgmls program and published TEI papers are available by anonymous
FTP from archives at sgml1.exeter.ac.uk. The Icon language processors
are available by anonymous ftp from cs.arizona.edu. The perl language
is available by anonymous ftp from many sites, including ftp.uu.net
and doc.ic.ac.uk. In addition to UNIX, all of these languages support
MS-DOS, VMS, and a number of other operating environments.