Encoding of Texts

To be usable by computer, an electronic text must include some kind of mark-up. The
mark-up introduced into the BNC texts indicates explicitly a wide range of important
information, including:

the boundary and part of speech of each word

the sentence structure identified by CLAWS

paragraphs, sections, headings and similar features in written texts

speech turns, pausing, and para-linguistic features such as laughter in spoken
texts

meta-textual information about the source or encoding of individual texts

These textual features, and others, are all encoded in a standardized way, to help
ensure that the corpus will be usable no matter what the local computational set-up may
be.

The format used by the BNC is called the Corpus Document Interchange Format
(CDIF for short) and is fully documented in the BNC Users Reference Guide. An article by Gavin Burnage and
Dominic Dunlop titled Encoding the British
National Corpus, written while the BNC was being developed, describes the scheme
and its use within the project in some detail.