What is the BNC?

The British National Corpus (BNC) is a 100 million word collection of samples of written
and spoken language from a wide range of sources, designed to represent a wide
cross-section of British English from the later part of the 20th century, both spoken and
written. The latest edition is the BNC XML Edition, released in 2007.

The written part of the BNC (90%) includes, for example, extracts from regional
and national newspapers, specialist periodicals and journals for all ages and interests,
academic books and popular fiction, published and unpublished letters and memoranda,
school and university essays, among many other kinds of text. The spoken part
(10%) consists of orthographic transcriptions of unscripted informal conversations
(recorded by volunteers selected from different age, region and social classes in a
demographically balanced way) and spoken language collected in different contexts, ranging
from formal business or government meetings to radio shows and phone-ins.

The corpus is encoded according to the Guidelines of the Text Encoding Initiative
(TEI)
to
represent both the output from CLAWS (automatic part-of-speech
tagger) and a variety of other structural properties of texts (e.g. headings, paragraphs,
lists etc.). Full classification, contextual and bibliographic information is also
included with each text in the form of a TEI-conformant header.

Work on building the corpus began in 1991, and was completed in 1994. No new texts have
been added after the completion of the project but the corpus was slightly revised prior
to the release of the second edition BNC World (2001) and the third edition
BNC XML Edition (2007). Since the completion of the project, two
sub-corpora with material from the BNC have been released separately: the BNC Sampler (a
general collection of one million written words, one million spoken) and the BNC Baby
(four one-million word samples from four different genres).

What sort of corpus is the BNC?

Monolingual: It deals with modern British English, not other languages used in
Britain. However non-British English and foreign language words do occur in the
corpus.

Synchronic: It covers British English of the late twentieth century, rather
than the historical development which produced it.

General: It includes many different styles and varieties, and is not limited to
any particular subject field, genre or register. In particular, it contains examples of
both spoken and written language.

Sample: For written sources, samples of 45,000 words are taken from various
parts of single-author texts. Shorter texts up to a maximum of 45,000 words, or
multi-author texts such as magazines and newspapers, are included in full. Sampling
allows for a wider coverage of texts within the 100 million limit, and avoids
over-representing idiosyncratic texts.