The BNC in numbers

The XML Edition of the BNC contains 4049 texts and occupies
(including all markup) 5,228,040 Kb, or about 5.2 Gb. In
total, it comprises just under 100 million orthographic words
(specifically, 96,986,707), but the number of w-units
(POS-tagged items) is slightly higher at 98,363,783. The
tagging distinguishes a further 13,614,425 punctuation
strings, giving a total content count of 110,691,482
tokens. The total number of s-units tagged is over 6 million
(6,026,284). Counts for these and all the other XML elements tagged in the corpus are provided in the corpus header.

To put these numbers into perspective, the average paperback book has about
250 pages per centimetre of thickness; assuming 400 words a page, we calculate that the
whole corpus printed in small type on thin paper would take up about ten metres of shelf
space. Reading the whole corpus aloud at a fairly rapid 150 words a minute, eight hours a
day, 365 days a year, would take just over four years.

As the following summary table shows, most (about 90%) of the words making up the
corpus are taken from written texts of many different kinds, but 10
percent — about 10 million in total — are taken from transcribed
speech, recorded in both formal and informal contexts.

Table 1. Composition of the BNC World Edition

Text type

Texts

Kbytes

W-units

S-units

percent

Spoken demographic

153

4206058

4.30

610563

10.08

Spoken context-governed

757

6135671

6.28

428558

7.07

All Spoken

910

10341729

10.58

1039121

17.78

Written books and periodicals

2688

78580018

80.49

4403803

72.75

Written-to-be-spoken

35

1324480

1.35

120153

1.98

Written miscellaneous

421

7373707

7.55

490016

8.09

All Written

3144

87278205

89.39

5013972

82.82

More detailed frequency information for the various kinds
of text included in the corpus are available in the BNC User
Reference Guide.