Making electronic texts

Three ways of creating electronic versions of the texts were envisaged at the start of
the BNC project: scanning, keyboarding, and re-use of existing electronic texts.

Scanning

Optical character readers are becoming increasingly sophisticated, and many BNC
books were "scanned" in this manner. High-quality original texts were required to
ensure the error rate was low. Hand editing was still required, though, to correct
scanning errors and insert textual markup.

Keyboarding

For leaflets, hand-written items, and of course recorded speech, keyboarding --
manually typing in texts -- was the only viable option, as it was for surprisingly
many magazines and newspapers. Scanners are not efficient enough at recognizing small
typefaces, lower-quality typography, or handwriting. It would have taken longer to
correct scanned output in such cases than it did for a trained typist simply to
re-type the documents in full.

Existing electronic texts

The corpus designers believed that many texts would already exist in electronic
form -- publishers' and typesetters' versions of newspapers, magazines and some books
-- and that converting such texts to the standard format required for the corpus would
be reasonably straightforward. In the event, texts in electronic form which fitted the
corpus design were far fewer than had been supposed. Newspaper text frequently came in
machine-readable form, but often required programs to reformat it.