Corpora

UCREL members have been involved in the compilation and annotation
of many electronic corpora, often in collaboration with other institutions.
Some corpora are held only as plain orthographic
text, whilst others are held with several kinds of
annotation.

Some of the corpora listed below are available via
ICAME in Bergen, Norway, and
information on how to obtain some of the others is available at the
same site.
A selection of the corpus manuals
are on-line too. Yet more corpora are made available via
ELDA or OTA

Speech, Thought, and Writing Presentation

Two corpora - one of spoken data and one of written texts - tagged
using categories of speech, writing and thought presentation
outlined initially in Leech and Short (1981) and developed in the work
of Short, Semino and Wynne (see for example Short, Wynne and Semino
1998). See the
project homepage.

Corpora of South Asian languages

Generated by the EMILLE (Enabling Minority Language Engineering) project
at Lancaster University and Sheffield University.
EMILLE collected a 97 million word electronic corpus of South Asian
languages, especially those spoken in the UK.
See http://www.emille.lancs.ac.uk/.

Lancaster Corpus of Mandarin Chinese

The corpus
was designed as a Chinese match of the Freiburg-LOB Corpus of
British English (FLOB), and, as such, provides a valuable resource
for contrastive studies between English and Chinese as well as a sound
basis for monolingual investigations of Chinese. The LCMC corpus is
distributed by the European Language Resources Association (Cat. No
ELRA-W0039) and the Oxford Text Archive (Cat. No 2474).