About the Corpus project

The aim of the project is to create a large corpus of learner (and examiner) speech which will be used in a wide range of research contexts including Second Language Acquisition, language testing, L2 pedagogy and materials development, etc.

The corpus currently stands at 3.5 million words. It has been created from recordings of Trinity’s Graded Exams in Spoken English (GESE) across a range of grades from B1 – C2 on the CEFR scale. It represents language used in a variety of speaking tasks which reflect speech events in the world outside the test and covers multiple different language backgrounds.

What is a language corpus?

A language corpus is a collection of texts, either written or spoken, which is compiled digitally for the purpose of language analysis. Advances in computer technology mean that it is now possible to create very large corpora (millions of words), store them in digital form, and analyse them automatically or semi-automatically.

The recorded speech is entered and coded with a variety of tags so that users can examine all the texts in the corpus, or a sample of them, in order to determine how language is used in particular contexts (eg in formal or informal situations), by specific groups of people (eg different ages, different mother tongues), for specific purposes (eg for academic purposes, for social purposes), etc. The findings of such analyses can be used for many real-world purposes such as devising teaching materials, constructing tests and other assessment procedures, compiling accurate dictionaries or improving communication amongst different social or cultural groups.

The nature of the GESE test – one which focuses on communicative skills and allows test takers choice in their contributions – means that the Trinity Lancaster Corpus can offer unique insights into how learners choose to manage interaction and build meaning based on their own identify rather than being overly constrained by the test task.

How can we use the corpus?

As a unique research resource the Trinity Lancaster corpus allows investigating learner speech at different proficiency levels (advanced, intermediate and lower intermediate/threshold) and analysing spoken learner production across different tasks (both monologic and interactive). The corpus samples language of learners with a variety of L1 backgrounds, representing English speakers from Italy, Spain, Mexico, Argentina, Brazil, China, India, Sri Lanka and Russia, which will allow us to report back to those learners on their specific proficiencies and needs for development. It will also allow the development of locally focused teaching materials and test support activities. See our current range of Corpus-informed teaching resoures.

Corpora analysis is likely to become more sophisticated in the future especially with multiple layers of corpus annotation that allows searching according to different linguistic and background criteria. The Trinity Lancaster Corpus has an aspiration to become a leading research tool in this respect.

Further information

Download a factsheet about the Trinity Lancaster Corpus which summarises its features and shows examples of research findings