Corpus Queries

Development of an effective grammatical query methodology in the
context of a parsed corpus

funded by

Ref: R 000 22 2598Institution: University College LondonDepartment: Department of English (Survey of English Usage)Investigator: Sean WallisPeriod: 1 March 1998 to 31 January 1999

Original aims and objectives

In recent years, corpus linguistics has developed dramatically,
due to increased computing power and improvements in annotation
software. This has precipitated a growth in the scale and complexity
of corpora, including the new grammatically annotated ICE-GB corpus.
Text corpora have been used both to improve software tools, such
as grammatical parsers, and to improve our understanding of language.

The research is to develop a linguistically plausible and transparent
method of forming queries for grammatical corpora.

The proposal is to use fragments of grammatical trees as the main
representation for queries. These "fuzzy tree fragments"
appeal because of the obvious parallel with familiar grammatical
structure. The difference is that a query must capture both what
is known and what is unknown: some components and relations may
be ommitted or "fuzzy". Developing this notion of "fuzziness"
is a major part of the research.

Complex queries may then be constructed by combining these tree
fragments with sociolinguistic variables using a logical language.

This project will run concurrently with the first release of the
ICE-GB corpus, and an early prototype of
the system will be provided at this point. Feedback from end users
will be used to aid further development.

Comment

Although this project was very modest in duration and scope, the
results proved to be extremely important and influential. The Corpus
Query project permitted the development of Fuzzy Tree Fragments
and ICECUP 3.0. The software was indeed published alongside ICE-GB
Release 1 in 1998, and has continued to improve ever since.