Current research

I'm currently working on the Teaching
English Grammar in Schools research project, developing the
platform. I believe it is important that our research is put to
effective use to benefit society as much as possible. We have recently
released a mobile App for the iPhone, iPad and Android mobile devices
called the interactive Grammar of English.

I was the joint PI and lead researcher on the Next
Generation Tools research project. This project aims to develop
a research platform for corpus linguistics which supports the whole
experimental research cycle. You can download the ICECUP
IV beta from here.

ICECUP IV is built on top of the ICECUP III architecture. It shares
components but extends them in new ways to support experimental
research. The idea is that exploratory steps can lead to insights
that help a linguist formulate hypotheses. She or he can then investigate
these thoroughly by formalising their research question and carrying
out experiments using the same platform. Conversely, if experiments
derive unexpected results, these can be thoroughly explored.

DCPSE

We completed the Diachronic Corpus of Present-Day Spoken English
(DCPSE) in 2006, a
parsed corpus of spoken English. The main task was to parse some
400,000 words taken from the spoken part of the London Lund
Corpus (LLC), collected in the 1960s and 1970s, consistently
with the British Component of the International Corpus of English
(ICE-GB).

ICE-GB was collected in the 1990s and was partially parsed automatically,
and then manually corrected by a team of linguists. The complexity
of the parsing task is high, especially when you consider the level
of detail of the analysis scheme and the fact that all the material
is spoken. A description of the project is here.
My input is primarily to manage the project and support it through
software.

ICECUP

ICECUP 3.1 can be downloaded from here
with sample corpora from ICE-GB R2 (click here
for the DCPSE version). The Survey webpages now include an extended
description of ICECUP 3.1. The latest
version in beta, ported to Visual C++, is compatible with 64-bit
platforms.

ICECUP, or to give
it its full (and slightly misleading) title, the International
Corpus of English Corpus Utility Program Version 3, is a corpus
exploration platform. It is a tool designed to make it easy
for researchers to explore a parsed corpus, and was initially distributed
with the parsed ICE-GB corpus in 1998 in a 3.0 version.

Research perspectives

My background is in artificial intelligence (AI). I was
a researcher in the (now sadly defunct) AI Group at the University
of Nottingham for six years before joining the Survey
of English Usage in 1995.

Broadly, my perspective in AI is that ‘artificial intelligence’
is not a substitute for human reasoning, knowledge and culture,
but a potential adjunct of it: ‘intelligent assistance’
if you will. AI includes a range of powerful techniques which can
be used to support human endeavour, particularly in the “understanding
of complex data”, which is my pat answer to the question -
what do I carry out research into?

My central research area is methodological -
that is, I am concerned with the way we think, learn about, process,
evaluate and communicate our research. This work requires me to
develop software tools and platforms to help researchers carry out
their research, but these tools are a means to an end, not an end
in themselves. To my mind, knowledge does not reside in the algorithm,
but in the sense we make of it.

Much of my current research work is bordering on mathematics and
statistics, at least insofar as this allows us to maximise the value
we can get out of corpus data! I'm increasingly of the view that
corpus linguists, ourselves included, are really only just beginning
to glimpse the value of the data that we have collected.

From Artificial Intelligence to Corpus Linguistics

Since joining the Survey, I have applied AI algorithms and approaches
to corpus linguistics in a number of ways. The following is not
an exhaustive list.

Knowledge acquisition principles were employed in the
design, development and evaluation of tree editors (Wallis and
Nelson 1997) and other browsers in ICECUP.

Deductive reasoning was used for matching Fuzzy Tree
Fragments to corpus trees (Wallis and Nelson 2000) and the axiomatic
reduction of logical propositions (Nelson, Wallis and Aarts 2002),
both used in ICECUP.

Machine learning techniques were employed in developing
and refining a phrasal parser used in the parsing of DCPSE, developing
a POS-tagger and knowledge discovery.

I have written on knowledge representation issues, including
recently in relation to corpus annotation (Wallis 2007) and corpus
query (Wallis 2008).

It should be noted that many of the algorithms described are not
simply 'proof-of-concept' systems. ICECUP has been used for corpus
linguistics research for almost a decade, its tree editor applied
to several hundreds of thousands of trees, etc.

There are two critical requirements for AI technologies embedded
in end user tools (Wallis, Cottam and Shadbolt 1994): they must
operate reliably (robustness, scalability), and both
the specification and results of processing must be meaningful (transparency).
Perhaps most of all, embedded AI should be seamless: the
'artificial intelligence' aspects are justified insofar as they
solve real problems in the application area. Consequently, on occasion
the AI implications of this research may be downplayed.

For example, our use of simulated annealing for aligning ICE-GB
tree and text files (Wallis and Nelson 1997) solved a significant
problem and permitted the corpus to be reconstructed from over 100,000
separate files.

The critique of longitudinal corpus correction in the same paper
paved the way to a more efficient cross-sectional correction approach
which was based around the corpus exploration platform, ICECUP,
itself embodying a number of AI algorithms (see above).

I refer to this perspective as the '3A
perspective' - annotation, abstraction and analysis. Without
regular annotation, reliable abstraction of concepts is impossible.
Hence most of the work in Corpus Linguistics to-date has been in
ensuring a reliable corpus annotation, with rather less research
effort in corpus query methods. Even less research on applying analysis
methods has taken place.

Similarly, the reliable abstraction of concepts is a necessary
precondition for analysis. The ability to capture a concept (e.g.
noun phrase post-modification) in its many forms is a precondition
to the definition of an experiment to investigate the factors that
lead a speaker to prefer one form over another.

Finally, this applied AI research has brought me back full-circle
to questions of experimental design and statistics. In part
this is because we are now discovering exciting and novel research
possibilities using our parsed corpora. We just have to ask the
right questions! In part this is also because corpus linguists,
like most of us, are not trained in experimental statistics.

So in recent years I have been attempting to humanise and make
relevant some key tools found in statistics articles and textbooks.
This started with conference papers and pages on our website discussing
methodology in broad terms, but a more rigorous and in-depth reading
and evaluation can be found on my corp.ling.stats
blog.

(with G. Nelson and B. Aarts) Exploring Natural Language:
Working with the British Component of the International Corpus
of English. G29, Varieties of English Around the World series.
Amsterdam: John Benjamins. More...

(with J. Close and B. Aarts) Recent changes in the use of
the progressive construction in English. In Capelle, B. and
Wada, N. (eds.) Distinctions in English Grammar. Kaitakusha:
Tokyo, Japan. 148-168. »pre-published.