Word sense selection in texts: an integrated model

This technical report is based on a dissertation submitted May 2000 by
the author for the degree of Doctor of Philosophy to the University of
Cambridge, Downing College.

Abstract

Early systems for word sense disambiguation (WSD) often depended on
individual tailor-made lexical resources, hand-coded with as much
lexical information as needed, but of severely limited vocabulary size.
Recent studies tend to extract lexical information from a variety of
existing resources (e.g. machine-readable dictionaries, corpora) for
broad coverage. However, this raises the issue of how to combine the
information from different resources.

Thus while different types of resource could make different contribution
to WSD, studies to date have not shown what contribution they make, how
they should be combined, and whether they are equally relevant to all
words to be disambiguated. This thesis proposes an Integrated Model as a
framework to study the inter-relatedness of three major parameters in
WSD: Lexical Resource, Contextual Information, and Nature of Target
Words. We argue that it is their interaction which shapes the
effectiveness of any WSD system.

A generalised, structurally-based sense-mapping algorithm was designed
to combine various types of lexical resource. This enables information
from these resources to be used simultaneously and compatibly, while
respecting their distinctive structures. In studying the effect of
context on WSD, different semantic relations available from the combined
resources were used, and a recursive filtering algorithm was designed to
overcome combinatorial explosion. We then investigated, from two
directions, how the target words themselves could affect the usefulness
of different types of knowledge. In particular, we modelled WSD with the
cloze test format, i.e. as texts with blanks and all senses for one
specific word as alternative choices for filling the blank.

A full-scale combination of WordNet and Roget’s Thesaurus was done,
linking more than 30,000 senses. Using these two resources in
combination, a range of disambiguation tests was done on more than
60,000 noun instances from corpus texts of different types, and 60
blanks from real cloze texts. Results show that combining resources is
useful for enriching lexical information, and hence making WSD more
effective though not completely. Also, different target words make
different demand on contextual information, and this interaction is
closely related to text types. Future work is suggested for expanding
the analysis on target nature and making the combination of
disambiguation evidence sensitive to the requirements of the word being
disambiguated.