LREC 2002 Workshop on
LINGUISTIC KNOWLEDGE ACQUISITION AND REPRESENTATION:
BOOTSTRAPPING ANNOTATED LANGUAGE DATA
Las Palmas, Canary Islands, Spain
2nd June 2002
_____________________________
MOTIVATION AND AIMS
Provision of large-scale labelled language resources, such as tagged
corpora or repositories of pre-classified text documents, is a crucial
key to steady progress in an extremely wide spectrum of research,
technological and business areas in the HLT sector. The continuously
changing demands for language-specific and application-dependent
annotated data (e.g. at the syntactic or at the semantic level),
indispensable for design validation and efficient software prototyping,
however, are daily confronted by the labelled-data bottleneck.
Hand-crafted resources are often too costly and time-consuming to be
produced at a sustainable pace, and, in some cases, they even exceed the
limits of human conscious awareness and descriptive capability.
Possible ways to circumvent, or at least minimise, this problem come
from the literature on automatic knowledge acquisition and, more
generally, from the machine-learning community. Annotated data are
bootstrapped by training a machine-learning classifier with a small
sample of pre-annotated data and by using the induced classifier to
annotate more data. Co-learning provides an alternative methodology,
which essentially consists in iterative cooperation of two or more
independent learning systems. Another promising route consists in
automatically tracking down recurrent knowledge patterns in unstructured
or implicit information sources (such as free texts or machine readable
dictionaries) for this information to be moulded into explicit
representation structures (e.g. subcategorisation frames,
syntactic-semantic templates, ontology hierarchies etc.).
We believe that all these attempts at bootstrapping labelled data are
not only of practical interest (for continuous updating, management and
validation of dynamic resources), but also point to a bunch of germane
theoretical issues. In particular, the workshop intends to focus on the
issue of interaction between techniques for inducing structured
knowledge from raw data and formal methods of linguistic knowledge
representation. Gaining insights into this issue is an essential
requirement for explaining the effective use of linguistic knowledge by
cognitive agents. Although the cognitive and engineering views of the
form and acquisition of linguistic knowledge need not be related, data
from neuroscience and psychology are indeed relevant when evaluating
different ways of representing information in artificial systems, and
different models for linguistic knowledge acquisition.
We encourage in-depth analysis of underlying assumptions of the proposed
bootstrapping methods and discussion of possible relevant connections
with existing annotation and representation schemes. This investigation
is likely to have significant repercussions on the way linguistic
resources will be designed, developed and used for applications in the
years to come. As the two aspects of knowledge representation and
acquisition are profoundly interrelated, progress on both fronts can
only be achieved, in our view of things, through a full appreciation of
this deep interdependency.
TOPICS OF INTEREST
Possible themes for contributions are:
* development of 'data-driven' annotation/representation schemes
* dynamic update, customisation and tuning of labelled resources through
acquired data
* 'hybrid models' of linguistic knowledge extraction, whereby machine
learning methods are integrated with formal structures of knowledge
representation
* incremental linguistic knowledge-bases
* formal representation and structuring of information flow
automatically acquired from texts
* knowledge acquisition and linguistic resources lifecycle
* linguistic knowledge acquisition and representation in cognitive tasks
IMPORTANT DATES
Deadline for workshop abstract submission:
15th of February 2002
Notification of acceptance:
15th of March 2002
Final version of paper for workshop proceedings:
15th of April 2002
Workshop:
2nd June 2002 (afternoon session)
SUBMISSIONS
The organizers welcome contributions describing existing research
related to the topics of the workshop. Each presentation will be 25
minutes long (20 minutes for presentation and 5 minutes for questions
and discussion). Submissions should include: title; author(s);
affiliation(s); and contact author's e-mail address, postal address,
telephone and fax numbers. Abstracts (maximum 500 words, plain-text
format) must be sent to: simoilc.pi.cnr.it
The final version of the accepted papers should not be longer than 4,000
words or 10 A4 pages. Instructions for formatting and presentation of
the final version will be sent to authors upon notification of
acceptance.
ORGANISING COMMITEE
Alessandro Lenci (Universit� di Pisa, Italy)
Simonetta Montemagni (Istituto di Linguistica Computazionale - CNR,
Italy)
Vito Pirrelli (Istituto di Linguistica Computazionale - CNR, Italy)
PROGRAM COMMITTEE
Harald Baayen (Max Planck Institute for Psycholinguistics - Nijmegen,
The Netherlands)
Rens Bod (University of Amsterdam, Holland)
Michael R. Brent (Washington University, USA)
Nicoletta Calzolari (Istituto di Linguistica Computazionale - CNR,
Italy)
Jean-Pierre Chanod (Xerox Research Centre Europe, Grenoble, France)
Walter Daelemans (University of Antwerp, Belgium)
Dekang Lin (University of Alberta, Edmonton, Canada)
Horacio Rodriguez (Universidad Politecnica de Catalunya)
Fabrizio Sebastiani (Istituto per l'Elaborazione dell'Informazione -
CNR, Italy)
Lucy Vanderwende (Microsoft Research, Redmond, USA)
Fran�ois Yvon (Ecole Nationale Superieure des Telecommunications, Paris
Frances)
Menno van Zaanen (University of Amsterdam, The Netherlands)
CONTACT PERSON
Simonetta Montemagni
Istituto di Linguistica Computazionale (ILC) - CNR
Area della Ricerca di Pisa
Via Moruzzi 1, 56124 Pisa, ITALY
e-mail: simoilc.pi.cnr.it