LINGUIST List 9.705

Tue May 12 1998

Confs: Distributing and Accessing Linguistic Resources

Editor for this issue: Martin Jacobsen <martylinguistlist.org>

Please keep your conference announcement as short as you can; LINGUIST
will not post conference announcements which in our opinion are
excessively long. Also, please remember that, once posted, your
announcement will be permanently available at our website:
http://www.linguistlist.org/issues/indices/Confs1997r.html
For this reason, we discourage multiple submissions of the same
conference announcement. Thank you for your cooperation.

***********************************************
Distributing and Accessing Linguistic Resources
***********************************************
May 27th,
This workshop is part of First International Conference on Language
Resources and Evaluation at the University of Granada, May 26th to
30th 1998 (see
http://ceres.ugr.es/~rubio/elra.html
for details and how to register).
The workshop will discuss ways to increase the efficacy of linguistic
resource distribution and programmatic access, and work towards the
definition of a new method for these tasks based on distributed
processing and object-oriented modelling with deployment on the WWW.
Organizers: Yorick Wilks, Wim Peters, Hamish Cunningham, Remi Zajac
PAPERS
The following papers will be presented in order of enumeration. After
each 15 minute presentation there will be 5 minutes for discussion.
Distributed Thesaurus Storage and Access in a Cultural Domain
Application
S. Boutsis, B. Georgantopoulos, S. Piperidis
Institute for Language and Speech Processing, Athens
A New Model for Language Resource Access and Distribution
W. Peters, H. Cunningham, Y. Wilks, C. McCauley
University of Sheffield
Reuse and Integration of NLP Components in the Calypso Architecture
R. Zajac
New Mexico State University
Corpus-based Research using the Internet
H. Brugman, A. Russel, P. Wittenburg
Max Planck Institute for Psycholinguistics, Nijmegen
The CUE Corpus Access Tool
O. Mason
University of Birmingham
Linguistic Research Utilizing the EDR Electronic Dictionary as a
Linguistic Resource
T. Ogino
EDR, Japan
POSTERS
The following posters will be on display during the workshop, and
presentations are planned during the breaks:
TRACTOR: TELRI Research Archive of Computational Tools and Resources
R. Krishnamurthy
University of Birmingham
Web-Surfing the Lexicon
D. Cabrero, M. Vilares, L. Docampo, S. Sotelo
Ramon Pineiro Research Centre/Universities of Coruna and Santiago
Exploring Distributed MT
O. Streiter, A. Schmidt-Wigger, U. Reuther, C. Pease
IAI Saarbruecken
A Proposal for an On-line Lexical Database
P. Cassidy
Micra, Inc.
PANEL DISCUSSION:
The final part of the workshop will consist of a panel discussion on:
Distributing and Accessing Linguistic Resources
The panel participants are:
Khalid Choukri, Eduard Hovy, Judith Klavans, Yorick Wilks, and
Antonio Zampolli.
Workshop scope and aims
- ---------------------
In general the reuse of of NLP data resources (such as lexicons or
corpora) has exceeded that of algorithmic resources (such as
lemmatisers or parsers). However, there are still two barriers to
data resource reuse:
1) each resource has its own representation syntax and corresponding
programmatic access mode (e.g. SQL for CELEX, C or Prolog for Wordnet,
SGML for the BNC);
2) resources must generally be installed locally to be usable (and of
course precisely how this happens, what operating systems are
supported etc. varies from case to case).
The consequences of 1) are that although resources share some
structure in common (lexicons are organised around words, for example)
this commonality is wasted when it comes to using a new resource (the
developer has to learn everything afresh each time) and that work
which seeks to investigate or exploit commonalities between resources
(e.g. to link several lexicons to an ontology) has to first build a
layer of access routines on top of each resources. So, for example, if
we wish to do task-based evaluation of lexicons by measuring the
relative performance of an information extraction system with
different instantiations of lexical resource, we might end up writing
code to translate several different resources into SQL or SGML.
The consequence of 2) is that there is no way to "try before you buy":
no way to examine a data resource for its suitability for your needs
before licencing it. Correspondingly there is no way for a resource
provider to expose limitted access to their products for advertising
purposes, or gain revenue through piecemeal supply of sections of a
resource.
This workshop will discuss ways to overcome these barriers. The
proposers will discuss a new method for distributing and accessing
language resources involving the development of a common programmatic
model of the various resources types, implemented in CORBA IDL and/or
Java, along with a distributed server for non-local access. This model
is being designed as part of the GATE project (General Architecture
for Text Engineering:
http://www.dcs.shef.ac.uk/research/groups/nlp/gate/) and goes under
the provisional title of an Active CREOLE Server. (CREOLE: Collection
of REusable Objects for Language Engineering. Currently CREOLE
supports only algortihmic objects, but will be extended to data
objects.)
A common model of language data resources would be a set of
inheritance hierarchies making up a forest or set of graphs. At the
top of the hierarchies would be very general abstractions from
resources (e.g. lexicons are about words); at the leaves would be data
items that were specific to individual resources. Programmatic access
would be available at all levels, allowing the developer to select an
appropriate level of commonality for each application.
Note that although an exciting element of the work could be to provide
algorithms to dynamically merge common resources what we're suggesting
initially is not to develop anything substantively new, but simply to
improve access to existing resources. This is NOT a new standards
initiative, but a way to build on previous initiatives.
Of course, the production of a common model that fully expressed all
the subtleties of all resources would be a large undertaking, but we
believe that it can be done incrementally, with useful results at each
stage. Early versions will stop decomposing the object structure of
resources at a fairly high level, leaving the developer to handle the
data structures native to the resources at the leaves of the
forest. There should still be a substantial benefit in uniform access
to higher level strucures.
Program Committee
- ---------------
Yorick Wilks
Hamish Cunningham
Wim Peters
Remi Zajac
Roberta Catizone
Paola Velardi
Maria Teresa Pazienza
Roberto Basili
Bran Boguraev
Sergei Nirenburg
James Pustejowsky
Ralph Grishman
Christiane Fellbaum