The Internet is rapidly bringing to the foreground the need for people to be able to access and manage information in many different languages. Even in cases where people have been lucky enough to learn several languages, they will still need help in effectively participating in the global information society. There are simply too many different languages, and all of them are important to somebody.

While machine translation has a long (over 50 year) history, computer technology now appears ready for the next great push in technology for multilingual information access and management, particularly over the World Wide Web. The European Commission and several US agencies are taking bold steps to encourage research and development in multilingual information technologies. The EC and the US National Science Foundation, for example, have recently issued a joint call for Multilingual Information Access and Management research. The US Defense Advanced Research Projects Agency is supporting a new effort in Translingual Information Detection, Extraction, and Summarization research. Both of these efforts are direct results of international planning efforts, and this Granada effort in particular.

No one was more surprised than the Granada workshop participants were at the rapid uptake in interest in Multilingual Information Management research. Attendees of the workshop in Granada, Spain hardly had their bags unpacked when the results were requested to be presented in Washington DC at a National Academy of Sciences workshop on international research cooperation. The US White House expressed interest in the topic as a groundbreaking effort for a new US-EU Science Cooperation Agreement. Now, DARPA has decided to invest in a multi-year, large-scale effort to push the envelope on rapid development of multilingual capability in new language pairs.

The World is surely shrinking as communication and computation advances proceed at a breath-taking pace. On the other hand, there is no doubt that people will continue to hold on to the values and beliefs of their native cultures. This includes holding on to the language of their families and ancestors. This is a treasure, a cultural knowledge base that must not be weakened even as pressures to be able to speak common languages increase. Therefore, efforts in multilingual technology not only allow us to share knowledge and resources of the World, they also allow us to preserve our individual human qualities that have allowed us to progress and solve problems that we all share.

I thank all whose efforts have gone into this workshop report and the resource that it represents for future efforts in the field. Those who proceed to carry on the needed research and development being called for from around the world will surely find this report to be of great value.

Introduction: The Goals of the Report

Over the past 50 years, a variety of language-related capabilities has been developed in machine translation, information retrieval, speech recognition, text summarization, and so on. These applications rest upon a set of core techniques such as language modeling, information extraction, parsing, generation, and multimedia planning and integration; and they involve methods using statistics, rules, grammars, lexicons, ontologies, training techniques, and so on.

It is a puzzling fact that although all of this work deals with language in some form or other, the major applications have each developed a separate research field. For example, there is no reason why speech recognition techniques involving n-grams and hidden Markov models could not have been used in machine translation 15 years earlier than they were, or why some of the lexical and semantic insights from the subarea called Computational Linguistics are still not used in information retrieval.

This picture will rapidly change. The twin challenges of massive information overload via the web and ubiquitous computers present us with an unavoidable task: developing techniques to handle multilingual and multi-modal information robustly and efficiently, with as high quality performance as possible.

The most effective way for us to address such a mammoth task, and to ensure that our various techniques and applications fit together, is to start talking across the artificial research boundaries. Extending the current technologies will require integrating the various capabilities into multi-functional and multi-lingual natural language systems.

However, at this time there is no clear vision of how these technologies could or should be assembled into a coherent framework. What would be involved in connecting a speech recognition system to an information retrieval engine, and then using machine translation and summarization software to process the retrieved text? How can traditional parsing and generation be enhanced with statistical techniques? What would be the effect of carefully crafted lexicons on traditional information retrieval? At which points should machine translation be interleaved within information retrieval systems to enable multilingual processing?

The purpose of this study is to address these questions, in an attempt to identify the most effective future directions of computational linguistics research and in particular, how to address the problems of handling multilingual and multi-modal information. To gather information, a workshop was held in Granada, Spain, immediately following the First International Conference on Linguistic Resources and Evaluation (LREC) at the end of May, 1998. Experts in various subfields from Europe, Asia, and North America were invited to present their views regarding the following fundamental questions:

What is the current level of capability in each of the major areas of the field dealing with language and related media of human communication?

How can (some of) these functions be integrated in the near future, and what kind of systems will result?

What are the major considerations for extending these functions to handle multi-lingual and multi-modal information, particularly in integrated systems of the type envisioned?

In a series of ten sessions, one session per topic, the experts explained their perspectives and participated in panel discussions that attempted to structure the material and hypothesize about where we can expect to be in a few years time. Their presentations, comments, and notes were collected and synthesized into ten chapters by a collection of chapter editors.

A second workshop, this one open to the general computational linguistics public, was held immediately after the COLING-ACL conference in Montreal in August, 1998. This workshop provided a forum for public discussion and critique of the material gathered at the first meeting. Subsequently, the chapter editors updated and refined the ten chapters.

This report is formed out of the presentations and discussions of a wide range of experts in computational linguistics research, at the workshops and later. We are proud and happy to present it to representatives and funders of the US and European Governments and other relevant associations and agencies.

We hope that this study will be useful to anyone interested in assessing the future of multilingual language processing.

We would like to thank the US National Science Foundation and the Language Engineering division of the European Commission for their generous support of this study.