A team of researchers from the Information Sciences
Institute at USC Viterbi has received a $16.7 million
grant from the Intelligence Advanced Research Projects Activity
(IARPA) to develop an automated information translation and
summarization tool to quickly translate obscure languages.

Principal investigator and ISI research team leader Scott
Miller, ISI computer scientist
Jonathan May, ISI research lead
Elizabeth Boschee—with senior advisors Prem
Natarajan, ISI’s Michael Keston executive director
and research professor of computer science, and
Kevin Knight, ISI research director and Dean’s
professor of computer science—are leading a team of
about 30 researchers, including academics from the University of
Massachusetts, Northeastern University, MIT, RPI, and the
University of Notre Dame.

The ISI team’s project is called SARAL, which stands
for Summarization and domain-Adaptive Retrieval (a Hindi word
whose translations include “simple” and
“ingenious”), and includes experts in machine
translation, speech recognition, morphology, information
retrieval, representation, and summarization.

“The overall objective is to provide a Google-
like capability, except the queries are in English but the retrieved
documents are in a low-resource foreign language,” says
Miller, who is based at ISI’s newest office in Boston, MA.

“The aim is to retrieve relevant foreign-language
documents and to provide English summaries explaining how
each document is relevant to the English query.”

In this project, the ISI team will initially test their systems
using Tagalog and Swahili, two low-resource languages selected
by IARPA for the task. Over the course of the project, the team
will receive additional languages to translate using the systems.

Doing more with less

Although so-called “low resource” languages
are often spoken by millions of people worldwide, relatively little
written material exists in these languages. This creates a
challenge for current translation systems, which typically
“learn” from seeing millions of written examples.

“Since we don’t have a lot of written data in
these languages, we have to do more with less,” says
May, who also holds an appointment as a research assistant
professor in computer science at USC Viterbi.

“Ideally, we would use about 300 million words to
train a machine translation system—and in this case, we
have around 800,000 words. There are about 100,000 words per
novel, so we have only eight novels’ worth of words to
work from.”

The researchers will begin the project by compiling
documents in the test languages, including speech, online
documents, and video clips, which have previously been
translated into English.

They will then develop algorithms to analyze the language
patterns, such as sentence structure—subject, verb and
object position, for example—and morphology, the
structure of words and their relation to other words in the same
language.

The system will be designed to respond to domain-specific
queries, for example environmental
protection in the “government and
politics” domain or primary education
in the “lifestyle” domain, and will produce a
summarized response of about 100 words describing how the
result is relevant to the search.

“You can think of the summary as something like
CliffsNotes, but with the added feature that it is indexed to the
precise part you want to write your essay about,” says
May.

In addition to ISI, a number of universities and research
institutions will work towards the same goal: John Hopkins,
Columbia University, and Raytheon BBN Technologies are also
taking part in the IARPA program, called MATERIAL, which stands
for Machine Translation for English Retrieval of Information in
Any Language.

“IARPA’s MATERIAL program is the first
organized attempt at synthesizing recent advances in machine
translation, speech recognition, cross-lingual retrieval and
summarization into a powerful new capability that allows users to
accurately access all relevant information, across languages and
modalities,” says Natarajan. “We are
tremendously grateful for the opportunity to contribute to this
nationally important effort.”