Generalized Example-Based Machine Translation

2. Our Implementation of Example-Based Machine Translation

Other EBMT systems operate on parse trees, or find the most similar
complete sentence and modify its translation based on the differences
between the sentence to be translated and the matched example.
Because our system was designed from the outset to be one engine in a
multi-engine translation system, we were able to use a different
approach. Since the EBMT engine is not required to produce output for
every word of the input -- one of the other engines can cover
for it -- it can produce partial translations.

The basic EBMT system that we are generalizing
performs partial exact matches against the examples it has been given,
and relies on the multi-engine architecture to assemble the partial
translations.
Its training consists of indexing the source-language half of each
translation example in an inverted index -- a listing, for
every possible word, of all the locations at which the word occurs.
When asked to perform a translation, it first finds every
exactly-matching phrase in its database, regardless of any overlap
with any other matches (it always selects the longest match in any
particular translation example). To find the matching phrases using
an inverted index, one simply retrieves the occurrence lists for the
first two words, and determines which of the occurences of the first
word are adjacent to an occurrence of the second word. Then one
retrieves the occurrence list for the third word, and either extends
the match or creates a new one where the third word appears adjacent
to the second one in the example base. Repeat until the entire input
text has been processed.

Figure 1: Our Approach to EBMT

Once all matching phrases have been found, the next step is to
determine what portion of the target-language half of each matched
example corresponds to the matched part of that example. If the
entire example sentence has been matched, this correspondence is
trivial to determine: it is the entire target-language half of the
example. If only a portion of the example has been matched, the
correspondence becomes much trickier -- finding the translation is
then the equivalent of performing a word-level alignment between the
two halves of the example, a problem which is still being actively
researched by many people.

In our EBMT engine, we use a simple, heuristically-based word-level
alignment algorithm which, while far from perfect, is nevertheless
adequate for the task. The alignment algorithm starts by constructing
a table of possible correspondences between individual words, based
on a bilingual dictionary. Then, it examines the correspondence
table to determine the shortest and longest possible translations
of the matched chunk. For every contiguous substring of the longest
possible translation that includes the shortest one, a set of simple
scoring functions is applied, and the substring with the best score
is declared the proper translation of the chunk. More details on
the alignment algorithm will be given below.

For run-time efficiency, the EBMT engine does not actually process
every single match -- some short common phrases such as "of the" can
occur thousands of times in the example base. First, it looks at a
maximum of eight matches for any particular phrase in the input, and
second, it stops as soon as it finds one example for which the scoring
metric indicates a perfect alignment between the source-language and
target-language chunks (both of these numbers are set in the program's
configuration file). The candidate matches are processed starting
with the most recently added one, proceeding backwards to the
candidate which was the earliest to be added to the example base.
This ordering was selected to permit later additions to have priority
over older examples, and supports the correction of translation errors
by incremental addition of corrections to the indexed example base
(including on-the-fly updates at run-time).

How efficient is the EBMT engine? Its run-time speed obviously
depends on the speed of the computer on which it is run, and (less
obviously) on the size of the example base -- run time increases
approximately in proportion to the amount of text in the example base
since the index grows in proportion and some additional phrases can be
matched as the example base grows. Our typical configuration has an
example base of 10 to 20 megabytes of text, which is sufficient for
reasonably good coverage of unrestricted input texts. With example
bases of such sizes, the engine can translate upwards of 500 words per
minute on even the slowest Pentium processor, and 5000 or more words
per minute on a 300 MHz UltraSPARC or Pentium II. This is easily
fast enough for real-time translation of a conversation between
two people who do not speak the same language.