This document describes an experimental system for translating between
two languages. The system has two parts: one for translating from one
language into another, and a second for extracting transfer rules from
pairs of phrases, described below.

NB: The system is very slow for sentences more than a few words in
length.

rules.pl is a file of transfer rules. Transfer rules can
be hand-written, or they can be extracted from phrase
pairs. The transfer rules are applied in parallel
-- there is no feeding and bleeding of transfer rules.

The translator works by enumerating the parses, transfering each parse,
enumerating the transfers of each parse, and then generating from each
transfer. This can be time-consuming if there are a lot of parses. You
can limit the number of parse, transfers, and generations enumerated using
the following Tcl variables:

set enumerate_parses 50
set enumerate_transfers 10
set enumerate_generations 10

If the translator has property weights associated with it, then
you can speed things up even further by only generating from the best
transfers:

set max_transfers 10
set score_diff_cutoff 5

Setting max_transfers to N tells the translator to only
generate from the N best transfers (if there are more transfers with the
same score as the Nth transfer, then these will be included too). Setting
score_diff_cutoff to M tells the translator to ignore
transfers whose score is M less than the score of the transfer with the
highest score.

The latter two commands only show the result of transfering the last
parse and the results of generating the last transfer. If you want to look
at earlier attempts, you can set enumerate_parses and
enumerate_transfers to smaller values.

Transfer rules can be extracted automatically from phrase pairs or from
aligned sentences. In the first case, a single transfer rule is extracted
that represents the transfer from the source phrase to the target phrase,
taking into account special words that represent arguments. In the second
case, atomic transfer rules are first extracted based on word alignments.
Then pairs and triples of adjacent atomic transfer rules are then combined
to make composite transfer rules. This process is similar to how transfer
rules are extracted from aligned sentences in the Pharaoh system.

The first step in extracting transfer rules is to choose the best
analyses on the source and target sides. XLE chooses the pair of analyses
that align the best so that the transfer rules will be as simple as
possible. Thus, the source and target sentences disambiguate each
other.

After an f-structure is chosen for each side, XLE simplifies the SUBJs.
SUBJs that are in a predicative noun or adjective are removed so that a
single transfer rule covers both predicative and attributive uses.
Predicative SUBJs are also removed from the source sentence when
translating so that the simplified transfer rules will match. Since the
target language is expecting the predicative SUBJs to be there, you must
make SUBJ addable in the generator. If SUBJ is addable, the generator will
add it even if it isn't governed.

XLE also simplifies SUBJs in verbs that are controlled in one way or
another. If the SUBJ of a verb is functionally controlled, or is a null
pronoun (representing anaphoric control; e.g. PRON-TYPE = null), or creates
a cycle (as in (^ ADJUNCT $ SUBJ) = ^), then the SUBJ is replaced with an
empty SUBJ (an f-structure with no content). This makes it easier to
translate between constructions that use one form of control to
constructions that use another form of control. Controlled SUBJs are also
simplified in the source sentence when translating so that the simplified
transfer rules will match. If the input to the generator has an empty
SUBJ, then the generator will add whatever form of control is required by
the target grammar.

After the SUBJs have been simplified, XLE aligns the individual
f-structures using user-specified words or using word alignments. Then XLE
extracts transfer rules and prints them out.

Suppressing Features

Often there are features in the f-structures that you don't want in the
transfer rules. For instance, you probably don't want the tense feature in your
transfer rule unless the tense feature has different values on the two sides.
Otherwise, you have to produce phrase pairs for every tense of
every verb. There are two ways to suppress such features. One is to use
set-gen-adds remove in the standard way in the translator:

Features that take f-structures instead of constant values (such as
SUBJ) are only removed if their values are paraphrase variables that are
aligned. This allows you to extract transfer rules work for both active
and passive forms.

The second command, extract-paraphrase-rules, extracts transfer
rules from the phrase pairs (it can also be used to extract paraphrase
rules if the source grammar and target grammar are the same). Its first
argument is the file of phrase pairs described above. Its second argument
is the name of the output file.

If either of the phrase pairs are ambiguous, then
extract-paraphrase-rules will choose the analyses that are most
parallel to each other. It skips phrase pairs that have a fragment parse
on one side or the other.

Normally, extract-paraphrase-rules extracts a single
transfer rule for each phrase pair that represents all that is in the
phrase pair. If it cannot find a single rule that covers both phrases,
then it will print all of the sub-transfer rules that it found. If you
also want very simple back-off transfer rules for each pair of aligned
words in the phrase pair, do the following in Tcl before calling
extract-paraphrase-rules:

setx extractParaphraseBackoffs 1

You can also see the results of extracting one transfer rule with the
following:

extract-paraphrase-rules phrase-pairs.txt 7

test-extract-paraphrase-rules tests
extract-paraphrase-rules by extracting rules one paraphrase at
a time and using the rule for each paraphrase to translate the left hand
side of the paraphrase.

Paraphrase Variables

If a word to be translated takes arguments, you can specify those
arguments using dummy lexical entries:

The dummy lexical entries must have the same f-structures in the
source and target languages.

The semantic forms of the dummy lexical entries also need to be added to the translator's performance vars
file:

setx paraphrase_variables "NP V"

Normally, features in a paraphrase variable that are equal on both sides
are excluded from the resulting transfer rule. However, you can tell the
system to preserve specific features by doing the following:

setx preserve_dummy_features "MEDICINE"

This is useful if there are selectional restrictions that you want to
include in the paraphrase variables. If you want argument types to be
included in the transfer rules, use the following:

setx include_argument_types 1

If all of the words in a phrase pair are paraphrase variables,
then extract-paraphrase-rules will extract transfer rules for the
paraphrase variables. This is useful when you want to translate the left
hand side of a phrase pair file as a sanity check but you need transfer
rules for the paraphrase variables.

-sourceDir should be a directory that contains the
f-structures for the source sentences, where S1.pl is the f-structure for
the first sentence. The sentences can have mixed upper and lower case,
even if the alignment file lower cases all of the words.
-targetDir should contain the f-structures for the target
sentences. extract_transfer_rules will print the transfer
rules in -outRules (or stdout if -outRules is not
specified).

extract_transfer_rules extracts transfer rules for each of
the well-formed f-structures that are aligned by -alignments.
It also extract transfer rules for pairs and triples of adjacent
f-structures that are aligned. This is similar to the phrase-based
translation used by the Pharaoh system, only applied to f-structures
instead of strings.

The output of extract_transfer_rules is not suitable as
input to load_translation_rules, since there is a fair amount of
repetition in the transfer rules. In order to eliminate repetition and
make it easier to find rules, you must collate the rules into a rule
directory using the following command:

collate_transfer_rules transfer-rules.pl rule_directory

You can collate as many rule files as you want into the same rule
directory. Then you can use the rule directory as input to
load_translation_rules:

The translation system uses property weights to choose the best translation
from the set of all translations produced for a sentence. Each component
of the translation system has its own property weights: the parser, the
transfer system, and the generator.

Most likely, the parser for the source language already has property
weights. These weights are used to choose the N best parses when
enumerate_parses is set to some value other than zero. They
are also used as input to the later components.

Transfer Property Weights

The transfer system has its own set of weights to pick the best
transfers. These weights are used by enumerate_transfers,
max_transfers, and score_diff_cutoff. XLE
recognizes the following property weights:

INPUT_SCORE is the score of the input f-structure based on
the parser property weights. DOMINANCE_SCORE is the score
given by the language model of dominance relations on the f-structure. It
is only useful if dominance_db_file has been set in the
performance vars file of the translator. DOMINANCE_COUNT is
the count of dominance relations. It is necessary in order to normalize
over f-structures with different numbers of dominance relations.
DEFAULT counts the number of default rules applied (where a
feature is translated as itself). Fewer is better here.
DEFAULTPRED measures the number of PRED values
that were translated as themselves. Fewer is better here.
rule_trace counts the number of rule applications. In
general, larger rules produce fewer rule applications, so fewer rule
applications is better. The statistics for absolute frequency and relative
frequency are only useful if the rules were collated from a corpus of
aligned sentences.

You will need the following in the performance variables file for the
translator:

Generator Property Weights

The generator also has its own set of property weights. The translation
system needs a specialized set of property weights since each system that
provides input the generator has its own conventions about which features
are left to be filled in by the generator. Any of the standard properties
used to disambiguate parses can be used to disambiguate generations. There
are also some special property weights that are useful in translation:

INPUT_SCORE is the score given to the input f-structure
provided by the transfer system (this score includes the parser's score as
well). GEN_NGRAM_SCORE is the score given to the output of
the generator by the language model. GEN_WORD_COUNT is the
number of space-delimited words in the output of the generator. It is
needed to normalize GEN_NGRAM_SCORE, since longer sentences
tend to have a lower language model score.
GEN_CONSTITUENT_MOVES is the number of constituents that were
moving in going from the source language sentence to the target language
sentence. Finally, GEN_STARRED is the number of ungrammatical
OT marks that were required in order to generate a particular string.

You will need to add statements like the following to the performance
variables file for the generator:

For language modeling, you will need a license to the SRILM language
modeling software. Then you will need libxle-lm.so so that XLE can access
the SRILM package.

Generating Training Data

If you don't want to set property weights by hand, then you will need to
generate training data to train the property weights using cometc. You can
generate training data by specifying that the output file is a directory
using the slash character:

translate-testfile german-testfile.lfg trainingdata/

translate-testfile will write out the training data in a
directory structure rooted in the specified directory. The top level has
directories for each sentence. The next level down has directories for
each parse for the current sentence. Each parse directory has directories
for each transfer structure for the current parse, as well as the parse
f-structure (parse.pl). Each transfer directory has the
transfer f-structure (transfer.pl), the features of the
transfer f-structure (transfer-features.pl), and three files
for each generation: the generated string (genN.txt), the
generation f-structure (genN.pl), and the features of the
generated f-structure (genN-features.txt):

The transfer features in transfer-features.txt are the
features in the translator's property weights file. The generation
features in genN-features.txt are the features in the
generator's property weights file.

To train with cometc, you need to create a set of files that has
unlabeled features weights for each sentence, and a set of files that has
labeled features weights for each sentence. The unlabeled features weights
for a sentence are just the disjunctions of the all of the feature weights
for a sentence. The labeled feature weights are the disjunction of the
feature weights whose generated strings are correct.

You should first train the transfer feature weights (all of the
transfer-features.txt), and then train the generation feature
weights (all of the genN-features.txt) using the transfer
feature weights that you just obtained.