Stanford Phrasal User Guide

This guide explains how to set up and train a phrase-based Statistical Machine
Translation system using Phrasal. It offers step-by-step
instructions to download, install, configure, and run the Phrasal decoder and
its related support tools.

Phrasal is designed for fast training of both traditional Moses-style machine translation (MT) models and large-scale, discriminative translation models with:

An intuitive feature engineering API

Large-scale learning with AdaGrad+FOBOS

Fast search with cube pruning

Unpruned language modeling with KenLM

This guide assumes that you are building an MT system on a Unix-like system, and assumes some familiarity with Unix-like command-line interpreters (or shells), such as bash for Linux and Mac OS X and Cygwin for Microsoft Windows. The commands in this tutorial are written for bash, but it is relatively easy to adapt
them to other shells. While the core Phrasal MT decoder will run anywhere that Java will run, this is not true for many of the support tools and scripts used to train MT systems.

Installation

Set the CORENLP_HOME environment variable to the path of your local CoreNLP git repo. Suppose that the repo is in $HOME. You would execute:

export CORENLP_HOME=$HOME/CoreNLP

You might want to add the Phrasal scripts directory to your shell PATH:

export PATH=$PATH:$HOME/phrasal/scripts

Compiling Phrasal

See the build instructions in the README.md file in the root directory of the Phrasal git repo.

Language Modeling

Phrasal comes with a Java language model query implementation, but does not include a tool for estimating language models. To build language models, we recommend that you use KenLM. You can download the latest from the KenLM site, or there is a copy of KenLM in the src-cc folder of the Phrasal download. Both the KenLM site and the download package (or the src-cc/kenlm folder we include) contain installation instructions. (It's written in C++ and requires Boost.)

Later on in this tutorial we will use KenLM's lmplz tool to build language models.

The Phrasal Java-based language model loader can load the ARPA format files created by lmplz, but it is quite inefficient for large language models (in terms of both speed and memory use). Phrasal includes a JNI loader for the efficient C++-based KenLM that can be compiled separately. You must first install a JDK from Oracle. Next, make sure to set the JAVA_HOME environment variable. Finally, compile the loader:

gradle compileKenLM

In case you are using Mac OS X and you are getting a compilation error caused by a missing jni.h file, replace this line in compile_JNI.sh

To activate this loader, add the "kenlm:" prefix to the language model path in the Phrasal ini file.

compile_JNI.sh builds only the part needed for querying language models. If you also want to estimate language models you have to build lmplz:

cd $HOME/phrasal.ver/src-cc/kenlm
./bjam

Word Alignments

For word alignment, we recommend the Berkeley Aligner. The Berkeley download contains installation and usage instructions. But it is also okay to use another compatible aligner, such as Giza++.

To run symmetrization heuristics like grow-diag during phrase extraction, you'll need to configure the Berkeley Aligner to produce A3 files. Add the parameter writeGIZA to the Berkeley configuration file. Here is an example configuration file.

Open the vars file and read the comments. If you've followed the instructions carefully until now, then all of the filenames and paths should match. However, you should verify the files and paths before proceeding.

Tuning and evaluation consists of eight stages, which are configured for convenience in a script. To see the phases, run:

phrasal.sh

If your PATH is configured correctly, you should see the following output:

To train and evaluate a system, which we will call "baseline," run this command:

phrasal.sh fr-en.vars 1-6 fr-en.ini baseline

An explanation of each step along with associated parameters in the vars file follows.

Extract phrases from dev set

Purpose: Extract translation rules from the parallel data for the development set.

Phrase extraction parameters:

EXTRACT_SET -- Specifies the bitext files and symmetrization heuristic (default heuristic is grow-diag.
THREADS_EXTRACT -- Number of threads to use for phrase extraction.
MAX_PHRASE_LEN -- Maximum source phrase length.
OTHER_EXTRACT_OPTS -- Other options that are described in edu.stanford.nlp.mt.train.PhraseExtract
LO_ARGS -- Parameters for the lexicalized re-ordering model.

Dev set parameters

TUNE_SET_NAME -- The name of the tuning/dev set (e.g., newstest2011).
TUNE_SET -- The actual filename of the dev set (e.g., newstest2011.fr.tok).
TUNE_REF -- The reference file of the dev set (e.g., newstest2011.en.tok).

The system is very stable on held-out data, achieving a maximum BLEU score of 24.18 at the end of iteration 1. This results compares fairly well to the WMT results given that we only used a fraction of the data. We ran with 16 threads, and each of the eight learning iterations lasted about four minutes:

The first log contains output from step 2. You can see the tuning objective function score by searching for "BLEU", e.g.:

grep BLEU newstest2011.baseline.online.log

BLEU scores should increase from one epoch to the next.

The other two "stdout" logs capture all system output to the console for the dev and test steps. You should look at the "stdout" logs for Java exceptions. In particular, if the paths in your vars file are not configured properly, then you will see Java FileNotFoundException information in these logs.

You can also inspect the inspect the intermediate weight files generated by the learning algorithm:

Advanced/Additional Features

This section describes features for advanced users who wish to maximize translation quality.

Word Classes

Phrasal 3.4 comes with several featurizers that use word classes. To use these featurizers you need mappings from words to classes for the source and the target language.
Phrasal includes an implemenation of a very fast word clustering algorithm that allows you to train word classes on corpora containing billions of tokens within a few a hours. As the following table shows, this is up to three orders of magnitude faster than other popular tools:

To use the word classes in any featurizer you have to add the mappings to your ini file:

[target-class-map]
en.cls
[source-class-map]
fr.cls

Feature Engineering API

Phrasal contains a very intuitive feature API that will be familiar to anyone who has written discriminative features for SVMs or logistic regression. Like cdec, feature functions can be loaded dynamically without recompiling the whole system. In Phrasal, this is accomplished via reflection.

The Feature API Tutorial describes the interfaces in the API, describes example features, and provides tips for writing feature templates. MT features are tricky. Bad features can significantly reduce translation quality due to interaction with the approximate search algorithm.

Baseline "dense" features are located in edu.stanford.nlp.mt.decoder.feat.base.

Examples of discriminative "sparse" features can be found in edu.stanford.nlp.mt.decoder.feat.sparse.

Feature Augmentation

For domain specification, you need to generate an input properties file (See edu.stanford.nlp.mt.util.{InputeProperty,InputProperties}). The input properties file is a set of key/value pairs for each segment in the source input file. For domain splitting, you would have:

Domain=tech
Domain=legal
Domain=nw

Add the input properties file to the *.ini file as follows:

[input-properties]
filename

Statistical Significance Testing

Phrasal includes an implementation of the permutation test described by Riezler and Maxwell. To obtain p-values for a pair of system outputs, run:

where reference_prefix is a common filename prefix for multiple references.

Advanced Parameters

Additional phrase extraction options are passed via the OTHER_EXTRACT_OPTS in the vars file to edu.stanford.nlp.mt.train.PhraseExtract. See the usage and javadocs in that package for a description of the options.

Learning options are passed via ONLINE_OPTS in the vars file to edu.stanford.nlp.mt.tune.OnlineTuner. See the usage and javadocs in that package for a description of the options.

Phrasal decoder options are specified in the .ini file. The decoder supports popular functions such as force decoding, droppping unknown words, larger beam sizes, etc. For the full list of options, see the javadocs in edu.stanford.nlp.mt.Phrasal.

MERT (Batch) Training

Phrasal contains an efficient MERT implementation. MERT does not scale to the large feature sets supported by the API, but it is an effective algorithm for the baseline dense features. To enable it, comment out the online tuning parameters in your vars file and uncomment the following batch parameters:

Arabic-English Translation

Arabic-English is a common research language pair in the United States. The steps in this tutorial can be used to build an Ar-En system with the exception of Arabic pre-processing and segmentation. The Stanford NLP group provides a free tokenizer/segmenter for Arabic. Run it on the Arabic side of the parallel data and you should be ready to go.

A high-quality English tokenizer (i.e., better than the WMT perl script in this document) is included in Stanford CoreNLP.

Chinese-English Translation

Chinese-English is also common research language pair in the United States. Chinese, like Arabic, requires tokenization and segmentation prior to system tuning. The Stanford NLP group provides a free tokenizer/segmenter for Chinese. Run it on the Chinese side of the parallel data and you should be ready to go.

A high-quality English tokenizer (i.e., better than the WMT perl script in this document) is included in Stanford CoreNLP.