Statistical machine translation (SMT) should benefit from linguistic information
to improve performance but current state-of-the-art models rely purely on data-driven
models.
There are several reasons why prior efforts to build linguistically annotated models
have failed or not even been attempted. Firstly, the practical implementation often
requires too much work to be cost effective. Where ad-hoc implementations have
been created, they impose too strict constraints to be of general use. Lastly, many
linguistically-motivated approaches are language dependent, tackling peculiarities in
certain languages that do not apply to other languages.
This thesis successfully integrates linguistic information about part-of-speech tags,
lemmas and phrase structure to improve MT quality.
The major contributions of this thesis are:
1. We enhance the phrase-based model to incorporate linguistic information as additional
factors in the word representation. The factored phrase-based model
allows us to make use of different types of linguistic information in a systematic
way within the predefined framework. We show how this model improves translation
by as much as 0.9 BLEU for small German-English training corpora, and
0.2 BLEU for larger corpora.
2. We extend the factored model to the factored template model to focus on improving
reordering. We show that by generalising translation with part-of-speech
tags, we can improve performance by as much as 1.1 BLEU on a small French-
English system.
3. Finally, we switch from the phrase-based model to a syntax-based model with
the mixed syntax model. This allows us to transition from the word-level approaches
using factors to multiword linguistic information such as syntactic labels
and shallow tags. The mixed syntax model uses source language syntactic
information to inform translation. We show that the model is able to explain
translation better, leading to a 0.8 BLEU improvement over the baseline hierarchical
phrase-based model for a small German-English task. Also, the model
requires only labels on continuous source spans, it is not dependent on a tree
structure, therefore, other types of syntactic information can be integrated into
the model. We experimented with a shallow parser and see a gain of 0.5 BLEU
for the same dataset. Training with more training data, we improve translation
by 0.6 BLEU (1.3 BLEU out-of-domain) over the hierarchical baseline. During the development of these three models, we discover that attempting to
rigidly model translation as linguistic transfer process results in degraded performance.
However, by combining the advantages of standard SMT models with linguistically-motivated
models, we are able to achieve better translation performance. Our work
shows the importance of balancing the specificity of linguistic information with the
robustness of simpler models.

Hoang, H. and Koehn, P. (2009). Improving mid-range re-ordering using templates of factors. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), pages 372–379, Athens, Greece. Association for Computational Linguistics.