Statistical Machine Translation systems have been designed to translate text from a source language into a target one. In most of the benchmark translation systems, the basic unit considered in the textual analysis is the observed textual form of a word. While such a design provides good performance when it comes to translation between two morphologically poor languages, this is not the case when translating into or from a morphologically rich (or complex) language.
The purpose of our work is to develop a Statistical Machine Translation (SMT) system as an alternative solution to the many challenges raised by morphological complexity. Our system has the potentials to capture the morphological diversity and hence, to produce efficient translation from a morphologically poor language to a rich one. Several methods have been designed to accomplish such a task. Pre-processing and Post-processing techniques have been built-in to these methods to allow for morphological information to improve translation quality. In this thesis, we first examine several methods of extending traditional SMT models and assess their power of producing better output by comparing them on English-Inuktitut and English-Finnish translation tasks. In a second step we develop a new morphologically aware segmentation algorithm that takes into account information coming from both languages to segment the morphologically rich language. This is done in order to enhance the quality of alignments and consequently the translation itself. This bilingual segmentation algorithm is then incorporated into the phrase-based translation model “PBM” to form our segmentation-based system. Finally we combine the segmentation-based system thus obtained with post-processing algorithms to procure our complete translation system. Our experiments show that the proposed segmentation-based system slightly outperforms the baseline translation system which doesn’t use any preprocessing techniques. It turns out also that our segmentation approach significantly surpasses the preprocessing baseline techniques used in this thesis.