Machine Translation (MT) is one task in Natural Language Processing (NLP), where automatic systems are used to translate text from one (source) language to another (target) while preserving the meaning of source language. Since there is a need for translation of documents between Tigrigna and English languages, there needs to be a mechanism to do so. Hence, this study explored the possibility of developing Tigrigna – English statistical machine translation and improving the translation quality by applying linguistic information. In this work, experimental quantitative research method is used. In order to achieve the objective of this research work, a corpora are collected from different domain and classified into five sets of corpora, and prepared in a format suitable for use in the development process. In order to realize the goal, three sets of experiments are conducted: baseline (phrase based machine translation system), morph-based (based on morphemes obtained using unsupervised method) and post processed segmented systems (based on morphemes obtained by post-processing the output of the unsupervised segmenter). We work on MOSES which is a free statistical machine translation framework, which allows automatically training translation model using parallel corpus. Since the system is bidirectional, four language models are developed; one for English and the other three are for Tigrigna language includes for baseline, morph-based and the other for the post processed experiment. Translation models which assigns a probability that a given source language text generates a target language text are built and a decoder which searches for the shortest path is used. BLUE score is used to evaluate the performance of each set of experiment. Accordingly, the result obtained from the post processed experiment using corpus II has outperformed the other, and the result obtained has a BLUE score of 53.35 % for Tigrigna – English and 22.46 % for English – Tigrigna translations. In addition, the result obtained for each corpus using this experiment outperforms the other (baseline and morph-based experiments). This clearly shows that the post segmented system outperforms all the other experiments. Therefore future research should focus to further improve the BLUE score by applying preprocessing and postprocessing techniques.

Description:

A Thesis Submitted to the School of Information Science in Partial Fulfillment for the Degree of Master of Science in Information Science