Thoughts on Software Engineering

RANLP 2009: Unsupervised Extraction of False Friends from Parallel Bi-Texts Using the Web as a Corpus

Today I presented at the prestigious scientific conference RANLP’2009 a research paper about new methods of extraction of false friends from parallel corpora, which is a major part of my PhD thesis. The article is named “Unsupervised Extraction of False Friends from Parallel Bi-Texts Using the Web as a Corpus” and was accepted after passing a thorough anonymous review by two distinguished scientists from the area of Natural Language Processing (NLP) and Information Retrieval (IR).

Abstract

False friends are pairs of words in two languages that are perceived as similar, but have different meanings, e.g., Gift in German means poison in English. In this paper, we present several unsupervised algorithms for acquiring such pairs from a sentence-aligned bi-text. First, we try different ways of exploiting simple statistics about monolingual word occurrences and cross-lingual word co-occurrences in the bi-text. Second, using methods from statistical machine translation, we induce word alignments in an unsupervised way, from which we estimate lexical translation probabilities, which we use to measure cross-lingual semantic similarity. Third, we experiment with a semantic similarity measure that uses the Web as a corpus to extract local contexts from text snippets returned by a search engine, and a bilingual glossary of known word translation pairs, used as “bridges”. Finally, all measures are combined and applied to the task of identifying likely false friends. The evaluation for Russian and Bulgarian shows a significant improvement over previously-known algorithms.