The principal focus of the Natural Language Processing group is to build a machine translation system that automatically learns translation mappings from bilingual corpora.

Overview

The Machine Translation (MT) project at Microsoft Research is focused on creating MT systems and technologies that cater to the multitude of translation scenarios today. Data driven systems, in particular those with a statistical core engine, have proven to be the most efficient, due to their ability to adapt to a wide domain coverage and being trained in new language pairs within a matter of weeks. This team works closely with research and development partners worldwide, making the system accessible to a variety of products and services.

Research:

Machine Translation has been a major focus of the NLP group since 1999. Our approach to MT has always been “data-driven”. Rather than writing explicit rules to translate natural language, we train our algorithms on human-translated parallel texts, which allows them to automatically learn how to translate. Our first generation Logical Form based system learned translation patterns at the level of abstract parsed structures, and was used to translate the entire Microsoft support knowledge base into several languages. Our recent research has focused on Statistical Machine Translation (SMT).

Syntax-Based SMT. Translating content from English into as many foreign languages as possible is a high priority for Microsoft, not to mention the billions of people around the world who do not read English. The Treelet Translation System leverages an English natural language parser to help guide this process. This technology is currently used in several places across Microsoft, including the Live translation system for computer-related texts and the Microsoft Support site. Ongoing research has produced major improvements in the choice of word inflections and word ordering in this system.

Phrase-Based SMT. Many leading SMT systems do not use any linguistic resources, such as dictionaries, grammars, or parsers. These so called “phrase-based” systems try to learn translations of arbitrary word sequences of words directly from parallel texts. By improving the methods used to prune the search for the best translation in this type of system, we have shown how to findbetter translations in less time than previous systems.

Word Alignment. SMT systems learn translations from existing bodies of translated data. For most modern systems, identifying the word correspondences or word alignments in this translated data is a crucial step in training systems. Our group has produced pioneering work in both discriminative and generative approaches to word alignment, resulting in faster alignment algorithms with state-of-the-art quality.

Language Modeling. Large n-gram language models are a crucial component in high-quality SMT systems. Trained on only target language data, they help translation systems select fluent and readable output. MSRLM is a publicly-available language modeling toolkit developed at MSR. The toolkit is both fast and scalable, training a 5-gram model from more than one billion pre-tokenized words in about 3 hours on a single machine.

MSR MT System

Other research areas:

Some languages have their own special challenges; for instance, word boundaries are not indicated in normal Chinese texts. MSRSeg can both segment Chinese words and identify names of entities such as people and organizations, capabilities that are very useful in machine translation. More detail on our Japanese MT work can be found here.

Currently our systems are trained on parallel texts that supply sentence-for-sentence translations of the original information. We have developed accurate methods of finding parallel sentences among mostly parallel documents. We have also begun research in extracting parallel data from pairs of “comparable” documents, which contain some information in common, but are not direct translations of each other.

Products and Integration Scenarios:

Microsoft Translator, a free translation portal, and a web service that powers many other translation scenarios, is the latest result of the work done by our research and product teams. The goal is to create the simplest, most intuitively integrated and useful translation services available to end users—while making ongoing improvements to translation quality. This service allows Live Search users to translate foreign language search results by clicking on “Translate this Page”. Users can also translate words, search queries, paragraphs or entire web pages through the Microsoft Translator portal. The Bilingual Viewer interface features a unique, side-by-side web page viewer that translates entire Web pages with blinding speed between 25 sets of language pairs. In addition, there is a Windows Live Toolbar Button , an add-in that puts a button on users’ websites, allowing their visitors to translate their web page using our service, and a Windows Live Messenger Translator Bot prototype that lets users translate IM conversations in a number of popular languages.

Portions of the technology behind MSR-MT, including parsing, LFs, MindNet, have been used in the grammar checkers in Word, in the natural language query function of Encarta, and in other MS products.

The system already has proven its value within Microsoft, having been used in 2003 to translate nearly 140,000 customer-support Knowledge Base articles into Spanish (If you go to the web site, click on International Support and choose Spain as your country. You can then enter Spanish queries for the KB and receive back machine-translated hits.) The effort was extended to Japanese the next year and to French and German in 2005. Now, Microsoft’s Knowledge Base materials have been translated into nine languages by MSR-MT. This approach lowered the cost barrier to obtaining customized, higher-quality MT and Microsoft's support group is now able to provide usable translations for its entire online KB. It can also keep current with updates and additions on a weekly basis - something that was previously unthinkable both in terms of time and expense.