Natural Language Computing

Overview

Natural Language Computing (NLC) Group is focusing its efforts on machine translation, question-answering, chat-bot and language gaming. Since it was founded 1998, this group has worked with partners on significant innovations including IME, Chinese couplets, Bing Dictionary, Bing Translator, Spoken Translator, search engine, sign language translation, and most recently on Xiaoice, Rinna and Tay, as well as general topics of Conversation As A Platform(CAAP).

Overview

The information era has brought us vast amounts of digitized text that are generated, propagated, exchanged, stored, and accessed through the Internet each day across the world. The accumulation of this data is making information acquisition increasingly difficult, with language becoming a critical obstacle to growth. To overcome these difficulties, the Natural Language Computing (NLC) Group is focusing its efforts on a variety of research topics, including multi-language text analysis, machine translation, cross language information retrieval, text mining of big web, social and enterprise, question answering with web, knowledge base and social repositories, and various applications of utilizing NLP technology for search engine, office and cloud computing. Recently, we have made a series of meaningful exploration applying deep learning to the typical NLP tasks such as machine translation, sentiment analysis and question-answering, chat-bot and conversation as a platform(CAAP), trying to reconstruct the NLP methodology foundation.

Over the years, the group has made significant contributions to Microsoft products, including a Japanese and Chinese Input Method Editor (IME) for Office in 2001, English writing assistant for Office in 2007, Chinese couplet game for Windows Live, Chinese word breaker, pinyin search and search speller for the MSN search engine, text mining for SQL Servers and SharePoint, and meta data extraction for MSN, Bing Dictionary, Bing IME, sentiment analysis, Light Question-Answering. Our research achievements have been published at most prestigious NLP conferences, including 50 papers at ACL, 17 papers at COLING, 9 papers at SIGIR, from 2000-2013. This group was awarded MSRA stamina award in 2006, MSRA collaboration award (both group and individual) in 2012 and 2013, MSRA best demo award in 2012, MSRA social impact award in 2012, MSRA technical transfer award in 2012, MSRA Best Research Team award in 2015 due to the above-mentioned excellent achievements. The Engkoo Dictionary, later rebranded as Bing Dictionary, an important innovation on English learning and online dictionary integrating machine translation and speech synthesis as the result of the collaboration between this group, MSRA Speech Group, and MSRA IEG Group won many awards including Wall Street Journal’s 2010 Asian Innovation Readers’ Choice Award. The Chinese-English translation engine has been deployed in Bing Translator. The Chinese-English translation engine has been deployed in Bing Translator. Our translation system has strongly supported the famous MS spoken translator which as successfully live demoed by Rick Rashid at 21 Century Computing Conference in Tianjin in 2012 . The powerful Question-Answering platform Engkoo Answers (aka project Light) has incubated a set of key technologies for web search, entity search and social search. The chat-bot technologies starting from query understanding, response generation, context-aware ranker, topical chat, sentiment analysis, to knowledge-based chat, text-based chat and so on have been used to power various AI products of Microsoft including popular chat-bots such as Xiaoice(China), Rinna(Japan) and Tay(US).

This group has broad collaboration with dozens of universities from China, Japan, Korea, Singapore, Taiwan and Hong Kong on various topics spanning from machine translation, web mining, sentiment analysis, question-answering to SNS text mining, summarization and search. This group has mentored 500+ interns since 1999 and many of them are active in NLP field in universities and companies. Through MS-University Joint PhD program, this group supervised 16 PhD students from 8 universities and 8 of them have obtained PhD degrees. Among many successful collaborations, this group and Internet Graphics collaborated with Institute of Computing Technology, Chinese Academy of Sciences and Beijing Union University to develop Sign Language Recognition system using Kinect has generated big impact. We also recruit research interns from over 20 universities worldwide to work together with researchers on important topics. This group has actively contributed to NLP research community. The notable contribution includes working with Harbin Institute of Technology (since 2004) and Chinese Information Processing Society(CIPS) since 2013 to run summer school on NLP and Internet Innovations since 2001, helping China Computer Federation(CCF) to establish the NLPCC conference, and promoting MS joint labs with Harbin Institute of Technology and Tsinghua University. Ming Zhou, as the chair of Chinese Information Processing Committee of CCF and the executive member of CIPS, has deeply involved in abovementioned NLP research activities in China.

Areas of Focus

Our research strategy is data driven and statistical learning: we collect large-scale monolingual/bilingual corpora from the web and third parties, and use machine learning approaches to acquire linguistic/translation knowledge. This knowledge is then used to support our research projects. Below is an introduction of our main research areas.

Corpus Collection, Classification, and Annotation

This is a continuous effort to build a large text corpus as the infrastructure for statistical learning. Text can be acquired from various documents and from the Web. Text classification by topic and writing style is useful for the construction of a balanced corpus as well as various domain specific corpora. Corpus annotation is a challenging task. It includes word segmentation, named entity identification, parts-of-speech tagging, syntactic parsing, word sense tagging, and anaphora tagging. The different tagging tools can be used directly in a number of natural language applications. The different annotated corpora can serve as supervised training data for statistical language modeling for different purposes.

Asian Language Natural Language Processing

Text Information Mining and Extraction (TIME) is a platform used to extract key information from a variety of documents such as web pages, word documents, and PowerPoint presentations in different languages. The extracted information can be used to support information retrieval and search engines, machine translation, summarization, and question answering. This innovation covers a variety of technologies such as tokenization, named entity identification, semantic labeling or skeleton information extraction, key term extraction, and summarization.

Statistical Machine Translation and Neural Machine Translation

The focus of the Machine Translation project is on helping and guiding non-native English users, such as Chinese, Japanese and Koreans, search, read and write English more fluently. To this end, the NLC Group has applied statistical machine translation to provide meaningful translation solutions at the word, phrase or collocation, and sentence levels.

Recently, this group has started intensive studies on neural machine translation. Research focuses include distributed representation of words, phrases and knowledge, improved attention modeling and training, efficient training and decoding algorithms as well as deep learning platform that can facilitate research on both neural machine translation and other natural language tasks using deep neural networks.

Supported by translation technologies, the group is conducting research into new applications for search engines, such as multilingual Search. This application works at the word level, for inputted queries, and the sentence level, for translation of returned snippets.

Our goal is to explore using natural language processing (NLP) technologies to improve the performance of classical information retrieval (IR) including indexing, query suggestion, spelling, and to relevance ranking. We will try these approaches with a vertical domain first and gradually extend to open domains. We have explored the best indexing terms for Chinese, new approaches for query expansion, mining word association and similarity from a text corpus, the fusing method of the retrieval results from different IR systems, base NP identification, accurate query translation using a statistical approach and example-based approaches. We participated in the cross-lingual track of TREC-9 and NTCIR-III and got best results on cross-language information retrieval. We focused on the query translation and optimizing indexing for a Chinese IR system. We also participated in the Web track of TREC-10.

Based on above mentioned technologies, we have built a successful NLP based search engine (lingo) which do deep NLP analysis to build indexing and allows complicated queries to search database. This search engine was used in Engkoo Dictionary, later rebranded as Bing Dictionary to do powerful search of huge data of bilingual example sentences mined from the web. It was also used in our semantic tweet search (QuickView) in 2010.

Question Answering

Question answering is a key technology being developed for the next generation search engine. Given a question, a search engine user hopes to get an exact answer rather than face a huge number of query results. NLC Group is creating question reformulation, paraphrasing, and various answer extraction techniques for factoid questions and non-factoid questions. Based on this work, the group also hopes to build domain specific chatbots with question answering technologies that mine text forums, web blogs, and other web resources.

Since 2011, we started to build a QA research platform, called Light (now it is called Engkoo Answers) which is designed to provide fundamental tools and benchmarks to support the long term sustained development of the research on key elements of QA including question understanding, question paraphrasing, query rewriting and correction, query expansion, entity extraction from query, documents, webpages and search snippets, answer extraction, ranking, confidence rate assignment of the candidate answers, sentiment analysis and opinionated summary. We built web-QA which uses web search results to support question-answering, KB-QA which uses large scale knowledge base such as Freebase and MS knowledge bases to support question-answering, and social-QA which uses large repository of community QA pairs, tweets and forums to support question-answering. This platform is capable to answer factoid questions, non-factoid questions such as definitional questions, yes/no question,, subjective questions, and even Jeopardy! quiz.

Semantic Analysis and Search for Big Text Data (Project QuickView)

This project started in 2000. We would like to build a semantic analysis, search for big volume of text data, for both unstructured data, structured data and semi-structured data. This semantic analysis is a pipeline of text data processing, information extraction, search engine, summarization, question-answering and visualization. In addition to help evolve search engine from current sorely search function to decision making and task completion, but perhaps more importantly, we would like to help the enterprise users to distill the information and knowledge from enormous data sources via cloud computing service in order to support business intelligence, information access and document generation. We hope we could develop unified and standard methods to cover different genres of data, starting with standard text, to noisy tweets, and then move to structured database.

Currently we have developed a semantic analysis and search engine for tweets (QuickView) with the full functions of semantic analysis including tweet categorization, clustering, NER, semantic role labelling, sentiment analysis, opinion mining, keyword search and simple question-answering.

Language Gaming

Can you imagine a computer capable of generating Chinese couplets? The NLC Group has made this a reality for the first time in the world by creating Chinese Couplet Generation software as part of its language gaming project for the Internet and mobile games (http://duilian.msra.cn). The software works by accepting a sentence provided by a user and then extrapolating a couplet sentence. This technology can be used to further Chinese language learning by entertaining and engaging users.

At Microsoft Build 2016 event, Microsoft CEO Satya Nadella said that chatbots, as next big thing, will have “as profound an impact as previous shifts we’ve had.” The past paradigm shifts include graphical user interface, the web browser and the touchscreen. Conversations As A platform(CAAP) has the promise of making booking a flight or buying a new shirt as easy as sending a text message, with the potential to make computing more accessible to users on mobile devices.

This group has been worked on the social chat, information & answer and dialogue system which are three key layers of technology to build the pyramid of CAAP. Our work includes research on query understanding, response or answer generation, context-aware ranker, knowledge-based response and answer, text-based response and answer, webpage-based response and answer. Personalization of the response and answer is also important research topic. In last two years, this group has worked with MS product teams on various chatbots such as Xiaoice (in China), Rinna (Japan) and Tay (in US).