Exploiting Web Data for NLP Research: from Multilingual Text to Social Media

この資料の関連情報

The Web is a gold-mine of data in diverse categories and characteristics, which contains ones that had been hardly available in past, such as a large amount of text in different languages and data from social media. Such data contributes to accelerate progress of research, and at the same time, brings new challenges. In this talk, I introduce our recent research efforts exploiting such novel data on the Web. First, we exploit Twitter data for classifying spiking queries into their topical categories. Spiking queries show sudden spikes in search engines, which represents users' hot attention to them. Therefore, accurate classification of spiking query is important for search engines. Next, I introduce our effort to extract Japanese-English parallel sentence pairs from the Web. We took 3 approaches to mine such data and carefully developed data cleaning framework to extract only high-quality portion. Then I briefly introduce our approach on Japanese-English statistical machine translation.