Research on Advanced Natural Language Processing and Text Mining: aNT

Our research group at the University of Tokyo has been granted a five-year project (Grant-in-Aid for Specially Promoted Research) on advanced NLP by the Ministry of Education, Culture, Sports, Science and Technology (MEXT) in Japan.
The project, aNT (advanced NLP and Text Mining), started in April 2006, and its technological focus is on deep parsing and knowledge-based processing with a strong emphasis on combining these with machine learning techniques. From an application standpoint, we aim to develop intelligent text management systems (information extraction, semantically enriched text, information retrieval, etc.), particularly for biomedical domains, along with a general set of NLP tools that are both adaptable for other domains and interoperable with tools developed by other groups.

Objectives

Significant progress has been made in Natural Language Processing and Text Mining for the last ten years. In particular, although “deep” syntactic parsing, based on linguistic theories such as HPSG, CCG, LFG, LTAG , etc., had been slow and fragile for real-world application, recent research in the field has much improved their efficiency and robustness. The rapid progress is accomplished as the result of integration of the two streams of research, that is, research on efficient algorithms for unification-based formal grammar formalisms and that on probabilistic language modeling and machine learning using large corpora.

The aims of the project are to apply the successful research framework to the fields of semantic and pragmatic processing, and to bring breakthroughs to these fields. Despite of the progress in efficiency and robustness, the current “deep” parsing is not still sufficient in their accuracy for many applications. Furthermore, the output of current “deep” parsing is not deep enough for applications such as information extraction, intelligent IR and Q/A, etc. They require deeper layers of representation beyond “deep” syntax. In short, they need layers of representation for interpretation of text.

Straightforward application of the methods successful in syntax to these fields will not work for semantics and pragmatics for several reasons. (1) While different schools of computational linguistics still remain to agree the representation of “deep” syntax, deeper semantics or human interpretation of text is far less tangible than syntax. Before applying probabilistic modeling, we have to construct a semantically (or pragmatically) annotated corpus which requires to build empirically attested theories of semantics and pragmatics. (2) Knowledge-based processing or interpretation of text needs knowledge. In order not to repeat the same mistakes in the early AI research, the knowledge provided should have non-trivial coverage which enables significant computational experiments. (3) Unlike syntactic theories which are monolithic and local in comparison, the knowledge-based semantic processing is heterogeneous and global for integrating various cues in text. We will develop n extended frameworks both in algorithmic theories and probabilistic modeling. (4) We proved in the previous project that the computational resources, including the GRID and distributive computer systems, are crucial for the success of application of deep parsing to the real-world application. Since knowledge-based semantic processing inherently requires huge data resources including ontology, large semantic lexicon, etc., such environment will play a more central role in the future advanced NLP and Text Mining.

In the project, we have the following four research topics.

Building of Semantically Annotated Corpus and Annotation Ontology for Biology: We enrich the corpus we have built in the GENIA project with semantic annotation. We are also developing a flexible tool for semantic annotation.

Processing Frameworks for Semantics and Pragmatics: Modules that relate text fragments with entities in semantic domains, such as Named Entity Recognizers, Event recognizers, Inference engines for text entailment, etc, are to be developed. These modules are to use the results of the deep parser and at the same, to be incorporated into the deep parser as components.

Computer Infra-structure for advanced NLP and Text Mining: Workflow software will be developed to allow users not only to freely combine software modules but also to exploit PC clusters effectively which consist of 1,000 PCs and share common data resources.

Real-world Application: To show the feasibility of our technology in real-world application, we are developing end-user systems for biologists as well as software components that can be freely combined. We believe that to show the real impact of advanced NLP and Text Mining on actual IT systems is an essential part of the project.