CHIME Text Seminars

This long-standing seminar series brings together faculty and students
to discuss issues in the general field of text processing, as it
applies to machine learning, natural language processing, information
retrieval and digital libraries. Most of the meetings will be held in
MR6 (AS6 05-10).
You can get announcements of the CHIME Text Processing
Seminar by joining our mailing list:
ChimeText.

Upcoming Seminars

The purpose of a grammatical theory is to specify the mechanisms and principles that can characterize the relations of acceptable sentences in particular languages to the meanings that they express. It is sometimes proposed that the simplest and most explanatory way of arranging the formal mechanisms of grammatical description is to allow them to produce unacceptable representations or derivations for some meanings and then to appeal to a global principle of economy to control this overgeneration. I will explore the conceptual and formal issues of Economy as it has been discussed within the theory of Lexical Functional Grammar, present a framework within which alternative explicit definitions of Economy can be formulated, and examine some phenomena for which Economy has been offered as an explanation. In fact, for the cases to be examined, descriptive devices already available and independently motivated within the traditional LFG formalism can also account for these phenomena directly, without relying on cross-derivational comparisons to compensate for overgeneration. This leads to the question of whether Economy is necessary or even useful as a separate principle of grammatical explanation.

In this talk I introduce our attempts to model language using HPSG, a monostratal theory of grammar that integrates syntax and semantics into a single sign. I will focus on a new grammar for Chinese. Chinese refers to a family of various languages including Mandarin Chinese, Cantonese, Min, etc. These languages share a large amount of structure, though they may differ in orthography, syntax and lexicon. We are attempting to build a grammar family, Zhong, which contains instantiations of various Chineses, sharing descriptions where possible. Currently we have prototype grammars for Cantonese and Mandarin in both simplified and traditional script, all based on a common core. The grammars can both parse and generate, and have facilities for handling unknown words, although they still are incomplete in their coverage of linguistic phenomena.

ABSTRACT:
I will describe some of our research in developing learning and inference methods in pursue of natural language understanding. In particular, I will address what I view as some of the key challenges, including (i) learning models from natural interactions, without direct supervision, (ii) knowledge acquisition and the development of inference models capable of incorporating knowledge and reason, and (iii) scalability and adaptation-learning to accelerate inference during the life time of a learning system.

A lot of this work is done within the unified computational framework of Constrained Conditional Models (CCMs), an Integer Linear Programming formulation that augments statistically learned models with declarative constraints as a way to support learning and reasoning. Within this framework, I will discuss old and new results pertaining to learning and inference and how they are used to push forward our ability to understand natural language.

BIODATA:
Dan Roth is a Professor in the Department of Computer Science and the Beckman Institute at the University of Illinois at Urbana-Champaign and a University of Illinois Scholar. Roth is a Fellow of the American Association for the Advancement of Science (AAAS), the Association of Computing Machinery (ACM), the Association for the Advancement of Artificial Intelligence (AAAI), and the Association of Computational Linguistics (ACL), for his contributions to Machine Learning and to Natural Language Processing. He has published broadly in machine learning, natural language processing, knowledge representation and reasoning, and learning theory, and has developed advanced machine learning based tools for natural language applications that are being used widely by the research community and commercially. Roth is the Associate Editor-in-Chief of the Journal of Artificial Intelligence Research (JAIR) and will serve as Editor-in-Chief for a two-year term beginning in 2015. He was the pr ogram chair of AAAI'11, ACL'03 and CoNLL'02. Prof. Roth received his B.A Summa cum laude in Mathematics from the Technion, Israel, and his Ph.D in Computer Science from Harvard University in 1995.

The Web increasingly provides a vehicle for diverse elements of modern society ranging from gaming to business, government to education and from crime to policing. The data generated grows not only in volume but also in variety and velocity. These key traits create big data – both ON the Web and big data ABOUT the Web. This is an unprecedented opportunity to gain insight into the ways in which the Web drives new social models which are changing the way we publish and consume information, how we behave online and what it means to have privacy. With this opportunity comes the challenge of distributed big data analytics, trust and provenance, and building systems which can interoperate across technical, legal and cultural borders. Web Science brings a strong interdisciplinary approach to understanding the technical and social elements of an evolving Web Eco-system and has proposed tools such as the Web Observatory to gather, analyse and curate web data individuall but also forming part of a growing global collective network of observatories through which new insights and new approaches to analytics are being developed. This workshop continues our program of best practice in Web Science and Big Data Analytics. We are delighted to welcome a program of speakers who will complement our hands-on work with Web Observatories during the event with insights into emerging trends and the future direction of Web Science.

When does the programme take place?

The Summer School will take place on 8-15 Dec 2014 for one-week long. Detailed programme can be found under Our Programme.

What is Summer School programme?

The Summer School will feature a mix of keynote lectures and tutorials as well as group projects by students. Our classes are rigourous and intense, and address challenges in big data social analytics that require multi-disciplinary approaches. An overview of our programme is as follows:

We welcome academic staff, MSc and PhD students, industry and government employees who are keen to learn more about big data analytics on and about the Web.

Registration & Expenses

There are two types of registration categories, i.e. Student and Non-Student. Both categories will be given access to all events organised under the Summer School. More specifically, participants who register under Student category will have the opportunity to participate in our group projects, dealing with the analysis and the application of selected Big Datasets.

Registration for the Summer School is free of charge, but attendees will be expected to cover their own travel, accommodation and subsistence expenses. Student accommodation has been reserved at NUS for students participating in group projects on a first come, first served basis. Please refer to Logistics Details for more information.

Deadline for Registration

For general participation - 30th Nov 2014

For students - 22nd Nov 2014
Note: Student must state reasons on why they want to participate in the Summer School.

ABSTRACT:
In this talk, I will discuss a particular approach to topic detection which tries to identify important topics of a document not by extracting part of the text verbatim as in TextRank, but by mapping it into Wikipedia and making use of page titles. One challenge that the approach faces -- which I call memory based topic detection or MBT -- is, however, that it is not able to handle documents that talk about events not covered by Wikipedia. I address the issue by harnessing it with sentence compression, which permits the creation of novel labels varying in kind and granularity, not available in Wikipedia.

Experiments with the New York Times and TDT Pilot Study corpora find that the present approach works significantly better than baselines, including TextRank, the state of the art in keyword extraction. The talk also includes a short presentation of Media Meter, a live demo system, which leverages MBT to detect and track trending topics in the US online news media.

BIODATA:
Currently, I am an associate professor at National Institute of Japanese Literature, with a joint appointment to Sokendai Graduate University of Advanced Studies, both of which are located in Tokyo, Japan. My research interests include natural language processing, text mining and quantitative studies of print and online media. Most of what I do at present can be found at http://www.quantmedia.org/meter/ .

ABSTRACT:
Idioms constitute a subclass of multi-word units that exhibit strong collocational preferences and whose meanings are at least partially non-compositional. The classic view of idioms as "long words" admits of little or no variation of a canonical form. Fixedness is thought to reflect semantic non-compositionality: the non-availability of semantic interpretation for some or all idiom constituents and the impossibility to parse syntactically ill-formed idioms block regular grammatical operations. We argue that corpus data showing a wide range of discourse-sensitive morphosyntactic flexibility and lexical variation--even in cases where the constituents cannot be semantically interpreted--refute this simplistic view of idioms. Such data weaken the categorical distinction between idioms and freely composed phrases and pose a challenge to the representation of idioms and their constituents in lexical resources designed for Natural Language Processing. We discuss one possible solution, illustrated by the treatment of idioms in the large lexical database WordNet.

BIODATA:
Christiane Fellbaum is a Senior Research Scholar in the Computer Science Department. Her Ph.D. is in Linguistics and her research focuses on computational and corpus linguistics and lexical semantics. She teaches a course on Bilingualism and enjoys exploring new languages and faraway places.
She is Co-Founder and Co-President of the Global WordNet Association. She was awarded Wolfgang Paul Prize and the Antonio Zampolli Prize. She is partner in the European Projects KYOTO, SIERA. A Permanent Fellow and Member of the Center for Language, Berlin-Brandenburg Academy of Sciences. A Member, Board of Directors, American Friends of the Humboldt Foundation. And she currently works supported by the U.S. National Science Foundation, the European Union (Seventh Framework), the Frank Moss Foundation and the Tim Gill Foundation.

ABSTRACT:
Grammar implementations which are guided by linguistic theory will normally lack coverage of even some well-formed utterances, since no current theory exhaustively characterizes all of the phenomena in any language. For many uses of a grammar, approximate or robust analyses of the out-of-grammar utterances would be better than nothing, and a variety of approaches have been developed for such robust parsing. In this paper I present an implemented method which adds two simple "bridging" rules to an existing broad-coverage grammar, the English Resource Grammar, allowing any two constituents to combine. This method relies on a parser which can efficiently pack the full parse forest for an utterance, and then selectively unpack the most likely N analyses guided by a statistical model trained on a manually constructed treebank.

BIODATA:
Dan Flickinger is a Senior Research Associate at the Center for the Study of Language and Information (CSLI) and Project Manager of the LinGO Laboratory at CSLI, Stanford University
Flickinger is the principal developer of the English Resource Grammar (ERG), a precise broad-coverage implementation of Head-driven Phrase Structure Grammar (HPSG). His current research is focused in two broad areas: Parsing text for improved information retrieval, and applying the ERG to improved educational software. Flickinger’s central research interests are in wide-coverage grammar engineering for both parsing and generation, lexical representation and the syntax-semantics interface.

ABSTRACT:
Mobile apps have become commonplace in society. But with millions of apps flooding the app stores, recommender systems have become indispensable tools as they help consumers overcome the problem of information overload. By sifting through the ocean of apps, they allow consumers to discover new and compelling apps through personalized recommendations. Yet, conventional recommender systems have their own set of problems — particularly the problem of data sparsity, which is the result of insufficient ratings per app. Furthermore, conventional recommender systems do not account for the singularity of the app domain that, if properly utilized, could potentially provide significant improvements to current app recommender systems.

In this thesis, we investigate the singularity of the app domain for the purpose of improving app recommendations. By exploiting the app domain’s unique characteristics, we come up with novel recommendation techniques that take advantage of information from social networks, version updates, and a slew of app metadata that is typically underused.

First, we describe an approach that accounts for nascent information culled from Twitter to provide relevant recommendations in cold-start situations. By exploiting an app’s Twitter handle (e.g., @angrybirds), we extract its Twitter-followers and show how these Twitter-followers can act as an alternative source of information to overcome the cold-start problem.

Second, we observe that in the domain of mobile apps, a version update may provide substantial changes to an app, which may revive a consumer’s interest for a previously unappealing version. We leverage version features for the purpose of improving app recommendations, and show that incorporating version information into conventional techniques significantly improves the recommendation quality.

Finally, given a diverse set of app recommendation techniques, we propose a unifying framework that marries the strengths of the various individual techniques while overcoming their respective weaknesses. We present a hybrid app recommender system that utilizes both conventional and novel app recommendation techniques — as well as the assimilation of user and app metadata features — for the purpose of generating a personalized ranked list of recommended apps.

BIODATA:
Jovian joins NExT Search Centre at NUS as a Research Assistant. He received his B.Sc. degree in School of Computing, National University of Singapore in year 2009, in which his final year project (FYP) titled "Human Computation through Game Playing" won the NUS School of Computing’s FYP-Innovation Award in the same year. He continued his Ph.D. study in NUS under supervision of Prof. Chua Tat Seng and A/P Kan Min Yen. Jovian’s research interests include app recommendation under cold-start and sparse conditions, and using social data to boost recommendation in App Stores. More info can be gathered from: http://jovianlin.com

ABSTRACT:
User's online activities and their participation in social networks have increased significantly in recent years. As a result of this massive participation, the amount of user-generated contents (UGC) has also been expanded rapidly. There are many types of UGC; they can be as short as a comment, tweet or a Facebook post or as long as a review or blog post. In line with this rapid development of social networks, online discussions are growing as a popular and effective source of information for users because of their timely, lively and flexible content. In recent years, this popularity and user demands have resulted in substantial growth in both the number of users and diversities of topics among online discussions. Usually online discussions are developed and advanced incrementally by groups of users with various backgrounds and intents. However the flexibility, fast updating, informal languages and dynamic structure of online discussions make them a challenge for new users and automated systems to understand and learn the underlying semantics of these long threaded discussions.

Recently, many approaches have been proposed to model relations of topics and learn the semantics from user-generated contents. These approaches are usually based on Bayesian models, clustering methods and different types of topic models. However, the problem of learning and modeling online discussions is a complex and multi-faceted problem which requires a mixture of these methods to analyze and model the complex and dynamic relations between discussions, topics and users. Therefore, for accurate and comprehensive modeling of online discussions, we need a unified framework that can learn the hidden relations between all these three aspects of users, topics and discussions. To learn all these relations and hidden structures, in this thesis we proposed a unified framework as follows: (a) We proposed a novel unsupervised Aspect-Action topic model (AS-AC), that enable us to identify primary topics and their dependencies from a sequence of user posts. In particular, we jointly modeled aspects with their associated actions to boost the precision of our generative process, where actions play the role of defining the functionalities for a group of aspects. (b) We extended the AS-AC model to learn and generate the underlying hierarchical structure of a discussion. We utilized a fast binary approach that were able to capture the evolution of topics and subtopics accurately in the duration of a discussion. (c) We presented a model for automatic identification of user's objectives and intents within a discussion which can be used for extensive evaluation and comparison between discussions. This will enable us to calculate the semantic similarity between any two sub-topics by generating the aspect-action relationship graph and finding the group of highly connected topics. In this thesis, we conduct experiments on Apple discussion forums based on various products with over 3.3 million user posts from 300k discussion threads. Our evaluation indicates that the joint aspect-action model results in substantial improvements in accuracy of discussion modeling and capturing latent relations between users, posts and topics.

BIODATA:
Ghasem Nobari finished his bachelor in Azad University of Iran (IAU) in 2008. For one year he worked as a research assistant in NUS database lab with Prof. Bressan and Prof. Kian Lee on a cloud based Privatization and Anonymization service (PASS). He started his PhD in computer science on 2010 under Prof. Chua Tat Seng supervision in the Lab for Media Search (LMS) at NUS. During his PhD he achieved the following awards: AStar SINGA Award 2010, Semi-finalist in Microsoft Imagine Cup 2010, Audience Choice Award and 3rd Place Award in Elsevier SgCodeJam24 2011, Winner of Startup Weekend Singapore 2012 (SWSG12) and Extra Chapter Challenge Award (ECC14) from NUS Enterprise's.

ABSTRACT:
We have developed a single document text summariser that uses Random indexing and Page Rank. The summariser can be used as-is but performance and robustness is improved if the word vectors are trained on an external corpus. I will present the techniques, especially on random indexing, and also recent activities on how to make the summariser more accessible and some evaluations on cohesion and readability of summarised texts.

BIODATA:
Arne Jönsson is a Professor in Computer Science at Linköping University, Sweden, and Head of the Division of Human Centered-Systems at Linköping University. His current research interests include digital inclusion, especially techniques for automatic summarisation of texts, automatic FAQ and readability of texts. For more than 25 years he conducted research on natural language interfaces, especially dialogue management and empirical studies of human computer interaction. He is still partly active in that area, but now with a focus on conversational systems and social interaction for educational systems. He has also conducted research on the use of Augmented Reality for collaboration in a command and control situation. He was co-founder of a masters programme in cognitive science, and has been responsible for the program for almost twenty years.

ABSTRACT:
Reordering models play a central role in machine translation. While lexicalized reordering models have been widely used in phrase-based translation systems, they suffer from three drawbacks: context insensitivity, ambiguity, and sparsity. We propose a neural reordering model that conditions reordering probabilities on the words of both the current and previous phrase pairs. Including the words of previous phrase pairs significantly improves context sensitivity and reduces reordering ambiguity. To alleviate the data sparsity problem, we build one classifier for all phrase pairs, which are represented as continuous space vectors. Experiments on the NIST Chinese-English datasets show that our neural reordering model achieves significant improvements over state-of-the-art lexicalized reordering models.

BIODATA:
Yang Liu is an Associate Professor in the Computer Science and Technology Department at Tsinghua University. He graduated in 2007 from Institute of Computing Technology, Chinese Academy of Sciences. His work has focused on natural language processing and machine translation. He has published over 20 papers in leading NLP/AI journals and conferences and received a COLING/ACL 2006 Meritorious Asian NLP Paper Award. He served as the Tutorial Co-Chair for ACL 2014 and will be the Local Arrangement Co-Chair for ACL 2015.

ABSTRACT:
"Time'' is a well-known concept to many; yet it has also been the subject of a raging debate between philosophers and scientists. One line of thought, attributed to Isaac Newton, sees "time" as a dimension along which events occur in sequence. However while time is sequential, the manner in which it is presented need not be. As such, it is not trivial to properly understand the temporal relations presented in a piece of text.

I will explain the work that I have done in the interpretation of time from text documents. Specifically, I focus on identifying the relationships between two basic temporal units of text, i.e. 1) time expressions and 2) events. Instead of the surface lexical features typically employed in the state-of-the-art, I show that the use of more semantically motivated approaches can help lead to better performance.

With these improvements, we are still some distance away from a perfect temporal processing system. However this does not mean that temporal interpretation is not useful. I will explain how it can be used effectively downstream to benefit automatic text summarization. The key idea is that knowledge of these temporal relations helps us to identify more accurately important sentences to include in an automatically generated summary.

BIODATA:
Jun Ping is currently a PhD candidate at the National University of Singapore, working together with A/P Min-Yen Kan.

His research interests include improving natural language applications such as question-answering and summarization through the use of semantic approaches, and has published several papers in renowned international conferences on these subjects.
Currently he is working on the interpretation of temporal information from text and their application to summarization.

Outside of the academia, Jun Ping is keen on advancing the state-of-the-art and the commercialization of technology. With this in mind he has started several companies dealing with software development and data analysis.

ABSTRACT:
Surprisingly few naturalistic studies exist of how people select books and music 'in the wild'--that is, in a physical library, bookstore, or music shop. This presentation reports on a series of observational studies of people interacting with large public collections of books and music, spanning 10 years. The insights gained into how people prefer to interact with the collections suggest directions for research and development in providing access to digital collections.

BIODATA:
Sally Jo Cunningham is an Associate Professor in the Computer Science Department at Waikato University (New Zealand). She is a founding member of the New Zealand Digital Libraries Research Group, who are the developers of the Greenstone software to support the development and management of digital document collections. Her research primarily focuses on digital library users and their information behaviour, over text, image, video, and music documents; she is particularly interested in how information behaviour changes as people move to digital documents, and in how we can support the 'non-native' behaviour seen with physical collections, in the digital library. Her work is primarily qualitative and ethnographic, though she does indulge in more technically oriented research projects on occasion. She is also an active researcher in the Computer-Human Interaction and Music Information Retrieval communities.

ABSTRACT:
Automatic evaluations form an important part of Natural Language Processing (NLP) research. Designing automatic evaluation metrics are not only interesting research problems in themselves, but they also help guide and evaluate algorithms in the underlying NLP task. More interestingly, one approach of tackling an NLP task is to maximize the automatic evaluation score of the output, further strengthening the link between the evaluation metric and the solver for the underlying NLP problem.

In this talk, I introduce TESLA, a very general and versatile linear programming-based framework for various automatic evaluation tasks.

TESLA builds on the basic n-gram matching method of the dominant machine translation evaluation metric BLEU, with several features that target the semantics of natural languages. In particular, we use synonym dictionaries to model word level semantics and bitext phrase tables to model phrase level semantics. We also differentiate function words from content words by giving them different weights.

Variants of TESLA are devised for many different evaluation tasks:

TESLAM, TESLA-B, and TESLA-F for the machine translation evaluation of European languages, TESLA-CELAB for the machine translation evaluation of languages with ambiguous word boundaries such as Chinese, TESLA-PEM for paraphrase evaluation, and TESLA-S for summarization evaluation. Experiments show that they are very competitive on the standard test sets in their respective tasks.

BIODATA:
Liu Chang joined the NUS SoC Computational Linguistics Lab (now Media Research Lab 2) in 2006 as a Final Year Project student, and later continued his research as a PhD student under the supervision of Dr. Ng Hwee Tou. He has published on a variety of Natural Language Processing topics, including Semantic Role Labeling, Machine Translation, Machine Translation Evaluation, and Summary Evaluation. His main research interests lie in the robust statistical processing of natural language semantics.

ABSTRACT:
Latent Dirichlet Allocation (LDA) provides a means of learning the latent structure of documents and document collections, by modelling each document as a mixture of topics, and each topic as a mixture of words. It was originally proposed in the machine learning community, largely independently of NLP, but has seen strong adoption within NLP circles. In this talk, I will discuss the interaction between LDA-based topic modelling and NLP, in two parts. First, I will describe work where we apply topic modelling to the NLP tasks of word sense induction and novel word sense detection, achieving state-of-the-art results over both tasks. Second, I will describe attempts to improve topic modelling evaluation through the automatic rating of topics and topic models basic on notions such as topic coherence and topic interpretability.

BIODATA:
Prof Timothy Baldwin is a Professor in the Department of Computing and Information Systems, The University of Melbourne, an Australian Research Council Future Fellow, and a contributed research staff member of the NICTA Victoria Research Laboratories. He has previously held visiting positions at the University of Washington, University of Tokyo, Saarland University, and NTT Communication Science Laboratories. His research interests include text mining of social media, computational lexical semantics, information extraction and web mining, with a particular interest in the interface between computational and theoretical linguistics. Current projects include web user forum mining, text mining of Twitter, and intelligent interfaces for Japanese language learners. He is currently Secretary of the Australasian Language Technology Association and a member of the Executive Committee of the Asian Federation of Natural Language Processing, and was PC Chair of EMNLP 2013.

Tim completed a BSc(CS/Maths) and BA(Linguistics/Japanese) at The University of Melbourne in 1995, and an MEng(CS) and PhD(CS) at the Tokyo Institute of Technology in 1998 and 2001, respectively. Prior to joining The University of Melbourne in 2004, he was a Senior Research Engineer at the Center for the Study of Language and Information, Stanford University (2001-2004).

ABSTRACT:
In a standard laboratory visual search experiment, observers look for a target among some number of distractor items in a search that lasts for, perhaps, on second. In the real world, however, search tasks can be quite different. There may be many targets (think of a shopping list). There may be multiple searches through the same scene (think of finding all the birds in a tree). The target may be ill-defined (Think of looking for "threats" at the airport.) In all of the cases, the search is likely to be extended over a longer period than the classic "trial" in a laboratory study. This talk will describe some of the rules of these extended search tasks. NOTE: This talk does not assume that you have attended my first talk.

BIODATA:
Jeremy Wolfe graduated summa cum laude from Princeton in 1977 with a degree in Psychology and went on to obtain his PhD in 1981 from MIT, studying with Richard Held. His PhD thesis was entitled "On Binocular Single Vision". Wolfe remained at MIT until 1991. During that period, he published papers on binocular rivalry, visual aftereffects, and accommodation. In the late 1980s, the focus of the lab shifted to visual attention. Since that time, his research has focused on visual search and visual attention with a particular interest in socially important search tasks in areas such as medical image perception (e.g. cancer screening) and security (e.g. baggage screening). In 1991, Wolfe moved to Brigham and Women's Hospital where he is Director of the Visual Attention Lab and the Center for Advanced Medical Imaging. At Harvard Medical School, he is Professor of Ophthalmology and Professor of Radiology. His work is currently funded by the US National Institutes of Health, the Office of Naval Research, Toshiba Corboration, the National Geospatial Agency, and Google. He has published over 130 peer-reviewed papers, 1 textbook, and 31 book chapters. Wolfe has taught Psychology courses at MIT & Harvard.

Jeremy Wolfe is Past-President of the Eastern Psychological Association, President of Division 3 of the American Psychological Association, and editor of the journal "Attention, Perception and Psychophysics". He was Chair of the NRC Panel on Soldier Systems (Army Research Lab Technical Assessment Board). He is chair-elect of the Psychonomic Society. He won the Baker Memorial Prize for teaching at MIT in 1989. He is a fellow of the AAAS, the American Psychological Assocation (Div. 3 & 6), the American Psychological Society, and a member of the Society for Experimental Psychologists. He lives in Newton, Mass.

ABSTRACT:
This presentation discusses the key insights from a 4-year project in the area of multi-document summarization and presents a framework for the multi-document summarization of research papers which emulates human summarizing methods. It addresses the gap identified between the structure and readability of human-written summaries and other automatic multi-document summaries, which only focus on selecting the more important information from the set of documents but neglect to consider its readability. In the context of this overall goal, the first part of the study developed a literature review framework from a discourse analysis of the structural, rhetorical, conceptual and content characteristics of human-written literature reviews. The corpus for a 4-level manual content analysis comprised 120 literature review sections published as a part of research papers in international peer-reviewed top information science journals over the years 2000-2008. The macro-level analysis identified 9 types of discourse elements within a literature review and conducted a sequence analysis to develop document-level templates. The sentence-level analysis identifies 22 rhetorical functions employed in literature reviews and 153 linguistic devices which framed information within sentences. The conceptual analysis developed a semantic paradigm for representing research information which can be instantiated for specific research domains. The information analysis identified significant associations between the source sections of selected sentences and the transformations performed on them. Results showed that literature reviews are written in two main styles – integrative literature reviews and descriptive literature reviews. Integrative literature reviews present information from several studies in a condensed form as a critical summary, possibly complemented with a comparison, evaluation or comment on the research gap. They focus on highlighting relationships amongst concepts or comparing studies against each other. Descriptive reviews present more experimental detail about previous studies, such as their approach, results and evaluation. These findings are incorporated into a multi-level literature review framework, comprising document templates, sentence templates and information selection and summarization strategies.

The second part of the study partially implemented and evaluated this framework, to generate multi-document summaries of research papers emulating some characteristic of human-written literature reviews. The result was an integrative summary which emulated characteristics of human literature reviews. It combined research objective information across the papers and highlighted the similarities and differences among them. Selected information was organized in a topic tree and sentences were instantiated using templates which fulfilled rhetorical functions.

Automatic content evaluation showed no significant difference between the summaries generated by the automatic method, and the baseline sentence extraction system; however, the quality characteristics of the automatic summaries were a significant improvement over the baseline because they were perceived as significantly more useful for obtaining a research overview or seeing comparisons across studies. The automatic summaries were also considered more readable in the way they relate topics and sentence to each other. Finally, this presentation identifies delimitations of the study and identifies areas for future research and application.

BIODATA:
Kokil is a Senior Computer Scientist at Adobe Research, Bangalore, India.
She obtained her B.Engg (Information Technology) degree from India in 2008 and submitted her PhD (Information Studies) dissertation in 2013 at Nanyang Technological University, Singapore. She specializes in summarization and text generation frameworks, applied linguistics and discourse, and the interactive effects of communication and information behaviour. Besides summarization, she has conducted research in gamification systems and solving the motion-planning problem for robots in multi-dimensional space. In her current research, she is developing frameworks for social media analysis, to explore the roles and behavior of users during important national or global events.

ABSTRACT:
The visual system takes in far more information than the brain can process. As a consequence, an observer may need to search for a desired item, even if it is clearly visible in the current visual scene. How do we perform such searches? Our attention is guided to items that have basic features of the target of our search. What are those features and how are they used in search? This talk will show that the rules of the human search engine are quite different from the capabilities of the visual system as a whole. Guidance is based on a limited set of attributes that are coarsely coded and that must be combined in some quite restricted ways. Nevertheless, this search engine enables us to guide our attention quite efficiently.

BIODATA:
Jeremy Wolfe graduated summa cum laude from Princeton in 1977 with a degree in Psychology and went on to obtain his PhD in 1981 from MIT, studying with Richard Held. His PhD thesis was entitled "On Binocular Single Vision". Wolfe remained at MIT until 1991. During that period, he published papers on binocular rivalry, visual aftereffects, and accommodation. In the late 1980s, the focus of the lab shifted to visual attention. Since that time, his research has focused on visual search and visual attention with a particular interest in socially important search tasks in areas such as medical image perception (e.g. cancer screening) and security (e.g. baggage screening). In 1991, Wolfe moved to Brigham and Women's Hospital where he is Director of the Visual Attention Lab and the Center for Advanced Medical Imaging. At Harvard Medical School, he is Professor of Ophthalmology and Professor of Radiology. His work is currently funded by the US National Institutes of Health, the Office of Naval Research, Toshiba Corboration, the National Geospatial Agency, and Google. He has published over 130 peer-reviewed papers, 1 textbook, and 31 book chapters. Wolfe has taught Psychology courses at MIT & Harvard.

Jeremy Wolfe is Past-President of the Eastern Psychological Association, President of Division 3 of the American Psychological Association, and editor of the journal "Attention, Perception and Psychophysics". He was Chair of the NRC Panel on Soldier Systems (Army Research Lab Technical Assessment Board). He is chair-elect of the Psychonomic Society. He won the Baker Memorial Prize for teaching at MIT in 1989. He is a fellow of the AAAS, the American Psychological Assocation (Div. 3 & 6), the American Psychological Society, and a member of the Society for Experimental Psychologists. He lives in Newton, Mass.

2013

ABSTRACT:
Search result ranking is one of the major concerns in search engine researches and click model construction which aims at improving ranking performance with the help of implicit relevance feedback information contained in click-through logs has been paid much attention. However, most existing click models assume that all search results should be homogeneous and are therefore not able to deal with search results in rich media formats or containing interaction functions. In contrast to the prevailing approaches, we proposed a different click model construction framework by taking the differences in result presentation, user preference and query information need into consideration. By collecting and analyzing both eye-tracking data and large-scale user behavior data, we plan to look into the practical information acquisition process of search users and extract behavior features to describe the heterogeneous nature of click-through behavior. This talk will introduce some our recent works and also share a few new ideas in modeling user interaction process with search engines.

BIODATA:
Dr. Yiqun LIU obtains his Ph.D degree from Tsinghua University in Beijing, China, where he is now working as an associate professor and vice dean for student affairs. His research area is focused on information retrieval, data mining and natural language processing, especially in Web search technology and Web user behavior analysis. He has published a number of high-quality papers in TWeb, JASIST, JIR, SIGIR, IJCAI, WWW, and some other important journals and conferences (Google scholar citation: 700+). He filed over 20 Chinese patents and has achieved a number of promising results through cooperation with Sogou, Baidu and some other popular search engines in China. He is also a principle investigator of the Tsinghua-NUS NExT research center and a senior member of CCF (China Computer Federation)

ABSTRACT:
Texts are the basis of human knowledge representation, including data analysis results and interpretation by experts, criticism and opinions, procedures and instructions. We have been working on the realization of knowledge-intensive structural NLP which can extract truly valuable knowledge for human beings from an ever growing volume of texts, known recently as Big Data. This talk introduces several of our on-going projects concerning knowledge-intensive structural NLP: synonymous expression, case frame and event relation acquisition from 15G parsed sentences, ellipsis resolution considering exophora and author/reader information, an open search engine infrastructure TSUBAKI, and an information analysis system WISDOM.

BIODATA:
Sadao Kurohashi received the B.S., M.S., and PhD in Electrical Engineering from Kyoto University in 1989, 1991 and 1994, respectively. He has been a visiting researcher of IRCS, University of Pennsylvania in 1994. He is currently a professor of the Graduate School of Informatics at Kyoto University. His research interests include natural language processing, knowledge acquisition/representation, and information retrieval. He received the 10th anniversary best paper award from journal of natural language processing in 2004 and 2009 IBM faculty award.

ABSTRACT:
The main aim of this thesis is to propose a text rewriting decoder, and then apply it to two applications: social media text normalization for machine translation, and source language adaptation for resource-poor machine translation.
In the first part of this thesis, we propose a text rewriting decoder based on beam search. The decoder can be used to rewrite texts from one form to another. In contrast to the beam-search decoders widely used in statistical machine translation (SMT) and automatic speech recognition (ASR), the text rewriting decoder works on the sentence level, so it can use sentence-level features, e.g., the language model score of the whole sentence.
We then apply the proposed text rewriting decoder to social media text normalization for machine translation in the second part of this thesis. Social media texts are written in an informal style, which hinders other natural language processing (NLP) applications such as machine translation. Text normalization is thus important for processing of social media text. Previous work mostly focused on normalizing words by replacing an informal word with its formal form. To further improve other downstream NLP applications, we argue that other normalization operations should also be performed, e.g., punctuation correction and missing word recovery. The proposed text rewriting decoder is adopted to effectively integrate various normalization operations. In the experiments, we have achieved statistically significant improvements over two strong baselines in both social media text normalization and translation tasks, for both Chinese and English.
In the third part of this thesis, our text rewriting decoder is applied to source language adaptation for resource-poor machine translation. As most of the world languages still remain resource-poor for machine translation and many resource-poor languages are actually related to some resource-rich languages, we propose to apply the text rewriting decoder to source language adaptation for resource-poor machine translation. Specifically, the text rewriting decoder attempts to improve machine translation from a resource-poor language POOR to a target language TGT by adapting a large bi-text for a related resource-rich language RICH and the same target language TGT. We assumed a small POOR-TGT bi-text which was used to learn word-level and phrase-level paraphrases and cross-lingual morphological variants between the resource-rich and the resource-poor language. Our work is of importance for resource-poor machine translation, since it can provide a useful guideline for people building machine translation systems of resource-poor languages.

BIODATA:
Wang Pidong is a PhD candidate in the Department of Computer Science at the National University of Singapore. His main research area is natural language processing (NLP) with focusing on machine translation and social media text normalization. He is also the owner of NLP CONSULTANT, a Singapore company, which provides NLP consultancy.

ABSTRACT:
Web user forums (or simply "forums") are online platforms for people to discuss and obtain information via a text-based threaded discourse, generally in a pre-determined domain (e.g. IT support or DSLR cameras). Due to the sheer scale of the data and the complex thread structure, it is often hard to extract and access relevant information from forums. To address this problem, we propose the task of automatically parsing the discourse structure of forum threads, for the purpose of enhancing information access and solution sharing over web user forums.

The discourse structure of a forum thread is modelled as a rooted directed acyclic graph (DAG), and each post in the thread is represented as a node in this DAG. The reply-to relations between posts are then denoted as directed edges (LINKs) between nodes in the DAG, and the type of a reply-to relation is defined as a dialogue act (DA). To parse the discourse structure of threads, we take several approaches. The first method uses conditional random fields (CRFs) to either classify the LINK and DA separately and compose them afterwards, or classify the combined LINK and DA directly. Another technique we adopt is to treat this discourse structure parsing as a dependency parsing problem, which is the task of automatically predicting the dependency structure of a token sequence, in the form of binary asymmetric dependency relations with dependency types. We obtain high discourse structure parsing F-scores with the proposed methods.

Furthermore, we investigate ways of using thread discourse structure information to improve information access and solution sharing over web user forums. In particular, we explore the tasks of thread Solvedness classification (i.e. whether the problem asked in a thread is solved or not), and thread-level information retrieval over forums. Our experiments show that using the discourse structure information of forum threads can benefit both tasks significantly.

BIODATA:
Li Wang is a final year PhD student from the Language Technology Group of The University of Melbourne. His research interests lie in knowledge discovery and extraction of social media data. Currently, his research mainly focuses on improving information access over web user forums. Li obtained a BE from Wuhan University, and Master of Information Technology from The University of Melbourne. He has published papers at top-tier NLP conferences including EMNLP, COLING, IJCNLP and CoNLL, and was awarded Google Plenary Highlight Paper Award. To learn more about his research, please visit http://research.liwang.info

To achieve the aims of this thesis, we first investigate the differentiation of
the semantic and sentiment similarity measures. It results that although the
semantic similarities are good measures for relating semantically related
words, they are less effective in relating words with similar sentiment. This
result leads to a need of sentiment similarity measure. Thus, we then model
the words in emotional space employing the association between the semantic
space and emotional space of word senses to infer their emotional vectors.
These emotional vectors are used to predict the sense sentiment similarity of
the words. To map the words into emotional vectors, we first employ the set
of basic human emotions that are central to other emotions: anger, disgust,
sadness, fear, guilt, interest, joy, shame, surprise. Then, we assume that the
number and types of the emotions are hidden and propose hidden emotional
models for predicting the emotional vectors of the words and interpreting the
hidden emotions that aim to infer sense sentiment similarity.

Experimental results through IQAPs inference and SO prediction tasks show
that the sense sentiment similarity is more effective than semantic similarity
measures. The experiments indicate that utilizing the emotional vectors of
the words is more accurate than comparing their overall sentiments in IQAPs
inference. In addition, in SO prediction, we can obtain a comparable result
with the state-of-the-art approach, when we employ sense sentiment similarity
along with a simple algorithm to predict the sentiment orientation.

BIODATA:
Mitra Mohtarami is a Ph.D. candidate under the supervision of Prof Chew-Lim
Tan at the School of Computing, Department of Computer Science, National
University of Singapore. She is part of the Artificial Intelligence Research
Lab: Centre for Information Mining and Extraction. Her research interests
are in the areas of Sentiment and Emotion Analysis, Text Mining, Social
Media Analysis and Natural Language Processing. Her Ph.D. research focuses
on inferring sense sentiment similarity through emotion analysis and
studying its effectiveness in natural language processing.

ABSTRACT:
With millions of apps in app stores, recommender systems that help a user discover new and interesting apps that match his or her interests become increasingly important. However, traditional recommender systems do not account for the singularity of the app domain which, if properly utilized, could provide significant improvements to current state-of-the-art recommendation techniques.

In this proposal, we will look into the issue of improving the recommendation accuracy of apps — with the goal of exploiting unique features in the app domain such as the presence of nascent information in microblogs and version updates of apps. Specifically, we will tackle the cold-start problem that plagues newly-released apps that are deprived of ratings, as well as explore the use of version updates to provide more precise recommendation which we term “version-sensitive” recommendation. Finally, as these two works are applied in different situations — one on the cold-start, the other on apps that have more than one version update — we propose a unified framework that marries the techniques in the aforementioned works and optimizes the overall recommendation accuracy by considering all factors pertinent to a given situation.

BIODATA:
Jovian joins NExT Search Centre at NUS as a Research Assistant. He received his B.Sc. degree in School of Computing, National University of Singapore in year 2009, in which his final year project (FYP) titled "Human Computation through Game Playing" won the NUS School of Computing’s FYP-Innovation Award in the same year. He continued his Ph.D. study in NUS under supervision of Prof. Chua Tat Seng and A/P Kan Min Yen. Jovian’s research interests include app recommendation under cold-start and sparse conditions, and using social data to boost recommendation in App Stores. More info can be gathered from: http://jovianlin.com

ABSTRACT:
I study Chinese informal language text, with the goal of building natural language tools to process them effectively. By informal text (commonly referred to as "microtext" given their prevalence in micro blogging environments) I refer to language used in today's digital environments - such as blogging, instant messaging and chat. In particular, I tackle the problems of Chinese word segmentation, informal word recognition and normalization, and named entity recognition. By devising novel approaches for these individual components and integrating them together, I aim to provide an end-to-end solution to tackle the challenges raised by the informality of microtext, and to propose a framework that processes Chinese microtext on lexical level.

The key contributions of my work thus far are: 1) an enhanced word segmenter which outperforms both research and commercial state-of-the-art on segmenting Chinese microtext; 2) the first (to the best of
my knowledge) informal word recognizer, and an automatically generated lexicon of Chinese informal words that resulted from running the recognizer; and 3) a general solution to effectively normalize Chinese informal words originating from three major channels.

BIODATA:
Aobo Wang is a Ph.D. candidate under the supervision of Prof Min-Yen Kan in School of Computing, Department of Computer Science, National University of Singapore.
His research interests include crowdscourcing, natural language processing and information retrieval.
Specifically, he is focusing on enhancing NLP technology to process the informal text in social media.

ABSTRACT:
In this paper we classify the temporal relations between pairs of events on an article-wide basis. This is in contrast to much of the existing literature which focuses on just event pairs which are found within the same or adjacent sentences. To achieve this, we leverage on discourse analysis as we believe that it provides more useful semantic information than typical lexico-syntactic features. We propose the use of several discourse analysis frameworks, including 1) Rhetorical Structure Theory (RST), 2) PDTB-styled discourse relations, and 3) topical text segmentation. We explain how features derived from these frameworks can be effectively used with support vector machines (SVM) paired with convolution kernels. Experiments show that our proposal is effective in improving on the state-of-the-art significantly by as much as 16% in terms of F1, even if we only adopt less-than-perfect automatic discourse analyzers and parsers. Making use of more accurate discourse analysis can further boost gains to 35%.

BIODATA:
Jun Ping is a PhD candidate under the supervision of A/P Min-Yen Kan at the National University of Singapore. He has a keen interest in improving and applying natural language processing (NLP) applications such as text summarization and question-answering. In earlier work he had worked on and released open-source systems for these applications. Currently his research focus is on the interpretation of temporal information within text.

ABSTRACT:
Fuji Xerox Communication Technology Laboratory (CTL) aims to develop technologies to extract valuable information from large scale text data. NLP team in CTL works with three kinds of text data. The first is sales daily reports input by sales persons on an SFA system. The second is micro blog texts and the last is clinical texts. This talk introduces NLP technologies and applications for these texts.

BIODATA:
Tomoko Ohkuma is a research leader of Communication Technology Laboratory in Fuji Xerox. Her main research interest is Information Extraction Technology based on Natural Language Processing (NLP). She is a member of organizers in NTCIR-11 MedNLP-2. She was a part-time lecturer of Tokyo Woman's Christian University (for 4 years)

She earned her Ph.D and M.A. from Keio University, where she majored NLP. She also earned B.A. in linguistics from Tokyo Woman's Christian University.

ABSTRACT:
In this talk I give an overview of efforts on discourse parsing to produce complete discourse structures for texts. Discourse structures are viewed as graphs, nodes being discourse units and edges as relations between units, that have been shown to have an important impact on semantic interpretation. Discourse parsing can be thought of as involving 3 tasks: the segmentation of a text into EDUs, attachment of EDUs to form unlabelled graphs in which EDUS can function together as larger discourse units, and the labeling of edges in those structures with discourse relations. I will compare various efforts on these tasks at both the theoretical and practical level with our own work and sketch a method for exploiting corpora with different discourse annotation schemes in discourse parsing.

BIODATA:
Nicholas Asher did his Ph.D at Yale University and then spent almost a quarter of a century at the University of Texas at Austin where he was professor of Philosophy and of Linguistics, before moving to France to take a director of research position at the CNRS. Nicholas Asher works on topics at the intersection of linguistics, computer science and philosophy, in particular on discourse structure and interpretation, but also formal semantics and pragmatics. Currently, he is directing a project on Strategic Conversation funded by the ERC, which brings together work on games and linguistics.

ABSTRACT:
In this talk, we introduce several attempts to mine social media for
analyzing (and sometimes predicting) the real-world events such as
election, market, and natural disasters. For example in our study,
mining twitter enables us to detect earthquakes promptly. Also
it gives us the intuition which manga or anime are popular among Asian countries.
The later half of the talk will cover the overview of our technology transfer trials
for operating web services such as SPYSEE, which is the most popular
people search in Japan, and READYFOR, the largest crowdfunding platform in Japan.
The talk concludes with our findings through the attempts and potential research topics in the future.

BIODATA:
Yutaka Matsuo is an associate professor at the University of Tokyo (UT),
working on Web technology and artificial intelligence. He got his Ph.D.
degree from the University of Tokyo in 2002 and was a visiting scholar at
Stanford from 2005 to 2007. He serves as editor-in-chief of Japanese Society
for Artificial Intelligence (JSAI) from 2012. He was a program committee member of
WWW, WSDM, and IJCAI for several years, and is a track chair
of Web mining track in WWW2014 (International World Wide Web Conference).

Traditional supervised learning methods rely on manually labeled structures for training. Unfortunately, manual annotations are often expensive and time-consuming for large amounts of rich text. It has great value to induce structures automatically from unannotated sentences for NLP research.

In this thesis, I first introduce and analyze the existing methods in structure induction, then present our explorations on three unsupervised structure induction tasks: the transliteration equivalence learning, the constituency grammar induction and the dependency grammar induction.

In transliteration equivalence learning, transliterated bilingual word pairs are given without internal syllable alignments. The task is to automatically infer the mapping between syllables in source and target languages. This dissertation addresses problems of the state-of-the-art grapheme-based joint source-channel model, and proposes Synchronous Adaptor Grammar (SAG), a novel nonparametric Bayesian learning approach for machine transliteration. This model provides a general framework to automatically learn syllable equivalents without heuristics or restrictions.

The constituency grammar induction is useful since annotated treebanks are only available for a few languages. This dissertation focuses on the effective Constituent-Context Model (CCM) and proposes to enrich this model with linguistic features. The features are defined in log-linear form with local normalization, in which the efficient Expectation-Maximization (EM) algorithm is still applicable. Moreover, we advocate using a separated development set (a.k.a. the validation set) to perform model selection, and measure trained model on an additional test set. Under this framework, we could automatically select suitable model and parameters without setting them manually. Empirical results demonstrate the feature-based model could overcome the data sparsity problem of original CCM and achieve better performance using compact representations.

Dependency grammars could model the word-word dependencies which is suitable for other high-level tasks such as relation extraction and coreference resolution. This dissertation investigates Combinatory Categorial Grammar (CCG), an expressive lexicalized grammar formalism which is able to capture long-range dependencies. We introduce boundary part-of-speech (POS) tags into the baseline model (Bisk-Hockenmaier:2012:AAAI) to capture lexical information.

For learning, we propose a Bayesian model to learn CCG grammars, and the full EM and k-best EM algorithms are also implemented and compared. Experiments show the boundary model improves the dependency accuracy for all these three learning algorithms. The proposed Bayesian model outperforms the full EM algorithm, but underperforms the k-best EM learning algorithm.

In summary, this dissertation investigates unsupervised learning methods including Bayesian learning models and feature-based models, and provides some novel ideas of unsupervised structure induction for natural language processing. The automatically induced structures may help on subsequent NLP applications.

BIODATA:
Huang Yun is currently a PhD student jointly supervised by Prof. Tan
Chew Lim from School of Computing NUS, and Dr. Zhang Min from
Institute for Infocomm Research (I2R). His research interests include
unsupervised grammar induction, natural language processing, and machine
learning.

ABSTRACT:
We address the problem of informal word recognition in Chinese microblogs. A key problem is the lack of word delimiters in Chinese. We exploit this reliance as an opportunity: recognizing the relation between informal word recognition and Chinese word segmentation, we propose to model the two tasks jointly. Our joint inference method significantly outperforms baseline systems, which conduct the tasks individually or sequentially.

The paper will be presented in ACL: Annual Meeting of the ACL (Association of Computational Linguistics).

BIODATA:
Wang Aobo has been a Ph.D. student under the supervision of A/P Min-Yen Kan in the School of Computing, Department of Computer Science, National University of Singapore since August 2009.
His research interests include natural language processing and information retrieval, specifically focusing on studying and enhancing NLP technology to process the language in social media.
Besides, he is also interested in Human Computation or Crowdsourcing.

He is affiliated with the Web, Information Retrieval / Natural Language Processing Group (WING) and the China-Singapore Institute of Digital Media(CSIDM).

ABSTRACT:
As a massive repository of User Generated Content (UGC), social media platforms are arguably the most active networks of interactions, content sharing, and news propagation that best represent the everyday thoughts, opinions and experiences of their users. Rapid analysis of such contents is thus critical for user-centric organizations and businesses as the relevant social media contents provide actionable insights for such organizations.

This thesis develops effective algorithms to address several challenging problems in online analysis of the social media contents for organizations. In particular, the main focus of this thesis is on: (a) effective sentiment analysis by mining new opinion words from UGCs, and (b) online learning of the context of organizations (including topics, users, and communities) in social media.

A unified framework is proposed to tackle the above issues. In particular, a semantic similarity measure based on the interchangeability characteristic of words in context has been proposed to mine new opinion words from UGCs. Furthermore, we propose algorithms to model the context of organizations (characterized by content and user information) in order to accurately identify relevant micro-posts about organizations, mine their topics, and identify user communities of organizations. In particular, an optimization algorithm has been proposed to mine the smooth evolution of emerging and evolving topics about organization. We also propose an effective approach for community detection for organization in social media. The approach is based on the topical and social relationships among users. We show that such relations are highly effective in mining user communities and ranking users for organizations.

Extensive experiments on different kind of UGCs show the effectiveness of the proposed approaches. In particular, we show that mining slang and urban opinion words and phrases significantly improves the performance of sentiment classification on UGCs. Furthermore, we show that user and content information are the key factors in effective modeling of the social media contents for organizations.

BIODATA:
Hadi is a PhD candidate at the NUS Graduate School for Interactive Science and Engineering (NGS), National University of Singapore. His research interests are in the areas of Social Media Analysis, Information Retrieval, and Sentiment Analysis. He is part of the Lab for Media Search (LMS) and his adviser is Prof. Tat-Seng Chua.

ABSTRACT:
The hallmark of Web 2.0 is the incorporation of the role of readers, in generating contents such as comments, votes and other social actions. This characteristic of the new Web has lead to a much more dynamic and immediate sense of popularity. To best satisfy a search query in Web 2.0, it is obligating for a ranking engine to properly account for the current temporal dynamics.

Taking page views as a surrogate for item popularity, we accord the goal of ranking items by their predicted future view count. We propose to conduct the popularity prediction by exploiting user comments, as the visiting histories are expensive to obtain externally.

In this presentation, we first make a comprehensive review of the popularity prediction work, which are categorized into three types: statistics-based, classification-based and model-based approaches. We further review the literatures about ranking items by using the comments signal, to shed some light on the nature of comments and how to use them for prediction. Then, we propose a novel ranking algorithm based on user generated comments for popularity prediction. We validate our approach on three real-world datasets, crawled from YouTube, Flickr and Last.fm. From the experiments, we show that our approach yields promising results that it consistently outperforms state-of-the-art baselines. Our approach is simple, may complement content-based ranking algorithms, and opens up a new avenue of research possibilities.

BIODATA:
Xiangnan He is a PhD student in the School of Computing, National University of Singapore, under the supervision of A/P Min-Yen Kan. His research interests lie predominately in improving information retrieval applications such as ranking and recommendation through the mining of rich user-generated contents. Currently, he is focusing on discovering knowledge from user comments for information retrieval.

ABSTRACT:
Being able to interpret temporal information conveyed in text is essential to text reading and comprehension. In this work, I study the problem of interpreting temporal information within text, with the goal of exploiting this knowledge to enhance applications such as multi-document summarization.

In particular, I tackle the problems of temporal relationship classification between an event and a time expression, as well as between two event expressions. Putting the results of these two together allows us to build a logical representation of the temporal information within a piece of text. With this representation, I hope to propose a framework which makes use of such temporal information to improve multi-document summarization.

BIODATA:
Jun Ping is a PhD candidate in the School of Computing, National University of Singapore, under the supervision of A/P Min-Yen Kan. His research interests lie predominantly in improving natural language applications such as question-answering and summarization through the use of semantic approaches. Currently he is focusing on the interpretation of temporal information within text.

ABSTRACT:
We examine two prominent problems in domain-specific
information retrieval: Resource Categorization and Text-to-Construct
Linking.
The first problem refers to the categorization of domain-specific
resources at multiple granularities. This helps a search engine to
better meet specific user needs by directing task-relevant materials
to the user and organize its presentation of search results by more
pertinent metadata criteria.
The second problem refers to the resolution of domain-specific
concepts to their related domain-specific constructs. This allows
constructs to properly influence relevance ranking in search results,
without troubling the user to input them in the potentially awkward
construct syntax.
We propose to model the domain-specific resources with a multi-layered
graphical framework. The nodes in the model represent the
characteristics associated with domain-specific resources, while the
edges represent the correlations among the characteristics. Our model
is not only expressive in providing a unified view of the
characteristics and correlations but also flexible in the choice of
computational mechanisms.
Guided by the model, we carry out our research on the two
aforementioned problems as follows. For Resource Categorization, we
use the key information extraction problem in healthcare as a case
study on the categorization of correlated nominal facets. We exploit
the correlation between two categorizations at different granularities
(i.e., sentence-level and word-level) by propagating information from
one to the other sequentially or simultaneously. In addition, we use
the readability measurement problem as a case study on the
categorization of ordinal facets. We exploit the correlation between
the readability of domain-specific resources and the difficulty of the
domain-specific concepts through an iterative computation algorithm.
For Text-to-Construct Linking, we tackle the linking of math concepts
to their representations in math expressions. We exploit the
correlation between the observable characteristics of a
concept-expression pair and its relation type using supervised
learning.
To demonstrate the applicability and usefulness of our research, we
have also implemented two domain-specific search systems, one in math
and the other in healthcare. The math system incorporates the
categorization of resource type and readability as well as
text-to-expression linking, while the healthcare system categorizes
healthcare resources at multiple granularities to pull out relevant
information for applicability and validity assessment, and has
additional features, such as dual interface for active/passive search,
for better workflow integration.

BIODATA:
Jin Zhao is a PhD candidate in School of Computing, National
University of Singapore (NUS), under the supervision of A/P Min-Yen
KAN. His research interest is in domain-specific information retrieval
with focus on math and nursing.

ABSTRACT:
Semantic parsing is the process of mapping natural language sentence into its formal meaning
representation. It may be helpful for machine translation, question answering and so on, and has
received increasing attention in recent years. In this talk, I will give a brief introduction to existing semantic parsing algorithms, and present a new grammar induction algorithm for supervised
semantic parsing.

Synchronous context-free grammar augmented with lambda calculus (lambda-SCFG)
provides an effective mechanism for semantic parsing, however how to learn such lambda-SCFG
rules still remains a challenge because of the difficulty in determining the correspondence between
NL sentences and logical forms. To alleviate this structural divergence problem, we extend the
GHKM algorithm, which is a state-of-the-art algorithm for learning synchronous grammars in
statistical machine translation, to induce lambda-SCFG from pairs of natural language sentences
and logical forms. By treating logical forms as trees, we reformulate the theory behind GHKM
that gives formal semantics to the alignment between NL words and logical form tokens.
Experiments on the GEOQUERY dataset show that our semantic parser achieves the best result
published to date.

BIODATA:
Peng Li is a Ph.D student from Tsinghua University, China. His research interests are semantic
parsing and machine translation. Currently he is working on supervised semantic parsing. He also
has a big interest in semi-supervised/unsupervised semantic parsing algorithm for open domain
texts, in the hope that semantic parsing can be used in machine translation and other tasks in the
future.

ABSTRACT:
Automatic evaluations form an important part of Natural Language Processing (NLP) research. Designing automatic evaluation metrics are not only interesting research problems in themselves, but they also help guide and evaluate algorithms in the underlying NLP task. More interestingly, one approach of tackling an NLP task is to maximize the automatic evaluation score of the output, further strengthening the link between the evaluation metric and the solver for the underlying NLP problem.

In this talk, I introduce TESLA, a very general and versatile linear programming-based framework for various automatic evaluation tasks.

TESLA builds on the basic n-gram matching method of the dominant machine translation evaluation metric BLEU, with several features that target the semantics of natural languages. In particular, we use synonym dictionaries to model word level semantics and bitext phrase tables to model phrase level semantics. We also differentiate function words from content words by giving them different weights.

Variants of TESLA are devised for many different evaluation tasks:

TESLA-M, TESLA-B, and TESLA-F for the machine translation evaluation of European languages, TESLA-CELAB for the machine translation evaluation of languages with ambiguous word boundaries such as Chinese, TESLA-PEM for paraphrase evaluation, and TESLA-S for summarization evaluation. Experiments show that they are very competitive on the standard test sets in their respective tasks.

BIODATA:
Liu Chang joined the NUS SoC Computational Linguistics Lab (now Media
Research Lab 2) in 2006 as a Final Year Project student, and later
continued his research as a PhD student under the supervision of Dr.
Ng Hwee Tou. He has published on a variety of Natural Language
Processing topics, including Semantic Role Labeling, Machine
Translation, Machine Translation Evaluation, and Summary Evaluation.
His main research interests lie in the robust statistical processing
of natural language semantics.

ABSTRACT:
Word Sense Disambiguation (WSD) is the process of identifying the meaning of an ambiguous word in context. It is considered a fundamental task in Natural Language Processing (NLP).
Previous research shows that supervised approaches achieve state-of-the-art accuracy for WSD. However, the performance of the supervised approaches is affected by several factors, such as domain mismatch and the lack of sense-annotated training examples. As an intermediate component, WSD has the potential of benefiting many other NLP tasks, such as machine translation and information retrieval (IR). But few WSD systems are integrated as a component of other applications.
We release an open source supervised WSD system, IMS (It Makes Sense). In the evaluation on lexical-sample tasks of several languages and English all-words tasks of SensEval workshops, IMS achieves state-of-the-art results. It provides a flexible platform to integrate various feature types and different machine learning methods, and can be used as an all-words WSD component with good performance for other applications.
To address the domain adaptation problem in WSD, we apply the feature augmentation technique to WSD. By further combining the feature augmentation technique with active learning, we greatly reduce the annotation effort required when adapting a WSD system to a new domain.
One bottleneck of supervised WSD systems is the lack of sense-annotated training examples. We propose an approach to extract sense annotated examples from parallel corpora without extra human efforts. Our evaluation shows that the incorporation of the extracted examples achieves better results than just using the manually annotated examples.
Previous research arrives at conflicting conclusions on whether WSD systems can improve information retrieval performance. We propose a novel method to estimate the sense distribution of words in short queries. Together with the senses predicted for words in documents, we propose a novel approach to incorporate word senses into the language modeling approach to IR and also exploit the integration of synonym relations. Our experimental results on standard TREC collections show that using the word senses tagged by our supervised WSD system, we obtain statistically significant improvements over a state-of-the-art IR system.

BIODATA:
Zhong Zhi is currently a senior software engineer at Elance Inc., a global online employment platform. Prior to joining Elance, he was a PhD Student in School of Computing, National University of Singapore, supervised by A/P Ng Hwee Tou. He is interested in natural language processing, especially word sense disambiguation, and information retrieval.

ABSTRACT:
A large portion of user-generated content accessible from the Web is in the form of short text. Examples are Twitter messages or tweets, comments, and status updates. On the one hand, such short texts reflect what netizen are interested in in a timely manner. On the other hand, short texts are sparse, noisy, and cover diverse and fast changing topics, making many information retrieval and mining tasks extremely challenging. In this talk, I will present our work on event detection from tweets and reader comments summarization for news articles. Our key contribution to the first task is the notion of tweet segment. A tweet segment is a meaningful keyphrase extracted from tweets. Events are detected by grouping relevant segments together. For reader comments summarization, we apply topic model to capture both explicit and implicit relationships between comments and news articles. The comments associated with a news article are then grouped into topical clusters. In the last part of my talk, I will briefly introduce our work on tag-based social image retrieval.

BIODATA:
Dr. SUN Aixin is an Assistant Professor with School of Computer Engineering (SCE), Nanyang Technological University (NTU), Singapore. He received B.A.Sc (1st class honours) and Ph.D. from the same school in 2001 and 2004 respectively. Aixin's research interests include Information Retrieval, Text/Web Mining, Social Computing, and Multimedia. His papers appeared in ACM SIGIR, ACM WSDM, ACM Multimedia, ACM CIKM, JASIST, TKDE, DSS, and IP&M. Aixin has been a PC member of many conferences in the related areas including ACM SIGIR, ACM Multimedia, ACM WSDM, and reviewer for several journals and IEEE/ACM transactions. He is a member of ACM.

2012

ABSTRACT:
In this thesis, we investigate a natural language problem of parsing a free text into its discourse structure. Specifically, we look at how to parse free texts in the Penn Discourse Treebank representation in a fully data-driven approach. A difficult component of the parser is to recognize Implicit discourse relations. We first propose a classifier to tackle this with the use of contextual features, word-pairs, and constituent and dependency parse features. We then design a parsing algorithm and implement it into a full parser in a pipeline. We present a comprehensive evaluation on the parser from both component-wise and error-cascading perspectives. To the best of our knowledge, this is the first parser that performs end-to-end discourse parsing in the PDTB style.

Textual coherence is strongly connected to a text's discourse structure. We present a novel model to represent and assess the discourse coherence of a text with the use of our discourse parser. Our model assumes that coherent text implicitly favors certain types of discourse relation transitions. We implement this model and apply it towards the text ordering ranking task, which aims to discern an original text from a permuted ordering of its sentences. To the best our knowledge, this is also the first study to show that output from an automatic discourse parser helps in coherence modeling.

Besides modeling coherence, discourse parsing can also improve downstream applications in natural language processing (NLP). In this thesis, we demonstrate that incorporating discourse features can significantly improve two NLP tasks -- argumentative zoning and summarization -- in the scholarly domain. We also show that output from these two tasks can improve each other in an iterative model.

BIODATA:
Lin Ziheng is a PhD student under the supervision of Prof. Kan Min-Yen and Prof. Ng Hwee Tou in School of Computing, National University of Singapore. His research interests include natural language processing and information retrieval. Specifically, he has been working on discourse analysis, text coherence, text summarization, and opinion mining. He is now a researcher in SAP Research.

ABSTRACT:
Structural knowledge enables many efficient, effective, and flexible applications both in business intelligence and public knowledge services. As the Web becomes ever more enmeshed with our daily lives, there is a growing desire for direct access to knowledge distributed on the Web, but automatically acquiring cross lingual structured knowledge on the web is a difficult and challenging problem. This talk will describe a framework for building cross lingual knowledge graph by combining ontology matching, knowledge extraction and machine learning. We will propose some of key issues in this framework and present our current work containing cross lingual knowledge linking, cross lingual knowledge extraction and their related system.

BIODATA:
Juanzi Li is a full professor at the department of computer science and technology, Tsinghua University, China. She is the vice director of Chinese Information Processing Society at China Computer Federation(CCF) and the member of National Technical Committee on Chinese Press Information Technology of Standardization Administration of China. She got her Ph.D from Tsinghua University in 2000. Prof. Li’s main research interest is to study the semantic technologies by combining the key technologies of Natural Language Processing, Semantic Web and Data Mining. She has published about 90 papers in many international journals and conferences such as WWW, TKDE, SIGIR, SIGMOD, SIGKDD, ISWC, JoWS et al. She took the important role in defining Chinese News Markup Language (CNML), and developed CNML specification management system, in which she has received the awards of the "Wang Xuan" News Science and Technology in 2011 and 2009, respectively.

ABSTRACT:
We study the problem of classifying the temporal relationship between events and time expressions in text. In contrast to previous methods that require extensive feature engineering, our approach is simple, relying only on a measure of parse tree similarity. Our method generates such tree similarity values using dependency parses as input to a convolution kernel. The resulting system outperforms the current state-of-the-art.
To further improve classifier performance, we can obtain more annotated data. Rather than rely on expert annotation, we assess the feasibility of acquiring annotations through crowdsourcing. We show that quality temporal relationship annotation can be crowdsourced from novices. By leveraging the problem structure of temporal relation classification, we can selectively acquire annotations on problem instances that we assess as more difficult. Employing this annotation strategy allows us to achieve a classification accuracy of 73.2%, a statistically significant improvement of 8.6% over the previous state-of-the-art, while trimming annotation efforts by up to 37%.
Finally, as we believe that access to sufficient training data is a significant barrier to current temporal relationship classification, we plan to share our collected data with the research community to promote benchmarking and comparative studies.

BIODATA:
Jun Ping Ng am currently a PhD student at the School of Computing, National University of Singapore. His research interests lie predominantly in improving natural language applications such as question-answering and summarization through the use of semantic approaches.

ABSTRACT:
We show that by making use of information common to document sets belonging to a common category, we can improve the quality of automatically extracted content in multi-document summaries. This simple property is widely applicable in multi-document summarization tasks, and can be encapsulated by the concept of category-specific importance (CSI). Our experiments show that CSI is a valuable metric to aid sentence selection in extractive summarization tasks. We operationalize the computation CSI of sentences through the introduction of two new features that can be computed without needing any external knowledge. We also generalize this approach, showing that when manually-curated document-to-category mappings are un- available, performing automatic categorization of document sets also improves summarization performance. We have incorporated these features into a simple, freely available, open-source extractive summarization system, called SWING. In the recent TAC-2011 guided summariza- tion task, SWING outperformed all other participant summarization systems as measured by automated ROUGE measures.

BIODATA:
Jun Ping am currently a PhD student at the School of Computing, National University of Singapore. His research interests lie predominantly in improving natural language applications such as question-answering and summarization through the use of semantic approaches.

ABSTRACT:
Language understanding is a key component of multimodal conversational
agents as well as conversation analytics. Meaning structures are usually
designed ad-hoc and generalize poorly over systems, tasks and domains.
Linguistically motivated syntactic or semantic parsers are feature-rich but
perform poorly on conversational and noisy speech. In this talk we will
report on discriminative parsing of conversational speech and recent results
on linguistically motivated parsing models. In the first part we present
the discriminative reranking models (DRM) based on kernel methods operating
over a very large set of meaning (sub)structures (e.g. strings, trees). We
show DRMs outperforms Conditional Random Fields in an extensive evaluation
of multilingual conversational corpora. In the second part of the talk we
will present our work on linguistically motivated meaning structures using
the FrameNet resource. Such resource has been used to annotate human-machine
and human-human corpora of spoken conversations. We have investigated the grounding of such meaning
structures into word sequences as well as speech signal features such as
pitch and formant trajectories.

BIODATA:
Prof. Giuseppe Riccardi is founder and director of the Signal and
Interactive Systems Lab at University of Trento, Italy. He received his
Laurea degree in Electrical Engineering and Master in Information
Technology, in 1991, from the University of Padua and CEFRIEL/Politechnic of
Milan (Italy), respectively. From 1990 to 1993 he collaborated with
Alcatel-Telettra Research Laboratories (Italy). In 1995 he received his PhD
in Electrical Engineering from the Department of Electrical Engineering at
the University of Padua, Italy. From 1993 to 2005, he was at AT&T Bell
Laboratories (USA) and then AT&T Labs-Research (USA) where he worked in the
Speech and Language Processing Lab. In 2005, he joined the faculty of
University of Trento (Italy). He is affiliated with Engineering School, the
Department of Information Engineering and Computer Science and Center for
Mind/Brain Sciences. His current research interests are language modeling,
language understanding, spoken/multimodal dialog, affective computing, machine learning and machine translation.

ABSTRACT:
Interacting with machines maybe an opportunity for delegating tasks,
cooperatively empower humans in decision making and for companionship in our
daily life. Current state-of-the-art conversational agents are far from
being persistent in our social networks, due their limited understanding,
interactive and social skills. In the ADAMACH project we have addressed some
of the outstanding challenges in training Adaptive and Meaning machines
(ADAMACH). We have explored spoken language models for grounding FrameNet
semantics into lexical and speech features. We have investigated hybrid
models for dialog management combining Partially Observable Markov Decision
Processes and lightweight task structures. In the last part of the talk we
will address the issue of understanding human interaction from a behavioral
perspective. We will report on the analysis and extraction of personality
traits from human-human spoken interactions.

BIODATA:
Prof. Giuseppe Riccardi is founder and director of the Signal and
Interactive Systems Lab at University of Trento, Italy. He received his
Laurea degree in Electrical Engineering and Master in Information
Technology, in 1991, from the University of Padua and CEFRIEL/Politechnic of
Milan (Italy), respectively. From 1990 to 1993 he collaborated with
Alcatel-Telettra Research Laboratories (Italy). In 1995 he received his PhD
in Electrical Engineering from the Department of Electrical Engineering at
the University of Padua, Italy. From 1993 to 2005, he was at AT&T Bell
Laboratories (USA) and then AT&T Labs-Research (USA) where he worked in the
Speech and Language Processing Lab. In 2005, he joined the faculty of
University of Trento (Italy). He is affiliated with Engineering School, the
Department of Information Engineering and Computer Science and Center for
Mind/Brain Sciences. His current research interests are language modeling,
language understanding, spoken/multimodal dialog, affective computing, machine learning and machine translation.

ABSTRACT:
Semantic Web technologies have great potential for improving search results, especially in the biomedical domain where biologists have been intensively developing community-curated ontologies. We can apply the technologies to biomedical literature to represent the information expressed in biomedical documents with the concepts and relations of the biomedical ontologies. When successfully represented, the information stored into the ontologies will allow us to perform fine-tuned semantic searches over the literature.

In this talk, we show a feasibility test toward the goal in two aspects: 1) an ontology-based text mining system that extracts the information expressed in biomedical documents by using logical inference based on domain knowledge and 2) ontology alignment methods for equivalence and subsumption relation identification for automatic ontological corpus annotation and cross-ontology semantic querying. The first system identifies textual semantics and represents them with an ontology called GRO. One of its advanced features is to deduce implicit information from explicitly expressed information by using inference rules that encode domain knowledge. The resultant GRO-based semantics, both explicit and implicit, are stored into the ontology and can be retrieved by a semantic search engine. The second part assumes that we can extract such textual semantics based on multiple ontologies individually, and enables to query across the integrated ontologies that are populated with the textual semantics. It requires the integration of the ontologies through cross-ontology correspondences like equivalence and subsumption relations. We introduce novel methods for the tasks.

BIODATA:
Jung-jae Kim is an Assistant Professor of the School of Computer Engineering at Nanyang Technological University (NTU) in Singapore. He received his BSc, MS, and PhD in 1998, 2000, and 2006, respectively, from KAIST, South Korea. He has worked as a post-doctoral researcher for the European Bioinformatics Institute (EBI) from 2006 to 2009. He is an editor of the Journal of Biomedical Semantics and served as Publicity Chair of ACL 2012, Program Chair of LBM 2011, and PC members of such conferences as ISMB, ECCB, IHI, and LREC.

Most successful computational approaches are supervised, relying on manually labeled corpora for training. Unfortunately, manual annotations of structures are often expensive and time-consuming. It has great value to induce structures automatically from unannotated sentences for NLP research.

In this proposal, I first introduce and analyze the existing methods in structure induction, then present some of our initial explorations on three unsupervised structure induction tasks.

Finally, I address some remaining issues and propose possible solutions as future work.

BIODATA:
Huang Yun obtained his BSc degree in computer science and technology from the Peking University, and MEng degree in computer science and technology from the Graduate University of Chinese Academy of Sciences. He is currently a PhD student jointly supervised by Prof. Tan Chew Lim from School of Computing NUS, and Dr. Zhang Min from Institute for Infocomm Research (I2R). His research interests include unsupervised grammar induction, natural language processing, and machine learning.

ABSTRACT:
As online multimedia information grows rapidly, multimedia information retrieval has become increasingly important to help users retrieve most relevant multimedia documents (e.g., music tracks, videos, web pages). Most multimedia documents are multi-faceted and contain heterogeneous types of information sources, such as textual metadata, audio content, video content, etc. Therefore, how to effectively combine multiple facets to better apprehend and retrieve relevant multimedia documents is essential in information retrieval systems. In my research, I study different multimodal fusion approaches to improve the performance of different music information retrieval systems. The proposed approaches are mainly based on two principles. Different query contexts require different fusion approaches to combine modalities for effective information retrieval. Since the relative significances of modalities vary among documents, different documents under the same query context also need to be fused differently.

Based on these guidelines, I first studied a domain-specific music retrieval system in which query features were restricted to certain aspects (e.g., tempo) and were not apparently relevant to some modalities (e.g., textual metadata). Therefore, more focus had been put on improving unimodal retrieval performances. Experiments and user studies validated the effectiveness of the proposed approach and the system utility.

I also investigated how fusion strategies influence information retrieval performance. The document-dependent fusion framework was first proposed, and it confirmed the efficacy of document content in fusion strategy derivation. A general multimodal fusion framework, query-document-dependent fusion was then proposed to extend existing works by deriving a different fusion strategy for each query-document pair. This enables each document to combine its modalities using the optimal fusion strategy and unleashes the power of different modalities in the retrieval process. Experimental results in a multimodal music retrieval task revealed both the effectiveness and efficiency of the proposed framework. Future plans are proposed to explore fusion approaches at early stages to better use different modalities in semantic music retrieval.

BIODATA:
Li Zhonghua is a research assistant and a part-time Ph.D student in School of Computing, National University of Singapore, supervised by Prof. Wang Ye. She received her bachelor's degree in Computer Science from Northwestern Polytechnical University in 2008. Her research interests include music content analysis, information retrieval, and multimodal fusion.

ABSTRACT:
The emergence of social media offers opportunities for online users to perform various social activities, such as sharing, commenting, tagging, contacting, etc. These activities not only contribute abundant user-generated content, but also tie users together through direct or indirect relations. The users and relations naturally form a large social network, of which the structure reflects users’ social interaction in the context of information exchange about the content being shared. The social network structure, at the same time, significantly impacts social activities and user decisions. Social structure mining in social media, therefore, is an interesting and important issue, opening possibilities to research problems like link prediction, community detection, leader identification, and so on.

In this talk, I will introduce one of my past works on prestigious member prediction as an application of social structure mining. In this work, we propose an action network framework focusing on predicting prestigious members through their prestige evolution in Flickr groups. The framework will be illustrated using a case study. In the end, I would like to share some of my thoughts on potential research directions in social structure mining.

BIODATA:
Dongyuan Lu recently received her PhD. degree from Chinese Academy of Sciences, Institute of Automation, under the supervision of Prof. Ruwei Dai. The title of her PhD Thesis is “Key Problems in Information Browsing and Retrieval in Social Media Websites”. Her current research interests include social network mining, social media analysis and information retrieval.

Dr. Lu has published several conference and journal papers in these fields, including information science journal of Decision Support System (DSS). She is the recipient of several academic awards including WWW female student travel grant.

ABSTRACT:
With the advent of Web 2.0, user-generated content (UGC) is a mainstay characteristic of the social media platforms. Recently, microblog (e.g., Twitter, Sina Weibo) is one of the most popular formats for users to generate creative content, broadcast themselves, share ideas, and interact with others. Our research interests lie in uncovering the relationship between tweet content and some user behaviors.

This study focuses on the re-tweet behavior, by answering: (1) What kinds of content features that trigger users to share (re-tweet) a text-tweet? (2) Can we build a predictive model for retweeting behavior based on the content? In this talk, I will review related literature on retweet prediction and tweet classification, describe the preliminary research of our work on this topic, and discuss potential future work.

BIODATA:
Tao Chen is a PhD candidate from the School of Computing at the National University of Singapore, and is under the supervision of Associate Professor Min-Yen Kan. Her research interests include Natural Language Processing and Information Retrieval. Specifically, she is working on analyzing user-generated content (UGC) in social media platforms.

The first problem refers to the categorization of domain-specific resources at multiple granularities. This helps a search engine to better meet specific user needs by directing task-relevant materials to the user and organize its presentation of search results by more pertinent metadata criteria.

The second problem refers to the resolution of domain-specific concepts to their related domain-specific constructs. This allows constructs to properly influence relevance ranking in search results, without troubling the user to input them in the potentially awkward construct syntax.

We propose to model the domain-specific resources with a multi-layered graphical framework. The nodes in the model represent the characteristics associated with domain-specific resources, while the edges represent the correlations among the characteristics. Our model is not only expressive in providing a unified view of the characteristics and correlations but also flexible in the choice of computational mechanisms.

Guided by the model, we carry out our research on the two aforementioned problems as follows. For Resource Categorization, we use the key information extraction problem in healthcare as a case study on the categorization of correlated nominal facets. We exploit the correlation between two categorizations at different granularities (i.e., sentence-level and word-level) by propagating information from one to the other sequentially or simultaneously. In addition, we use the readability measurement problem as a case study on the categorization of ordinal facets. We exploit the correlation between the readability of domain-specific resources and the difficulty of the domain-specific concepts through an iterative computation algorithm. For Text-to-Construct Linking, we tackle the linking of math concepts to their representations in math expressions. We exploit the correlation between the observable characteristics of a concept-expression pair and its relation type using supervised learning.

To demonstrate the applicability and usefulness of our research, we have also implemented two domain-specific search systems, one in math and the other in healthcare. The math system incorporates the categorization of resource type and readability as well as text-to-expression linking, while the healthcare system categorizes healthcare resources at multiple granularities to pull out relevant information for applicability and validity assessment, and has additional features, such as dual interface for active/passive search, for better workflow integration.

BIODATA:
Jin Zhao is a PhD candidate in School of Computing, National University of Singapore (NUS), under the supervision of A/P Min-Yen KAN. His research interest is in domain-specific information retrieval with focus on math and nursing.

ABSTRACT:
As humans, we can often infer a person's hometown from how
he speaks. Computers have also been trained to automatically recognize
people's accents or dialects. However, it is non-trivial for either a
human or an automatic system to explicitly articulate the underlying
rules that define such dialect differences. In this talk, I will
discuss some of my past work on automatically characterizing phonetic
and acoustic patterns across dialects.

We propose a framework that refines the traditional network of hidden
Markov Models to explicitly characterize dialect differences. These
differences can be phonetic or acoustic, and are interpreted as
dialect-specific rules. The automatic nature of the proposed system
makes characterizing large corpora efficient; the explicit modeling of
rules makes the system output informative to human analysts (e.g.,
forensic phoneticians). This combination of being automatic yet
linguistically informative has not been previously possible; this new
approach is therefore termed Informative Dialect Recognition.

Experiments have been conducted on multiple English and Arabic corpora
in dialect recognition, pronunciation conversion, and rule retrieval
tasks. All results showed that the proposed system learns
dialect-specific rules well.

The proposed approach can also be potentially applied to other fields
such as computer-assisted language learning, speech and language
disorders diagnosis, speaker identification, and forensic analytics.

BIODATA:
Nancy Chen is a scientist at the Human Language Technology
Department, Institute for Infocomm Research (I2R), Singapore. Dr. Chen
received her PhD from MIT and Harvard in 2011. Her research interests
are in spoken language processing and understanding, with a slant
toward integrating speech science and linguistic knowledge into
computational and statistical models. She has been working on areas
such as accent/dialect/language recognition, computer-assisted
language learning, and automatic speech summarization and semantic
analysis.

Dr. Chen is a recipient of multiple awards, including a Best Paper
Award: the IEEE Spoken Language Processing Student Travel Grant,
sponsored by XD Huang, A Acero and H-W Hon (Microsoft) and the United
States National Institute of Health Ruth L. Kirschstein National
Research Service Award.

ABSTRACT:
Even though research on recommender systems have been around
for more than a decade, it does not mean that there is no additional
room for improvement. Most of the recommendation systems in the past
have been focusing on movies and the MovieLens dataset. However, as
times change, other domains such as mobile apps and real-time social
news are on the rise; and the characteristics of these new domains
differ greatly from those that are movie-related. This allows us to
come up with new techniques that are more suited for such prospective
domains.

As real- time social media gradually changes the way our society
functions, and as ordinary people become “content providers”
themselves, we now have a new “data source” to harness – something
which was not around a few years ago. By leveraging on the content
that is freely available in social networking sites like Twitter and
Facebook, we can further empower recommender systems for mobile apps.
In other words, we use the proliferation of real-time social data to
alleviate the proliferation of mobile apps.

In this talk, we will provide comprehensive reviews of the
state-of-art technologies related to recommender systems and the
social web, describe the preliminary research of our work on this
topic, and discuss potential future work.

BIODATA:
Jovian Lin is a PhD candidate from the School of Computing at
the National University of Singapore, and is under the the supervision
of Professor Chua Tat-Seng. His research deals with incorporating
real-time social data into recommender systems, particularly those
related to the domain of mobile apps.

ABSTRACT:
Coreference resolution is one of the central tasks in
natural language processing. Successful coreference resolution
benefits many other natural language processing and information
extraction tasks. This thesis explores three important research issues
in coreference resolution.

A large body of prior research on coreference resolution recasts the
problem as a two-class classification problem. However, standard
supervised machine learning algorithms that minimize classification
errors on the training instances do not always lead to maximizing the
F-measure of the chosen evaluation metric for coreference resolution.
We propose a novel approach comprising the use of instance weighting
and beam search to maximize the evaluation metric score on the
training corpus during training. Experimental results show that this
approach achieves significant improvement over the state of the art.
We report results on standard benchmark corpora (two MUC corpora and
three ACE corpora), when evaluated using the link-based MUC metric and
the mention-based B-CUBED metric.

In the literature, most prior work on coreference resolution worked on
newswire domain. Although a coreference resolution system trained on
the newswire domain performs well on the same domain, there is a huge
performance drop when it is applied to the biomedical domain.
Annotating coreferential relations in a new domain is very
time-consuming. This raises the question of how we can adapt a
coreference resolution system trained on a resource-rich domain to a
new domain with minimum data annotations. We present an approach
integrating domain adaptation with active learning to adapt
coreference resolution from newswire domain to biomedical domain, and
explore the effect of domain adaptation, active learning, and target
domain instance weighting for coreference resolution. Experimental
results show that domain adaptation with active learning and the
weighting scheme achieves performance on MEDLINE abstracts similar to
a system trained on full coreference annotation, but with a hugely
reduced number of training instances that we need to annotate.

Lastly, we present a machine learning approach to the identification
and resolution of Chinese anaphoric zero pronouns. We perform both
identification and resolution automatically, with two sets of easily
computable features. Experimental results show that our proposed
learning approach achieves anaphoric zero pronoun resolution accuracy
comparable to a previous state-of-the-art, heuristic rule-based
approach. To our knowledge, our work is the first to perform both
identification and resolution of Chinese anaphoric zero pronouns using
a machine learning approach.

BIODATA:
Shanheng Zhao is currently a senior software engineer at
Elance Inc., the world-leading online employment platform
headquartered in downtown Mountain View. Prior to joining Elance, he
was a PhD Student in School of Computing, National University of
Singapore, supervised by A/P Hwee Tou Ng. His research interests are
in the areas of natural language processing and information retrieval.
At Elance, he serves as the core developer in search related products,
and turns his knowledge of NLP/IR into industry applications.

ABSTRACT:
Machine translation (MT) is a task that investigates the use
of computer software to translate human languages. The past several
years have witnessed the rapid development of statistical MT, shifting
from word-based and phrase-based to syntax-based methods. This talk
surveys tree-to-string translation, which exploits the syntax of
source language to guide translation. Two fundamental problems will be
discussed: how to learn tree-to-string rules from real-world data
automatically and how to generate translation accurately and
efficiently using tree-to-string rules. The talk closes with recent
advances and future directions.

BIODATA:
Yang Liu is an Associate Professor in the Department of
Computer Science and Technology at Tsinghua Univeristy. He obtained
his PhD degree from Institute of Computing Technology, Chinese Academy
of Sciences (CAS/ICT) in 2007 and worked at ICT until 2011. His
research focuses on machine translation, natural language processing,
and Chinese information processing. He has published six ACL full
papers in the machine translation area as first author. His work on
tree-to-string translation received a Meritorious Asian NLP Paper
Award at ACL 2006. He served as a reviewer for leading journals and
conferences including TALIP, NLE, Neurocomputing, ACL, EMNLP, COLING,
etc.

ABSTRACT:
Language identification is the task of determining the natural language a document is written in. In this talk, I will introduce some commonly used methods for language identification as well as their limitations. I will discuss transfer learning in the context of building a cross-domain language identification system, and demonstrate langid.py, an off-the-shelf language identifier.

BIODATA:
Marco Lui is a PhD candidate in the Deparment of Computing and Information Systems at the University of Melbourne in Australia. His main research area is language identification, and his research interests include text classification, natural language processing, machine learning, web-scale data and user-generated content. He grew up in Singapore, and speaks English, Italian, Cantonese and Mandarin.

ABSTRACT:
In this presentation I will discuss the compilation of a social media corpus with chats, tweets and SMS as part of the SoNaR corpus, a 500-million word reference corpus of written Dutch, comprising all text categories. I will briefly outline the structure of resources for speech and language technology as currently organized in Europe and The Netherlands. In the compilation of the social media corpus, special focus was addressed to IPR issues and to the collection of metadata. IPR was obtained through licenses with platform owners, but largely by consent of individual contributors. Collection was done with use of free publicity. Processing was done in to FoLiA format and a tokeniser was used that was adapted to social media.

Secondly, the current projects of the Centre for Language and Speech Technology (CLST) at the Radboud University Nijmegen, the Netherlands, will be introduced. CLST research focuses on the following domains: data mining, language learning and language teaching, communication in health care, forensics and resources and research infrastructure.

BIODATA:
Maaske Treurniet MA is working as a researcher in the Centre for Language and Speech Technology, Radboud University Nijmegen, The Netherlands. She received her bachelor’s degree in Speech and Language Therapy in 2007 and a master’s degree in Speech and Language Pathology in 2011. Her research interests include corpus collection and data curation.

ABSTRACT:
Knowledge bases (KBs) which contain rich knowledge about the world’s entities have been shown to form a valuable component for many natural language processing (NLP) tasks. However, to populate the KBs by the information mined from unstructured texts or to utilize the KB resources for NLP tasks, we are required linking the mentions of entities in text to their corresponding entries in the KB, which task is called entity linking.

The major challenges of entity linking are name variation and name ambiguity. Name variation refers to the case that more than one name variation such as alias, misspelling and acronym refers to the same entity. Name ambiguity refers to the case that more than one entity shares the same name.

Most of the studies on entity linking use annotated data to learn a classifier or ranker by supervised learning algorithms. As manually creating a training corpus for entity linking is labor-intensive and costly, our research initially focuses on automatically labeling a large scale training corpus for entity linking, where we label the ambiguous mentions leveraging on their unambiguous synonyms. We also propose an instance selection strategy to effectively utilize the automatically generated annotation.

Besides, the traditional approaches treat the context of the mention as a bag of words, n- grams, noun phrases or/and co-occurring named entities, and measures context similarity by the comparison of the weighted literal term vectors. Such literal matching suffers from the problems: lack of semantic information and sparsity issues. Thus, we introduce a topic model to entity linking to discover the underlying topics of documents and KB entries.

Another problem in the previous studies is that they disambiguate a mention of a name (e.g.“AZ”) based on the distribution knowledge learned from labeled instances in the training set, which are related to other names (e.g.“Hoffman”,“Chad Johnson”, etc.). The gaps among the distributions of the instances related to different names hinder the further improvement of the previous approaches. To further improve entity linking approaches, we propose a lazy learning model, which allows us to improve the learning process with the distribution information specific to the queried name (e.g.“AZ”).

BIODATA:
Zhang Wei is a Ph.D student in School of Computing, National University
of Singapore, supervised by Prof. Tan Chew Lim and Dr. Su Jian. He
received his bachelor's degree in Computer Science from Harbin Institute
of Technology in 2008. His research interests include Information
Extraction and Entity Linking.

ABSTRACT:
There is a lot of textual information that are embedded in images. More and more documents are digitized everyday via camera, scanner and other equipments, many digital images contain texts, and lots of textual information is embedded in web images. It would be very useful to turn the characters from image format to textual format using optical character recognition (OCR). That information is very important for document mining, document image retrieval and so on. However, in many cases, the document images cannot be directly sent to an OCR engine for recognition due to document degradation.

Document Image Enhancement is a technique that improves the quality of a document image to enhance human perception or to facilitate subsequent automated image processing, which is widely used in the preprocessing stage of different document analysis tasks. We try to improve the accessibility of the textual information embedded in the document images for different applications. Document image enhancement problem is essentially an ill-posed problem, because there could be many different ways of distortion that lead to the same degraded image. Furthermore, the quality of enhancement technique is mainly judged by human perception, which makes the quantitative measures hard to be applied.

There are many different kinds of document enhancement techniques which handle difference distorted document images. In this proposal, we focus on two aspects of the document enhancement techniques: document image binarization and document image deblurring. In addition, we also try to enhance the web images to increase the recognition performance of OCR engine on those images. Moreover, just as we enhance text document images to improve OCR performance, we also extend the work into enhancing OMR (Optical Musical Recognition) performance by detecting and removal staff lines in musical document images in order to improve the accessibility of the musical scores by OMR system.

BIODATA:
Bolan Su is a PhD candidate in computer science at School of Computing (SOC), National University of Singapore (NUS), supervised by Prof. Chew Lim Tan and Dr. Shijian Lu. He received his bachelor's degree in Department of Computer Science at Fudan University in 2008. His research interests are in document image analysis and enhancement.

ABSTRACT:
In social tagging systems, users have different purposes when they annotate items. Tags not only depict the content of the annotated items, for example by listing the objects that appear in a photo, or express contextual information about the items, for example by providing the location or the time in which a photo was taken, but also describe subjective qualities and opinions about the items, or can be related to organisational aspects, such as self-references and personal tasks.

Current folksonomy-based search and recommendation models exploit the social tag space as a whole to retrieve those items relevant to a tag-based query or user profile, and do not take into consideration the purposes of tags. We hypothesise that a significant percentage of tags are noisy for content retrieval, and believe that the distinction of the personal intentions underlying the tags may be beneficial to improve the accuracy of search and recommendation processes.

We present a mechanism to automatically filter and classify raw tags in a set of purpose-oriented categories. Our approach finds the underlying meanings (concepts) of the tags, mapping them to semantic entities belonging to external knowledge bases, namely WordNet and Wikipedia, through the exploitation of ontologies created within the W3C Linking Open Data initiative. The obtained concepts are then transformed into semantic classes that can be uniquely assigned to content- and context-based categories. The identification of subjective and organisational tags is based on natural language processing heuristics.

We collected a representative dataset from Flickr social tagging system, and conducted an empirical study to categorise real tagging data, and evaluate whether the resultant tags categories really benefit a recommendation model using the Random Walk with Restarts method. The results show that content- andcontext-based tags are considered superior to subjective and organisational tags, achieving equivalent performance to using the whole tag space.

BIODATA:
Joemon Jose is a Professor at the School of Computing Science, University of Glasgow, Scotland and leader of theInformation Retrieval group . He is interested in all aspects of information retrieval (theory, experimentation, evaluation and applications) in the textual and multimedia domain. His research focuses around the following three themes: (i) Adaptive and personalized search systems; (ii) Multimodal interaction for information retrieval; (iii) Multimedia mining and search. He is fellow of the BCS and IET societies.

ABSTRACT:
This proposal focuses on the topic of product consumer reviews structuralization. The topic aims to discover a structure within the consumer reviews and organize them accordingly, so as to facilitate users in understanding the knowledge inherent within the reviews. While the hierarchy can usually improve information dissemination and accessibility, we particularly propose to generate a hierarchical structure to organize the consumer reviews by developing a domain-assisted approach. The structure has categorized the product aspects and consumers' opinions mentioned in the reviews, as well as captured the patent-child relations among the aspects. Such structure provides a well-visualized way to browse consumer reviews at different granularity to meet various users' needs. Also, it provides a platform for users to easily grasp the overview of consumer reviews and conveniently seek users' desired information (i.e. product aspects and consumer opinions) in the reviews. Such valuable characteristics of the structure can support multiple real-world applications, and we show two sample applications in this proposal, including aspect ranking and product opinion Question Answering (opinion-QA). In particular, the application of aspect ranking aims to identify the important product aspects that have great influence on consumers' decision making and firms' product development strategies. We propose to leverage the structure to acquire all aspects and opinions commented on the reviews, and develop a probabilistic regression algorithm to identify the important aspects. In addition, for another application of product opinion-QA, we exploit the structure to boost the performance on its components of question analysis and answer generation.

BIODATA:
Jianxing Yu is a PhD candidate in computer science at School of Computing (SOC), National University of Singapore (NUS), supervised by Prof. Chua Tat-Seng. He received his bachelor's degree in School of Information, Renmin University of China (RUC) in 2008. His research interests are in sentiment analysis and text mining.

ABSTRACT:
Arabic is among the world's most widely-spoken languages, and incidentally
one of the most computationally difficult to process. Being a Semitic
language, it is usually written without vowels or diacritics. This creates
substantial ambiguities in both phonetic pronunciation and semantic
meaning, so that the "diacritization" of the text becomes a necessary step
for several Arabic applications. In the absence of accurately annotated
corpora, the diacritization of syntax-specific "case endings" (also known
as inflectional diacritics) is particularly a complex problem, because the
syntax in Arabic is flexible.

Existing research focuses on textually extracted information. In this talk
we introduce speech as a valuable component of the diacritization process,
and explore the extent to which acoustic information can complement a
text-based system. We describe a speech-and-text combined system that
successfully predicts diacritics with an error rate of only 1% when
inflectional diacritics are excluded, and 1.6% when they are included. We
also evaluate two of the most widely used tools for Arabic Natural
Language Processing (NLP) in the context of our system, as well as the
performance of Support Vector Machines and Conditional Random Fields for
text-based diacritization.

BIODATA:
Aisha Siddiqa Azim is a MSc student at the School of Computing (NUS),
supervised by Dr. Sim Khe Chai. She received her Bachelor degree from the
Lahore University of Management Sciences in Pakistan in 2007. Her research
interests in Natural Language Processing focus on language learning and
information extraction.

ABSTRACT:
With the advent of computers and the web, people working with language
can explore it as never before. We can now ask our computers to find
lots of examples of a word, phrase, or grammatical construction, and
with advanced tools we can also find the core terms for a domain, or
the words most often occurring in a construction, or automatically
prepare draft dictionary entries. The baseline approach is through
Google, but specialist tools and language corpora - prepared to cut
out the problems associated with using Google - allow us much more
precision. I will introduce two tools - Sketch Engine, for exploring
corpora, and WebBootCaT, for building them - and show how they can be
used to address a range of tasks and research questions in linguistics
and lexicography. The session will be largely demo-ing corpora and
functions of the tools, and lends itself well to exploring many
research questions where audience members may want a 'corpus
perspective', so come prepared!

BIODATA:
Adam Kilgarriff is Director of Lexical Computing Ltd. He has led the
development of the Sketch Engine http://www.sketchengine.co.uk, a
leading tool for corpus research used for dictionary-making at Oxford
University Press, Cambridge University Press, HarperCollins, Le Robert
and elsewhere. His scientific interests lie at the intersection of
computational linguistics, corpus linguistics, and dictionary-making.
Following a PhD on "Polysemy" from Sussex University, he has worked at
Longman Dictionaries, Oxford University Press, and the University of
Brighton. He is a Visiting Research Fellow at the University of Leeds.
He is active in moves to make the web available as a linguists' corpus
and was the founding chair of ACL-SIGWAC (Association for
Computational Linguistics Special Interest Group on Web as Corpus).
He has been chair of the ACL-SIG on the lexicon, co-organiser of the
first two SENSEVAL competitions, and Board member of EURALEX (European
Association for Lexicography). See also http://www.kilgarriff.co.uk

2011

ABSTRACT:
Due to the advances in medical imaging technology and wider adoption of electronic medical record systems in recent years, medical imaging has become a major tool in clinical trials and a huge amount of medical images are proliferated in hospitals and medical institutions every day.

Current research works mainly focus on modality/anatomy classification, or simple abnormality detection. However, the needs to efficiently retrieve the images by pathology classes are great. The
lack of large training data makes it difficult for pathology based image classification. To solve problems, we propose two approaches to use both the medical images and associated radiology reports to automatically generate a large training corpus and classify brain CT image into different pathological classes. In the first approach, we extract the pathology terms from the text and annotate the images associated with the text with the extracted pathology terms. The resulting annotated images are used as training data set. We use probabilistic models to derive the correlations between the hematoma regions and the annotations. We also propose a semantic similarity language model to use the intra-annotation correlation to enhance the performance. In testing, we use both the trained probabilistic model and language model to automatically assign pathological annotations to the new cases. In the second approach, we use deeper semantics from both images and text and map the hematoma regions in the images and pathology terms from the text explicitly by extracting and matching anatomical information from both resource. We explore hematoma visual features in both 3D and 2D and classify the images into different classes of pathological changes, so that the images can be searched and retrieved by pathological annotation.

BIODATA:
Gong Tianxia is a PhD candidate in computer science at School of Computing (SOC), National University of Singapore (NUS), supervised by A/P Tan Chew Lim. She received her bachelor's degree in Computer Engineering at SOC in 2006. Her research interests are in information Rretrieval and medical text processing.

ABSTRACT:
Recently, Natural Language Processing (NLP) has been greatly benefiting from the progress of machine learning methods in large data driven applications. Some NLP tasks require complex data representation to deeply analyze the syntactic and semantic features. In many cases the input data is represented as sequences, trees and even graphs. Traditional feature based methods transform these structured input data into vectorial representation by sophisticated feature engineering, which is argued infeasible to fully explore the structure features. Alternatively, kernel methods can explore a very high dimensional feature space for these complex input structures without explicitly representing the input data as a feature vector. In terms of tree structures, tree kernels can explore the subtree features in the parse trees, without explicitly enumerating each type of subtree.

However, previous tree kernels explore the structure features with respect to the single subtree representation. The structure of the large single subtree may be sparse in the data set, which prevents large structures from being effectively utilized. Sometimes, only certain parts of a large subtree are beneficial instead of the entire subtree. In this case, using the entire structure may introduce noisy information. To address the above deficiency, this dissertation systematically investigates the phrase parse tree and attempts to design more sophisticated kernels to deeply explore the structure features embedded in the phrase parse trees other than the single subtree representation.

Specifically, this dissertation proposes tree sequence based kernels which adopt the subtree sequence structure as the basic feature type to explore the structure features in phrase parse trees. A variety of kernels are built up based on the subtree sequence structure. The advantages of the subtree sequence structures are demonstrated on various NLP applications. By means of the tree (sequence) kernels over multiple parse trees, a kernel based alignment model is proposed for the task of bilingual subtree alignment, with which the translation performance can be effectively improved. On a more general perspective, this dissertation systematically explores the disconnected structure features in parse trees by means of kernels. On this point, this dissertation may provide novel views of structure features for NLP applications.

BIODATA:
Mr Jun Sun is a Ph.D. student in School of Computing (SoC), National University of Singapore (NUS), under supervision of Prof. Chew Lim Tan and Dr. Min Zhang since 2007. He received his B.Sc. degree in Computer Science in Harbin Institute of Technology in 2006. His research interests include Machine Translation, Statistical Parsing and Statistical Machine Learning.

ABSTRACT:
In many record matching problems, the input data is either ambiguous or incomplete, making the record matching task difficult. However, for some domains, evidence for record matching decisions are readily available in large quantities on the Web. These resources may be retrieved by making queries to a search engine, making the Web a valuable resource. On the other hand, Web resources are slow to acquire compared to data that is already available in the input. Also, some Web resources must be acquired before others. Hence, it is necessary to acquire Web resources selectively and judiciously, while satisfying the acquisition dependencies between these resources.

This thesis has two major parts. In the first part, I propose methods for using information from the Web for record matching, establishing that acquiring web based resources can improve record matching tasks. In the second and larger part, I propose approaches for selective acquisition of web based resources for record matching tasks, with the aim of balancing acquisition costs and acquisition benefits. These approaches start from the more task-specific and move towards the more general, culminating in a framework for performing cost-sensitive resource acquisition problems with hierarchical dependencies. This graphical framework is versatile and can apply to a large variety of problems. In the context of this framework, I propose an effective resource acquisition algorithm for record matching problems, taking particular characteristics of such problems into account.

BIODATA:
Yee Fan Tan presently holds the position of Chief System Architect in KAI Square Pte Ltd, a company started by alumni of School of Computing, National University of Singapore. Prior to joining KAI Square, he performed research work leading to a Ph.D. in the School of Computing, and was a member of the Web Information Retrieval / Natural Language Processing Group (WING).

ABSTRACT:
The literature on digital archives tends to place a great emphasis on the "virtual" (i.e. intangible) nature of electronic resources. However, digital objects are created and perpetuated through physical things (e.g. charged magnetic particles, pulses of light, holes in disks). This materiality brings challenges, because data must be read from specific artifacts, which can become damaged or obsolete. The materiality of digital objects also brings unprecedented opportunities for description, interpretation and use. ?There is a substantial body of information within the underlying data structures of computer systems that can often be discovered or recovered. ?Because of the possibility of interacting with digital information at different levels, there is no single, canonical representation of digital data. To ensure integrity and future use, archivists and other information professionals must make decisions regarding treatment of materials at multiple levels of representation. ?In this talk, I will report on several projects that involve treatment of data - both computational methods and decision making processes - at multiple levels of representation.

BIODATA:
Christopher (Cal) Lee is an Associate Professor at the School of Information and Library Science at the University of North Carolina, Chapel Hill. ?He teaches archival administration, records management, digital curation, and information technology for managing digital collections. ?His research focuses on long-term curation of digital collections and stewardship (by individuals and information professionals) of personal digital archives. ? He is particularly interested in the practical and ethical implications of diffusing tools and methods into professional practice. ?Two major streams of his current research are personal digital archives and the application of digital forensics methods to the curation of digital collections.

ABSTRACT:
I will present the diverse research activities on human language technology in the Philippines, considering its context: geography, history and people. These projects include the formal representation of human languages and the processes involving these languages cutting across various forms such as text, speech and video files. Both rule-based and example-based approaches have been used in various experiments and have shown to be complementary in their computational behaviors. Applications on languages that we have worked on include machine translation, natural language generation, information extraction, and audio and video Processing.

BIODATA:
Rachel Edita O. Roxas is Associate Professor of the College of Computer Studies (CCS) of De La Salle University-Manila, a knowledge center of the Brothers of the Christian Schools. Dean of the College of Computer Studies, Former Chair of the Natural Language Processing and Programming Languages academic areas, and holder of the Jonathan Dy Distinguished Professorial Chair of Computer Science, her research interests lie in understanding (1) how new technologies can be used to represent and process natural languages, and (2) how daily use of computers can be enhanced through the use of natural language interfaces.

Prof. Roxas obtained her Ph.D. in Computer Science from the Australian National University, with Professor Malcolm Newey as supervisor with PhD dissertation on Proving the Correctness of Program Transformations using Higher Order Logic. She obtained her Diploma in Computer Science also from the Australian National University, and her Bachelor of Science in Computer Science cum laude from the University of the Philippines at Los Banos.

ABSTRACT:
The rapid growth of social media in recent years, exemplified by Facebook and Twitter, has led to a massive volume of user-generated informal text.This in turn has sparked a great deal of research interest in studying social and linguistic properties of Twitter and other micro-blogs, and has brought significant new challenges for their utilization in natural language processing (NLP). In order to understand the inherent laws of conversion between informal and formal language, and further to make better mutual understanding and communication among different language speakers in social medias, we believe the tasks of informal language normalization and translation are indispensable. In this talk, I will focus on reviewing related literature on the topics of informal language normalization, OOV word translation and corpus collection by crowdsourcing for machine translation. Besides I will introduce the social network application named FLashMob (currently located on Facebook) which aims to crowdsource informal/formal pairs and bilingo pairs thus building training corpus for the two tasks. It is under the support of China-Singapore Institute of Digital Media (CSIDM).

BIODATA:
Aobo Wang has been a Ph.D. student under the supervision of Prof Min-Yen Kan in School of Computing, National University of Singapore since August 2009. His research interests include natural language processing and information retrieval. Specifically, he is working on Human Computation and Machine Translation.

ABSTRACT:
This paper describes a novel probabilistic approach for generating natural language sentences from their underlying semantics in the form of typed lambda calculus. The approach is built on top of a novel reduction-based weighted synchronous context free grammar formalism, which facilitates the transformation process from typed lambda calculus into natural language sentences. Sentences can then be generated based on such grammar rules with a log-linear model. To acquire such grammar rules automatically in an unsupervised manner, we also propose a novel approach with a generative model, which maps from sub-expressions of logical forms to word sequences in natural language sentences. Experiments on benchmark datasets for both English and Chinese generation tasks yield significant improvements over results obtained by two state-of-the-art machine translation models, in terms of both automatic metrics and human evaluation.

BIODATA:
Dr. LU Wei obtained his PhD from the Singapore-MIT Alliance (SMA, Computer Science Programme) of the National University of Singapore in 2009. He obtained his Bachelor of Computing (Computer Science, Honours I) from School of Computing, NUS in 2005. He subsequently obtained his MSc in SMA in 2006. Currently he is working as a research fellow in the National University of Singapore.

ABSTRACT:
Many machine translation evaluation metrics have been proposed after the seminal BLEU metric, and many among them have been found to consistently outperform BLEU, demonstrated by their better correlations with human judgment. It has long been the hope that by tuning machine translation systems against these new generation metrics, advances in automatic machine translation evaluation can lead directly to advances in automatic machine translation. However, to date there has been no unambiguous report that these new metrics can improve a state-of-the-art machine translation system over its BLEU-tuned baseline. In this paper, we demonstrate that tuning Joshua, a hierarchical phrase-based statistical machine translation system, with the TESLA metrics results in significantly better human-judged translation quality than the BLEU-tuned baseline. TESLA-M in particular is simple and performs well in practice on large datasets. We release all our implementation under an open source license. It is our hope that this work will encourage the machine translation community to finally move away from BLEU as the unquestioned default and to consider the new generation metrics when tuning their systems.

BIODATA:
Daniel Dahlmeier is a Ph.D. candidate at the NUS Graduate School for Integrative Sciences and Engineering at the National University of Singapore. He received his diploma in Informatics from the University of Karlsruhe in 2008. Daniel's research interests are in the area of computational linguistics and natural language processing. In particular, he is interested in automated grammatical error correction for language learners. He is supervised in his research by Assoc. Prof Hwee Tou Ng.

ABSTRACT:
Word ambiguity and vocabulary mismatch are critical problems in information retrieval. To deal with these problems, this paper proposes the use of translated words to enrich document representation, going beyond the words in the original source language to represent a document. In our approach, each original document is automatically translated into an auxiliary language, and the resulting translated document serves as a semantically enhanced representation for supplementing the original bag of words. The core of our translation representation is the expected term frequency of a word in a translated document, which is calculated by averaging the term frequencies over all possible translations, rather than focusing on the 1-best translation only. To achieve better efficiency of translation, we do not rely on full-fledged machine translation, but instead use monotonic translation by removing the time-consuming reordering component. Experiments carried out on standard TREC test collections show that our proposed translation representation leads to statistically significant improvements over using only the original language of the document collection.

BIODATA:
Dr Na is a Research Fellow at the National University of Singapore (NUS). He received his PhD in Computer Science from POSTECH in South Korea. His main research interests are information retrieval and natural language processing.

ABSTRACT:
Current commercial search engines face great challenges in the rapid growth of WWW data and user requests. In order to improve their performance, we believe that the behavior information of search users, so wisdom of the , can be utilized. With the support of National Natural Science Foundation of Tsinghua and Joint Lab on Web Search Technology, tangible progresses were made in this direction. In this talk, I will introduce some recent progresses of our work covering Web page quality estimation, Web spam page identification, search performance evaluation and search engine query recommendation. Algorithms and systems developed from these projects have been adopted by Sogou.com, one of the most popular search engines in China.

BIODATA:
Dr. Yiqun Liu obtained his Ph.D. degree in the year of 2007 in Tsinghua University, Beijing, China. Since then he has been working as an assistant professor in the department of computer science and technology in Tsinghua. He is generally interested in the research areas of Web search technology and Web user behavior analysis, and has particular focus on improving search performance with the help of user behavior analysis. He have published a number of high quality papers on journals and conferences including JIR, JASIST, IJCAI, WWW, CIKM and WSDM. He was also the winner of the Youth Innovation Prize of "Qian Weichang" Chinese Information Processing Award in 2010.

ABSTRACT:
Speech recognition machines are in use in more and more devices and services. Airlines, banks, and telephone companies provide information to customers via spoken queries. You can buy hand-held devices, appliances, and PCs that are operated by spoken commands.

And, for around $100, you can buy a program for your laptop that will transcribe speech into text. Unfortunately, automatic speech recognition systems are quite error prone, nor do they understand the meanings of spoken messages in any significant way. I argue that to do so, speech recognition machines would have to possess the same kinds of cognitive abilities that humans display. Engineers have been trying to build machines with human-like abilities to think and use language for nearly 60 years without much success. Are all such efforts doomed to failure? Maybe not. I suggest that if we take a radically different approach, we might succeed.

If, instead of trying to program machines to behave intelligently, we design them to learn by experiencing the real world in the same way a child does, we might solve the speech recognition problem in the process. This is the ambitious goal of the research now being conducted in my laboratory. To date, we have constructed three robots that have attained some rudimentary visual navigation and object manipulation abilities which they can perform under spoken command. Our new experiments are being conducted with a fully anthropomorphic robot and we expect to use its much improved sensori- motor function to greatly advance automatic language acquisition.

BIODATA:
Stephen E. Levinson was born in New York City on September 27, 1944. He received the B. A. degree in Engineering Sciences from Harvard in 1966, and the M. S. and Ph.D. degrees in Electrical Engineering from the University of Rhode Island, Kingston, Rhode Island in 1972 and 1974, respectively. From 1966-1969 he was a design engineer at Electric Boat Division of General Dynamics in Groton, Connecticut. From 1974-1976 he held a J. Willard Gibbs Instructorship in Computer Science at Yale University. In 1976, he joined the technical staff of Bell Laboratories in Murray Hill, NJ where he conducted research in the areas of speech recognition and understanding. In 1979 he was a visiting researcher at the NTT Musashino Electrical Communication Laboratory in Tokyo, Japan. In 1984, he held a visiting fellowship in the Engineering Department at Cambridge University. In 1990, Dr. Levinson became head of the Linguistics Research Department at AT&T Bell Laboratories where he directed research in Speech Synthesis, Speech Recognition and Spoken Language Translation.

In 1997, he joined the Department of Electrical and Computer Engineering of the University of Illinois at Urbana-Champaign where he teaches courses in Speech and Language Processing and leads research projects in speech synthesis and automatic language acquisition. Dr. Levinson is a member of the Association for Computing Machinery, a fellow of the Institute of Electrical and Electronic Engineers and a fellow of the Acoustical Society of America. He is a founding editor of the journal Computer Speech and Language and a former member and chair of the Industrial Advisory Board of the CAIP Center at Rutgers University. He is the author of more than 80 technical papers and holds seven patents. His new book is entitled "Mathematical Models for Speech Technology"

ABSTRACT:
In this talk, I will introduce the machine learning technologies for query document matching in search, which we have developed at MSRA. I will start my talk by pointing out that query-document mismatch is one of the major challenges for web search. I will explain how we think that the challenge can be addressed. Specifically, it is necessary to perform better query understanding and document understanding, and conduct query-document matching at sense, topic, and structure levels. Next I will describe the machine learning techniques which we have developed for query document matching, including Learning of Query Document Matching Model and Regularized Latent Semantic Indexing. I will also discuss the relationships between our methods and existing methods. Finally, I will outline future research directions for query-document matching.

BIODATA:
Hang Li is senior researcher and research manager at Microsoft Research Asia. He is also adjunct professors at Peking University, Nanjing University, Xi'an Jiaotong University, and Nankai University. His research areas include information retrieval, natural language processing, statistical machine learning, and data mining. He graduated from Kyoto University in 1988 and earned his PhD from the University of Tokyo in 1998. He worked at the NEC lab in Japan during 1991 and 2001. ?He joined Microsoft Research Asia in 2001 and has been working there until present.

In this paper, we dedicate to the topic of aspect ranking, which aims to automatically identify important product aspects from online consumer reviews. The important aspects are identified according to two observations: (a) the important aspects of a product are usually commented by a large number of consumers; and (b) consumers' opinions on the important aspects greatly influence their overall opinions on the product. We develop an approach for this topic and conduct experiment on 11 popular products in four domains. The results demonstrate the effectiveness of our approach. We further apply the aspect ranking results to the application of document-level sentiment classification, and improve the performance significantly.

In this paper, with a belief that a language model that embraces a larger context provides better prediction ability, we present two extensions to standard n-gram language models in statistical machine translation: a backward language model that augments the conventional forward language model, and a mutual information trigger model which captures long-distance dependencies that go beyond the scope of standard n-gram language models. We integrate the two proposed models into phrase-based statistical machine translation and conduct experiments on large-scale training data to investigate their effectiveness. Our experimental results show that both models are able to significantly improve translation quality and collectively achieve up to 1 BLEU point over a competitive baseline.

A semantic framework for machine translation evaluation is presented, which is able to operate without the need for reference translations. It is based on the concepts of adequacy and fluency, which are independently assessed by using a cross-language latent semantic indexing approach and an n-gram based language model approach, respectively. A comparative analysis with conventional evaluation metrics is conducted over a large collection of human evaluations involving five European languages.

Machine transliteration is defined as automatic phonetic translation of names across languages. In this paper, we propose synchronous adaptor grammar, a novel nonparametric Bayesian learning approach, for machine transliteration. This model provides a general framework without heuristic or restriction to automatically learn syllable equivalents between languages. The proposed model outperforms the state-of-the-art EM-based model in the English to Chinese transliteration task.

A semantic feature for statistical machine translation is proposed and evaluated. This model aims at favouring those translation units that were extracted from training sentences that are semantically related to the current input sentence being translated. Experimental results on a Spanish-to-English translation task on the Bible corpus demonstrate a significant improvement on translation quality with respect to a baseline system.

This paper presents a procedure for extracting transfer rules for multiword expressions from parallel corpora for use in a rule based Japanese-English MT system. We show that adding the multi-word rules improves translation quality and sketch ideas for learning more such

ABSTRACT:
While traditional question answering (QA) systems tailored to the TREC QA task work relatively well for simple questions, they do not suffice to answer real world questions. The community-based QA (cQA) systems offer this service well, as they contain large archives of such questions where manually crafted answers are directly available. However, the question and answer retrieval in the cQA archive is not trivial. In particular, I identify three major challenges in building up such a QA system - matching of complex online questions; (2) multi-sentence questions mixed with context sentences; and (3) mixture of sub-answers corresponding to sub-questions.

To tackle these challenges, I focus my research in developing advanced techniques to deal with complicated matching issues and segmentation problems for cQA questions and answers, including: 1) A Syntactic Tree Matching model based on a comprehensive tree weighing scheme to flexibly match cQA questions at the lexical and syntactic levels to find similar questions; 2) A question segmentation model to differentiate sub-questions of different topics and align them to the corresponding context sentences; and 3) An answer segmentation model to model the question-answer relationships to segment multi-topic answer sentences and perform question-answer alignment.

The main contributions of this thesis are in developing the syntactic tree matching model to flexibly match online questions coupled with various grammatical errors, and the segmentation models to sort out different sub-questions and sub-answers for better and more precise cQA retrieval. These models are generic in the sense that they can be applied to other related applications.

BIODATA:
Wang Kai is a Ph.D. candidate in SoC supervised by Prof. Chua Tat-Seng. He received his Bachelor's degree in Computer Engineering from Nanyang Technological University. His research interests lie in Information Retrieval/Extraction, Natural Language Processing, Question Answering and Sentiment Analysis.

ABSTRACT:
VOC (voice of customer) analysis for CRM (customer relationship management) is quite important for most of businesses and is one of the killer application for NLP technologies, we believe. NEC runs a billion dollers' CRM businesses in Japan and is working to make it into global market. In this talk, we will show our two research activities related to the VOC analysis. One is an speech dialogue summarization with differentiation from call memos. This technique can make summary which includes distinctive parts of the spoken words in the dialogue. The other is a technique of the Recognizing Textual Entailment (RTE). We implemented our original method in our RTE system and won the 3rd prize in the main task and the 2nd prize in the sub task in TAC2010 (Text Analysis Conference 2010).

BIODATA:
Itaru Hosomi and Kai Ishikawa are principal researchers in Information and Media Processing Laboratories, NEC Corporation, Japan. Mr. Hosomi finished his master in systems engineering in the graduate school of engineering, Kobe university. He is a member of The Semantic Web committee in Japan which cooperates with W3C. His research focuses on NLP and semantic technology including ontology modeling and constructing for information extraction.

Mr. Ishikawa finished his master in physics in the graduate school of science in The University of Tokyo. He won the best presentation award from The Association for Natural Language Processing, Japan in 1997 and The AAMT Nagao award in 2006. He has working on the research and development of NLP including speech translation, machine translation, document summarization, text mining, and recognizing textual entailment.

Discourse parsing is a fundamental task in natural language processing (NLP), which aims to study the discourse relations between text units and automatically construct the internal discourse structure of these relations. Over the last couple of decades, a number of discourse frameworks has been proposed and studied, and researchers have proposed various algorithms to automatically label and construct discourse structures. Recently, many discourse studies have been performed on the basis of machine learning approaches, which require large annotated data sets. The Penn Discourse Treebank (PDTB) is a newly released, discourse-level annotation that aims to fill this need.

The purpose of this thesis proposal is three-fold. Firstly I will present my current research study in building an implicit discourse relation classifier in the PDTB and an end-to-end discourse parser that parses a free text in a PDTB representation. Discourse has a tight connection with text coherence in the way that it organizes and presents the flow of argumentation of a text. I want to study this connection by proposing a coherence model that utilizes discourse relations and that is able to evaluate the coherence quality of a text. Lastly, I want to show that realizing the discourse structure of a text will improve the downstream NLP applications, such as summarization, argumentative zoning, and question answering. I have completed the first part of my research and am currently working on the second part -- coherence modeling. At the end of this proposal, I describe user applications that I propose to implement, as the third part of my thesis. I conclude with my proposed research timeline.

2010

ABSTRACT:
In information exchange networks such as email and blog networks, most of the processes are carried out using exchange of messages. The behavioral analysis in such a network leads to interesting insights which would be quite valuable for organizational or social analysis.

In this talk, I will talk about user engagingness and responsiveness as two interaction behaviors that help us understand the information exchange networks: Engaging actors are those who can effectively solicit responses from other actors. Responsive actors are those who are willing to respond to other actors. By modeling such behaviors, we are able to measure them and to identify engaging or responsive actors. We systematically propose a variety of novel models to quantify the engagingness and responsiveness of actors in the Enron email network. We also study email reply order prediction problem in which we present a learning model with behavior features defined by the various engagingness and responsiveness models. Furthermore, we study the user behaviors in mobile messaging and friendship linking using the data collected from a large mobile social network service known as myGamma (m.mygamma.com).

BIODATA:
Dr. Byung-Won On has worked as a research staff in Advanced Digital Sciences Center of Illinois at Singapore Pte Ltd. In 2007, he received his Ph.D. degree in the Department of Computer Science and Engineering at Pennsylvania State University. Previously, he worked as a post-doctoral fellow at AT&T Labs, University of British Columbia, and Singapore Management University. His research interests are entity resolution and data quality, social network analysis and mining, and unsupervised learning methods.

This paper studies the effects of training data on binary text classification and postulates that negative training data is not needed and may even be harmful for the task. Traditional binary classification involves building a classifier using labeled positive and negative training examples. The classifier is then applied to classify test instances into positive and negative classes. A fundamental assumption is that the training and test data are identically distributed. However, this assumption may not hold in practice. In this paper, we study a particular problem where the positive data is identically distributed but the negative data may or may not be so. Many practical text classification and retrieval applications fit this model. We argue that in this setting negative training data should not be used, and that PU learning can be employed to solve the problem. Empirical evaluation has been conducted to support our claim. This result is important as it may fundamentally change the current binary classification paradigm.

We propose a language-independent approach for improving statistical machine translation for morphologically rich languages using a hybrid morpheme-word representation where the basic unit of translation is the morpheme, but word boundaries are respected at all stages of the translation process. Our model extends the classic phrase-based model by means of (1) word boundary-aware morpheme-level phrase extraction, (2) minimum error-rate training for a morpheme-level translation model using word-level BLEU, and (3) joint scoring with morpheme- and word-level language models. Further improvements are achieved by combining our model with the classic one. The evaluation on English to Finnish using Europarl (714K sentence pairs; 15.5M English words) shows statistically significant improvements over the classic model based on BLEU and human judgments.

This paper focuses on the task of inserting punctuation symbols into transcribed conversational speech texts, without relying on prosodic cues. We investigate limitations associated with previous methods, and propose a novel approach based on dynamic conditional random ?elds. Different from previous work, our proposed approach is designed to jointly perform both sentence boundary and sentence type prediction, and punctuation prediction on speech utterances. We performed evaluations on a transcribed conversational speech domain consisting of both English and Chinese texts. Empirical results show that our method outperforms an approach based on linear-chain conditional random fields and other previous approaches.

This paper studies two issues, non-isomorphic structure translation and target syntactic structure usage, for statistical machine translation in the context of forest-based tree to tree sequence translation. For the first issue, we propose a novel non-isomorphic translation framework to capture more non-isomorphic structure mappings than traditional tree-based and tree-sequence-based translation methods. For the second issue, we propose a parallel space searching method to generate hypothesis using tree-to-string model and evaluate its syntactic goodness using tree-to-tree/tree sequence model. This not only reduces the search complexity by merging spurious-ambiguity translation paths and solves the data sparseness issue in training, but also serves as a syntax-based target language model for better grammatical generation. Experiment results on the benchmark data show our proposed two solutions are very effective, achieving significant performance improvement over baselines when applying to different translation models.

Event Anaphora Resolution is an important task for cascaded event template extraction and other NLP study. Previous study only touched on event pronoun resolution. In this paper, we provide the first systematic study to resolve event noun phrases to their verbal mentions crossing long distances. Our study shows various lexical, syntactic and positional features are needed for event noun phrase resolution and most of them, such as morphology relation, synonym and etc, are different from those features used for conventional noun phrase resolution. Syntactic structural information in the parse tree modeled with tree kernel is combined with the above diverse flat features using a composite kernel, which shows more than 10% F-score improvement over the flat features baseline. In addition, we employed a twin-candidate based model to capture the pair-wise candidate preference knowledge, which further demonstrates a statistically significant improvement. All the above contributes to an encouraging performance of 61.36% F-score on OntoNotes corpus.

We present PEM, the first fully automatic metric to evaluate the quality of paraphrases, and consequently, that of paraphrase generation systems. Our metric is based on three criteria: adequacy, fluency, and lexical dissimilarity. The key component in our metric is a robust and shallow semantic similarity measure based on pivot language N-grams that allows us to approximate adequacy independently of lexical similarity. Human evaluation shows that PEM achieves high correlation with human judgments.

ABSTRACT:
In this work, we study the task of personalized tag recommendation in social tagging systems. To reach out to tags beyond the existing vocabularies of the query resource and of the query user, we examine recommendation methods that are based on personomy translation, and propose a probabilistic framework for incorporating translations by similar users (neighbors). We propose to use distributional divergence to measure the similarity between users in the context of personomy translation, and examine two variations of such similarity measures. We evaluate the proposed framework on a benchmark dataset collected from BibSonomy, and compare with personomy translation methods based on the query user solely and collaborative filtering. Our experimental results show that our neighbor based translation methods outperform these baseline methods significantly. Moreover, we show that the translations borrowed from neighbors indeed help ranking relevant tags higher than that based solely on the query user. This is a practice talk for the Second IEEE International Conference on Social Computing (SocialCom2010) in August 2010.

BIODATA:
HU Meiqun is a Ph.D candidate in School of Information Systems, Singapore Management University. She is advised by Professor LIM Ee-Peng and Assistant Professor JIANG Jing. Her research focuses on data mining and knowledge management in general. Currently, her primary research interests include predictive modeling for information processing in social media. Prior to research on social tagging systems as one type of social media, she has worked on modeling interaction in Web 2.0 communities, and assessing the quality of collaborative user-generated content on the Web. She also investigated the use of quality assessment in improving Web 2.0 applications for content search and recommendation. Meiqun obtained her degree of Bachelor of Engineering in Computer Engineering from Nanyang Technological University, Singapore in 2006.

A large body of prior research on coreference resolution recasts the problem as a two-class classification problem. However, standard supervised machine learning algorithms that minimize classification errors on the training instances do not always lead to maximizing the F-measure of the chosen evaluation metric for coreference resolution. In this paper, we propose a novel approach comprising the use of instance weighting and beam search to maximize the evaluation metric score on the training corpus during training. Experimental results show that this approach achieves significant improvement over the state-of-the-art. We report results on standard benchmark corpora (two MUC corpora and three ACE corpora), when evaluated using the link-based MUC metric and the mention-based B-CUBED metric.

Question detection serves great purposes in many research fields. While detecting questions in standard language data corpus is relatively easy, it becomes a great challenge for online content. Question threads posted online are usually long containing multiple sub-questions, and question sentences may be in informal style where standard features such as question mark or 5W1H words are likely to be absent. In this paper, we explore question characteristics in community-based question answering services, and propose an automated approach to detect question sentences based on lexical and syntactic features. Our model is capable of handling informal online languages. The empirical evaluation results further demonstrate that our model significantly outperforms traditional methods in detecting online question sentences, and it considerably boosts the question retrieval performance in cQA.

Entity linking refers entity mentions in a document to their representations in a knowledge base (KB). In this paper, we propose to use additional information sources from Wikipedia to find more name variations for entity linking task. In addition, as manually creating a training corpus for entity linking is labor- intensive and costly, we present a novel method to automatically generate a large scale corpus annotation for ambiguous mentions leveraging on their unambiguous synonyms in the document collection. Then, a binary classifier is trained to filter out KB entities that are not similar to current mentions. This classifier not only can effectively reduce the ambiguities to the existing entities in KB, but also be very useful to highlight the new entities to KB for the further population. Furthermore, we also leverage on the Wikipedia documents to provide additional information which is not available in our generated corpus through a domain adaption approach which provides further performance improvements. The experiment results show that our proposed method outperforms the state-of-the-art approaches.

In this paper, we present an unsupervised hybrid model which combines statistical, lexical, linguistic, contextual, and temporal features in a generic EM-based framework to harvest bilingual terminology from comparable corpora through comparable document alignment constraint. The model is configurable for any language and is extensible for additional features. In overall, it produces considerable improvement in performance over the baseline method. On top of that, our model has shown promising capability to discover new bilingual terminology with limited usage of dictionaries.

We demonstrate the use of context features, namely, names of places, and unlabelled data for the detection of personal name language of origin. While some early work used either rule-based methods or n-gram statistical models to determine the name language of origin, we use the discriminative classification maximum entropy model and view the task as a classification task. We perform bootstrapping of the learning using list of names out of context but with known origin and then using expectation-maximisation algorithm to further train the model on a large corpus of names of unknown origin but with context features. Using a relatively small unlabelled corpus we improve the accuracy of name origin recognition for names written in Chinese from 82.7% to 85.8%, a significant reduction in the error rate. The improvement in F-score for infrequent Japanese names is even greater: from 77.4% without context features to82.8% with context features.

This paper presents two pivot strategies for statistical machine transliteration, namely system-based pivot strategy and model-based pivot strategy. Given two independent source-pivot and pivot-target name pair corpora, the model-based strategy learns a direct source-target transliteration model while the system-based strategy learns a source-pivot model and a pivot-target model, respectively. Experimental results on benchmark data show that the system-based pivot strategy is effective in reducing the high resource requirement of training corpus for low-density language pairs while the model-based pi-vot strategy performs worse than the system-based one.

We introduce the novel problem of automatic related work summarization. Given multiple articles (e.g., conference/journal papers) as input, a related work summarization system creates a topic-biased summary of related work specific to the target paper. Our prototype Related Work Summarization system, ReWoS, takes in set of keywords arranged in a hierarchical fashion that describes a target paper's topics, to drive the creation of an extractive summary using two different strategies for locating appropriate sentences for general topics as well as detailed ones. Our initial results show an improvement over generic multi-document summarization baselines in a human evaluation.

2009

ABSTRACT:
Word Sense Disambiguation (WSD) is a fundamental task in Natural Language Processing (NLP). According to the results of SensEval workshops, supervised methods achieve state-of-the-art accuracy for WSD. The performance of the supervised methods is affected by fine-grained sense-inventory, domain adaptation problem, as well as a lack of training examples. Moreover, in SensEval workshops, WSD is evaluated as an isolated task. The word senses are predefined without considering the real applications. As a result, few WSD systems are integrated as a component for other applications which are supposed to benefit from WSD, such as machine translation and information retrieval.

In this proposal, I build a WSD system which achieves a high accuracy of 89.1% on OntoNotes data set with a coarse-grained sense-inventory. With this system, a method of combining the feature augmentation domain adaptation technique with active learning is proposed to solve the domain adaptation problem. To overcome the lack of sense-annotated data problem, I propose an approach to extract training examples from parallel corpora without extra human efforts. Finally, with the knowledge of WSD, I plan to exploit the potential of incorporating WSD in other tasks.

BIODATA:
Zhong Zhi is currently a Ph.D. candidate in SOC, NUS under supervision of Prof. Ng Hwee Tou. He received his Bachelor's degree from Fudan University in 2006. His primary research interest is Natural Language Processing, especially Word Sense Disambiguation.

ABSTRACT:
In a linguistically-motivated syntax-based translation system, the entire translation process is normally carried out in two steps, translation rule matching and target sentence decoding using the matched rules. Both steps are very time-consuming due to the exponential number of translation rules, the exhaustive search in translation rule matching and the complex nature of the translation task itself. In this talk, we will introduce a hyper-tree-based fast algorithm for translation rule matching. Experimental results on the NIST MT-2003 Chinese-English translation task show that this algorithm is at least 19 times faster in rule matching and is able to help to save 57% of overall translation time over previous methods when using large fragment translation rules.

BIODATA:
Zhang Hui is a Research Assistant at CHIME Lab co-supervised by Prof. Tan Chew Lim (NUS) and Dr. Zhang Min (I2R). He received his Bachelor and Master degrees from Xiamen University in 2004 and 2007 respectively. His research interests are Statistical Machine Translation and Natural Language Syntax Parsing.

ABSTRACT:
Bootstrapping is the process of improving the performance of a trained classifier by iteratively adding data that is labeled by the classifier itself to the training set, and retraining the classifier. It is often used in situations where labeled training data is scarce but unlabeled data is abundant. In this paper, we consider the problem of domain adaptation: the situation where training data may not be scarce, but belongs to a different domain from the target application domain. As the distribution of unlabeled data is different from the training data, standard bootstrapping often has difficulty selecting informative data to add to the training set. We propose an effective domain adaptive bootstrapping algorithm that makes use of the unlabeled target domain data that are informative about the target domain and easy to automatically label correctly. We call these instances bridges, as they are used to bridge the source domain to the target domain. We show that the method outperforms supervised, transductive and bootstrapping algorithms on the named entity recognition task.

BIODATA:
Wu Dan is currently a PhD candidate at the Singapore-MIT Alliance of National University of Singapore. Her thesis advisers are Assoc. Prof. Lee Wee Sun (NUS) and Prof. Tomas Lozano-Perez (MIT). She received her Bachelor's degree in Computer Engineering from Nanyang Technological University. Her research interests include machine learning, domain adaptation and natural language processing.

ABSTRACT:
Lexical Association Measures (AMs) have been employed by past work in extracting multiword expressions. Our work contributes to the understanding of these AMs by categorizing them into two groups and suggesting the use of rank equivalence to group AMs with the same ranking performance. In this work, we also examine how existing AMs can be adapted to better rank English verb particle constructions and light verb constructions. Specifically, we suggest normalizing (Pointwise) Mutual Information and using marginal frequencies to construct penalization terms. We empirically validate the effectiveness of these modified AMs in detection tasks in English, performed on the Penn Treebank, which shows significant improvement over the original AMs.

BIODATA:
Hoang Huu Hung is a former HYP student from School of Computing, NUS. His interest centers around language phenomena and the automation of language processing.

ABSTRACT:
We present an implicit discourse relation classifier in the Penn Discourse Treebank (PDTB). Our classifier considers the context of the two arguments, word pair information, as well as the arguments' internal constituent and dependency parses. Our results on the PDTB yields a significant 14.1% improvement over the baseline. In our error analysis, we discuss four challenges in recognizing implicit relations in the PDTB.

BIODATA:
Lin Ziheng is a PhD student under the supervision of Prof. Kan Min-Yen and Prof. Ng Hwee Tou in School of Computing, National University of Singapore. His research interests include natural language processing and information retrieval. Specifically, he is working on discourse analysis and automatic text summarization.

ABSTRACT:
In this paper, we propose and implement a completely automatic approach to scale up word sense disambiguation to all words of English. Our approach relies on English-Chinese parallel corpora, English-Chinese bilingual dictionaries, and automatic methods of finding synonyms of Chinese words. No additional human sense annotations or word translations are needed. The evaluation results on OntoNotes 2.0 show that our approach is able to achieve high accuracy, outperforming the first-sense baseline and coming close to a prior reported approach that requires manual human efforts to provide Chinese translations of English senses.

BIODATA:
Zhong Zhi is currently a Ph.D. candidate in SOC, NUS under supervision of Prof. Ng Hwee Tou. He received his Bachelor's degree from Fudan University in 2006. His primary research interest is Natural Language Processing, especially Word Sense Disambiguation.

Talk 2

"A 2-Poisson Model for Probabilistic Coreference of Named Entities for Improved Text Retrieval"
by
Na Seung-Hoon

ABSTRACT:
Text retrieval queries frequently contain named entities. The standard approach of term frequency weighting does not work well when estimating the term frequency of a named entity, since anaphoric expressions (like he, she, the movie, etc) are frequently used to refer to named entities in a document, and the use of anaphoric expressions causes the term frequency of named entities to be underestimated. In this paper, we propose a novel 2-Poisson model to estimate the frequency of anaphoric expressions of a named entity, without explicitly resolving the anaphoric expressions. Our key assumption is that the frequency of anaphoric expressions is distributed over named entities in a document according to the probabilities of whether the document is elite for the named entities. This assumption leads us to formulate our proposed Co-referentially Enhanced Entity Frequency (CEEF). Experimental results on the text collection of TREC Blog Track show that CEEF achieves significant and consistent improvements over state-of-the-art retrieval methods using standard term frequency estimation. In particular, we achieve a 3% increase of MAP over the best performing run of TREC 2008 Blog Track.

BIODATA:
Dr. Na Seung-Hoon is a Research Fellow at the National University of Singapore (NUS). He received his PhD in Computer Science from POSTECH in South Korea. Dr. Seung-Hoon's main research interests are in the areas of information retrieval and natural language processing.

ABSTRACT:
The sense of a preposition is related to the semantics of its dominating prepositional phrase. Knowing the sense of a preposition could help to correctly classify the semantic role of the dominating prepositional phrase and vice versa. In this paper, we propose a joint probabilistic model for word sense disambiguation of prepositions and semantic role labeling of prepositional phrases. Our experiments on the PropBank corpus show that jointly learning the word sense and the semantic role leads to an improvement over state-of-the-art individual classifier models on the two tasks.

BIODATA:
Daniel is a PhD student in the NUS Graduate School for Integrative Sciences and Engineering (NGS) at the National University of Singapore (NUS). He is working under the supervision of Assoc. Prof. Ng Hwee Tou. He obtained his Diploma (Master) in Computer Science from the University of Karlsruhe in Germany in 2008. His current research interest focuses on the application of natural language processing for educational purposes.

ABSTRACT:
This paper presents an effective method for generating natural language sentences from their underlying meaning representations. The method is built on top of a hybrid tree representation that jointly encodes both the meaning representation as well as the natural language in a tree structure. By using a tree conditional random field on top of the hybrid tree representation, we are able to explicitly model phrase-level dependencies amongst neighboring natural language phrases and meaning representation components in a simple and natural way. We show that the additional dependencies captured by the tree conditional random field allows it to perform better than directly inverting a previously developed hybrid tree semantic parser. Furthermore, we demonstrate that the model performs better than a previous state-of-the-art natural language generation model. Experiments are performed on two benchmark corpora with standard automatic evaluation metrics.

BIODATA:
Lu Wei is currently a final year PhD student from the Singapore-MIT Alliance (Computer Science Program) of the National University of Singapore. His thesis advisers are Assoc. Prof. Ng Hwee Tou (NUS), Assoc. Prof. Lee Wee Sun (NUS), and Prof. Leslie Pack Kaelbling (MIT). He obtained his Bachelor of Computing (Computer Science, first class honors) from School of Computing, NUS in 2005, where he worked with Assoc. Prof. Kan Min-Yen. He later obtained his Msc from SMA in 2006. His research interests include semantic parsing, language generation and statistical machine translation.

ABSTRACT:
We propose a novel language-independent approach for improving statistical machine translation for resource-poor languages by exploiting their similarity to resource-rich ones. More precisely, we improve the translation from a resource-poor source language X1 into a resource-rich language Y given a bi-text containing a limited number of parallel sentences for X1-Y and a larger bi-text for X2-Y for some resource-rich language X2 that is closely related to X1. The evaluation for Indonesian->English (using Malay) and Spanish->English (using Portuguese and pretending Spanish is resource-poor) shows an absolute gain of up to 1.35 and 3.37 Bleu points, respectively, which is an improvement over the rivaling approaches while using much less additional data.

BIODATA:
Dr. Preslav Nakov is a Research Fellow at the National University of Singapore (NUS). He received his PhD in Computer Science from the University of California at Berkeley. Dr. Nakov's main research interests are in the areas of machine translation and lexical semantics.

ABSTRACT:
While traditional question answering (QA) systems tailored to the TREC QA task work relatively well for simple questions, they do not suffice to answer real world questions. The community-based QA systems offer this service well, as they contain large archives of such questions where manually crafted answers are directly available. However, finding similar questions in the QA archive is not trivial. In this paper, we propose a new retrieval framework based on syntactic tree structure to tackle the similar question matching problem. We build a ground-truth set from Yahoo! Answers, and experimental results show that our method outperforms traditional bag-of-word or tree kernel based methods by 8.3% in mean average precision. It further achieves up to 50% improvement by incorporating semantic features as well as matching of potential answers. Our model does not rely on training, and it is demonstrated to be robust against grammatical errors as well.

ABSTRACT:
Wikipedia provides a wealth of knowledge, where the first sentence, infobox (and relevant sentences), and even the entire document of a wiki article could be considered as diverse versions of summaries (definitions) of the target topic. We explore how to generate a series of summaries with various lengths based on them. To obtain more reliable associations between sentences, we introduce wiki concepts according to the internal links in Wikipedia. In addition, we develop an extended document concept lattice model to combine wiki concepts and non-textual features such as the outline and infobox. The model can concatenate representative sentences from non-overlapping salient local topics for summary generation. We test our model based on our annotated wiki articles which topics come from TREC-QA 2004-2006 evaluations. The results show that the model is effective in summarization and definition QA.

BIODATA:
Shiren Ye received the PhD degree in computer software from the Institute of Computing, the Academy of Sciences in 2001. He is a senior research fellow in School of Computing, National University of Singapore. His research interests include Web and text mining, question answering, and natural language processing.

ABSTRACT:
We present FireCite, a Mozilla Firefox browser extension that helps scholars assess and manage scholarly references on the web by automatically detecting and parsing such reference strings in real-time. FireCite has two main components: 1) a reference string recognizer that has a high recall of 96%, and 2) a reference string parser that can process HTML web pages with an overall F1 of .878 and plaintext reference strings with an overall F1 of .97. In our preliminary evaluation, we presented our FireCite prototype to four academics in separate unstructured interviews. Their positive feedback gives evidence to the desirability of FireCite's citation management capabilities.

BIODATA:
Andy is a graduating final year student at NUS under Prof. Kan Min-Yen. This work was part of his final year project.

ABSTRACT:
This presentation will firstly give an overview of current state-of-the-art Forest-based Translation technologies and Tree Sequence-based Models; then we will go ahead to talk about how to integrate the two categories of technologies into a single framework, including the theoritical and engineering challenges we face, the solution we propose and the promising results we got.

BIODATA:
Zhang Hui is a Research Assistant at CHIME Lab co-supervised by Prof. Tan Chew Lim (NUS) and Dr. Zhang Min (I2R). He received his Bachelor and Masters degrees from Xiamen University in 2004 and 2007 respectively. His research interests are Statistical Machine Translation and Natural Language Syntax Parsing.

ABSTRACT:
Although the previously proposed syntax based translation models maintain a satisfactory modeling capability to handle the grammatically structure divergence and global reordering problem between source and target languages. Some of the linguistic evidence is still uncovered by these models. The non-contiguous translational equivalences are unable to be modeled via these models. In this talk, I will present a more powerful synchronous grammar based model named Synchronous non-contiguous Tree Sequence Substitution Grammar (SncTSSG) to handle such issue. The proposed model bases on non-contiguous tree sequence alignment, where a non-contiguous tree sequence is a sequence of sub-trees and gaps. Compared with the contiguous tree sequence based model, the proposed model can well handle non-contiguous phrases with any large gaps. An algorithm targeting the noncontiguous constituent decoding will also be presented.

BIODATA:
Sun Jun is a Ph.D. candidate in SoC NUS co-supervised by Prof. Tan Chew Lim at NUS and Dr. Zhang Min at I2R. He received his Bachelor's degree at Harbin Institute of Technology in 2006. His research interest includes Statistical Machine Translation and Statistical Learning.

ABSTRACT:
This presentation will give an overview of current projects involving the use of semantic technologies and data mining in the study of research processes in biomedical research at Duke-NUS School of Medicine, with the explicit goal of opening a discussion regarding potential future collaborations with the Department of Computer Science at NUS. Specifically, the presentation will make demonstrations about (1) the use of semantic technologies for the streamlining of the aggregation of published articles (randomized controlled trial, diagnostic studies, and cohort studies) into meta-analyses; (2) the use of semantic technologies for the aggregation of different biomedical open data sources; (3) Use of data mining techniques to predict enrollment in clinical research studies; (4) Use of information extraction methods for research resource mapping in Southeast Asia.

BIODATA:
Ricardo Pietrobon is an Associate Professor at Duke University (Durham, NC, US), with a fractional appointment at Duke-NUS School of Medicine (Singapore). Dr. Pietrobon, who previously served as the Director of Biomedical Informatics for the Duke Translational Medicine Institute, is currently the Associated Vice Chair for the Department of Surgery at the Duke University Medical Center. His research interests are focused on the study of processes in biomedical research, including the use of biomedical informatics in combination with other interdisciplinary methods.

ABSTRACT:
Tone and intonation play a crucial role across many languages. However, the use and structure of tone varies widely, ranging from lexical tone which determines word identity to pitch accent signalling information status. In this work, we employ a uniform representation of acoustic features for recognition of Mandarin tone, isiZulu tone, and English pitch accent. The representation captures both local tone height and shape as well as contextual coarticulatory and phrasal influences.

By exploiting multiclass Support Vector Machines as a discriminative classifier, we achieve competitive rates of tone and pitch accent recognition. We demonstrate the greater importance of modeling preceding local context, which yields up to 24% reduction in error over modeling the following context, and further demonstrate that alternate acoustic features such band energy can improve tone recognition, for challenging cases such as neutral tone.

While these approaches to this recognition task have relied upon fully supervised learning methods employing extensive collections of manually tagged data obtained at substantial time and financial cost, we next explore two approaches to tone learning with substantial reductions in training data. We employ both unsupervised clustering and semi-supervised learning to recognize pitch accent and tone, based on the intrinsic structure of the tones in acoustic space. In unsupervised tone and pitch accent clustering experiments, we achieve 75% to 96% of accuracy rates achieved with large training data sets. For semi-supervised training with only small numbers of labeled examples, accuracies reach 90-98% of levels obtained with hundreds or thousands of labeled examples. These results indicate that the intrinsic structure of tone and pitch accent acoustics can be exploited to reduce the need for costly labeled training data for tone learning and recognition.

BIODATA:
Gina-Anne Levow is a Research Fellow at the National Centre for Text Mining in the School of Computer Science at the University of Manchester. From 2001 to 2008, she served as an Assistant Professor in the Computer Science Department at the University of Chicago, where she still holds an appointment as Research Associate (Assistant Professor). She received undergraduate degrees in Computer Science and Oriental Studies from the University of Pennsylvania and her Master's and Ph.D. from Massachusetts Institute of Technology. Her research is strongly multi-lingual and spans natural language processing, spoken language processing, and information retrieval. Specific areas of interest include prosody in discourse and dialogue, spoken and multi-lingual document retrieval, multi-modal interaction, and entity and event extraction.

ABSTRACT:
Recent research efforts on spoken document retrieval (SDR) have tried to overcome the low quality of 1-best automatic speech recognition transcripts -- especially for conversational speech -- by using statistics derived from speech lattices containing multiple transcription hypotheses as output by a speech recognizer. However, these efforts have invariably used the classical vector space retrieval model.

In this thesis, I present a lattice-based SDR method based on a statistical approach to information retrieval. I formulate a way to estimate statistical models for documents from expected word counts derived from lattices; query-document relevance is computed as a log probability under such models. Experiments show that my method outperforms statistical retrieval using 1-best transcripts, a recent lattice-based vector space method, and BM25 using lattice statistics.

I also extend my proposed SDR method to the task of query-by-example SDR -- retrieving documents from a speech corpus, where the queries are themselves full-fledged spoken documents (query exemplars).

BIODATA:
Chia Tee Kiah is a PhD candidate (2003-present) in School of Computing, National University of Singapore. He is supervised by A/P Ng Hwee Tou from School of Computing and Dr. Li Haizhou from Institute for Infocomm Research. He has received his B.Comp (Hons) from the National University of Singapore, Singapore in 2003. He is currently working on methods for improving spoken document retrieval.

BIODATA:
David has drawn on his background in computational linguistics (research in academia and industry) to help build a community-driven translation tool. This technology allows Facebook to continue grow at an incredible pace worldwide, and is available to developers on Platform, including external sites via Connect. David had earlier worked on the News Feed team, and continues to dabble in natural language processing, engineering recruiting puzzles and a variety of other efforts. He holds a BA with honors in linguistics from Brown University, and has published often (in journals, conferences, book-chapters, along with a pending patent).

ABSTRACT:
The Institute of Automation, Chinese Academy of Sciences (CASIA) has been working with spoken language translation (SLT) for over decade. In the recent years, the CASIA SLT system had good performance in the international evaluations on spoken language translation, and now the system has been successfully developed for common used mobilephone-based SLT. In this talk I will introduce some work on CASIA SLT, including the translation models, system implementation, and the experiments. Especially, a sentence type-based SLT model will be introduced in detail.

BIODATA:
Chengqing Zong is a professor of natural language processing and the deputy director of the National Laboratory of Pattern Recognition (NLPR) at the Chinese Academy of Sciences' Institute of Automation. NLPR is an over 300-people national key laboratory of China in pattern recognition, multimedia, and machine intelligence. His research interests include machine translation, document classification, and human-computer dialogue systems. He has over 60 publications in the recent years. He's also a director of the Chinese Association of Artificial Intelligence and the Society of Chinese Information Processing, and is an executive member of the Asian Federation of Natural Language Processing. He serves on the editorial board of Journal of Chinese Language and Computing and journal Intelligent Technology (in Chinese). He was a program co-chair of IEEE International Conference on Natural Language Processing and Knowledge Engineering (NLP-KE) in 2003, 2005, 2007, and 2008, and has been on the program committees for the Annual Meeting of Association for Computational Linguistics (ACL) and the International Conference on Computational Linguistics (COLING). Zong received his PhD from the Chinese Academy of Sciences' Institute of Computing Technology.

ABSTRACT:
Disambiguation is one of the important topics in the field of natural language processing. In this seminar, we will talk about the following two topics in disambiguation.

The first topic is "Personal Name Disambiguation in Web Search Results." Personal names are often submitted to search engines as query keywords. However, in response to a personal name query, search engines currently return a long list of search results about different namesakes. Most of the previous works on disambiguating personal names in Web search results employ several types of unsupervised agglomerative clustering approaches. However, these approaches have a challenge in guiding the clustering process appropriately. Therefore, we employ semi-supervised clustering by introducing some Web pages as seed pages that describe the entity of a person.

The second topic is "Word Sense Disambiguation in Japanese Texts". Most works on word sense disambiguation (WSD) employ a supervised approach and improve their accuracy by adding features to original features such as POS tags, local collocations, and syntactic relations. However, these features are not very effective in WSD. We believe that the accuracy of WSD can be improved by directly computing features from word instances clustered to sense-tagged instances (seed instances). Therefore, we apply semi-supervised clustering by introducing seed instances to the supervised WSD process.

In both topics, we employ a semi-supervised clustering approach that controls the fluctuation of the centroid of a cluster. This particular point has been overlooked in other existing semi-supervised clustering approaches.

BIODATA:
Kazunari Sugiyama is currently a postdoctoral research fellow at National University of Singapore. He received his B.Eng and M.Eng degrees from Yokohama National University, Japan, in 1998 and 2000, respectively. In 2004, he also received his Ph.D. degree from Nara Institute of Science and Technology, Japan, He then worked Hitachi, Ltd., Software Division, Japan, from 2004 to 2006. After that, he was a postdoctoral researcher at Tokyo Institute of Technology, Japan, from 2006 to 2009. His research interests include information retrieval (especially on how to characterize documents better, and adapting Web search users' information needs) and natural language processing (especially in personal name disambiguation in Web search results and word sense disambiguation in Japanese texts).

ABSTRACT:
With the advances of medical techniques, large amounts of medical data are produced in hospitals every day. Medical data processing gained increasing interest in recent years. Radiology reports, contain rich information about the corresponding medical images but are often merely archived and not utilized to build useful medical applications. In this thesis proposal, we focus our research work on radiology report propose methods to do brain CT radiology reports. information extraction, retrieval, automatic generation, and aid to CT image processing. We surveyed research works in related area, report our on-going research and planned future work in this proposal. The systems we already built have started to benefit the medical professionals in the radiology field. We will contribute more to the medical community with our continuing work.

BIODATA:
Gong Tianxia is a PhD candidate in computer science at School of Computing (SOC), National University of Singapore (NUS), supervised by A/P Tan Chew Lim. She received her bachelor's degree in Computer Engineering at SOC in 2006. Her research interests are in information Rretrieval and medical text processing.

ABSTRACT:
Domain-specific information on the web is currently not well served by general search engines employing standard keyword search strategies and criteria for relevance. Rather than construct domain-specific rules for each domain, we observe two broad properties of domain-specific knowledge -- that such knowledge 1) may be easily categorized into specific types/genres of knowledge; and 2) may employ a domain-specific language construct to transmit domain-specific knowledge. To leverage these two properties, we propose a corresponding two-pronged strategy to enhance the support of general search in dealing with domain-specific materials.

- Resource Categorization refers to the problem of categorizing domain-specific resources with respect to different labels and metrics both at the larger webpage level (e.g., page type, readability, etc.) and at a smaller segment level (e.g., content type). Properly categorized resources provide the search engine with additional important metadata that can help it better serve domain-specific user needs by routing resources properly and organizing search results in a more domain sensitive manner.

- Text-to-Construct Linking refers to the problem of resolving text keywords to relevant, domain-specific constructs. Constructs are defined as symbolic representations of domain-specific knowledge, such as mathematical expressions in the domain of math. Such text-to-construct linking enables searching and navigation of such constructs by letting users refer to constructs using familiar keyword search. This saves the user the hassle of inputting constructs in potentially awkward syntax.

We propose to model the domain-specific resources and concepts under a Bayesian network probabilistic framework. This multi-layer framework provides a unified view of the influential factors in a domain and allows the flexibility in both modeling the dependence relationships among factors and well-founded mechanisms to estimate their influence via standard probability.

In this proposal we codify this framework and put forth how we plan to solve both of the aforementioned problems, by applying various information retrieval, natural language processing and machine learning (IR/NLP/ML) techniques. We also tap on expert knowledge to derive the structure of our model Bayesian network and give justification for estimating the probability distributions.

We demonstrate this framework for domain-specific information retrieval by realizing an instance of the framework on the domain of mathematics. We implement a domain-specific, mathematical IR system based on our concept that serves both as a research tool and as a practical system for math searchers. We plan to carry out both objective and subjective evaluations to assess the accuracy of our system and its usefulness to potential users.

BIODATA:
Zhao Jin is a Ph.D. candidate in Computer Science at the School of Computing in National University of Singapore (NUS). He obtained his Bachelor degree in Computer Science from NUS in 2006. He is currently working on Domain-Specific Information Retrieval for mathematical and medical documents under the supervision of Dr. Kan Min-Yen.

ABSTRACT:
The last decade has seen an explosive growth in the amount of textual information in general and some specific domain such as biomedicine. There is a need for an effective and efficient text-mining system to gather and utilize the knowledge encoded in each specific domain. An intelligent text-mining system should be able to perform tasks as discourse analysis, entailment, inference and etc. For a correct discourse analysis, a text-mining system should have the capability of understanding the reference relations among different expressions in texts. Hence, anaphor resolution, the task of resolving a given text expression to its referred expression in prior texts, is important for an intelligent text processing system. Bridging anaphora, the most complex form natural anaphoric phenomenon, appears to be a challenging and unexplored area in the natural language processing field.

In this survey work, we have systematically explored and surveyed the previous work on bridging anaphora resolution. Our study took both the linguistic and computational point of view. We well defined and classified the bridging phenomenon. In addition, we have studied and categorized the previous approaches in both linguistic study and computational linguistics works. Furthermore, we have conducted exploratory and experimental work in Other-Anaphora which is a special and simplified type of bridging phenomenon. This exploratory work was published in COLING 2008 conference as full paper. Upon the information we obtained in the survey work we have selected several potential research directions or future work.

BIODATA:
Chen Bin received his Bachelor of Computing with honours in the National University of Singapore. He is a currently a second year Ph.D. student under Prof. Tan Chew Lim and Dr. Su Jian, and his research interest is in bridging anaphora resolution.

ABSTRACT:
Scenario Template Creation (STC) is a Natural Language Processing (NLP) task to detect the commonalities among articles on similar events and summarize them into an abstract representation -- a scenario template (ST). For this task, the estimation of verb-centric text span similarity is the key. Various approaches have been proposed, ranging from bag-of-words to more sophisticated ones involving thesauri and features at different linguistic levels. However, there are still opportunities for further improvement. Contextual information, for instance, by intuition would enhance text span similarity estimation. But it has yet to be well exploited.

In this talk, I first discuss an intrinsic similarity measure for predicate-argument tuples (PATs). It is applied to a Paraphrase Recognition (PR) task, demonstrating its feasibility. Then with different contextual relations defined, I hypothesize that two PATs' semantic similarity can also be reflected by their extrinsic similarity, i.e., whether they are contextually similarly connected to similar contexts. Experimental results confirmed the correlation between such an extrinsic similarity and the semantic similarity of PATs. To integrate both intrinsic and extrinsic similarities for PAT clustering, I propose a graphical framework, using a novel core algorithm called Context Sensitive Clustering (CSC). According to the widely-used purity and inverse purity metrics, the proposed framework outperforms the standard K-means algorithm over all the scenarios tested, thanks to its ability to use the contexts selectively.

BIODATA:
Long Qiu is a doctoral student with SoC, NUS, co-supervised by Dr. Min-Yen Kan and Professor Tat-Seng Chua. He got his Master of Science (SM) in Computer Science from Singapore-MIT Alliance in 2002. He is interested in NLP and the related machine learning techniques.

ABSTRACT:
Machine translation has been angling its development towards statistics-based approach on large corpus since late 1980s, widely identified as Statistical machine translation (SMT). Generally, SMT tends to treat the problem of natural language translation as a machine learning task. Via rich translational equivalences extracted from bilingual sentences and the corresponding extensions, SMT systems automatically transform a source sentence into the target output. SMT strengthens itself rapidly in recent years by developing from word based models to phrase based models, and currently syntax based models seize the reign.

In addition to the state-of-the-art approaches of corresponding SMT approaches, this work focuses the discussion on syntax based models and present a research proposal on syntactic structure alignment based SMT. The proposed research work consists of the construction of a phrase structure tree based translation model with large expressive power and flexibility, which is possible to explicitly model the syntax of the source and target language, thereby improving grammaticality of target language generation; an initial framework of syntactic structure alignment designed to benefit the proposed syntax based translation model; and the further investigation of the relationship between the alignment results and the translation performance.

BIODATA:
Sun Jun is a PhD candidate in computer science at School of Computing (SOC), National University of Singapore (NUS), supervised by Prof. Tan Chew Lim and Dr. Zhang Min. He received his bachelor's degree in Computer Science at Harbin Institute of Technology in 2006. His research interests includes Machine Learning, Data Mining and Natural Language Processing.

ABSTRACT:
The lack of large-scale corpora annotated with semantic information has been a serious bottleneck for computational semantics, slowing down not only the development of more advanced statistical methods, but also our empirical understanding of the phenomena. The creation of the Ontonotes corpus will finally bring computational semantics to the point where computational syntax was in 1993 - but in the meantime, we have come to appreciate the limitations of that methodology both theoretically and as a way of gathering judgments. In this talk, I will discuss an ongoing effort to use the 'Games with a Purpose' methodology to create a large-scale anaphorically annotated corpus in which multiple judgments are maintained about the interpretation of each anaphoric expression - and in particular, the Phrase Detectives game: http://www.phrasedetectives.org Joint work with Jon Chamberlain and Udo Kruschwitz (Uni Essex)

BIODATA:
Massimo Poesio obtained his PhD in Computer Science from University of Rochester (USA) in 1994. He was a EPSRC Advanced Research Fellow at the University of Edinburgh from 1994-2001, a Reader at the University of Essex from 2001, and is currently also Chair of Humanities Computing at the University of Trento (Italy), where he is also Director of the Language, Interaction and Computation Lab (clic.cimec.unitn.it/). His research interests are in Cognitive Science and Artificial Intelligence, and include computational models of semantic interpretation, particularly anaphora resolution; relation extraction, and for the acquisition and use of commonsense knowledge (www.clsp.jhu.edu/ws07/groups/elerfed/, www.livememories.orgw), the creation of large corpora of semantically annotated data to evaluate NLP models (cswww.essex.ac.uk/Research/nle/arrau/, anawiki.essex.ac.uk), and spoken dialogue systems.

2008

ABSTRACT:
The work addresses the task of spoken document retrieval (SDR) - the retrieval of speech recordings from speech databases in response to user queries.

In the SDR task, we are faced with the problem that automatic transcripts as generated by a speech recognizer are far from perfect. This is especially the case for conversational speech, where the transcripts are often not of sufficient quality to be useful on their own for SDR - due to environment and channel effects, as well as intra-speaker and inter-speaker pronunciation variability. Recent research efforts in SDR have tried to overcome the low quality of 1-best transcripts by using statistics derived from multiple transcription hypotheses, represented in the form of lattices; however, these efforts have invariably used the classical vector space retrieval model.

In this thesis. we present a method for lattice-based spoken document retrieval based on a statistical approach to information retrieval. In this method, a smoothed statistical model is estimated for each document from the expected counts of words given the information in a lattice, and the relevance of each document to a query is measured as a log probability of the query under such a model. We investigate the efficacy of our method as compared to two previous SDR methods - statistical retrieval using only 1-best transcripts, and a recently proposed lattice-based vector space retrieval method - as well as a lattice-based BM25 method which we implemented. Experimental results obtained on Mandarin and English conversational speech corpora show that our method consistently achieves better retrieval performance than all three methods.

We also extend our statistical lattice-based SDR method to the task of query-by-example SDR - retrieving documents from a speech corpus, where the queries are themselves in the form of complete spoken documents (query exemplars). In our query-by-example SDR method, we compute expected word counts from document and query lattices, estimate statistical models from these counts, and compute relevance scores as Kullback-Leibler divergences between these models. Experiments on English conversational speech show that the use of statistics from lattices for both documents and query exemplars results in better retrieval accuracy than using only 1-best transcripts for either documents, or queries, or both. Finally, we investigate the effect of stop word removal further improves reetrieval accuracy, and then lattice-based retrieval also yields an improvement over 1-best retrieval even in the presence of stop word removal.

BIODATA:
Chia Tee Kiah is a PhD candidate (2003-present) in School of Computing, National University of Singapore. He is supervised by A/P Ng Hwee Tou from School of Computing and Dr. Li Haizhou from Institute for Infocomm Research. He has received his B.Comp (Hons) from the National University of Singapore, Singapore in 2003. He is currently working on methods for improving spoken document retrieval.

ABSTRACT:
Collocation, i.e. the sequences of certain words which habitually co-occur, plays an essential part in human language. For example, in English you say strong wind but heavy rain. It would not be normal to say *heavy wind or *strong rain. For students, choosing the right collocation will make his speech and writing sound much more natural, more native speaker-like. For linguists, collocation is often used to distinguish word senses. For computational purposes, collocation is of a type usable for various Natural Language Processing (NLP) applications. Similar to English, collocation runs through the whole of Chinese language. No piece of natural spoken or written Chinese is totally free of collocation.

However, no matter how convinced learners and experts are in principle of the importance of collocation, it is difficult for them to put these principles into practice without the benefit of a large scale dictionary of collocations.

This presentation introduces a dictionary-rule-statistics combination way to automatically extract collocations from huge Corpus. The specific aim is to build a large collocation knowledge-base for the 10,000 most frequently used words in Singapore Mandarin. The investigation is based on the largest and only Singapore Chinese corpus (SCC), of which 20 million words have been POS tagged.

BIODATA:
Dr Wang Hui is an Assistant Professor at the Department of Chinese Studies, National University of Singapore (NUS). Prior to taking up her appointment at NUS, Dr Wang was a lecturer in the Institute of Computational Linguistics at Peking University, then a visiting scientist in Free University, Berlin, Germany. She has a PhD in Chinese Linguistics from Peking University. She is the author of A Syntagmatic Study on Chinese Noun Senses (Beijing: Peking University Press, 2004), the Structure of Chinese Language: Characters, Words and Sentences (co-author), (New Jersey, Singapore: Global Publishing Co. 2004), and the Grammatical Knowledge Base of Contemporary Chinese -- A Complete Specification (co-author), (Beijing: Tsinghua University Press. 1998. 2nd editions, 2003).

ABSTRACT:
I will present Web-based approaches to the syntax and semantics of noun compounds (NCs), which can be used in query parsing, technical term understanding, etc. I will also describe an application to machine translation.

First, I will present a highly accurate lightly supervised method based on surface features and paraphrases for making bracketing decisions for three-word noun compounds, e.g. "[[liver cell] antibody]" is left-bracketed, while "[liver [cell line]]" is right-bracketed. The enormous size of the Web makes such features frequent enough to be useful.

Second, I will introduce an unsupervised method for discovering the implicit predicates characterizing the semantic relations that hold in noun-noun compounds. For example, "malaria mosquito" is a "mosquito that carries/spreads/causes/transmits/brings/infects with/... malaria".

Finally, I will present a method for improving Machine Translation (SMT). Most modern SMT systems rely on aligned sentences of bilingual corpora for training. I will describe a method for expanding the training set with conceptually similar but syntactically differing paraphrases at the NP-level which involve NCs. The English to Spanish evaluation on the Europarl corpus shows an improvement equivalent to 33%-50% of that of doubling the amount of training data.

BIODATA:
Dr. Preslav Nakov is a Research Fellow at the National University of Singapore (NUS). He received his PhD in Computer Science from the University of California at Berkeley in 2007. Dr. Nakov's main research interestes are in the areas of lexical semantics and machine translation.

ABSTRACT:
This paper proposes an other-anaphora resolution approach in biomedical texts. It utilizes automatically mined patterns to discover the semantic relation between an anaphor and a candidate antecedent. The knowledge from lexical patterns is incorporated in a machine learning framework to perform anaphora resolution. The experiments show that machine learning approach combined with the auto-mined knowledge is effective for other-anaphora resolution in the biomedical domain. Our system with auto-mined patterns gives an accuracy of 56.5%., yielding 16.2% improvement against the baseline system without pattern features, and 9% improvement against the system us-ing manually designed patterns.

BIODATA:
Chen Bin received his Bachelor degree with honor from NUS. He is currently a Ph.D. candidate in his 2nd year.

ABSTRACT:
Recent research in Statistical Machine Translation (SMT) tends to incorporate more grammatical structure information into the translation model known as linguistically motivated syntax-based models. The phrase structure parse tree is a commonly-used representation for bilingual sentence pairs in developing a linguistically syntax-based translation model. In nature, syntax-based SMT regards machine translation as a structure transfer process. Therefore, structure alignment is a critical step in training syntax-based SMT models. In this talk, we propose a statistical structure based model to induce the sub-tree alignment between bilingual phrase-structure trees. To fulfill the discriminative capability of the model, we design novel syntactic features combined with common lexical features. The task is achieved by developing an unsupervised framework via agreement-based learning.

BIODATA:
Sun Jun is a PhD candidate in computer science at School of Computing (SOC), National University of Singapore (NUS), supervised by Prof. Tan Chew Lim and Dr. Zhang Min. He received his bachelor's degree in Computer Science at Harbin Institute of Technology in 2006. His research interests includes Machine Learning, Data Mining and Natural Language Processing.

ABSTRACT:
This is the presentation for my Graduate Research Paper (GRP). Discourse relation recognition is to identify and classify the relation between two text units (words, phrases or sentences), and it is widely agreed that this is an important step in discourse parsing and other NLP applications. The newly released version 2 of the Penn Discourse Treebank corpus provides a clean linguistic resource towards understanding discourse relations and a common platform for researchers to develop discourse-centric related systems. Recently, several statistical coherence models have been proposed to derive local and global discourse coherence using cohesion level features such as reference and lexical repetition. While the tasks of discourse relation recognition and coherence modeling have a strong implicit connection to each other, to our knowledge, thus far there have been no publications that attempt to make this connection explicit.

I propose to focus my research on designing a discourse relation recognition system, and a coherence evaluation metric that is based on discourse relations recognized. I give a detailed literature review on discourse relation recognition and automatic coherence evaluation, with an introduction to the Penn Discourse Treebank (PDTB). I give a description of a case study on PDTB and two baseline systems on discourse relation recognition. I then propose future work on the research goals, including the discourse relation identification and classification phases and the coherence evaluation metric.

BIODATA:
Lin Ziheng is a PhD student under the supervision of Dr. Kan Min-Yen and Prof. Ng Hwee Tou in School of Computing, National University of Singapore. His research interests include natural language processing and information retrieval. Specifically, he is working on discourse analysis and automatic text summarization.

ABSTRACT:
In this paper, we present an algorithm for learning a generative model of natural language sentences together with their formal meaning representations with hierarchical structures. The model is applied to the task of mapping sentences to hierarchical representations of their underlying meaning. We introduce dynamic programming techniques for efficient training and decoding. In experiments, we demonstrate that the model, when coupled with a discriminative reranking technique, achieves state-of-the-art performance when tested on two publicly available corpora. The generative model degrades robustly when presented with instances that are different from those seen in training. This allows a notable improvement in recall compared to previous models.

BIODATA:
Lu Wei is currently a PhD student from the Singapore-MIT Alliance (Computer Science Program) of the National University of Singapore. His thesis advisers are Assoc. Prof. Ng Hwee Tou (NUS), Assoc. Prof. Lee Wee Sun (NUS), and Prof. Leslie Pack Kaelbling (MIT). He obtained his Bachelor of Computing (Computer Science, first class honors) from School of Computing, NUS in 2005, where he worked with Assist. Prof. Kan Min-Yen. He also obtained his Msc in SMA in 2006. His current research interest is on grounding natural language to formal language.

ABSTRACT:
The accuracy of current word sense disambiguation (WSD) systems is affected by the fine-grained sense inventory of WordNet as well as a lack of training examples. Using the WSD examples provided through OntoNotes, we conduct the first large-scale WSD evaluation involving hundreds of word types and tens of thousands of sense-tagged examples, while adopting a coarse-grained sense inventory. We show that though WSD systems trained with a large number of examples can obtain a high level of accuracy, they nevertheless suffer a substantial drop in accuracy when applied to a different domain. To address this issue, we propose combining a domain adaptation technique using feature augmentation with active learning. Our results show that this approach is effective in reducing the annotation effort required to adapt a WSD system to a new domain. Finally, we propose that one can maximize the dual benefits of reducing the annotation effort while ensuring an increase in WSD accuracy, by only performing active learning on the set of most frequently occurring word types.

BIODATA:
Zhong Zhi is a Ph. D. candidate in the Department of Computer Science at National University of Singapore. He obtained his B.Comp. from Fudan University in 2006. His research interest lies in natural language processing, with a focus on word sense disambiguation.

ABSTRACT:
The process of identifying the correct meaning, or sense of a word in context, is known as word sense disambiguation (WSD). We explore three important research issues for WSD.

Current WSD systems suffer from a lack of training examples. In our work, we describe an approach of gathering training examples for WSD from parallel texts. We show that incorporating parallel text examples improves performance over just using manually annotated examples. Using parallel text examples as part of our training data, we developed systems for the SemEval-2007 coarse-grained and fine-grained English all-words tasks, obtaining excellent results for both tasks.

In training and applying WSD systems on different domains, an issue that affects accuracy is that instances of a word drawn from different domains have different sense priors (the proportions of the different senses of a word). To address this issue, we estimate the sense priors of words drawn from a new domain using an algorithm based on expectation maximization (EM). We show that the estimated sense priors help to improve WSD accuracy. We also use this EM-based algorithm to detect a change in predominant sense between domains. Together with the use of count-merging and active learning, we are able to perform effective domain adaptation to port a WSD system to new domains.

Finally, recent research presents conflicting evidence on whether WSD systems can help to improve the performance of statistical machine translation (MT) systems. In our work, we show for the first time that integrating a WSD system achieves a statistically significant improvement on the translation performance of Hiero, a state-of-the-art statistical MT system.

BIODATA:
Yee Seng Chan is current both a final year PhD student and Research Fellow, in the School of Computing, National University of Singapore (NUS). His PhD supervisor is A/P Hwee Tou Ng from School of Computing, NUS. Yee Seng's main research is on word sense disambiguation, including its integration into machine translation systems. Other research includes automatic metrics for machine translation evaluation.

ABSTRACT:
Imagine a world where every object that can be represented on the Web has a means of identification - a uniform resource identifier (URI) - that can be accessed by processes which understand the "world" that object is part of - as represented by an ontology - so that the objects becomes accessible as part of the Web infrastructure. That is the new and exciting world offered by the Semantic Web. This presentation will take a deeper look at the components that make up the Semantic Web. We will review a number of examples of Semantic Web applications. We will also look at the basic components required to build Semantic Web applications and a Web of data.

The Semantic Web is about where the Web was in 1992. The necessary standards and protocols have been defined and we understand how to build Semantic Web applications. But we are in the boot-strapping phase where we need to encourage organisations to make their data available in the appropriate formats to enable the full potential of the Semantic Web to be realised.

There is a growing realization among many researchers that if we want to model the Web and understand its future trajectory; if we want to understand the architectural principles that have provided for its growth; and if we want to be sure that it supports the basic social values of trustworthiness, privacy, and respect for social boundaries, then we must chart out a research agenda that targets the Web as a primary focus of attention. This is Web Science and the presentation will finish with a discussion of the emergence of this exciting new discipline.

BIODATA:
Professor Wendy Hall: Wendy Hall is a Professor of Computer Science at the University of Southampton in the United Kingdom and was Head of the School of Electronics and Computer Science (ECS) from 2002-2007. She was the founding Head of the Intelligence, Agents, Multimedia (IAM) Research Group in ECS. She has published over 350 papers in areas such as hypermedia, multimedia, digital libraries, and Web technologies.

Her current research includes applications of the Semantic Web and exploring the interface between the life sciences and the physical sciences. She is a Founding Director, along with Professor Sir Tim Berners-Lee, Professor Nigel Shadbolt and Daniel J. Weitzner, of the Web Science Research Initiative.

She has recently been elected as President of the Association for Computing Machinery (ACM), and is the first person from outside North America to hold this position. She was Senior Vice President of the Royal Academy of Engineering (2005-2008) and is a Past President of the British Computer Society (2003-2004). She is a member of the Prime Minister's Council for Science and Technology, a member of the Executive Committee of UKCRC, and Chair of the newly formed BCS Women's Forum. She is the Chair of the Advisory Board of the new Company, Garlik Limited, and is a founding member of the Scientific Council of the European Research Council. She was awarded a CBE in the Queen's Birthday honours list in 2000, and became a Fellow of the Royal Academy of Engineering in the same year. In 2006 she was awarded the Anita Borg Award for Technical Leadership. A longer biography is available at http://users.ecs.soton.ac.uk/wh/

Nigel Shadbolt is a Professor of Artificial Intelligence in the School of Electronics and Computer Science (ECS) at the University of Southampton. In its 50th Anniversary year, Nigel was President of the British Computer Society. He is a Fellow of both the Royal Academy of Engineering and the British Computer Society.

Between 2000-07, he was the Director of the #7.5m EPSRC Interdisciplinary Research Collaboration in Advanced Knowledge Technologies (AKT). He is the Chief Technology Officer of Garlik, a company formed to exploit semantic web technology to enhance consumers' and citizens' privacy. He is also a founding director of the Web Science Research Initiative (WSRI).

ABSTRACT:
In this thesis, we investigate a specific area within Statistical Machine Translation (SMT): the reordering task -- the task of arranging translated words from source to target language order. This task is crucial as well as challenging, as the failure to order words correctly leads to a disfluent discourse and it may require in-depth knowledge about the source and target language syntaxes, which are often not available to SMT systems.

In this thesis, we propose to address the reordering task by using knowledge of function words. In many languages, function words -- which include prepositions, determiners, articles, etc -- are important in explaining the grammatical relationship among phrases within a sentence. Projecting them and their dependent arguments into another language often results in structural changes in target sentence. Furthermore, function words have desirable empirical properties as they are enumerable and appear frequently in the text, making them highly amenable to statistical modeling.

We demonstrate the utility of this function word idea to the syntax-based approach, following the recent trend of using syntactic formalisms in modeling reordering. We also believe the idea brought forward and developed in this thesis is applicable to other SMT approaches. We implement this idea in a specific syntax-based approach: the formally syntax-based approach, which assumes a knowledge-poor environment where no linguistic annotation is available to the model. In particular, we demonstrate the benefit of our function words idea by proposing several statistical models that address the suboptimalities of the current formally syntax-based models.

We first argue that the current formally syntax-based models are still problematic, although they achieve state-of-the-art performance. More specifically, without access to linguistic knowledge, these models typically come with only one type of nonterminal symbol, which unfortunately introduces many structural ambiguities. In contrast, our idea, which is implemented as a Head-driven Synchronous Context Free Grammar, is better at addressing this problem since it introduces two types of nonterminals: one for function words, and one for their arguments. With this richer set of nonterminals, we develop novel statistical models to better resolve the structural ambiguities. Our experimental results suggest that our syntax-based approach performs well in the reordering task in perfect lexical choice scenarios, thanks to its stronger structural modeling with the advantage of being more compact. We also validate this approach in the full translation task where the training data contains noise, confirming the merit of our idea to both the reordering and the translation task.

BIODATA:
Hendra Setiawan is a Doctoral Student at SoC, NUS, co-supervised by Dr. Min-Yen Kan and Dr. Haizhou Li. His main research interest is Statistical Machine Translation and Natural Language Processing (NLP) in general.

ABSTRACT:
Scenario Template Creation (STC) is a Natural Language Processing (NLP) task to detect the commonalities among articles on similar events and generalize them into an abstract representation -- a scenario template (ST). For this task, the estimation of verb-centric text span similarity is the key. Since text span similarity calculation plays an important role in many NLP applications, various approaches have been proposed. They range from bag-of-words to more complicated ones involving thesauri and features at different linguistic levels. However, there are still demands and opportunities for further improvement. Contextual information, for instance, by intuition would be a source to enhance text span similarity estimation. But it has yet to be exploited as well as the internal features have been.

In this talk, I first discuss an intrinsic similarity measure for predicate-argument tuples (PATs). It is applied to a Paraphrase Recognition (PR) task, demonstrating its feasibility. Then I show a context model to capture contexts that could be more informative compared to other surrounding tokens. With different contextual relations defined, I hypothesize that two PATs' semantic similarity can also be reflected by their extrinsic similarity, i.e., whether they are contextually similarly connected to similar contexts. I show experimental results that confirm the correlation between such an extrinsic similarity and the semantic similarity of PATs. To integrate intrinsic and extrinsic similarities for PAT clustering, I propose a graphical framework, using a novel core algorithm called Context Sensitive Clustering (CSC). This clustering process is guided by the Expectation-Maximization (EM) algorithm. I conduct experiments comparing this EM-based CSC algorithm with the standard K-means algorithm. Under the widely-used purity and inverse purity metrics, the proposed algorithm outperforms K-means over all the scenarios tested.

BIODATA:
Long Qiu is a Doctoral Student at SoC, NUS, co-supervised by Professor Chua Tat-Seng and Dr. Min-Yen Kan. He got his Master of Science (SM) in Computer Science from Singapore-MIT Alliance in 2002. He is interested in Natural Language Processing (NLP) and the related machine learning techniques.

ABSTRACT:
China has become the world's biggest online market in terms of users. What continues to drive this growth? What are its challenges and opportunities? In this survey we will outline the social and economic background, the key business models and competitive advantages, how media and multimedia interact, and how people use the Internet in their daily lives. The second part of this talk will present an overview and forward-looking synopsis of the principles and applications of search, from the perspective of a long-time search engineer.

BIODATA:
Dr. William Chang has been the Chief Scientist at Baidu since January 2007. Prior to joining Baidu, Dr. Chang served as the CTO of Infoseek and the VP of Strategy of Go Network. He is also the creator of the highly successful Infoseek natural language search engine and Ultraseek enterprise search engine. Dr. Chang has extensive expertise in search technology, online community building and advertising business models. Dr. Chang earned an undergraduate degree in mathematics from Harvard and a PhD in computer science from the University of California, Berkeley for his breakthrough work in text search. At the renowned Cold Spring Harbor Laboratory, Dr. Chang mapped a genome and invented a protein sequence search methodology. More recently, he created a contextual advertising product at Sentius Corporation, and founded Affini, Inc., a social network technology company.

ABSTRACT:
Distributed search engines are often more complex to implement compared to centralized engines. Distributing a search engine across multiple sites, however, has several advantages. In particular, it enables the utilization of less computer resources and the exploitation of data and user locality. In this presentation we show the feasibility of distributed Web search engines, by proposing a model for assessing the total cost of a distributed Web-search engine that includes the computational costs as well as the communication cost among all distributed sites. Using examples, we show that a distributed Web search engine can be more cost effective than a centralized one, if there is a large percentage of local queries, which is usually the case. We then present a query-processing algorithm that maximizes the amount of queries answered locally, without sacrificing the quality of the results, by using caching and partial replication. We simulate our algorithm on real document collections and real query workloads to measure the actual parameters needed for our cost model, and we show that a distributed search engine can be competitive compared to a centralized architecture with respect to cost. This is joint work with Aris Gionis, Flavio Junqueira, Vassilis Plachouras and Luca Telloli.

BIODATA:
Ricardo Baeza-Yates is VP of Yahoo! Research for Europe and Latin America, leading the labs at Barcelona, Spain and Santiago, Chile. Until 2005 he was the director of the Center for Web Research at the Department of Computer Science of the Engineering School of the University of Chile; and ICREA Professor at the Dept. of Technology of Univ. Pompeu Fabra in Barcelona, Spain. He is co-author of the book Modern Information Retrieval, published in 1999 by Addison-Wesley, as well as co-author of the 2nd edition of the Handbook of Algorithms and Data Structures, Addison-Wesley, 1991; and co-editor of Information Retrieval: Algorithms and Data Structures, Prentice-Hall, 1992, among more than 150 other publications. He has received the Organization of American States award for young researchers in exact sciences (1993) and with two Brazilian colleagues obtained the COMPAQ prize for the best CS Brazilian research article (1997). In 2003 he was the first computer scientist to be elected to the Chilean Academy of Sciences. During 2007 he was awarded the Graham Medalfor innovation in computing, given by the University of Waterloo to distinguished ex-alumni.

ABSTRACT:
Web advertising is the primary driving force behind many Web activities, including Internet search as well as publishing of online content by third-party providers. A new discipline - Computational Advertising - has recently emerged, which studies the process of advertising on the Internet from a variety of angles. A successful advertising campaign should be relevant to the immediate user's information need as well as more generally to user's background, be economically worthwhile to the advertiser and the intermediaries (e.g., the search engine), as well as not detrimental to user experience. At first approximation, the process of obtaining relevant ads can be reduced to conventional information retrieval, where one constructs a query that describes the user's context, and then executes this query against a large inverted index of ads. We show how to augment the standard IR approach using query expansion and text classification techniques. We demonstrate how to employ a relevance feedback assumption and use Web search results retrieved by the query. We will also survey the numerous challenges and open research problems posed by computational advertising, such as text summarization, natural language generation, named entity extraction, handling geographic names, and others.

BIODATA:
Evgeniy Gabrilovich is a Senior Research Scientist and Manager of the NLP & IR Group at Yahoo! Research. His research interests include information retrieval, machine learning, and computational linguistics. Recently, he co-organized a workshop on the synergy between Wikipedia and research in AI at AAAI 2008, as well as co-presented a tutorial on computation advertising at ACL 2008 and EC 2008. He served on the program committees of ACL-08:HLT, AAAI 2008, WWW 2008, CIKM 2008, JCDL 2008, AAAI 2007, EMNLP-CoNLL 2007, and COLING-ACL 2006. Evgeniy earned his MSc ad PhD degrees in Computer Science from the Technion - Israel Institute of Technology. In his Ph.D. thesis, Evgeniy developed a methodology for using large scale repositories of world knowledge (e.g., all the knowledge available in Wikipedia) in order to enhance text representation beyond the bag of words.

ABSTRACT:
Web search results are typically based on the user's search query, without taking other contextual information into account. However, we can see from user search behavior that for some search topics the user may prefer results which are geographically close to home. We will show topics which have a geographical dependence, as well as others which appear to be geographically independent. Based on these findings, we propose a more flexible approach to web search, which in which we prefer a ranking with results close to the user location when this will best satisfy the user's information need.

BIODATA:
Rosie Jones is a Senior Research Scientist at Yahoo!. Her research interests include web search, geographic information retrieval and natural language processing. She received her PhD from the School of Computer Science at Carnegie Mellon University. In 2005 she co-organized the SIGIR workshop on lexical cohesion and information retrieval, and in 2003 she co-organized the ICML workshop on The Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining. She served as a Senior PC member for SIGIR in 2007 and 2008.

ABSTRACT:
In this talk we discuss the problem of whether or not to show online advertisements. We propose two methods for addressing this problem, a simple thresholding approach and a machine learning approach, which collectively analyzes the set of candidate ads augmented with external knowledge. Our experimental evaluation, based on over 28,000 editorial judgments, shows that we are able to predict, with high accuracy, when to show ads for both content match and sponsored search advertising tasks.

BIODATA:
Donald Metzler is a Research Scientist at Yahoo! Research in Santa Clara, CA. He obtained his Ph.D. degree in Computer Science from the University of Massachusetts Amherst in 2007. His research interests include information retrieval, machine learning, and their intersection. He is the co-author of Search Engines: Information Retrieval in Practice, which will be published in the early part of 2009.

ABSTRACT:
Large-scale image retrieval on the Web relies on the availability of short snippets of text associated with the image. This user-generated content is a primary source of information about the content and context of an image. While traditional information retrieval models focus on finding the most relevant document without consideration for diversity, image search requires results that are both diverse and relevant. This is problematic for images because they are represented very sparsely by text, and as with all user-generated content the text for a given image can be extremely noisy.

The contribution of this paper is twofold. We show that it is possible to minimize the trade-off between precision and diversity, relevance models offer a unified framework to afford the greatest diversity without harming precision. Furthermore we show that estimating the query model from the distribution of tags favors the dominant sense of a query. Relevance models operating only on tags offers the highest level of diversity with no significant decrease in precision.

BIODATA:
Vanessa Murdock currently holds a Post Doc position at Yahoo! Research Barcelona. Her current work focuses on retrieval of short texts, such as for advertisements, and user-generated content for images and video. She completed her PhD in 2006 at the University of Massachusetts, working with W. Bruce Croft. Her thesis, focusing on sentence retrieval for applications such as Question Answering, novelty detection, and information provenance, was recently published as a book "Exploring Sentence Retrieval".

ABSTRACT:
Part of the unique cultural heritage of China is the game of Chinese couplets (duìlián) One person challenges the other person with a sentence (first sentence). The other person then replies with a sentence (second sentence), in a way that corresponding words in the two sentences match each other by obeying certain constraints on semantic, syntactic, and lexical relatedness. This task is viewed as a difficult problem in AI and has not been explored in the research community.

In this paper, we regard this task as a kind of machine translation process. We present a phrase-based SMT approach to generate the second sentence. First, the system takes as input the first sentence and generates as output an N-best list of proposed second sentences using a phrase-based SMT decoder. Then, a set of filters is used to remove candidates violating linguistic constraints. Finally, a Ranking SVM is applied to rerank the candidates. A comprehensive evaluation, using both human judgments and BLEU scores, has been conducted, and the results demonstrate that this approach is very successful.

BIODATA:
Ming Zhou, research manager of Natutal Language Computing Group at Microsoft Research Asia (MSRA). As one of the first group in MSRA, this group has been working on machine translation, information retrieval, question answering and language gaming and has contributed many technologies to MS products such as Chinese/Japanese IME, Chinese word breaker, English writing assistant, search engine speller, multi-language search and keyword bidding, text mining, etc.

Ming developed the China's first Chinese-English machine system CEMT-I in 1988 which set up the foundation of machine translation research of Harbin Institute of Technology. He is the inventor of J-Beijing Chinese-Japanese machine translation system, a famous MT product in Japan which has taken the 62% market share for 10 years since it was launched in 1998. Ming Zhou got his PhD degree at Harbin Institute of Technology in 1991. Then he had his post-doc in Tsinghua University in 1991-1993. He then became an associate professort at the same university untill 1999 when he joined MSRA.

ABSTRACT:
Question answering has been a very active research field in information retrieval and natural language processing. Despite the success of TREC QA track, large scale robust QA systems are still yet to be found in the real world. In this talk, I will briefly introduce recent progress on SQuAD --a question and answering project aiming to crawl, index, and serve all question and answer pairs existing on the web. I will address six main challenges of the project and then focus on the topic of question search and recommendation. Three demos will be shown to highlight how SQuAD technologies can be used in different scenarios.

BIODATA:
Dr. Chin-Yew LIN is a lead researcher and research manager at Microsoft Research Asia. Before joining Microsoft in 2006, he was a senior research scientist at the Information Sciences Institute at University of Southern California (USC/ISI) where he worked in the Natural Language Processing and Machine Translation group since 1997. His research interests are automated summarization, opinion analysis, question answering, computational advertising, community intelligence, machine translation, and machine learning.

Recently, his main focus is developing scalable automatic question answering and distillation system -- SQuAD. He also developed automatic evaluation technologies for summarization, QA, and MT. In particular, he created the ROUGE automatic summarization evaluation package. It has become the de facto standard in summarization evaluations. More than 200 research sites worldwide have downloaded this package.

ABSTRACT:
Babbie defines content analysis as "the study of recorded human communications such as books, Web sites, paintings and laws." We all practice what we might call "first generation" content analysis every time we read a paper. What we might call "second generation" content analysis involves social scientists who develop coding frames appropriate to their research question and then meticulously annotate a collection of moderate size in order to support their analysis. Third-generation content analysis leverages extensive automation in fairly straightforward ways, such as by counting words or preparing a concordance. We now find ourselves on the verge of a fourth generation of content analysis techniques in which computational linguistics holds promise for automated population of complex coding frames. This could enable sophisticated Web-scale studies, potentially fostering emergence of research methods that go well beyond content to encompass many forms of evidence from human interaction with information. In this talk, I will describe some challenges that we must overcome as these two communities learn to work together. I'll illustrate my talk with examples from the PopIT procect collaboration between social scientists and computational linguists at the University of Maryland in which we are developing automated tools for computational analysis of trends in the popularity of information technology innovations. I'll start with a sketch of our research design for working at the intersection of these two fields, and then I'll describe a few specific pieces of that puzzle that we have already started to build.

Finally, I'll conclude with a few remarks about where we see potential for collaboration with others who share similar interests.

BIODATA:
Douglas Oard is Associate Dean for Research at the College of Information Studies of the University of Maryland, College Park, where he holds joint appointments as Associate Professor in the College of Information Studies and in the Institute for Advanced Computer Studies. He earned his Ph.D. in Electrical Engineering from the University of Maryland. Dr. Oard's research interests center around the use of emerging technologies to support information seeking by end users, with recent work focusing on interactive techniques for cross-language information retrieval, searching conversational media, and leveraging observable behavior to improve user modeling. Together with Ping Wang and Ken Fleischmann, he helps to lead the NSF-funded PopIT project. Additional information is available at http://www.glue.umd.edu/~oard/

ABSTRACT:
Bracketing Transduction Grammar (BTG) is a natural choice for effective integration of desired linguistic knowledge into statistical machine translation (SMT). In this talk, we introduce a Linguistically Annotated BTG (LABTG) for SMT. It conveys linguistic knowledge of source-side syntax structures to BTG hierarchical structures through linguistic annotation. From the linguistically annotated data, we learn annotated BTG rules and train linguistically motivated phrase translation model and reordering model. We also present an annotation algorithm that captures syntactic information for BTG nodes. The experiments show that the LABTG approach significantly outperforms a baseline BTG-based system and a state-of-the-art phrase-based system on the NIST MT-05 Chinese-to-English translation task. Moreover, we empirically demonstrate that the proposed method achieves better translation selection and phrase reordering.

BIODATA:
Xiong Deyi received his Ph.D. from the Institute of Computing Technology of Chinese Academy of Sciences. His research interests include statistical machine translation, Chinese language processing, information extraction, and statistical parsing. He is currently a research fellow at the Institute for Infocomm Research of Agency for Science, Technology and Research (I2R,A-STAR).

ABSTRACT:
Information Extraction (IE) is the task of identifying information (e.g. entities, relations or events) from free text. Numerous previous context-, ontology-, rule- and classification-based methods were actively explored during the decades of research on this task. However, a challenging open question of effectively handling the flexibility of natural language remains unresolved over the years. In IE, this implies the problem of sparseness of data instances, which in turn causes the problems of paraphrasing and misalignment of context features of the extracted information. In this thesis, we hypothesize that such problems can be alleviated by combining relations between entities at the phrasal, dependency, semantic and inter-clausal discourse levels. To validate our hypothesis, we develop a 2-level multi-resolution framework ARE (Anchors and Relations). The first level of ARE extracts candidate phrases (anchors), while the second level evaluates the relations among the anchors and composes possible candidate templates.

The relations between the anchors are combined in several ways. First, we evaluate dependency relations between anchors. We classify dependency relation paths between the anchors into the Simple, Average and Hard categories according to the path length and develop different techniques to handle them. The category-specific strategies resulted in the improvement of 3%, 4% on the MUC4 (Terrorism) and MUC6 (Management Succession) domains, respectively. The increased performance demonstrates that dependency relations are important to handle paraphrases at the syntactic level. Second, we incorporate the discourse relation analysis in a multi-resolution framework for IE to handle long distance dependency relations and possible paraphrasings at the intra-clausal level. This leads to a further improvement of 3%, 7%, 3% and 4% on MUC4, MUC6 and ACE RDC 2003 (general and specific types) domains, respectively. Third, we explore 2 supplementary strategies to combine relation paths between anchors. Since the amount of negative paths between the anchors is many times more than that of positive paths, we apply a filtering strategy to eliminate negative paths. Also, we support the learning process of our dependency relation classifier by the cascading of the features from the discourse classifier. These 2 strategies further improve the IE performance on the MUC4, MUC6 and ACE RDC 2003 (general and specific types) corpora.

Overall, our results affirm the hypothesis that the extraction of candidate phrases (anchors) and the combination of different relation types between anchors in a multi-resolution framework is important to tackle the key problems of paraphrasing and misalignment in Information Extraction.

BIODATA:
Mr. Maslennikov Mstislav is a Doctoral Student at SOC, NUS. He received his 5-year diploma (equivalent to M.Sc.) degree from the Moscow State University, Russia. Since 2002, he has been studying in the internship and PhD programs under the supervision of Prof. Chua Tat-Seng and Dr. Tian Qi. His research is on the theme of improving Information Extraction through relation-based analysis of free text.

We report on the user requirements study and preliminary implementation phases in creating a digital library that indexes and retrieves educational materials on math. We first review the current approaches and resources for math retrieval, then report on the interviews of a small group of potential users to properly ascertain their needs. While preliminary, the results suggest that meta-search and resource categorization are two basic requirements for a math search engine. In addition, we implement a prototype categorization system and show that the generic features work well in identifying the math contents from the webpage but perform less well at categorizing them. We discuss our long term goals, where we plan to investigate how math expressions and text search may be best integrated.

We consider the task of automatic slide image retrieval, in which slide images are ranked for relevance against a textual query. Our implemented system, SLIDIR caters specifically for this task using features specifically designed for synthetic images embedded within slide presentation. We show promising results in both the ranking and binary relevance task and analyze the contribution of different features in the task performance.

We describe a user-assisted framework for correcting ink-bleed in old handwritten documents housed at the National Archives of Singapore (NAS). Our approach departs from traditional correction techniques that strive for full automation. Fully automated approaches make assumptions about ink-bleed characteristics that are not valid for all inputs. Furthermore, fully-automated approaches often have to set algorithmic parameters that have no meaning for the end-user. In our system, the user needs only to provide simple examples of ink-bleed, foreground ink, and background. These training examples are used to classify the remaining pixels in the document to produce a computer generated result that is equal or better than existing fully-automated approaches.

To offer a complete system, we provide additional tools to allow any remaining errors to be easily cleaned up by the user. The initial training markup, computer-generated results, and manual edits are all recorded with the final output, allowing subsequent viewers to see how a corrected document was created and to make changes or updates. While an on-going project, our feedback from the NAS staff has been overwhelmingly positive that this user-assisted approach is a practical and useful way to address the ink-bleed problem.

[11:20 - 11:40]

"The ACL Anthology Reference Corpus: A Reference Dataset for Bibliographic Research in Computational Linguistics (LREC)"
by
Kan Min-Yen

The ACL Anthology is a digital archive of conference and journal papers in natural language processing and computational linguistics. Its primary purpose is to serve as a reference repository of research results, but we believe that it can also be an object of study and a platform for research in its own right. We describe an enriched and standardized reference corpus derived from the ACL Anthology that can be used for research in scholarly document processing. This corpus, which we call the ACL Anthology Reference Corpus (ACL ARC), brings together the recent activities of a number of research groups around the world. Our goal is to make the corpus widely available, and to encourage other researchers to use it as a standard testbed for experiments in both bibliographic and bibliometric research.

We describe ParsCit, a freely available, open-source implementation of a reference string parsing package. At the core of ParsCit is a trained conditional random field (CRF) model used to label the token sequences in the reference string. A heuristic model wraps this core with added functionality to identify reference strings from a plain text file, and to retrieve the citation contexts. The package comes with utilities to run it as a web service or as a standalone utility. We compare ParsCit on three distinct reference string datasets and show that it compares well with other previously published work.

ABSTRACT:
The ILIAD (Improved Linux Information Access by Data Mining) Project is an attempt to apply language technology to the task of Linux troubleshooting by analysing the underlying information structure of a multi-document text discourse and improving information delivery through a combination of filtering, term identification and information extraction techniques. In this talk, I will outline the overall project design and present results for a variety of thread-level filtering tasks.

BIODATA:
Timothy Baldwin is a Senior Lecturer in the Department of Computer Science and Software Engineering, University of Melbourne. Since completing his PhD at the Tokyo Institute of Technology in 2001, he has been involved with research grants from including the NSF, NTT, ARC, NICTA and Google. His research interests include web mining, information extraction, deep linguistic processing, multiword expressions, deep lexical acquisition, and biomedical text mining. He is the author of over 130 journal and conference publications, and has held visiting appointments at NTT Communication Science Laboratories and Saarland University. He is the recipient of a number of awards for both teaching and research in the areas of computer science and natural language processing. He is currently on the editorial board of Computational Linguistics, a series editor for CSLI Publications, and a member of the Deep Linguistic Processing with HPSG Initiative (DELPH-IN).

We propose an automatic machine translation (MT) evaluation metric that calculates a similarity score (based on precision and recall) of a pair of sentences. Unlike most metrics, we compute a similarity score between items across the two sentences. We then find a maximum weight matching between the items such that each item in one sentence is mapped to at most one item in the other sentence. This general framework allows us to use arbitrary similarity functions between items, and to incorporate different information in our comparison, such as n-grams, dependency relations, etc. When evaluated on data from the ACL-07 MT workshop, our proposed metric achieves higher correlation with human judgements than all 11 automatic MT evaluation metrics that were evaluated during the workshop.

Recent efforts on the task of spoken document retrieval (SDR) have made use of speech lattices: speech lattices contain information about alternative speech transcription hypotheses other than the 1-best transcripts, and this information can improve retrieval accuracy by overcoming recognition errors present in the 1-best transcription. In this paper, we look at using lattices for the query-by-example spoken document retrieval task -- retrieving documents from a speech corpus, where the queries are themselves in the form of complete spoken documents (query exemplars). We extend a previously proposed method for SDR with short queries to the query-by-example task. Specifically, we use a retrieval method based on statistical modeling: we compute expected word counts from document and query lattices, estimate statistical models from these counts, and compute relevance scores as divergences between these models. Experimental results on a speech corpus of conversational English show that the use of statistics from lattices for both documents and query exemplars results in better retrieval accuracy than using only 1-best transcripts for either documents, or queries, or both. In addition, we investigate the effect of stop word removal which further improves retrieval accuracy. To our knowledge, our work is the first to have used a lattice-based approach to query-by-example spoken document retrieval.

Abbreviations, acronyms, initialisms, and shortenings frequently occurin many texts found on the Web, such as publication metadata, stock ticker codes, and biological articles. To connect these disparate forms together for knowledge discovery, short forms must be properly linked to their canonical long forms. In this paper, we demonstratehow a search engine can be efficiently utilized in mining the requiredcontextual information, so that short forms can be effectively linked to long forms. We show that a count-based method consistently outperforms other methods, and that using the snippets is better thanusing the full web pages. We also consider adaptively combining a query probing algorithm together with our count-based method. This reduces running time and network bandwidth, while maintaining the strong linkage performance.

ABSTRACT:
In this work, we propose a novel method based on ellipsed predicates to automatically interpret compound nouns with a predefined set of semantic relations. First we map verb tokens in sentential contexts to a fixed set of seed verbs using WordNet::Similarity and Moby's Thesaurus. We then match the sentences with semantic relations based on the semantics of the seed verbs and grammatical roles of the head noun and modifier. Based on the semantics of the matched sentences, we then build a classifier using a memory-based classification tool, Timbl 5.1. The performance of our final system at interpreting NCs is 52.6%. We also compared our method with previous methods and confirmed better performance over the same dataset.

BIODATA:
Su Nam Kim is a postdoctoral research fellow at NUS. She received her BS and MS degrees from Pusan National University, South Korea, a MS degree from State University of New York at Stony Brook, USA. She recently completed her Ph.D study at University of Melbourne, Australia. She has a broad research interest in AI but primarily focuses on lexical semantics including multiword expressions, word sense disambiguation and cross-lingual lexical acquisition. She is also interested in multi-document/multilingual summarization and question-answering systems.

ABSTRACT:
With the advances of medical techniques, large amounts of medical data are produced in hospitals every day. Radiology reports contain rich information about the corresponding medical images but are often under mined. Therefore, our research topics focus on information extraction from brain CT radiology reports, radiology reports assisted medical image content retrieval, and automatic generation of brain CT reports based on domain knowledge and associated images. Current medical record search systems will benefit from our research so that searching for information is more efficient and convenient. Doctors and radiologists can also be more efficient to conduct their research in the area using the improved system. The automatical generation of reports can give reference to radiologists. Our research will also be helpful to facilitate an education system for junior doctors and researchers in the area.

BIODATA:
Gong Tianxia is a PhD candidate in computer science at School of Computing (SOC), National University of Singapore (NUS), supervised by A/P Tan Chew Lim. She received her bachelor's degree in Computer Engineering at SOC in 2006. Her research interests are in information Rretrieval and medical text processing.

ABSTRACT:
This research focuses on multiword expressions (MWEs), that is lexical items that are made up of two or more simplex words, such as "dog pound", "call up" or "red herring". My goals are: to shed light on underlying the linguistic processes giving rise to MWEs; to generalize techniques for indentifying, extracting and analyzing MWEs; to compare pre-existing MWE classifications; and finally, to exemplify the utility of MWE interpretation within NLP tasks. This is aimed at improving the fluency, robustness and understanding of natural language. The first of the three part presentation on Feb. 26th will provide a brief background on MWEs including different research perspectives and linguistic foundations of MWEs. It will also cover the basic statistical approaches broadly used in MWE studies and will present a summary of recent advances. The second and third talks will present a more technical and detailed discussion on work done in the past two years. The schedule for the second and third talks will be announced later.

BIODATA:
Su Nam Kim is a postdoctoral research fellow at NUS. She received her BS and MS degrees from Pusan National University, South Korea, a MS degree from State University of New York at Stony Brook, USA. She recently completed her Ph.D study at University of Melbourne, Australia. She has a broad research interest in AI but primarily focuses on lexical semantics including multiword expressions, word sense disambiguation and cross-lingual lexical acquisition. She is also interested in multi-document/multilingual summarization and question-answering systems.

ABSTRACT:
Hierarchical clustering of data is one of the most widely used machine learning techniques. Traditional hierarchical clustering techniques construct a single tree in a greedy fashion, either in a top-down or a bottom-up agglomerative fashion. Sometimes we are interested in how reliable the constructed tree is, i.e. how much we believe that the structure of the tree reflects true underlying structure in the data rather than spurious effects due to noise. Such a question can be answered using a Bayesian approach where we define a prior over trees and compute a posterior distribution over trees which captures the uncertainty in the learned tree structure.

However past Bayesian models for hierarchical clustering either do not give a posterior over trees (Heller and Ghahramani 2005, Friedman 2003), not infinitely exchangeable (Williams 2000), or is simply too complex to have widespread appeal (Neal 2003). In this talk we present a model that 1) gives a posterior distribution over trees, 2) is easy to implement, and 3) has the additional nice property that it is infinitely exchangeable.

Our model is based upon a standard model in population genetics called Kingman's coalescent. We propose both greedy and sequential Monte Carlo inference algorithms for the model. We show that our model performs well compared to previous approaches on a number of small datasets, and apply it to document clustering and phylolinguistics.

BIODATA:
Dr Teh Yee Whye is a lecturer at the Gatsby Computational Neuroscience Unit, University College London in the United Kingdom. Prior to this appointment he worked with Prof Lee Wee Sun as Lee Kuan Yew Postdoctoral Fellow at the National University of Singapore, and with Prof. Michael I. Jordan as a postdoc at University of California at Berkeley. He obtained his PhD from the University of Toronto under Prof. Geoffrey E. Hinton. His research interests are in Bayesian machine learning and probabilistic graphical models.

ABSTRACT:
With the explosion of the amount of textual data in the information age, natural language processing (NLP) has become increasingly important, with direct applications in areas such as Web mining and biomedical literature mining. Currently, the most effective approach to solving most NLP problems is supervised learning coupled with linguistic knowledge. However, standard supervised learning requires the training and the test corpora to be similar, and therefore falls apart in real NLP applications because obtaining labeled data for every new domain is expensive and thus infeasible. In this talk, I will present the major line of my PhD research on domain adaptation in NLP, which aims at adapting classifiers trained on one domain to another domain. We have proposed two frameworks to achieve domain adaptation, both having been evaluated on real NLP problems and outperformed standard learning methods. I will also briefly mention the future plan to incorporate knowledge bases and expert interactions into the domain adaptation process, with applications in large-scale information extraction from biomedical literature.

BIODATA:
Ms Jing Jiang is a final year PhD student in the Text Information Management Group in the Computer Science Department at the University of Illinois at Urbana-Champaign, working with Professor ChengXiang Zhai. Her research interests include natural language processing, information retrieval, machine learning, and biomedical literature mining. She received her B.S. degree and her M.S. degree in Computer Science from Stanford University in 2002 and 2003, respectively.

2007

ABSTRACT:
I will describe how one useful aspect of the structure of scientific articles can be discovered with reasonably shallow means, namely the prototypical argumentation for the validity of the current research. Reference to other people's work, and reasonably standardised statements about this work, are a staple part of the argumentation, and citation analysis can exploit this fact. AZ-discourse analysis is the robust machine-learning of this structure, based on the extraction of correlated, and often linguistically interesting, features. I will show results of AZ on two domains (computational linguistics and chemistry), and discuss several search and summarisation applications using AZ. I will also speculate on more sequence-based methods for recognising AZ-type structures in text.

BIODATA:
Simone Teufel is a senior lecturer in the Computer laboratory at Cambridge University, where she has worked since 2001. Her main research interests are in corpus-linguistic approaches to discourse theory, and in the application of such information to summarisation, information retrieval and citation analysis. She has a background in computer science (1994 Diploma from University Stuttgart) and in cognitive science (2000 PhD from Edinburgh University), and has also experience in medical information processing and search, from a postdoctoral stay at Columbia University, and in collocation extraction, from a research post at Xerox Europe. Her lastest research interests include lexical acquisition, and the visualisation and language generation of the analysis results of scientific articles.

ABSTRACT:
While bioinformatics has far advanced in the past years and recognisers for gene and protein names and interactions have been built, biochemistry is a new field for computational linguistics to move into. I will be talking about the recognition strategy for scientific papers in general which the NLIP group at Cambridge University is developing, while concentrating on the research done in the project SciBorg, on chemical name parsing, ontology discovery, and discourse-related search. I will also talk a bit about the role of citations in this recognition effort, and about quite unusual infrastructure that our project is built on -- robust semantic representations, encoded as XML standoff.

BIODATA:
Simone Teufel is a senior lecturer in the Computer laboratory at Cambridge University, where she has worked since 2001. Her main research interests are in corpus-linguistic approaches to discourse theory, and in the application of such information to summarisation, information retrieval and citation analysis. She has a background in computer science (1994 Diploma from University Stuttgart) and in cognitive science (2000 PhD from Edinburgh University), and has also experience in medical information processing and search, from a postdoctoral stay at Columbia University, and in collocation extraction, from a research post at Xerox Europe. Her lastest research interests include lexical acquisition, and the visualisation and language generation of the analysis results of scientific articles.

Coreference resolution is the task of finding different mentions of the same entity in the word. In the past decade, knowledge-lean approaches are widely adopted, in which only simple morpho-syntactic cues as knowledge sources are employed in the resolution process. Although these approaches have achieved reasonable success, researchers have found that deeper syntactic or semantic knowledge is necessary in order to reach the next level performance. In this talk, we will introduce our knowledge-rich approaches to coreference resolution, including a tree-kernel-based method for the syntactic knowledge, and web-based methods for the semantic knowledge. These sources of enriched knowledge are acquired automatically without too many human efforts, and have proved effective for the coreference resolution task.

ABSTRACT:
A Scenario Template is a data structure that reflects the salient aspects shared by a set of similar events, which are considered as belonging to the same scenario. These salient aspects are typically the scenario's characteristic actions, the entities involved in these actions and the related attributes of them.

In this talk, I will first brief about our approach to scenario template creation and update the latest evaluation results. Then I will discuss one of the possible applications of scenario templates, namely, open-domain question and answering. For Q&A systems, query expansion is a common strategy while sentence selection is an important process. I will show how scenario templates might help in these two aspects.

BIODATA:
Qiu Long is a Doctoral Student at SoC, NUS, co-supervised by Professor Chua Tat-Seng and Dr. Min-Yen Kan. He got his Master of Science (SM) in CS from Singapore - MIT Alliance. He is interested in Natural Language Processing (NLP) and the related machine learning techniques.

ABSTRACT:
In recent years, speech processing products had been widely distributed all over the world, reflecting a general believe that speech technologies have a huge potential to overcome language barriers and to let everyone participate in today's information revolution. However, in spite of vast improvements in speech and language technologies, the development of speech processing systems still requires significant skills and resources to carry out. Consequently, with more than 6500 languages in the world, the current costs and effort in building speech support is prohibitive to all but the most economically viable languages.

In this talk I will discuss the challenges and limitations of rapidly developing automatic speech processing systems for a large number of languages and dialects. I will describe solutions to system development based on sharing data and system components across languages. Practical implementations and recent results are presented in the light of our SPICE project, which aims to bridge the gap between language and technology experts by providing innovative strategies and tools for non-expert users. These tools enable the user to easily collect appropriate text and speech data, to quickly develop acoustic models, pronunciation dictionaries, and language models based on very limited resources, and to monitor progress and performance allowing for iterative improvements with the user in the loop.

BIODATA:
Tanja Schultz received her Ph.D. and Masters in Computer Science from University Karlsruhe, Germany in 2000 and 1995 respectively and got a German Masters in Mathematics, Sports, and Education Science from the University of Heidelberg, Germany in 1990. She joined Carnegie Mellon University in 2000 and is a faculty member of the Language Technologies Institute as a Research Computer Scientist. Since 2007 she also holds a full professorship at Karlsruhe University, Germany.

Her research activities center around language independent and language adaptive speech recognition but also include large vocabulary continuous speech recognition systems, human-machine interfaces using speech and various biosignals, speech translation, as well as language and speaker identification approaches. With a particular area of expertise in multilingual approaches, she performs research on portability of speech processing systems to many different languages. In 2001 Tanja Schultz was awarded with the FZI price for her outstanding Ph.D. thesis on language independent and language adaptive speech recognition. In 2002 she received the Allen Newell Medal for Research Excellence from Carnegie Mellon for her contribution to Speech-to-Speech Translation and the ISCA best paper award for her publication on language independent acoustic modeling. In 2005 she was awarded the Carnegie Mellon Language Technologies Institute Junior Faculty Chair. Tanja Schultz is the author of more than 100 articles published in books, journals, and proceedings.

She is a member of the IEEE Computer Society, the European Language Resource Association, the Society of Computer Science (GI) in Germany, and currently serves on the ISCA board and several program and review panels.

ABSTRACT:
The idea of using humans to teach computers is not a new one, but it has been largely impractical and largely ignored. Modern-day computers tend to "learn" by either sifting through large amounts of data or by being programmed/endowed with expert knowledge. Typically there is little interaction between man and machine. Our recent project, called "Wubble World", capitalizes on the availability of free hands-on human teaching as a means for machine learning of language and concepts.

The basic premise of this work begins with an online game situated in a virtual 3D environment. Language is generated as children interact with their personal creature, called a wubble, or with other children. By virtue of the virtual environment, this language is situated and forms a rich corpus of matched scenes and sentences upon which to learn language and concepts. In one part of the environment, children interact with their wubble by teaching it to accomplish certain given tasks. The wubble, like a toddler, initially knows little about the world, and must acquire concepts and labels by interacting with the child. I'll describe this environment and the basic concept learning that happens inside the wubble. In another part of the world, children play a competitive team game against other children. The game is designed to require cooperation among team members, typically using spoken language. This language, combined with a log of the game state, generates a rich sentence-scene corpus. This richness could potentially enable natural language processing to move beyond current statistical techniques by incorporating data that reveals underlying meaning. I'll demonstrate the game, describe the data we have collected so far, and discuss some of the possible approaches for learning from this data.

BIODATA:
Dr. Yu-Han Chang is a Computer Scientist at the Information Sciences Institute of the University of Southern California (USC ISI). His current research interests span topics from reinforcement learning, game theory, natural language understanding, interactive technologies, and traditional AI. Recent and ongoing projects include harnessing the power of the Internet to train intelligent agents via human teaching, transfer learning, and the development of efficient no-regret algorithms for non-cooperative learning domains. Dr. Chang holds undergraduate degrees in Mathematics and Economics, as well as a S.M. in Computer Science, from Harvard University. He received his Ph.D. in Electrical Engineering and Computer Science from MIT, focusing his efforts on developing algorithms for multi-agent learning in the context of machine learning and game theory.

ABSTRACT:
The task of referring expression generation is concerned with determining what semantic content should be used in a reference to an intended referent so that the hearer will be able to identify that referent. The task has been a focus of interest within natural language generation at least since the early 1980s, in part because the problem appears relatively well-defined. Over the last 25 years, a range of algorithms and approaches have been proposed and explored, making this the most intensely studied problem in natural language generation; and yet, even a casual analysis of real human-authored texts suggests that we have a long way to go in terms of providing an explanation for the range of real linguistic behaviour that we find. In this talk, I'll review research in the area to date, try to characterise where we are now, and point to directions for future research in the area.

BIODATA:
Robert Dale received his PhD in Computational Linguistics from the University of Edinburgh in 1989. His research interests include low-cost approaches to intelligent text processing tasks; practical natural language generation; the engineering of habitable spoken language dialog systems; and computational, philosophical and linguistic issues in reference and anaphora. He is Director of the Centre for Language Technology at Macquarie University, Convenor of the Australian Research Council's Human Communication Science Network, and editor-in-chief of the Journal of Computational Linguistics.

ABSTRACT:
This work deals with the problem of event annotation in social networks. The problem is made difficult due to variability of semantics and due to scarcity of labeled data. Events refer to real-world phenomena that occur at a specific time and place, and media and text tags are treated as facets of the event metadata. We are proposing a novel mechanism for event annotation by leveraging related sources (other annotators) in a social network. Our approach exploits event concept similarity, concept co-occurrence and annotator trust. We compute concept similarity measures across all facets. These measures are then used to compute event-event and user-user activity correlation. We compute inter-facet concept co-occurrence statistics from the annotations by each user. The annotator trust is determined by first requesting the trusted annotators (seeds) from each user and then propagating the trust amongst the social network using the biased PageRank algorithm. For a specific media instance to be annotated, we start the process from an initial query vector and the optimal recommendations are determined by using a coupling strategy between the global similarity matrix, and the trust weighted global co-occurrence matrix. The coupling links the common shared knowledge (similarity between concepts) that exists within the social network with trusted and personalized observations (concept co-occurrences). Our initial experiments on annotated everyday events are promising and show substantial gains against traditional SVM based techniques.

Co-authors: Amit Zunjarwad (AME), Lexing Xie (IBM)

BIODATA:
Hari Sundaram is currently an assistant professor, at Arizona State University. This is a joint appointment with the department of Computer science and the Arts Media and Engineering program. He received his Ph.D. from the Department of Electrical Engineering at Columbia University in 2002. He received his MS degree in Electrical Engineering from SUNY Stony Brook 1995 and a B.Tech in Electrical Engineering from Indian Institute of Technology, Delhi in 1993.

This talk will examine some assumptions in media semantics under three broad categories - (a) aspects of meaning (b) rethinking semantic construction (c) learnability contradictions. A re-examination of the assumptions behind media semantics is useful, as the mechanisms by which people create and consume media have changed significantly in the last decade. These changes offer fresh insight into the familiar problem of the semantic gap - how to go from sensory data to meaning. There are three aspects of meaning of interest - context, approximations, and variability. We need to examine the construction of meaning in a manner very different from the familiar Marr model - specifically we shall examine embodiment and networked construction. A significant challenge to the learnability of semantics lies in re-examining within the multimedia context, of what Chomsky calls "the poverty of input" problem. How is it possible to learn a large number of concepts with very few / or even non-existent training examples? We will examine the role of context and semantic approximation with an application to media retrieval. The issues of embodiment and its relation to semantics will be discussed with respect to an educational application. We hope to provide a partial answer to the issue of semantic construction and learnability in an application related to social networks.

BIODATA:
Hari Sundaram is currently an assistant professor, at Arizona State University. This is a joint appointment with the department of Computer science and the Arts Media and Engineering program. He received his Ph.D. from the Department of Electrical Engineering at Columbia University in 2002. He received his MS degree in Electrical Engineering from SUNY Stony Brook 1995 and a B.Tech in Electrical Engineering from Indian Institute of Technology, Delhi in 1993.

His research group works on developing computational models and systems for situated communication. There are two complementary (but coupled) directions - (a) designing intelligent multimedia environments that exist as part of our physical world (e.g. an intelligent room) (b) developing new algorithms and systems to understand the media artifacts resulting from human activity (e.g. emails, photos / video). Specific projects include - context models for action, resource adaptation, interaction architectures, communication patterns in media sharing social networks, collaborative annotation, as well analysis of online communities.

Prof. Sundaram's research has won several awards - the best student paper award at JCDL 2007, the best ACM Multimedia demo award in 2006. The best student paper award at ACM conference on Multimedia 2002, the 2002 Eliahu I. Jury Award for best Ph.D. dissertation. He has also received a best paper award on video retrieval, from IEEE Trans. On Circuits and Systems for Video Technology, for the year 1998. He is an active participant in the Multimedia community - he is an associate editor for ACM Transactions on Multimedia Computing, Communications and Applications (TOMCCAP), as well as the IEEE Signal Processing magazine. He has co-organized workshops at acm multimedia on experiential telepresence (ETP 2003, ETP 2004), archival of personal experiences (CARPE 2004, CARPE 2005) and a conference of image and video retrieval (CIVR 2006).

ABSTRACT:
The problem of domain adaptation for statistical classifiers arises when our labeled training examples and unlabeled test examples come from different domains. This problem is commonly encountered in natural language processing (NLP) tasks. For example, we may train a named entity recognition (NER) system on news articles but apply the system to blog or email text. It is generally observed that the performance of a classifier tends to drop significantly when it is applied to a different domain.

In this talk, I will present our recent work addressing the domain adaptation problem. We have proposed two frameworks, corresponding to two different perspectives on this problem: feature selection and instance weighting. In the feature selection framework, we seek to identify .generalizable features. that behave similarly across domains; in the instance weighting framework, our idea is to re-weight the examples in order to minimize the expected loss on the test domain. In both frameworks, we have also incorporated semi-supervised learning to make use of the unlabeled test domain examples. Experiment results on a number of NLP tasks, including NER, part-of-speech (POS) tagging, and spam filtering, show the effectiveness of both frameworks. At the end of the talk, I will briefly mention our current effort of unifying the two perspectives, as well as some future directions to pursue.

BIODATA:
Jing Jiang is a Ph.D. candidate in the Department of Computer Science at the University of Illinois at Urbana-Champaign. She is a member of the Information Retrieval Group led by Professor ChengXiang Zhai. Her research interests include information extraction, information retrieval, biomedical text mining, and machine learning. She received her B.S. degree and her M.S. degree in Computer Science from Stanford University in 2002 and 2003, respectively.

The JAVELIN question answering architecture has been used to build QA systems for monolingual English, Japanese and Chinese, as well as cross-lingual QA systems for English-Japanese and English-Chinese. This talk will present and discuss recent research results in structured retrieval, answer extraction and answer selection for QA, and summarize end-to-end system performance as evaluated in the recent NTCIR-6 competition.

Speech recognition transcripts are far from perfect; they are not of sufficient quality to be useful on their own for spoken document retrieval. This is especially the case for conversational speech. Recent efforts have tried to overcome this issue by using statistics from speech lattices instead of only the 1-best transcripts; however, these efforts have invariably used the classical vector space retrieval model. This paper presents a novel approach to lattice-based spoken document retrieval using statistical language models: a statistical model is estimated for each document, and probabilities derived from the document models are directly used to measure relevance. Experimental results show that the lattice-based language modeling method outperforms both the language modeling retrieval method using only the 1-best transcripts, as well as a recently proposed lattice-based vector space retrieval method.

In this paper, we present a machine learning approach to the identification and resolution of Chinese anaphoric zero pronouns. We perform both identification and resolution automatically, with two sets of easily computable features. Experimental results show that our proposed learning approach achieves anaphoric zero pronoun resolution accuracy comparable to a previous state-of-the-art, heuristic rule-based approach. To our knowledge, our work is the first to perform both identification and resolution of Chinese anaphoric zero pronouns using a machine learning approach.

Function words are a class of words with little intrinsic meaning but is vital in expressing grammatical relationships among phrases within a sentence. Such encoded grammatical information, often implicit, makes function words pivotal in modeling structural divergences, as projecting them in different languages often result in long-range structural changes to the realized sentences. This distinctive feature has not been fully-utilized to address phrase ordering problem in the context of statistical machine translation (SMT). We observe that just like foreign language learner often makes mistakes in using function words, current SMT system often perform poorly in ordering function words' arguments; lexically correct translations often end up reordered incorrectly. In this talk, I will present a Function Words centered, Syntax-based (FWS) solution to address the phrase ordering problem, including its statistical formalism, its implementation and experimental results.

Semantic relatedness is a very important factor for the coreference resolution task. To obtain this semantic information, corpus-based approaches commonly leverage patterns that can express a specific semantic relation. The patterns, however, are designed manually and thus are not necessarily the most effective ones in terms of accuracy and breadth. To deal with this problem, in this paper we propose an approach that can automatically find the effective patterns for coreference resolution. We explore how to automatically discover and evaluate patterns, and how to exploit the patterns to obtain the semantic relatedness information. The evaluation on ACE data set shows that the pattern based semantic information is helpful for coreference resolution.

[15:20 - 15:35]

"PSNUS: Web People Name Disambiguation by Simple Clustering with Rich Features"
by
Tan Yee Fan

We describe about the system description of the PSNUS team for the SemEval-2007 Web People Search Task. The system is based on the clustering of the web pages by using a variety of features extracted and generated from the data provided. This system achieves F_alpha=0.5 = 0.75 and F_alpha=0.2 = 0.78 for the final test data set of the task.

Recent research presents conflicting evidence on whether word sense disambiguation (WSD) systems can help to improve the performance of statistical machine translation (MT) systems. In this paper, we successfully integrate a state-of-the-art WSD system into a state-of-the-art hierarchical phrase-based MT system, Hiero. We show for the first time that integrating a WSD system improves the performance of a state-of-the-art statistical MT system on an actual translation task. Furthermore, the improvement is statistically significant.

When a word sense disambiguation (WSD) system is trained on one domain but applied to a different domain, a drop in accuracy is frequently observed. This highlights the importance of domain adaptation for word sense disambiguation. In this paper, we first show that an active learning approach can be successfully used to perform domain adaptation of WSD systems. Then, by using the predominant sense predicted by expectation-maximization (EM) and adopting a count-merging technique, we improve the effectiveness of the original adaptation process achieved by the basic active learning approach.

Words of foreign origin are referred to as borrowed words or loanwords. A loanword is usually imported to Chinese by phonetic transliteration if a translation is not easily available. Semantic transliteration is seen as a good tradition in introducing foreign words to Chinese. Not only does it preserve how a word sounds in the source language, it also carries forward the word's original semantic attributes. This paper attempts to automate the semantic transliteration process for the first time. We conduct an inquiry into the feasibility of semantic transliteration and propose a probabilistic model for transliterating personal names in Latin script into Chinese. The results show that semantic transliteration substantially and consistently improves accuracy over phonetic transliteration in all the experiments.

Convolution tree kernel has shown very promising results in semantic role classification. However, this method considers less linguistic knowledge and only carries out hard matching between substructures, which may lead to over-fitting and less accurate similarity measure. To remove the constraints, this paper proposes a grammar-driven convolution tree kernel for semantic role classification by introducing more linguistic grammar information into the standard convolution tree kernel. The proposed grammar-driven convolution tree kernel displays two advantages over the previous one: 1) grammar-driven approximate substructure matching and 2) grammar-driven approximate tree node matching. The two improvements enable the proposed grammar-driven tree kernel explore more linguistically motivated substructure features than the previous one. Experiments on the CoNLL-2005 SRL shared task show that the proposed grammar-driven tree kernel significantly outperforms the previous non-grammar-driven one in semantic role classification. Moreover, we present a composite kernel to integrate feature-based and tree kernel-based methods. Experimental results show that the composite kernel outperforms the previous best-reported methods.

Extraction of relations between entities is an important part of Information Extraction on free text. Previous methods are mostly based on statistical correlation and dependency relations between entities. This paper re-examines the problem at the multi-resolution layers of phrase, clause and sentence using dependency and discourse relations. Our multi-resolution framework ARE&D (Anchor and Relation and Discourse analysis) uses clausal relations in 2 ways: 1) to filter noisy dependency paths; and 2) to increase reliability of dependency path extraction. The resulting system outperforms the previous approaches by 3%, 7%, 4% on MUC4, MUC6 and ACE RDC domains respectively.

Function words are a class of words with little intrinsic meaning but is vital in expressing grammatical relationships among phrases within a sentence. Such encoded grammatical information, often implicit, makes function words pivotal in modeling structural divergences, as projecting them in different languages often result in long-range structural changes to the realized sentences. This distinctive feature has not been fully-utilized to address phrase ordering problem in the context of statistical machine translation (SMT). We observe that just like foreign language learner often makes mistakes in using function words, current SMT system often perform poorly in ordering function words' arguments; lexically correct translations often end up reordered incorrectly.

In this talk, I will present a Function Words centered, Syntax-based (FWS) solution to address the phrase ordering problem, including its statistical formalism, its implementation and experimental results.

We propose a faceted classification scheme for web queries. Unlike previous work, our functional scheme ties its classification to actionable strategies for search engines to take. Our scheme consists of four facets of ambiguity, authority sensitivity, temporal sensitivity and spatial sensitivity. We hypothesize that the classification of queries into such facets yields insight on user intent and information needs. To validate our classification scheme, we asked users to annotate queries with respect to our facets and obtained high agreement. We also assess the coverage of our faceted classification on a random sample of queries from logs. Finally, we discuss the algorithmic approaches we take in our current work to automate such faceted classification.

In this talk, I will present a new graph-based approach to text understanding and summarization. Current graph-based approaches to automatic text summarization, such as LexRank and TextRank, assume a static graph which does not model how the input texts emerge. A suitable evolutionary text graph model may impart a better understanding of the texts and improve the summarization process. We give simplified assumptions of human writing and reading processes, and then propose a timestamped graph (TSG) model that is motivated by these processes and show how text units in this model emerge over time. This model not only captures the evolving process of text within a document, but also the evolving process across documents. In our model, the graphs used by LexRank and TextRank are specific instances of our timestamped graph with particular parameter settings.

Text representation is the task of transforming the content of a textual document into a compact representation of its content so that the document could be recognized and classified by a computer or a classifier. This thesis focuses on the development of an effective and efficient term weighting method for text categorization task. We selected the single token as the unit of feature because the previous researches showed that this simple type of features outperformed other complicated type of features.

We have investigated several widely-used unsupervised and supervised term weighting methods on several popular data collections in combination with SVM and kNN algorithms. In consideration of the distribution of relevant documents in the collection and analysis of the term's discriminating power, we have proposed a new term weighting scheme, namely tf.rf. The controlled experimental results showed that the term weighting methods show mixed performance in terms of different category distribution data sets and different learning algorithms. Most of the supervised term weighting methods which are based on information theory have not shown satisfactory performance according to our experimental results. However, the newly proposed tf.rf method shows a consistently better performance than other term weighting methods. On the other hand, the popularly used tf.idf method has not shown a uniformly good performance with respect to different category distribution data sets.

A Scenario Template is a data structure that reflects the salient aspects shared by a set of events, which are similar enough to be considered as belonging to the same scenario. The salient aspects are typically the scenario's characteristic actions, the entities involved in these actions and the related attributes. Such a scenario template, once populated with respect to a particular event, serves as a concise overview of the event. It also provides valuable information for applications such as information extraction (IE), text summarization, etc.

Manually defining scenario template is expensive and we aim to automatize this template generation process. We argue that context is valuable to identify semantically similar text spans, from which template slots could be generalized. To leverage context, we convert news articles into a graphical representation and then apply a generic context-sensitive clustering (CSC) framework to get meaningful clusters of text spans by examining the intrinsic and extrinsic similarities between them. We use the Expectation-Maximization algorithm to guide the clustering process. The experiments show that: 1) our approach generates high quality clusters, and 2) information extracted from the clusters is adequate to build high coverage templates.

This thesis studies the task of Relation Extraction, which has received more and more attention in recent years. The task of relation extraction is to identify various semantic relations between named entities from text contents. With the rapid increase of various textual data, relation extraction will play an important role in many areas, such as Question Answering, Ontology Construction, and Bioinformatics.

The goal of our research is to reduce the manual effort and automate the process of relation extraction. To realize this intention, we investigate semi-supervised learning and unsupervised learning solutions to rival supervised learning methods to resolve the problem of relation extraction with minimal human cost and still achieve comparable performance to supervised learning methods.

First, we presented a Label Propagation (LP) based semi-supervised learning algorithm for relation extraction problem to learn from both labeled and unlabeled data. It represents labeled and unlabeled examples and their distances as the nodes and the weights of edges of a graph, then propagating the label information from any vertex to nearby vertices through weighted edges iteratively, finally inferring the labels of unlabeled examples after the propagation process converges.

Secondly, we introduced an unsupervised learning algorithm based on model order identification for automatic relation extraction. The model order identification is achieved by resampling based stability analysis and used to infer the number of relation types between entity pairs automatically.

Thirdly, we further investigated unsupervised learning solution for relation disambiguation using graph based strategy. We defined the unsupervised relation disambiguation task for entity mention pairs as a partition of a graph so that entity pairs that are more similar to each other, belong to the same cluster. We apply spectral clustering to resolve the problem, which is a relaxation of such NP-hard discrete graph partitioning problem. It works by calculating eigenvectors of an adjacency graph's Laplacian to recover a submanifold of data from a high dimensionality space and then performing cluster number estimation on such spectral information.

The thesis evaluates the proposed methods for extracting relations among named entities automatically, using the ACE corpus. The experimental results indicate that our methods can overcome the problem of not having enough manually labeled relation instances for supervised relation extraction methods. The results show that when only a few labeled examples are available, our LP based relation extraction can achieve better performance than SVM and another bootstrapping method. Moreover, our unsupervised approaches can achieve order identification capabilities and outperform the previous unsupervised methods. The results also suggest that all of the four categories of lexical and syntactic features used in the study are useful for the relation extraction task.

As a kind of Shallow Semantic Parsing, Semantic Role Labeling (SRL) is being paid more attention and illustrating a good prospect of application on wide natural language processing problems. So I will show a demo at first to explain what is the semantic role labeling is. Usually, feature-based methods with feature vector are used for semantic role labeling as the state of the art methods. However, these methods, which are widely used in natural language processing field, are difficult in modeling structure features, e.g. the useful Path features for semantic role labeling. As an extension of the feature-based methods, kernel-based methods are able to do this efficiently in a much higher dimension. Convolution tree kernel, a special kind of kernel, has been used in semantic role labeling. The conventional convolution tree kernel which selected the tree portion of a predicate and one of its arguments as feature space is named as predicate-argument feature (PAF). However, the integral view of PAF is not suitable for the semantic role labeling. A hybrid convolution tree kernel is proposed to model syntactic tree structure features more effectively. The hybrid kernel consists of two individual convolution kernels: a Path kernel, which captures predicate-argument link features, and a Constituent Structure kernel, which captures the syntactic structure features of arguments. Evaluation on the data sets of CoNLL-2005 SRL shared task shows that our novel hybrid convolution tree kernel significantly outperforms the previous tree kernels. We future provide a composite kernel combining our hybrid tree kernel with the polynomial kernel using standard flat feature vector. The experimental results show that the composite kernel achieves better performance than each of the individual methods.

Talk 2

"Using Maximum Entropy to Recognize Name Origin in Machine Transliteration"
by
Sun Chengjie
(Harbin Institute of Technology, Institute for Infocomm Research I<sup>2</sup>R)

Name origin recognition is to identify the original source of a name. It is a necessary step for name translation/transliteration because of different origins need different translation strategies. It is more important when translating across languages with different alphabets and sound inventories. Previous works used rule based methods or statistics based methods to solve this problem. In this work, we cast name origin recognition as a multi-class classification task and propose to use Maximum Entropy model to solve it. Experiments show that our approach can achieve an overall accuracy 98.35% for name written in English and 98.10% for name written in Chinese, which are much better than previous methods.

Extraction of relations between entities is an important part of IE on free text. Previous methods are mostly based on statistical correlation and dependency relations between entities. This paper re-examines the problem at the multi-resolution layers of phrase, clauses and sentences using dependency and discourse relations. Our multi-resolution framework uses clausal relations in 2 ways: 1) to filter noisy dependency paths; and 2) to increase reliability of dependency path extraction. The resulting system outperforms the previous approaches by 3%, 7%, 4% on MUC4, MUC6 and ACE RDC domains respectively.

Writing or speaking requires making choices from words and syntactic constructions that have similar but not identical meanings. Are two parties "foes" or "enemies"? Did John meet Mary or was Mary met by John? An important component of language understanding is recognizing the implications of the nuances in the speaker's or writer's choices. I will describe our research on computational aspects of linguistic nuance, focusing on the differentiation of near-synonyms and on the consequences that arise for knowledge representation formalisms. In addition, I will discuss how contemporary views of meaning in computational linguistics need to be broadened to take into account the choices that the speaker or writer makes.

Accessing web pages from a mobile device is becoming very valuable, especially for people constantly on the move. However, the small screen, limited memory, and the slow wireless connection make the surfing experience on mobile devices unacceptable to most people. In this thesis, we aim to solve three fundamental challenges in the mobile Internet: web page content ranking, web content classification, and web article summarization. We propose a new method to solve these three fundamental challenges. As a web page is too complex to analyze as a whole, we will first divide the entire web page into basic elements such as text blocks, pictures, etc. Next, based on the relationship between the elements, we will connect the elements with edges to make a graph. Finally, we will use random walk methods to provide solution for the three challenges. The main contribution of this thesis is a graph and a random walk based framework for the Internet information process. It is shown to be very simple and effective. For example, our experiments of web page ranking show that from randomly selected websites, the system need only deliver 39% of the objects in a web page in order to fulfill 85% of a viewer's desired viewing content. In the experiments of web content classification, the system generates good performance with the F value for main content and advertisement (A) as high as 0.93 and 0.82 respectively. In the experiments of text summarization, with the use of the well-accepted dataset for single document summarization, the graph and random walking based text summarization system outperformed the results of all participants of the conference

Word Sense Disambiguation (WSD) is a problem in Natural Language Processing concerned on identifying correct meaning of a word used in a given context. Over time, supervised machine learning has consistently shown better performance in WSD, compared to unsupervised learning. However, supervised approach for WSD has been facing the serious problem of knowledge acquisition bottleneck, or the difficulty of acquiring enough labeled training data for learning classifiers. This problem is exasperated by several facts, including the large number of fine-grained senses in contemporary lexicons, need of training data for individual polysemous word, and the high cost of manually sense-labeling training examples. Our research focuses on an approach to find a workaround to this problem, by exploiting the usage similarities of different words. We propose using a generalized and coarse-grained set of senses at classifier level, and then using lexicon-induced heuristics to convert the resulting classes into fine-grained senses. The generic nature of the sense classes allows us to use labeled training examples from different words to be used for learning the classes, effectively increasing the amount of available training data. We discuss how the noise due to generalization can be reduced by using a semantic similarity based weighting strategy, and show, using WordNet lexicographer files as generic classes, that this approach can yield state of the art WSD performance with sparse training data. Further, we argue that the human-created, taxonomy based class schema such as WordNet lexicographer files are not ideal for supervised learning, as they are not necessarily coherent with the contextual usage patterns, which are available for the classifier as features. In addition, they have undesirable properties that result in high losses during the class to fine-grained sense conversion. We propose using clustering techniques to automatically create generic sense classes that are aimed for better performance of WSD as an end-task, and show that such classes can improve the WSD performance over manually created classes.

2006

In this talk, I will present a simple yet novel method of exploiting unlabeled text to further improve the accuracy of a high-performance state-of-the-art named entity recognition (NER) system. The method utilizes the empirical property that many named entities occur in one name class only. Using only unlabeled text as the additional resource, our improved NER system achieves an F1 score of 87.13%, an improvement of 1.17% in F1 score and a 8.3% error reduction on the CoNLL 2003 English NER official test set. This accuracy places our NER system among the top 3 systems in the CoNLL 2003 English shared task. This work was done jointly with Wong Yingchuan.

When the names of people are used as unique identifiers, it often causes problems -- different people may share the same name spelling or a person may have several names spelled or used. As the searching by person' name is one of the most common query types in Digital Libraries and WWW (about 30%), it becomes increasingly important to have clean name data in such systems. In this talk, I will first present various types of ambiguous names drawn from real Digital Libraries. Then, I will discuss various approaches for identifying and fixing such ambiguous names -- syntactic, semantic, and google-based approaches.

This talk borrows materials from my recent work in IQIS'05 JCDL'06, ICDM'06, and ICDE'07, that are the results of joint work with several students and collaborators:

Spoken Document Retrieval involves finding from within a collection of spoken documents (e.g. voice mails, news broadcasts) the documents which satisfy a given information need. One way to represent a spoken document for this task is the lattice -- a directed acyclic graph whose paths correspond to a hypothesis of the words spoken in the document. In this talk I present a method for using word statistics derived from lattices in a probabilistic retrieval algorithm to perform spoken document retrieval. Results which compare the performance of this approach with using only the 1-best speech recognizer transcription are also presented.

I will present the architecture of a XML based Chinese processing platform for web application. It is named as Language Technology Platform (LTP). There are five main points of it: a suite of DLL modules for DOM Tree, Language Technology Markup Language (LTML), a suite of visualization tools, language corpora based on LTML and web service for LTP. Current LTP has integrated ten key Chinese processing modules on morphology, word sense, and syntax and document analysis. A suite of systematism tools is supplied for beginners of natural language processing and information retrieval. Based on it, they can study on the relationship between all levels and some advanced topics. Currently, the platform has been shared to more than 60 research labs in the world. Another topic of my talk is about WSD. I will present a new approach based on Equivalent Pseudowords (EPs) to tackle Word Sense Disambiguation (WSD) in Chinese language. EPs are particular artificial ambiguous words, which can be used to realize unsupervised WSD. A Bayesian classifier is implemented to test the efficacy of the EP solution on Senseval-3 Chinese test set. The performance is better than state-of-the-art results with an average F-measure of 0.80. The experiment verifies the value of EP for unsupervised WSD.

Web 2.0 is the latest trend in the Word Wide Web. In the first part of my seminar, I shall review the social characteristics of this paradigm and how suitable it is for the Asian community. In the second part, I shall focus on a particular communication means on Web 2.0, namely chatting, e.g. via ICQ, chat rooms, etc. A unique dialect is commonly used for chatting. I refer it as the Chat Language (CL). CL is different from natural languages due to its anomalous and dynamic natures. These render conventional NLP tools inapplicable for analyzing CL. The language changes frequently rendering contemporary chat language corpora quickly out-dated. To address this dynamic language problem in Chinese, we propose a phonetic language model to map between chat terms and standard words via phonetic transcription, i.e. Chinese Pinyin in our case. Different from grapheme mapping, phonetic mapping can be constructed from available standard Chinese corpus. For term normalization, i.e. to translate a chat term to its natural language counterpart, we extend the source channel model by incorporating the phonetic mapping model. Experimental results show that this method is effective and robust.

Synchronous grammars are rapidly gaining importance for modeling machine translation and other complex language transformations. It has therefore become useful to understand their basic formal properties. Many advances in NLP in the 1990s exploited basic algorithms for probabilistic finite-state transducers, whose theory is well understood and widely taught. The analogous theory for trees is less widely known but well developed, with roots going back to the 1960s. In this tutorial, we aim to (1) cover the literature of synchronous grammars, (2) describe how they relate to current NLP applications, such as machine translation, and (3) discuss some new theoretical and algorithmic problems raised by these applications, and some recent solutions.

If we define a QA system as a system which takes a natural-language question, searches a text corpus and returns a ranked a list of answers, then we can broadly discern two ways in which accuracy can be increased: intrinsically, by generating better candidate lists (by e.g. more accurate entity recognition, deeper parsing, better pattern-matching and/or more judicious choice of keywords in search), or extrinsically, by re-evaluating and re-shaping such answer lists by reference to other QA methods or other data sources. This talk is about approaches of each kind that we are using at IBM Research to improve the accuracy of our QA system. I will first describe the semantic information we build into the search-engine index from running text analytics on the corpus. In addition to text tokens, we index types, typed tokens and relations. I will present the results of several evaluations demonstrating how such "Semantic Search" can increase precision.

As far as extrinsic methods go, leading QA systems employ a variety of means to boost accuracy. Such methods include redundancy (getting the same answer from multiple documents/sources), inferencing (proving the answer from information in texts plus background knowledge) and sanity-checking (verifying that answers are consistent with known facts). To our knowledge, however, no other QA system deliberately asks additional questions in order to derive constraints on the answers to the original questions. We present two variations on this idea. The first is the method of QA-by-Dossier-with-Constraints (QDC), which is an extension of the simpler method of QA-by-Dossier, in which definitional questions ("Who/what is X?") are addressed by asking a set of questions about anticipated properties of X. In QDC, the collection of Dossier candidate answers is subjected to satisfying a set of naturally-arising constraints. For example, for a "Who is X?" question, the system will ask about birth, accomplishment and death dates, which if they exist, must occur in that order, and also obey other constraints such as lifespan. Temporal, spatial and kinship relationships seem to be particularly amenable to this treatment.

The second variation takes an arbitrary question and "inverts" it, using top candidate answers. By requiring that the answers to the inverted questions be consistent with entities in the original questions, we demonstrate that precision is increased. Finally, we show how the use of constraints provides a natural way to assert "no answer", when that condition is true.

In text categorization, term weighting methods assign appropriate weights to the terms to improve the classification performance. In this study, we propose an effective term weighting scheme, i.e. tf.rf, and investigate several widely-used unsupervised and supervised term weighting methods on two popular data collections, Reuters-21578 and 20 Newsgroups, in combination with SVM and kNN algorithms. From our controlled experimental results, not all supervised term weighting methods have a consistent superiority over unsupervised term weighting methods. Specifically, the three supervised methods based on the information theory, i.e. tf.chi2 (chi2) tf.ig (information gain) and tf.or (Odds Ratio), perform rather poorly in all experiments. On the other hand, our proposed tf.rf achieves the best performance consistently and outperforms other methods substantially and significantly. The popularly-used tf.idf method has not shown a uniformly good performance with respect to different data corpora.

Solving computationally hard problems, such as those commonly encountered in natural language processing and computational biology,often requires that approximate search methods be used to produce a structured output(eg., machine translation, speech recognition,protein folding). Unfortunately, this fact is rarely taken into account when machine learning methods are conceived and employed.This leads to complex algorithms with few theoretical guarantees about performance on unseen test data.

I present a machine learning approach that directly solves "structured prediction" problems by considering formal techniques that reduce structured prediction to simple binary classification, within the context of search. This reduction is error-limiting: it provides theoretical guarantees about the performance of the structured prediction model on unseen test data. It also lends itself to novel training methods for structured prediction models, yielding efficient learning algorithms that perform well in practice. I empirically evaluate this approach in the context of two tasks: entity detection and tracking and automatic document summarization.

Shortage of manually labeled data is an obstacle to supervised relation extraction methods. In this paper we investigate a graph based semi-supervised learning algorithm, a label propagation (LP) algorithm, for relation extraction. It represents labeled and unlabeled examples and their distances as the nodes and the weights of edges of a graph, and tries to obtain a labeling function to satisfy two constraints: 1) it should be fixed on the labeled nodes, 2) it should be smooth on the whole graph. Experiment results on the ACE corpus showed that this LP algorithm achieves better performance than SVM when only very few labeled examples are available, and it also performs better than bootstrapping for the relation extraction task.

We propose a supervised, two-phase framework to address the problem of paraphrase recognition (PR). Unlike most PR systems that focus on sentence similarity, our framework detects dissimilarities between sentences and makes its paraphrase judgment based on the significance of such dissimilarities. The ability to differentiate significant dissimilarities not only reveals what makes two sentences a non-paraphrase, but also helps to recall additional paraphrases that contain extra but insignificant information. Experimental results show that while being accurate at discerning non-paraphrasing dissimilarities, our implemented system is able to achieve higher paraphrase recall (93%), at an overall performance comparable to the alternatives.

In my talk, I will start by presenting some general considerations about discourse and what is expected from a theory of discourse. Then, I will introduce the veins theory (VT), formulating first intuitions, then giving some definitions and stating the theory's claims with respect to the cohesion and coherence of the discourse. I will then show to what degree the predictions made by the theory are proved experimentally. In the third part I will present how VT helps to do discourse parsing and focused summarisation. Finally, I will show why we need a graph representation instead of a tree to represent certain types of discourse.

Instances of a word drawn from different domains may have different sense priors (the proportions of the different senses of a word). This in turn affects the accuracy of word sense disambiguation (WSD) systems trained and applied on different domains. This paper presents a method to estimate the sense priors of words drawn from a new domain, and highlights the importance of using well calibrated probabilities when performing these estimations. By using well calibrated probabilities, we are able to estimate the sense priors effectively to achieve significant improvements in WSD accuracy.

In this talk, I will introduce two statistical learning methods for information retrieval. One is about "expert search", and the other "learning to rank". Expert search is a search task where the user types a query representing a topic and the search system returns a ranked list of people who are considered experts on the topic. Previous studies employed profile-based methods, where the expert ranking is based only on co-occurrence between people and terms in documents. We propose a new approach capable of employing many types of association relationships among query terms, documents and people (experts). These include relevance between query terms and documents, co-occurrence between people and terms in documents, co-occurrence between people and terms in authors and title fields, and co-occurrence between people and people. We employ a new statistical model, referred to as the two-stage expert search model, to combine all the association information in a unified and theoretically sound way. We used the data in TREC 2005 expert search task and the data from an industrial research lab to verify the effectiveness of our proposal. Our experimental results show that the two-stage model can significantly outperform the profile-based method.

Learning to rank is an important topic in document retrieval. One approach to the task is to formalize the problem as ordinal regression. "Ranking SVM" is such a method. We point out that there are two factors one must consider when applying ordinal regression to document retrieval. First, correctly ranking documents on the top is crucial for an IR system. One must conduct training in a way such that the top ranked results are very accurate. Second, the numbers of relevant documents can vary from query to query. One must avoid training a model biased toward queries with many relevant documents. Previously, when existing methods including Ranking SVM were applied to document retrieval, none of the two factors were taken into consideration. In our work, we demonstrate that it is possible to define a new loss function for document retrieval. The loss function is a natural extension of the conventional "Hinge Loss" used in Ranking SVM. With the new loss function, we can overcome the drawbacks which plague Ranking SVM. We employ two optimization methods to minimize the loss function: gradient descent and quadratic programming. Experimental results show that our method can outperform Ranking SVM and other existing methods for document retrieval in two data sets.

The introduction of data-driven methods into machine translation (MT) in the 1990s created a whole new way of doing MT, and the recent move from the word-based models developed at IBM to the phrase-based models developed by Och and others has led to a breakthrough in MT performance. The next breakthrough, the move to syntax-based models that deal with the hierarchical, meaning-bearing, structures of sentences, is still waiting to happen. Several approaches have been tried, but none yet have been able to outperform phrase-based models in large-scale evaluations. Hiero is a first step towards that breakthrough. It deals with hierarchical structures, similarly to syntax-based models, but also draws on ideas from phrase-based translation, including the ability to be trained from parallel bilingual text without any syntactic annotation, manual or automatic. In the recent NIST MT Evaluation, it outperformed several state-of- the-art systems, both phrase-based and syntax-based, on both Chinese- English and Arabic-English translation. I will present Hiero's underlying model, its implementation, and experimental results, including some recent investigations into how syntactic information does and does not improve translation quality.

In recent years, the output of Statictical Machine Translation (SMT) system has achieve a comparable quality and even better than the output of traditional rule-based system. In this talk, we will discuss SMT as an alignment model and outline the talk along the two recurring themes in SMT: lexical mapping and reordering mechanism. Lexical mapping is responsible to maximize the adequacy of the translation output while reordering mechanism is responsible to maximize the fluency. This talk will give an overview previous work; from word-based approach, phrase-based approach until recent hierarchical phrase-based approach and propose a novel constituency-based approach. We will show how we extend state-of-the-art phrase-based approach to include a notion of discontinuous constituency and how it helps to improve the output quality. We will also outline the challenges in incorporating discontinous constituency to SMT alignment model.

This talk gives an overview of histories, recent advances and current trends of natural language processing (NLP). The speaker outlines the key achievements and figures out the key problems and challenges of the NLP research in China. Then, the emphasis is on the future directions, esp. the research and development program of NLP, Chinese Information Processing and all other language-related research and development supported by NSFC in China's 11th five-year plan period. The talk also covers the research effort and progress in MOE-MS (Ministry of Education and Microsoft) Language and Speech Lab in recent 10 years, including machine translation, information retrieval, integrated information processing and a preliminary study on bioinformatics. The speaker addresses all the above topics from industry, education and research viewpoints.

Contrasts are useful conceptual vehicles for learning processes and exploratory research of the unknown. For example, contrastive information between proteins can reveal what similarities, divergences, and relations there are of the two proteins, leading to invaluable insights for better understanding about the proteins. Such contrastive information are found to be reported in the biomedical literature. However, there have been no reported attempts in current biomedical text mining work that systematically extract and present such useful contrastive information from the literature for knowledge discovery.

We have developed a BioContrasts system that extracts protein-protein contrastive information from MEDLINE abstracts and presents the information to biologists in a web-application for knowledge discovery. Contrastive information are identified in the text abstracts with contrastive negation patterns such as ``A but not B''. A total of 799,169 pairs of contrastive expressions were successfully extracted from a 2.5 million-abstract corpus. Using grounding of contrastive protein names to Swiss-Prot entries, we were able to produce 41,471 pieces of contrasts between Swiss-Prot protein entries. These contrastive information are then presented via a user-friendly interactive web portal for knowledge discovery, such as refining the functional roles of similar proteins in biological pathways.

The growing volume and variety of text has demanded techniques involving semantics for effective information retrieval and extraction. Although many useful syntactic and semantic structures are available in various domains, approaches utilizing structures face common critical problems such as paraphrasing and alignment. In this work, we propose a relation-based model, ARE (Anchor and Relation), to tackle these problems. ARE performs IE by utilizing anchor extraction and optimal relation path matching between anchors. In particular, we devise three strategies to combine anchor features and dependency relations to perform target extraction in IE. These include: relation path scoring, anchor scoring and their combination. Experiments conducted on two IE domains (MUC3-4 and MUC6) demonstrate improvement over previous approaches.

Interpolated Kneser-Ney is one of the best smoothing methods for n-gram models. It involves three ideas: to use absolute discounting of counts, to interpolate between high and low order n-grams, and to use the number of contexts rather than counts to estimate lower order n-grams. We will review previous work on explaining interpolated Kneser-Ney and propose a novel Bayesian interpretation. We will show that Interpolated Kneser-Ney can be recovered exactly as approximate inference in a hierarchical Bayesian model. We describe a number of properties of this interpretation and show how we can extend Interpolated Kneser-Ney in novel ways based on this new view.

We propose a document re-ranking method for Chinese information retrieval where a query is a short natural language description. The method bases on term distribution where each term is weighted by its local and global distribution, including document frequency, document position and term length. The weight scheme lifts off the worry that very fewer relevant documents appear in top retrieved documents, and allows randomly setting a larger portion of the retrieved documents as relevance feedback. It also helps to improve the performance of MMR model in document re-ranking. The experiments show our method can get significant improvement against standard baselines, and outperforms relevant methods consistently.

Talk 2

"Supervised Categorization of Javascript using Program Analysis Features"
by
Wei Lu
(NUS, SMA)

Web pages often embed scripts for a variety of purposes, including advertising and dynamic interaction. Understanding embedded scripts and their purpose can often help to interpret or provide crucial information about the web page. We have developed a functionality-based categorization of JavaScript, the most widely used web page scripting language. We then view understanding embedded scripts as a text categorization problem. We show how traditional information retrieval methods can be augmented with the features distilled from the domain knowledge of JavaScript and software analysis to improve classification performance. We perform experiments on the standard WT10G web page corpus, and show that our techniques eliminate over 50% of errors over a standard text classification baseline.

Our aim is to explore a translation-free technique for multilingual information retrieval. This technique is based on an ontological representation of documents and queries. We use a multilingual ontology for documents/queries representation. For each language, we use the multilingual ontology to map a term to its corresponding concept. The same mapping is applied to each document and each query. Then, we use a classic vector space model for the indexing and the querying. The main advantages of our approach are: no merging phase is required, no dependency on automatic translators between all pairs of languages exists, and adding a new language only requires a new mapping dictionary to the multilingual ontology.Scaling Up Word Sense Disambiguation via Parallel Texts

State-of-the-art question answering (QA) systems employ term-density ranking to retrieve answer passages. Such methods often retrieve incorrect passages as relationships among question terms are not considered. Previous work attempts to address this problem by matching dependency relations between questions and answers. Previous methods used strict match, which fail when semantically equivalent relationships are phrased differently. We propose fuzzy relation matching based on statistical models. We present two methods based on mutual information and expectation maximization to learn relation mapping scores from past QA pairs. Experimental results show that our method significantly outperforms the state-of-the-art density-based passage retrieval methods by up to 78% in mean reciprocal rank. Integrated with query expansion, relation matching also brings about a further 50% improvement.

This paper explores probabilistic lexico-syntactic pattern matching, also known as soft pattern matching. While previous methods in soft pattern matching are ad-hoc in computing the degree of match, we propose two formal matching models based on bigrams and on Profile Hidden Markov Models. Both models provide a theoretically sound method to model pattern matching as probabilistic generation process that generates token sequences. We demonstrate the effectiveness of these models on definition sentence retrieval for definitional question answering. We show that both models significantly outperform the state-of-the-art manually constructed patterns. A critical difference between the two models is that the PHMM technique handles language variations more effectively but requires more training data to converge. We believe that both models can be extended to other areas to which lexico-syntactic pattern matching can be applied.

A critical problem faced by current supervised WSD systems is the lack of manually annotated training data. Tackling this data acquisition bottleneck is crucial, in order to build high-accuracy and wide-coverage WSD systems. In this paper, we show that the approach of automatically gathering training examples from parallel texts is scalable to a large set of nouns. We conducted evaluation on the nouns of SENSEVAL-2 English-all-words task, using fine-grained sense scoring. Our evaluation shows that training on examples gathered from 680MB of parallel texts achieves accuracy comparable to the best system of SENSEVAL-2 English-all-words task, and significantly outperforms the baseline of always choosing sense 1 of WordNet.

This paper describes our research on automatic semantic argument classification, using the PropBank. Previous research employed features that were based either on a full parse or shallow parse of a sentence. These features were mostly based on an individual semantic argument and the relation between the predicate and a semantic argument, but they did not capture the interdependence among all arguments of a predicate. In this paper, we propose the use of the neighboring semantic arguments as additional features in determining the class of the current semantic argument. Our experimental results show significant improvement in the accuracy of semantic argument classification after exploiting argument interdependence. Argument classification accuracy on the standard test set improves to 90.50%, representing a relative error reduction of 18%.

A word sense disambiguation (WSD) system trained on one domain and applied to a different domain will show a decrease in performance. One major reason is the different sense distributions between different domains. We present novel application of two distribution estimation algorithms to provide estimates of the sense distribution of the new domain data set. Even though our training examples are automatically gathered from parallel corpora, the sense distributions estimated are good enough to achieve a relative improvement of 56% when incorporated into our WSD system.

The talk will present initial work at NTU School of Communication & Information to develop an automatic classifier for sentiment classification of text -- automatically classifying documents as expressing positive or negative sentiments/opinions. Classifiers were developed using Support Vector Machine based on various text features to classify product reviews into recommended (positive sentiment) and not recommended (negative sentiment). Compared with traditional topic classification, it was hypothesized that syntactic and semantic processing of text would be more important for non-topical classification such as sentiment classification. In the study, different text features -- unigrams (individual words), words belonging to selected grammatical categories (verbs, adjectives, and adverbs), words labeled with part-of-speech tags, and negation phrases were investigated. A prototype search agent was developed to apply sentiment classification to Web pages retrieved from Web search engines. To improve the effectiveness of the system and make it more portable to other domains, a general list of emotion words was compiled, with an indication of their affective orientation (positive or negative) and intensity of the sentiment. Three assessors coded more than 3000 emotion words taken from Roget's Thesaurus. The work is being extended to other types of sentiments. Investigators in the project are Dr Na Jin Cheon, Dr Paul Wu, Dr Diderich Joachim, and Dr Chris Khoo.

The multinomial manifold is the simplex of multinomial models furnished with the Riemannian structure induced by the Fisher information metric. It makes more sense to view documents as points on the multinomial manifold, rather than in the much larger ambient Euclidean space. I will show that Support Vector Machines can achieve better text classification performance using kernels on the multinomial manifold instead of standard kernels.

G-Portal is a Web-based digital library that manages metadata of geography related resources on the Web and provides digital library services to access them. It adopts a map-based interface as its primary point of access to visualize and manipulate the distributed geography content. A classification-based interface is also provided to classify and visualize all resources. In this talk, we will describe the design principles of G-Portal and its main functions. The latest extensions to integrate G-Portal functions with learning will also be presented.

Word Sense Disambiguation suffers from a long-standing problem of knowledge acquisition bottleneck. Although state of the art supervised systems report good accuracies for selected words, they have not been shown to be promising in terms of scalability. In this paper, we present an approach for learning coarser and more general set of concepts from a sense tagged corpus in order to alleviate the knowledge acquisition bottleneck. We show that these general concepts can be transformed to fine grained word senses using simple heuristics, and applying the technique for recent Senseval data sets shows that our approach can yield state of the art performance.

Shortage of labeled data is an obstacle to supervised learning methods for word sense disambiguation (WSD). In this paper we investigate a label propagation based semi-supervised learning algorithm for WSD, which can effectively combine unlabeled data with labeled data in learning process by constructing a connected graph to represent the manifold structure (or cluster structure) of data, then propagating the label information from labeled examples to unlabeled examples through weighted edges, and finally inferring the labels of unlabeled examples after the propagation process converges. We evaluated this label propagation algorithm on interest, line and senseval-3 corpus. It consistently outperforms SVM on senseval-3 corpus when only very few labeled examples are available, and its performance on interest and line corpus is also better than monolingual bootstrapping, and comparable to bilingual bootstrapping.

This paper presents an unsupervised method to discover the embedded significant relations between name entities in the documents, which is capable to discriminate relation types and assign labels to them automatically. With the acquisition of the context vectors for all entity pairs in advance, relation types discrimination is achieved by model order selection to estimate the "nature" number of relation types so that relation can be interpreted as groups (or clusters) of similar context vectors of entity pairs. For relation labelling, the discriminative trategies that we used ensure us to identify a distinctive label for each relation type. Experimental results on the ACE corpus show that our algorithm can estimate model order (cluster number) and identify discriminative features for labelling each relation type.

2004

Coreferential information of a candidate, such as the properties of its antecedents, is important for pronoun resolution because it reflects the salience of the candidate in the local discourse. Such information, however, is usually ignored in previous learning-based systems. In this paper we present a trainable model which incorporates coreferential information of candidates into pronoun resolution. Preliminary experiments show that our model will boost the resolution performance given the right antecedents of the candidates. We further discuss how to apply our model in real resolution where the antecedents of the candidate are found by a separate noun phrase resolution module. The experimental results show that our model still achieves better performance than the baseline.

We present a framework and a system that extracts events relevant to a query from a collection C of documents, and places such events along a timeline. Each event is represented by a sentence extracted from C, based on the assumption that "important" events are widely cited in many documents for a period of time within which these events are of interest. Evaluation was performed using G8 leader names as queries: comparison made by human evaluators between manually and system generated timelines showed that although manually generated timelines are on average more preferable, system generated timelines are sometimes judged to be better than manually constructed ones. This paper was presented at SIGIR 2004.

Term weighting scheme, which has been used to convert the documents as vectors in the term space, is a vital step in automatic text categorization. In the text categorization field, support vector machines (SVM) have been shown to have a better performance than other traditional machine learning algorithms. The previous studies show that term weighting schemes dominate the performance rather than the kernel functions of SVMs when one applies SVMs to the text categorization tasks. In this paper, we conducted comprehensive experiments to compare various term weighting schemes with SVM on two widely-used benchmark data sets. Based on the analysis of discriminating power, we also present a new term weighting scheme tf.rf to improve the discriminating power. Finally, the cross-scheme comparison is performed by using statistical significance tests (McNemar's Tests) based on the micro-averaged break-even point. The controlled experimental results show that the newly proposed tf.rf scheme is significantly better than other widely-used term weighting schemes on the two data sets based on different category distributions. The idf factor does not improve or even decrease the term's discriminating power for text categorization. These term weighting schemes related with tf (term frequency) factor alone show good performance. The binary and tf.chi representations significantly underperform the other term weighting schemes.

Structured objects, such as entities about product and service specification, member listing, and facility introduction, are the most valuable information on the Web. Based on our available techniques, we can automatically extract the related object data and the corresponding object models from the published Web sites. However, these object models might be not compatible since their underlying backend databases use different schemas and the datasets on the Web are varied when the original data meet different business logics. Our research attempts to disambiguate the variations and differences among the different object models about the same kind of entity (for example, model a about notebook in site A and model b about notebook in site B) and construct a universal ontology (about notebook) based on these models. The variations and differences lie in: structures, size and vocabulary. We use the available language ontology and domain ontology to explore the attributes in the object description based on the lattice theory, and build a general model (ontology) that covers the detailed models and plays the bridge among the different layers in the special-general problem space. It will become the critical role when users perform calculation between objects from different sites (i.e. comparison the performance and price of two products).

We will present work in progress of two separate systems that aim to classify web page elements. We will first give the background to the problem and discuss work based on on Brin and Page's random walk algorithm as it is applied to web page element classification. Then we will discuss another approach that uses separated views of lexical and stylistic learning, integrated under the co-training learning framework that is open-source and heading towards public release and deployment in other projects.

Text processing can be roughly categorized into two levels, namely language processing and data-driven analysis. Clustering is one of the major topics in data-driven text analysis. With a appropriate language modeling, the text collection can be approximated into a mixture Gaussian distribution, on which clustering is essentially a density estimation task. Based on this modeling, various clustering methods follow an iterative refinement process. With such, the initialization of the process is deterministic to the clustering solution. Hence the study of the initialization methods, though neglected by some literature, is of great value for cluster analysis. This talk reports the speaker's recent typology review and quantitative evaluation of several initialization methods for clustering. Experiments are done on the Reuters text collection as well as a number of sythetic and real-life non-text data sets that follow the mixture Gaussian distribution. The extended discussion also covers methods for refining the initial state of a clustering process towards a semi-optimal solution.

I'll focus on 3 papers on COLING 2004. The first one is about cross-lingual information extraction evaluation which shows the cross-lingual query-driven information extraction system outperforms the translation-based query-driven information extraction system. The second is regarding to automatically detect the question-answer pairs in email threads. The last paper is to generate overview summaries of ongoing email thread discussions.

I will discuss common, installed resources used by my group of students. I will discuss how to find proper tools and corpora, install new ones and working with the directories for projects. I also discuss how to go about doing group-based evaluation and future infrastructure plans for common text processing work. If time permits, I will also briefly highlight HYP student research in my group over the past year.

New words such as names, technical terms, etc appear frequently. As such, the bilingual lexicon of a machine translation system has to be constantly updated with these new word translations. Comparable corpora such as news documents of the same period from different news agencies are readily available. In this paper, we present a new approach to mining new word translations from comparable corpora, by using context information to complement transliteration information. We evaluated our approach on six months of Chinese and English Gigaword corpora, with encouraging results.