Thursday, October 7, 2010

Please join us for two upcoming IR series talks on Wednesday, Oct 13, 2010.

Lunch will be provided by Yahoo!.

Date/Time: Wednesday, Oct 13, 2010, noonPlace: GHC 4405

First Speaker: Le ZhaoTitle: Term Necessity PredictionAbstract:

The probability that a term appears in relevant documents (P(t|R)) is a fundamental quantity in several probabilistic retrieval models, however it is difficult to estimate without relevance judgments or a relevance model. We call this value term necessity because it measures the percentage of relevant documents retrieved by the term – how necessary a term’s occurrence is to document relevance. Prior research typically either set this probability to a constant, or estimated it based on the term's inverse document frequency, neither of which was very effective.

This paper identifies several factors that affect term necessity, for example, a term’s topic centrality, synonymy and abstractness. It develops term- and query-dependent features for each factor that enable supervised learning of a predictive model of term necessity from training data. Experiments with two popular retrieval models and 6 standard datasets demonstrate that using predicted term necessity estimates as user term weights for the original query terms leads tosignificant improvements in retrieval accuracy.

This work presents a general rank-learning framework for leveraging deep linguistic and semantic features for passage ranking within Question Answering (QA) systems. The passage ranking framework enables query-time checking of these complex and long-distance constraints among question features such as keywords and named entities. These constraints can include keyword ordering, annotation type-checking, verb-argument attachment and arbitrary long-distance paths through an annotation graph. We show that a trained ranking model using this rich feature set achieves greater than a 20% improvement in Mean Average Precision over baseline keyword retrieval models. We also show that for questions expressing the most complex linguistic semantic constraints, further gains in MAP are realized, yielding a 40% improvement over the baseline.

Vertical aggregation is the task of incorporating results from specialized search engines or verticals (e.g., images, video, news) into Web search results. Vertical selection is the subtask of deciding, given a query, which verticals, if any, are relevant. State of the art approaches use machine learned models to predict which verticals are relevant to a query. When trained using a large set of labeled data, a machine learned vertical selection model outperforms baselines which require no training data. Unfortunately, whenever a new vertical is introduced, a costly new set of editorial data must be gathered. In this paper, we propose methods for reusing training data from a set of existing (source) verticals to learn a predictive model for a new (target) vertical. We study methods for learning robust, portable, and adaptive cross-vertical models. Experiments show the need to focus on different types of features when maximizing portability (the ability for a single model to make accurate predictions across multiple verticals) than when maximizing adaptability (the ability for a single model to make accurate predictions for a specific vertical). We demonstrate the efficacy of our methods through extensive experimentation for 11 verticals.

This is joint work with Fernando Diaz and Jean-Francois Paiement from Yahoo! Labs and will be presented at SIGIR 2010.

Abstract:This paper studies quality of human labels used to train search engines’ rankers. Our specific focus is performance improvements obtained by using overlapping relevance labels, which is by collecting multiple human judgments for each training sample. The paper explores whether, when, and for which samples one should obtain overlapping training labels, as well as how many labels per sample are needed. The proposed selective labeling scheme collects additional labels only for a subset of training samples, specifically for those that are labeled relevant by a judge. Our experiments show that this labeling scheme improves the NDCG of two Web search rankers on several real-world test sets, with a low labeling overhead of around 1.4 labels per sample. This labeling scheme also outperforms several methods of using overlapping labels, such as simple k-overlap, majority vote, the highest labels, etc. Finally, the paper presents a study of how many overlapping labels are needed to get the best improvement in retrieval accuracy.

This paper is published in Proceedings of the 33th Annual ACM SIGIR Conference (SIGIR2010), Geneva, Switzerland, July 19-23, 2010.

Abstract:Information hidden in sequences of related data provides a significant exploitable source for information extraction. In this work we demonstrate techniques for increasing the signal of classifications from sequences of dependent data. This includes using algorithms, as well as generating features, that take advantage of the sequence oriented nature of the data. The efficacy of these techniques is evaluated with the classification of familial relationships within United States census data. The census data proves an interesting corpus for this work because of the highly sequential nature of the instances, and the explicit classification relationship between one instance and an instance previously found in the sequence.

In this talk, I briefly explain the cross language information retrieval research for Indian languages. There is a national wide mission mode project on cross language information access sponsored by Ministry of Communications and Information technology, Government of India involving ten Universities. I will talk about the major research issues and a possible roadmap for the CLIR research in Indian setting. I will also share our experiences with multi-lingual summarization.

Vasudeva Varma is a faculty member at International Institute of Information Technology, Hyderabad Since 2002. His research interests include search (information retrieval), information extraction, information access, knowledge management, cloud computing and software engineering. He is heading Search and Information Extraction Lab and Software Engineering Research Lab at IIIT Hyderabad. He is also the chair of Post Graduate Programs since 2009. He published a book on Software Architecture (Pearson Education) and over sixty technical papers in journals and conferences. In 2004, he obtained young scientist award and grant from Department of Science and Technology, Government of India, for his proposal on personalized search engines. In 2007, he was given Research Faculty Award by AOL Labs.

Web search providers often include search services for domain-specific subcollections, called verticals, such as news, images, videos, job postings, company summaries, and artist profiles. We address the problem of vertical selection, predicting relevant verticals (if any) for queries issued to a search engine's main web search page. In contrast to prior collection selection tasks, vertical selection is associated with unique resources that can inform the classification decision. We focus on three sources of evidence: (1) the query string, from which features are derived independent of external resources, (2) logs of queries previously issued to the vertical directly by users, and (3) corpora representative of vertical content. These sources of evidence are integrated as features in a classification-based approach. We make use of and compare against prior work in federated search and retrieval effectiveness prediction. Our evaluation focuses on 18 different verticals, which differ in terms of semantics, media type, size, and level of query traffic. An in-depth error analysis reveals unique challenges across different verticals and provides insight into vertical selection for future work.

Based on work conducted at Yahoo! Labs Montreal to be presented at SIGIR 2009.

Title: Quantitative modeling of the neural representation of adjective-noun phrases to account for fMRI activation

Abstract: Recent advances in functional Magnetic Resonance Imaging (fMRI) offer a significant new approach to studying semantic representations in humans by making it possible to directly observe brain activity while people comprehend words and sentences. In this study, we investigate how humans comprehend adjective-noun phrases (e.g. strong dog) while their neural activity is recorded. Classification analysis shows that the distributed pattern of neural activity contains sufficient signal to decode differences among phrases. Furthermore, vector-based semantic models can explain a significant portion of systematic variance in the observed neural activity. Multiplicative composition models of the two-word phrase outperform additive models, consistent with the assumption that people use adjectives to modify the meaning of the noun, rather than conjoining the meaning of the adjective and noun.