Thursday, October 7, 2010

Please join us for two upcoming IR series talks on Wednesday, Oct 13, 2010.

Lunch will be provided by Yahoo!.

Date/Time: Wednesday, Oct 13, 2010, noonPlace: GHC 4405

First Speaker: Le ZhaoTitle: Term Necessity PredictionAbstract:

The probability that a term appears in relevant documents (P(t|R)) is a fundamental quantity in several probabilistic retrieval models, however it is difficult to estimate without relevance judgments or a relevance model. We call this value term necessity because it measures the percentage of relevant documents retrieved by the term – how necessary a term’s occurrence is to document relevance. Prior research typically either set this probability to a constant, or estimated it based on the term's inverse document frequency, neither of which was very effective.

This paper identifies several factors that affect term necessity, for example, a term’s topic centrality, synonymy and abstractness. It develops term- and query-dependent features for each factor that enable supervised learning of a predictive model of term necessity from training data. Experiments with two popular retrieval models and 6 standard datasets demonstrate that using predicted term necessity estimates as user term weights for the original query terms leads tosignificant improvements in retrieval accuracy.

This work presents a general rank-learning framework for leveraging deep linguistic and semantic features for passage ranking within Question Answering (QA) systems. The passage ranking framework enables query-time checking of these complex and long-distance constraints among question features such as keywords and named entities. These constraints can include keyword ordering, annotation type-checking, verb-argument attachment and arbitrary long-distance paths through an annotation graph. We show that a trained ranking model using this rich feature set achieves greater than a 20% improvement in Mean Average Precision over baseline keyword retrieval models. We also show that for questions expressing the most complex linguistic semantic constraints, further gains in MAP are realized, yielding a 40% improvement over the baseline.