Utilizing passage-based language models for ad hoc document retrieval

Abstract

To cope with the fact that, in the ad hoc retrieval setting, documents relevant to a query could contain very few (short) parts (passages) with query-related information, researchers proposed passage-based document ranking approaches. We show that several of these retrieval methods can be understood, and new ones can be derived, using the same probabilistic model. We use language-model estimates to instantiate specific retrieval algorithms, and in doing so present a novel passage language model that integrates information from the containing document to an extent controlled by the estimated document homogeneity. Several document-homogeneity measures that we present yield passage language models that are more effective than the standard passage model for basic document retrieval and for constructing and utilizing passage-based relevance models; these relevance models also outperform a document-based relevance model. Finally, we demonstrate the merits in using the document-homogeneity measures for integrating document-query and passage-query similarity information for document retrieval.

Notes

Acknowledgments

This paper is based upon work done in part while the first author was at the Technion and the second author was at Cornell University. The work presented here was supported in part by Google’s and IBM’s faculty research awards, by the Center for Intelligent Information Retrieval, and by the National Science Foundation under grant no. IIS-0329064. Any opinions, findings and conclusions or recommendations expressed in this material are the authors’ and do not necessarily reflect those of the sponsoring institutions.

Performance numbers of passage-based relevance models (Liu and Croft, 2002). We use either the originally suggested basic passage language model (RelPsg) (Liu and Croft 2002) or our homogeneity-based passage language model (RelPsg\([{\mathcal{M}}]).\) Document-based relevance-model (Lavrenko and Croft 2001) performance is presented for reference (RelDoc). Boldface: best result in a column; underline: best result in a corpus table (per evaluation measure). Statistically significant differences with RelDoc and RelPsg are marked with d and p, respectively

Performance numbers of the InterMaxPsg algorithm when implemented with standard document and passage language models. Document-based (Doc[base]) and passage-based (MaxPsg[base]) retrieval performance is presented for reference. Boldface: best result per column; underline: best performance in a corpus table per evaluation measure and smoothing technique; d and p mark statistically significant differences with Doc[base] and MaxPsg[base], respectively