Summary: An Empirical Study of Query Specificity
Avi Arampatzis1
and Jaap Kamps2
1
Electrical and Computer Engineering, Democritus University of Thrace, Greece
2
Media Studies, University of Amsterdam, The Netherlands
Abstract. We analyse the statistical behavior of query-associated quantities in
query-logs, namely, the sum and mean of IDF of query terms, otherwise known
as query specificity and query mean specificity. We narrow down the possibilities
for modeling their distributions to gamma, log-normal, or log-logistic, depending
on query length and on whether the sum or the mean is considered. The results
have applications in query performance prediction and artificial query generation.
1 Introduction and Definitions
Inverse document frequency (IDF) is a widely used and robust term weighting function
capturing term specificity [1]. Analogously, query specificity (QS) or query IDF can be
seen as a measure of the discriminative power of a query over a collection of documents.
A query's IDF is a log estimate of the inverse probability that a random document from
a collection of N documents would contain all query terms, assuming that terms occur
independently. The mean IDF of query terms, which we call query mean specificity