Status

Abstract

The profusion of information retrieval effectiveness metrics has
inspired the development of meta-evaluative criteria for choosing
between them.
One such criterion is discriminative power; that is, the proportion
of system pairs whose difference in effectiveness is found
statistically significant.
Studies of discriminative power frequently find normalized discounted
cumulative gain (nDCG) to be the most discriminative metric, but
there has been no satisfactory explanation of which feature makes it
so discriminative.
In this paper, we examine the discriminative power of nDCG and
several other metrics under different evaluation and pooling depths,
and with different forms of score normalization.
We find that evaluation depth is more important to metric behaviour
and discriminative power than metric type; that evaluating beyond
pooling depth does not seem to lead to a misleading system
reinforcement effect; and that nDCG does seem to have a genuine,
albeit slight, edge in discriminative power under a range of
conditions.