Evaluation in information retrieval

We have seen in the preceding chapters many alternatives in designing an
IR system. How do we know which of these techniques are effective in
which applications? Should we use stop lists? Should we stem? Should
we use inverse document frequency weighting?
Information retrieval has developed as a highly empirical
discipline, requiring careful and thorough evaluation
to demonstrate the superior performance of novel techniques on
representative document collections.

In this chapter we begin with a discussion of measuring the
effectiveness of IR systems (Section 8.1 ) and the test collections
that are most often used for this purpose (Section 8.2 ).
We then present the straightforward notion of relevant and
nonrelevant documents and the formal evaluation methodology
that has been
developed for evaluating unranked retrieval results
(Section 8.3 ). This includes explaining the kinds of
evaluation measures that are standardly used for document retrieval and related
tasks like text classification and why they are appropriate. We then extend
these notions and develop further measures for evaluating ranked
retrieval results (Section 8.4 ) and discuss developing
reliable and informative test collections (Section 8.5 ).

We then step back to introduce the notion of user utility, and how it is
approximated by the use of document relevance (Section 8.6 ).
The key utility measure is user happiness. Speed of
response and the size of the index are factors in user happiness. It seems
reasonable to assume that
relevance of results is the most important factor:
blindingly fast, useless answers do not make a user happy. However,
user perceptions do not always coincide with system designers' notions
of quality. For example, user happiness commonly depends very strongly
on user interface design issues, including the layout, clarity, and
responsiveness of the user interface, which are independent of the
quality of the results returned. We touch on
other measures of the quality of a system, in particular
the generation of high-quality
result summary snippets, which strongly influence
user utility, but are not measured in the basic relevance ranking
paradigm (Section 8.7 ).