14.
Evaluation Challenges On The
Web
• Collection is dynamic
̵ 10-20% urls change every month
• Queries are time sensitive
̵ Topics are hot then they ae not
• Spam methods evolve
̵ Algorithms evaluated against last month’s
web may not work today
• But we have a lot of users… you can use
clicks as supervision
SIGIR'05 Keynote given by Amit
Singhal from Google

25.
Precision at 10
• P@10 is the number of relevant documents in
the top 10 documents in the ranked list
returned for a topic
• E.g.
̵ there is 3 documents in the top 10
documents that is relevant
̵ P@10=0.3

26.
Mean Reciprocal Rank
• MRR is the reciprocal of the first relevant
document’s rank in the ranked list returned
for a topic
• E.g.
̵ the first relevant document is ranked as
No.4
̵ MRR = ¼ = 0.25

27.
bpref
• Bpref stands for Binary Preference
• Consider only judged docs in result list
• The basic idea is to count number of time
judged non-relevant docs retrieval before
judged relevant docs

52.
Introduction
• The user study is different in at least two respects
from previous work
̵ The study provides detailed insight into the users’
decision-making process through the use of
eyetracking
̵ Evaluate relative preference signals derived from
user behavior
• Clicking decisions are biased at least two ways, trust
bias and quality bias
• Clicks have to be interpreted relative to the order of
presentation and relative to the other abstracts

53.
User Study
• Designed these studies to not only record
and evaluate user actions, but also to give
insight into the decision process that lead the
user to the action
• This is achieved by recording users’ eye
movements by Eye tracking

63.
Does relevance influence user
decisions?
• Yes
• Use the “reversed” condition
̵ Controllably decreases the quality of the retrieval
function and relevance of highly ranked abstracts
• Users react in two ways
̵ View lower ranked links more frequently, scan
significantly more abstracts
̵ Subjects are much less likely to click on the first
link, more likely to click on a lower ranked link

64.
Are clicks absolute relevance
judgments?
• Interpretation is problematic
• Trust Bias
̵ Abstract ranked first receives more clicks
than the second
• First link is more relevant (not influenced by
order of presentation) or
• Users prefer the first link due to some level of
trust in the search engine (influenced by order
of presentation)

65.
Trust Bias
• Hypothesis that users are not influenced by
presentation order can be rejected
• Users have substantial trust in search engine’s ability
to estimate relevance

66.
Quality Bias
• Quality of the ranking influences the user’s
clicking behavior
̵ If relevance of retrieved results decreases,
users click on abstracts that are on
average less relevant
̵ Confirmed by the “reversed” condition

67.
Are clicks relative relevance
judgments?
• An accurate interpretation of clicks needs to
take two biases into consideration, but they
are they are difficult to measure explicitly
̵ User’s trust into quality of search engine
̵ Quality of retrieval function itself
• How about interpreting clicks as pairwise
preference statements?
• An example

68.
In the example,
Comments:
• Takes trust and quality bias into consideration
• Substantially and significantly better than random
• Close in accuracy to inter judge agreement

78.
Conclusion
• Users’ clicking decisions influenced by search bias
and quality bias, so it is difficult to interpret clicks as
absolute feedback
• Strategies for generating relative relevance feedback
signals, which are shown to correspond well with
explicit judgments
• While implicit relevance signals are less consistent
with explicit judgments than the explicit judgments
among each other, but the difference is
encouragingly small