Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

On the Reliability and Intuitiveness of Aggregated Search Metrics

Aggregating search results from a variety of diverse verticals such as news, images, videos and Wikipedia into a single interface is a popular web search presentation paradigm. Although several aggregated search (AS) metrics have been proposed to evaluate AS result pages, their properties remain poorly understood. In this paper, we compare the properties of existing AS metrics under the assumptions that (1) queries may have multiple preferred verticals; (2) the likelihood of each vertical preference is available; and (3) the topical relevance assessments of results returned from each vertical is available. We compare a wide range of AS metrics on two test collections. Our main criteria of comparison are (1) discriminative power, which represents the reliability of a metric in comparing the performance of systems, and (2) intuitiveness, which represents how well a metric captures the various key aspects to be measured (i.e. various aspects of a user’s perception of AS result pages). Our study shows that the AS metrics that capture key AS components (e.g., vertical selection) have several advantages over other metrics. This work sheds new lights on the further developments and applications of AS metrics.

35.
Methodology
DiscriminaNve
Power
(Reliability)
• DiscriminaNve
power
– reﬂect
metrics’
robustness
to
variaNon
across
topics.
– measure
by
conducNng
a
staNsNcal
signiﬁcance
test
for
diﬀerent
pairs
of
systems,
and
counNng
the
number
of
signiﬁcantly
diﬀerent
pairs.
• Randomized
Tukey’s
Honestly
Signiﬁcantly
Diﬀerence
(HSD)
test
[Cartereoe
TOIS’12]
– use
the
observed
data
and
computaNonal
power
to
esNmate
the
distribuNons.
– conservaNve
nature
Main
idea:
if
the
largest
mean
diﬀerence
of
systems
observed
is
not
signiﬁcant,
then
none
of
the
other
diﬀerences
should
be
signiﬁcant
either.
B.
Cartereoe.
MulNple
TesNng
in
StaNsNcal
Analysis
of
Systems-­‐Based
InformaNon
Retrieval
Experiments.
TOIS,
30-­‐1,
2012.

36.
Methodology
DiscriminaNve
Power
(Reliability)
• DiscriminaNve
power
– reﬂect
metrics’
robustness
to
variaNon
across
topics.
– measure
by
conducNng
a
staNsNcal
signiﬁcance
test
for
diﬀerent
pairs
of
systems,
and
counNng
the
number
of
signiﬁcantly
diﬀerent
pairs.
• Randomized
Tukey’s
Honestly
Signiﬁcantly
Diﬀerence
(HSD)
test
[Cartereoe
TOIS’12]
– use
the
observed
data
and
computaNonal
power
to
esNmate
the
distribuNons.
– conservaNve
nature
Main
idea:
if
the
largest
mean
diﬀerence
of
systems
observed
is
not
signiﬁcant,
then
none
of
the
other
diﬀerences
should
be
signiﬁcant
either.
B.
Cartereoe.
MulNple
TesNng
in
StaNsNcal
Analysis
of
Systems-­‐Based
InformaNon
Retrieval
Experiments.
TOIS,
30-­‐1,
2012.

37.
Methodology
DiscriminaNve
Power
(Reliability)
• DiscriminaNve
power
– reﬂect
metrics’
robustness
to
variaNon
across
topics.
– measure
by
conducNng
a
staNsNcal
signiﬁcance
test
for
diﬀerent
pairs
of
systems,
and
counNng
the
number
of
signiﬁcantly
diﬀerent
pairs.
• Randomized
Tukey’s
Honestly
Signiﬁcantly
Diﬀerence
(HSD)
test
[Cartereoe
TOIS’12]
– use
the
observed
data
and
computaNonal
power
to
esNmate
the
distribuNons.
– conservaNve
nature
Main
idea:
if
the
largest
mean
diﬀerence
of
systems
observed
is
not
signiﬁcant,
then
none
of
the
other
diﬀerences
should
be
signiﬁcant
either.
B.
Cartereoe.
MulNple
TesNng
in
StaNsNcal
Analysis
of
Systems-­‐Based
InformaNon
Retrieval
Experiments.
TOIS,
30-­‐1,
2012.

67.
Conclusions
Final
take-­‐out
• In
terms
of
discriminaNve
power,
– RP
is
the
most
discriminaNve
feature
(metric)
for
evaluaNon
among
the
four
AS
components.
– AS
and
novelty-­‐emphasized
metrics
are
superior
to
diversity
and
orientaNon
emphasized
metrics.
• In
terms
of
intuiNveness,
– Tolerance-­‐based
AS
Metric
and
diversity
emphasized
metric
is
the
most
intuiNve
metric
to
emphasize
all
AS
components.
• Overall,
Tolerance-­‐based
AS
Metric
is
the
most
discriminaNve
and
intuiNve
metric.
• We
propose
a
comprehensive
approach
for
evaluaNng
intuiNveness
of
metrics
that
takes
special
aspects
of
aggregated
search
into
account.

68.
Conclusions
Final
take-­‐out
• In
terms
of
discriminaNve
power,
– RP
is
the
most
discriminaNve
feature
(metric)
for
evaluaNon
among
the
four
AS
components.
– AS
and
novelty-­‐emphasized
metrics
are
superior
to
diversity
and
orientaNon
emphasized
metrics.
• In
terms
of
intuiNveness,
– Tolerance-­‐based
AS
Metric
and
diversity
emphasized
metric
is
the
most
intuiNve
metric
to
emphasize
all
AS
components.
• Overall,
Tolerance-­‐based
AS
Metric
is
the
most
discriminaNve
and
intuiNve
metric.
• We
propose
a
comprehensive
approach
for
evaluaNng
intuiNveness
of
metrics
that
takes
special
aspects
of
aggregated
search
into
account.

69.
Conclusions
Final
take-­‐out
• In
terms
of
discriminaNve
power,
– RP
is
the
most
discriminaNve
feature
(metric)
for
evaluaNon
among
the
four
AS
components.
– AS
and
novelty-­‐emphasized
metrics
are
superior
to
diversity
and
orientaNon
emphasized
metrics.
• In
terms
of
intuiNveness,
– Tolerance-­‐based
AS
Metric
and
diversity
emphasized
metric
is
the
most
intuiNve
metric
to
emphasize
all
AS
components.
• Overall,
Tolerance-­‐based
AS
Metric
is
the
most
discriminaNve
and
intuiNve
metric.
• We
propose
a
comprehensive
approach
for
evaluaNng
intuiNveness
of
metrics
that
takes
special
aspects
of
aggregated
search
into
account.

70.
Conclusions
Final
take-­‐out
• In
terms
of
discriminaNve
power,
– RP
is
the
most
discriminaNve
feature
(metric)
for
evaluaNon
among
the
four
AS
components.
– AS
and
novelty-­‐emphasized
metrics
are
superior
to
diversity
and
orientaNon
emphasized
metrics.
• In
terms
of
intuiNveness,
– Tolerance-­‐based
AS
Metric
and
diversity
emphasized
metric
is
the
most
intuiNve
metric
to
emphasize
all
AS
components.
• Overall,
Tolerance-­‐based
AS
Metric
is
the
most
discriminaNve
and
intuiNve
metric.
• We
propose
a
comprehensive
approach
for
evaluaNng
intuiNveness
of
metrics
that
takes
special
aspects
of
aggregated
search
into
account.