Summary

Image retrieval systems are generally based on the notion of image similarity:
they compute similarity scores between the images of the database and a query (image or text), and organize the images according to these scores.
However, this notion is ill-defined, and the collections used to train and evaluate
image retrieval systems are based on similarity judgments that rely on simplistic, non-realistic, assumptions. This paper addresses the issue of the
definition of image similarity, and more precisely the two following questions: do humans assess image similarity in the same way? Is it possible to define
reference similarity judgments that would correspond to the perception of most users? An experiment is proposed, in which human subjects are assigned two
tasks that fall in principle to the system: rating the similarity of images and ranking images according to a reference image. The data provided by the
subjects is analyzed quantitatively to the light of the two aforementioned questions. Results show that the subjects do not have collective strategies of
similarity assessment, but that a satisfying consensus can be found individually on the data
samples used in the experiments. Based on this, methods to define reference similarity scores and rankings are proposed, that
can be used on a larger scale to produce realistic ground truths for the evaluation of image retrieval systems.
This study is a first step towards a general, realistic, definition of the notion of image similarity in the context
of image retrieval.