Archive

SIGIR Reviews as Pseudo-Relevance Feedback

Some ACM conferences such as CHI offer authors an opportunity to flag material misconceptions in reviewers’ perceptions of submitted papers prior to rendering a final accept/reject decision. SIGIR is not one of them. Its reviewers are free from any checks on their accuracy from the authors, and, to judge by the reviews of our submission, from the program committee as well.

Consider this: We wrote a paper on a novel IR framework which we believe has the potential to greatly increase the efficacy of interactive Information Retrieval systems. The topic we tackled is (not surprisingly) related to issues we often discuss on this and on the IRGupf blog, including HCIR, Interactive IR, Exploratory Search, and Collaborative Search. In short, these are all areas that could be well served by an algorithmic framework that supports greater interactivity.

So in our paper, we chose to evaluate our framework through experiments that involved relevance feedback. Relevance feedback is a long-studied, traditionally well-accepted interaction paradigm. The user runs a query, judges a few documents for relevance, and any relevant documents that are found during this process are saved or marked and fed back into the system to produce even better results on subsequent queries. Our results showed that the proposed framework is not only more effective than a robust, well-understood baseline, but that algorithms involved are up to an order of magnitude more efficient than traditional baselines. And speed is of utmost importance to interactive IR systems!

We received three reviews…

Reviewer #1

The first review, after summarizing our contribution, read in its entirety:

The paper is well written and both the idea and the experimental part are sound.

This was accompanied by a 4/6 recommendation score. Not much help.

Reviewer #2

The second review’s worst criticism of the work was that the evaluation was incomplete:

The idea is new & interesting, especially that it can make use of non-text query logs. One drawback of the paper, in this reviewer’s opinion, is incompleteness: why pseudo-relevance-feedback not considered as well, which is easy to do? Asking a user to judge documents until one gets 5 relevant may not be realistic. Even if PRF does not work, paper should present the results.

…

The idea … is new & interesting, especially that it can make use of non-text query logs. One drawback of the paper, in this reviewer’s opinion, is incompleteness: no study of employing pseudo-relevant docs. The impact would be small if one requires judged relevant docs.

This criticism is flawed on three counts:

First, it was flat out wrong. We were not asking people to find five relevant documents, we were asking them to make five judgments of relevance. This was made very clear in the paper. Furthermore, even if making explicit judgments is difficult, there are many techniques for eliciting implicit (but not pseudo!) judgments of relevance (e.g., see Kelly and Belkin, 2001).

The second reason for not doing pseudo-relevance feedback is that it is more likely to introduce noise and topic drift.

The third reason for avoiding PRF entirely is that it is unnecessary for interactive systems. Indeed, if a user is reading or saving (marking) documents, i.e. if a user is giving explicit judgments of relevance, decades of research have already shown those judgments to be much more effective than pseudo-judgments.

The argument that somehow an evaluation is incomplete or meaningless — or that the impact is small — if it does not involve pseudo-relevance feedback is offensive in its narrow-mindedness. What it reflects, we believe, is the current bias of the field as a whole toward non-interactive web search-like experiences. In the web IR world, the commonly held understanding is that users are too lazy to engage in explicit relevance feedback, or else are engaged in a type of information seeking activity, such as navigation, that does not require any feedback, pseudo-relevant or otherwise. But web information retrieval is not all of information retrieval.

The third reviewer, while stating that the work is novel, had two main concerns:

Our chosen query expansion technique (selection and weighting of terms) was not convincing because many others were possible, using our framework.

We did not evaluate using pseudo-relevance feedback.

For the first point, we intentionally chose the simplest implementation of our framework to show its strength and to intentionally make the fairest comparison possible. If our naive approach beats the quite reasonable baseline (20-30% increase in effectiveness, 10x speedup in efficiency) that should be enough; it is beyond the scope of a conference paper to exhaustively demonstrate its effectiveness for arbitrary schemes a reviewer might dream up. The naive approach worked. That’s a publishable result.

That brings us again to the second point, which actually sounded like the stronger reviewer criticism: Pseudo-relevance feedback. Reading between the lines of the reviews, one gets the impression that the reviewers are well-versed in traditional web search: they mention log mining (a minor aside in our paper), and are obsessed with pseudo-relevance feedback. Those of us who were doing IR research before the late 1990s remember a time when intellectual efforts were not judged by standards applicable only to web search. The diversity of approaches, of metrics, and of applications of that era seem to have been reduced to the bleak outlines of precision-oriented page-at-a-time results lists, where interactivity is looked upon as a burden rather than an opportunity.

It is ironic, then, to note that just two days ago, on the same day that our rejection reviews arrived, Google rolled out an interface that allows people to make explicit relevance judgments through bookmarking, which Google’s algorithms then use as a form of relevance feedback! The old web maxim of users being too lazy or unwilling or unengaged to mark documents for relevance — thus necessitating pseudo-relevance feedback at the expense of real relevance feedback — was busted by a major web search engine!

The third reviewer’s score was 4/6.

So what? What are we going to do about it?

Given the discussion on Twitter and in e-mail in the aftermath of this round of rejection decisions, I think it is safe to say that we are not alone in our dissatisfaction in the reviewing process. What will happen is what always happens; the paper will be resubmitted elsewhere and life moves on.

But what about the SIGIR conference, and the community it represents? Are we unhappy with the misreadings and misunderstandings, with the reviewer who could not tell the difference between “5 judgments of relevance” and “5 relevant documents”? Yes, of course. And a conference review system that allows for anonymous feedback from the author to the reviewers, as does CHI, could go a long way in rectifying these misunderstandings. Misconceptions of one’s work are, to a certain extent, completely understandable. Even if a paper is written clearly, the reviewers have not grappled with the ideas anywhere near as much as the author(s) have.

But what about the more basic problem, the one of narrow thinking in the reviews themselves? The idea that a paper on interactivity and relevance feedback is not acceptable unless it also includes experiments on and evaluations of the non-interactive pseudo-relevance feedback approach is one that we have a difficult time accepting. Non-interactive approaches, and the web search world which thrives on them, are popular right now. Pseudo-relevance feedback epitomizes that non-interactivity, yet multiple reviewers suggested that the lack of PRF was the paper’s biggest weakness. We feel that it isn’t; PRF is a different problem, solving a different kind of need, in a different kind of scenario. Not all of information retrieval is web search. So what is one to do when a review not only mis-perceives a paper, but actively tries to impose its own values onto the paper, asking it to solve a different kind of problem than the one it is trying to solve?

Given the non-interactivity in the SIGIR review process, the inability to discuss and correct mis-perceptions and biases, one is tempted to label reviewer comments themselves as a form of pseudo-relevance feedback. I.e. contrary to appearances, no explicit judgments of relevance to the conference had actually been actually made. ;-)

[…] the acceptance criteria, and each other throughout March; accepted and rejected authors alike complain about the quality of reviews; conference attendees complain about the quality of the accepted pages; all of these complaints are […]

I don’t know if you already did this in your submission because I’ve never seen it. A few comments.

PRF was around before web search and provides a “robust, well-understood baseline”*. You can even combine RF with PRF so there’s no comment on interactivity inherent in asking for a PRF baseline. How does your method compare to these approaches?**

*robustness claims subject to argument.
**I was not a reviewer on this paper

Our paper was about Algorithm Z, done in the context of RF. It was a comparison of how well various Stage 2 mechanisms worked for RF.

For the reviewers to heavily ding the paper not for failing to implement Algorithm Y as a baseline would be one thing. But that’s not what they said. What they said is that we didn’t do PRF.

PRF is a different interaction mechanism, is used in a different way and for different reasons than RF. I never said PRF wasn’t around before web search; I only said that it is representative of the web search interaction paradigm. Our paper wasn’t about (Stage 2) algorithms for (Stage 1) PRF; it was about (Stage 2) algorithms for (Stage 1) RF.

It seems to me that the program committee could have caught this, and probably should have done so.

This, incidentally, is an area I feel has deteriorated in a number of conferences in recent years. With only a small pc meeting, or none at all, the readings of reviewers go unchecked and unexamined. Many people aren’t trained to review effectively, and without good PC experience I find they do not learn.

Yours is also a paper with a difficult profile: a technical subject and three positive but unenthusiastic reviewers, all in broad agreement (or with the same bee in their headgear). As a chair, it’s not the 6-5-1 papers Ive learned to dread: they inspire discussion and debate. The headaches are the 4-4-3’s , especially when you don’t know the reviewers very well.

My sense is that 3s and 4s were easy rejects for the committee, and that papers in this range didn’t receive much discussion, considering that they accepted 87 of 520 submissions (just under 17%). But I am just guessing.

FD.. just one more thought on the differing mechanisms that support PRF vs. RF.

The documents selected by PRF in Stage 1 are not really relevant and have much higher potential of adding noise and increasing the likelihood of topic drift. Therefore, even though a single technical mechanism (algorithm) is logically capable of support both — after all input documents are just input documents — in reality the algorithms in Stage 2 that are developed for PRF are very different from the ones developed for RF.

In PRF, the primary guiding principle is risk minimization. Average performance of many algorithms is good, but there is wild per-topic variability (expanded queries that do much worse than the original unexpanded query) that can severely impact overall system perception. Algorithms that support PRF try to minimize those negative queries. See for example papers by Yun Zhou, Iadh Ounis, and Kevyn Collins-Thompson.

With RF, on the other hand, the primary guiding principle is effectiveness maximization. There are many fewer queries that perform worse than the original, so much more emphasis is placed on making the good queries even better.

Of course PRF is also concerned with relevance. Obviously. And RF also want to minimize risk. It is just that the *primary* challenge or obstacle that one runs into when non-relevant documents are automatically assumed to be relevant (PRF) is very different than when the user has explicitly (or even implicitly!) given some sort of indication that a document is relevant.

Our paper was about an algorithm for RF. Not for PRF. And so when multiple reviewers ding us for not doing PRF, and the metareviewer doesn’t call them on it, I’ve got a bit of a problem with that. Maybe a combined RF+PRF paper is good topic for a journal, but for a conference paper one can only do so many conceptually different things. We did RF, and I feel we were not reviewed as an RF paper. We were reviewed as a PRF paper.

..and because I believe we were reviewed as a PRF paper, it makes a lot more sense that one of the reviewers would confuse “5 relevant documents” with “5 relevance judgments”. In RF, those two concepts are not (necessarily) equivalent; in PRF they are!

But then one has to ask: When we say that paper is about RF, why is it being reviewed as a PRF paper?

You bring up a good point, but I wonder: Does pre-printing and having this discussion now basically destroy the possibility of anonymity in reviews, for the next conference to which I submit this work?

Your question presupposes that anonymity is a good thing. In the good old days, referees would be aware, not only of your work, but of the consensus as to its validity and impact. Then they would decide whether or not it was archival quality.

The net effect of high stakes conferences, whose results are used as an ersatz measure of scholarship, is that both results and discussion are suppressed. I don’t see how this is a good thing.

gvc: I did not wish to explicitly presuppose this. In fact, I would be fine with non-anonymous submissions, and non-anonymous reviews, especially if what we were gaining in the process were more interactivity and feedback. (Real feedback, not pseudo-feedback :-)

Until that changes, however, I’m still concerned about how “correct” it is to de-anonymize my paper. Right this very minute, I mean.

[…] as a venue for publishing and discussing work related to information retrieval that might have been rejected by traditional publication venues. The goal of Not Relevant is to provide a novel dissemination […]

[…] a noticeable outcry about the poor quality of the reviewing process. Some authors chose to publish public rebuttals to the reviewers on their blog; others wrote about the unending cycle of complaining that the IR […]