Wednesday, April 30, 2008

STOC 2009 : "Impending Doom"

Another surprise announcement: I was asked to be the PC chair for STOC 2009, and I accepted.

I have to admit, I was surprised to be asked, as I am, after all, somewhat vocal in my opinions, which are not always popular. (I was going to label myself a crank, with examples, here, here, here, and here, before someone else did, but I prefer "vocal".) Once asked, I found it hard to decline, despite the warnings that the job is a lot of work (and arguably little reward). It is an honor to be asked, I do believe in service for the community, and, most importantly, I felt it might lead to interesting fodder for the blog. (Obviously nothing confidential will be discussed, but the challenges of the process -- an inside view -- might be interesting.) It was something I was hopefully going to do once in my life. And (as Cynthia Dwork nicely pointed out to me, when I asked her about the job), I'm only getting older, and will have less energy.

Some people might be worried, given my noted worldview, that I might set out to change things drastically. I was thinking it would make a great joke to take "competitive analysis" off the list of topics of the call for papers, only to find it wasn't actually there. Rest assured that things will probably change incrementally; I respect the traditions, and I view the main part of this job to be selecting and empowering a solid, talented PC to do their best job.

The one change I'd really like to implement, so much so that I have to let people object already, is to introduce the rating system that conferences like SIGCOMM use:

5: top 5%4: top 10%, not top 5%3: top 25%, not top 10%2: top 50%, not top 251: bottom 50%

I like this approach much better than trying to guess what everyone thinks a 7 means, or differentiating between a 6.3 and a 6.6. (Though, depending on the projected acceptance ratio, I could see changing the numbers a bit -- to top 10%, 10-20%, 20-33%.) I think this approach makes it easier to concentrate attention on controversial papers that need attention.

20 comments:

I think the real measure should not be relative to the pool of submissions, but absolute, and I think there are just 4 real levels needed: "I will fight hard to get this paper in, because it will change our field", "This is worthwhile, and won't lower the standard of the conference, but others might also be worthy; I won't spend effort fighting for it", "Not as good as usual in the conference, but it is a minor contribution which should appear somewhere", "This is unclear, wrong, silly, already known or trivial, it would be an embarrassment to be associated with a conference that publishes this as submitted". [There may also be a fifth category for "something is unclear/wrong/silly/etc, but there is also an idea worth pursuing that could be published here if the paper were rewritten substantially" - unless the conference has a shepherd process allowing revision, this should be treated like "embarrassing if it appears", except for the feedback it sends to authors]

With such an absolute system, it's easy to focus attention on the real debates that matter for the program. One can simply accept every paper where someone will fight for it, and no-one thinks it below the usual standard; reject every paper where no-one will fight for it and someone has doubts. One must resolve the cases of real debate (when someone will fight for a paper, and someone else thinks it is below the usual standard for the conference, or even bad). Finally, one can fill up the program with a random (or better, spread by topic among submissions from authors new to the community) selection from the papers that everyone thinks are worthy to appear but not important enough to fight for.

Something like this would be an interesting experiment. I wonder how the size of the PC will impact things. Even with the large number of submissions that individual committee members are responsible for, they see only a small fraction of the papers. How should a committee member produce their ratings? Should their ratings be relative only to the papers they are assigned? If so, this is a bit of a quota system on ratings divided into subareas associated with committee members. This issue becomes more complicated because external reviewers will not have access to the pool of submissions to compare.

If there is no such quota, which seems more reasonable, how do they judge the pool as a whole? (This is particularly difficult for someone who has not served previously.) It seems that the only unambiguous standard that an inexperienced PC member can use in this case is how a paper rates relative to accepted papers at previous conferences. What will likely happen is that people will not actually abide by percentages and will use something between the top 25% or top 33% as code for something that is up to the standards of previous STOC/FOCS (since that is roughly what the acceptance rates have been). They'll use "top 50%" or something similar for the middle section of good papers and they will save time by not worrying precisely about grading the rest. (Committee members now probably spend too much time overall trying to tease out whether a paper is publishable, a 4/10 rating say, versus garbage 0-2/10. This would allow them to save time for more important things and seems to be the biggest win with the proposal.)

The software will tend to want to produce list of papers based on "averages" from these scores but by changing the scale the "averages" probably will have less meaning. (I've never much liked these averages anyway.)

Why might this be any different from SIGCOMM? SIGCOMM has a significantly larger committee and a much higher proportion of reviewers will likely have prior SIGCOMM reviewing experience. On the other hand, I bet every SIGCOMM reviewers know the typical acceptance percentages so actual behavior might not be so different.

Harry Lewis reminded me to go back and look at Excellence Without a Soul, page 120, discussing the number of categories used for grading. The main point is the first sentence, "A scale with more categories allows more precise comparisons, but the value assigned to any individual piece of work is more arbitrary." He suggests that even 5 categories is too many. The SIGCOMM scale is designed so that the arbitrariness is focused at the top (differentiating between top 5 and top 10%), but perhaps for STOC 4 categories -- given either as percentages, or nominally in the style of Alan -- would be sufficient.

Michael, I like your idea of cutting down on the number of different rating categories. I also agree with Paul that categories like "top 33% but not top 10%" are difficult to assign without looking at the whole pool. It must be even harder to get an opinion from a subreferee not on the committee and arrive at a grade based on their remarks.

Anyway, however you end up doing it, good luck! Hope to see a great program for STOC '09.

1) The Program Chair (in this case, me) with advice from others picks the PC. 2) The PC is responsible for reviews; sometimes PC members request others (experts, or students) to do some of the reviewing for them, to ease the load, give students experience, and/or get better reviews.

to ease the load, give students experience, and/or get better reviews.

This doesn't seem completely accurate. According to SIGACT guidelines, to "give students experience" is not a legitimate reason for them to be sub-referees. This guideline was instituted way back at the STOC 1999 business meeting (see Sept 1999 SIGACT News):

"Because of concerns that members of the community raised [about the STOC 1999 process] in the future non-expert sub-referees will not be allowed." The exact implementation of the guidelines is supposed to be interpreted by each PC chair.

Generally, students are appropriate sub-referees when they know the area. They can also be appropriate sub-referees when the object is to check correctness of a paper.

"Because of concerns that members of the community raised, in the future non-expert sub-referees will not be allowed. Student sub-referees must be experts."

And I agree with that...except that there's obviously some room for interpretation. (When is a student -- or anyone -- qualified as an expert? It seems not to be spelled out in the guidelines...)

I would never give a paper to a student who I felt didn't know enough to give me significant useful information. Indeed, whenever I use a subreferee as a PC member, I feel its incumbent on me to take their report and use it as information for my own decision as a PC member. If someone has much greater expertise than me on a subject (for example, pretty much whenever I'm stuck with a quantum paper) I may not have anything further worthwhile to add, but usually that's not the case, and even then, in the end I'm responsible for the final opinion and judgment. If I use a student as a subreferee, by definition it's because I think they qualify as expert enough to provide me with a useful opinion, but if a student has less experience, I would be responsible as the PC member for taking that into account.

In short, I read (and recall) that situation and the resulting report being a clear statement about how PC members need to view their responsibility -- which should inherently limit the amount of subrefereeing given to students. I see it less as a dictum drawing a bar meant to prevent students from reviewing -- which is an important experience for graduate students to undertake as they become more advanced.

Michael, please consider expanding the size of the PC. I just don't see the point of having each member of the PC responsible for 40+ papers. Plus with more people on the PC you can get better coverage of more areas.

A big PC, sometimes, becomes unfair to students who write papers with a co-author on the PC. As much as I like the idea of having a reasonably big PC, I wonder how will this concern be taken care of. I don't have a good answer...

hi michael, what is the homepage for stoc 2009? When I tried googling for stoc 2009 all i got was your blog :-) . Though I enjoyed reading your blog entry and the comments which followed, I was wondering where the actual home page for stoc 2009 was :-) Please let me know.

So having used (suffered through?) the SIGCOMM system on several PC a few comments.

The goal of the SIGCOMM system is to figure out which 40 or 50 papers (out of the hundreds submitted) are to be discussed at the PC meeting.

In that light, there's no point in differentiating between papers that are clearly below the bar -- so a single grade (1) covers papers that are in the bottom 50%. Indeed, one of the most important innovations in SIGCOMM reviewing has been the quick reject process -- in the first round, every paper goes to two reviewers -- if both rate it in the bottom 50% it is rejected. Only papers with at least one rating in the upper 50% get additional reviews. So the additional reviews are concentrated on the papers that have a chance at acceptance.

The remaining grades distinguish among papers that might be discussed (most in the second 25% are not discussed, but some are) and help focus the PC discussion a bit.

Also, someone asked if the ratings fit the percentages (i.e. is the top rating given in only 5% of all reviews?). Some PC chairs have looked at this issue and the answer, roughly, appears to be yes. But it is only rough.

The issue is that it takes a while for a reviewer to calibrate. Consider that the typical PC member (in a big PC) sees only about 20 papers out of 300. The likelihood that in this sample of 20 they'll get a quality distribution matching that of the 300 is pretty small. (Compounded by the fact that their selections aren't random, but rather match their reviewing expertise -- and some years are more fertile/innovative than others in a particular sub-field).