Math, Madison, food, the Orioles, books, my kids.

Ranking mathematicians by hinge loss

As I mentioned, I’m reading Ph.D. admission files. Each file is read by two committee members and thus each file has two numerical scores.

How to put all this information together into a preliminary ranking?

The traditional way is to assign to each applicant their mean score. But there’s a problem: different raters have different scales. My 7 might be your 5.

You could just normalize the scores by subtracting that rater’s overall mean. But that’s problematic too. What if one rater actually happens to have looked at stronger files? Or even if not: what if the relation between rater A’s scale and rater B’s scale isn’t linear? Maybe, for instance, rater A gives everyone she doesn’t think should get in a 0, while rater A uses a range of low scores to express the same opinion, depending on just how unsuitable the candidate seems.

Here’s what I did last year. If (r,a,a’) is a triple with r is a rater and a and a’ are two applicants, such that r rated a higher than a’, you can think of that as a judgment that a is more admittable than a’. And you can put all those judgments from all the raters in a big bag, and then see if you can find a ranking of the applicants (or, if you like, a real-valued function f on the applicants) such that, for every judgment a > a’, we have f(a) > f(a’).

Of course, this might not be possible — two raters might disagree! Or there might be more complicated incompatibilities generated by multiple raters. Still, you can ask: what if I tried to minimize the number of “mistakes”, i.e. the number of judgments in your bag that your choice of ranking contradicts?

Well, you can ask that, but you may not get an answer, because that’s a highly non-convex minimization problem, and is as far as we know completely intractable.

But here’s a way out, or at least a way part of the way out — we can use a convex relaxation. Set it up this way. Let V be the space of real-valued functions on applicants. For each judgment j, let mistake_j(f) be the step function

mistake_j(f) = 1 if f(a) < f(a’) + 1

mistake_j(f) = 0 if f(a) >= f(a’) + 1

Then “minimize total number of mistakes” is the problem of minimizing

M = sum_j mistake_j(f)

over V. And M is terribly nonconvex. If you try to gradient-descend (e.g. start with a random ranking and then switch two adjacent applicants whenever doing so reduces the total number of mistakes) you are likely to get caught in a local minimum that’s far from optimal. (Or at least that can happen; whether this typically actually happens in practice, I haven’t checked!)

So here’s the move: replace mistake_j(f) with a function that’s “close enough,” but is convex. It acts as a sort of tractable proxy for the optimization you’re actually after. The customary choice here is the hinge loss:

hinge_j(f) = min(0, f(a)-f(a’) -1).

Then H := sum_j hinge_j(f) is a convex function on f, which you can easily minimize in Matlab or python. If you can actually find an f with H(f) = 0, you’ve found a ranking which agrees with every judgment in your bag. Usually you can’t, but that’s OK! You’ve very quickly found a function H which does a decent job aggregating the committee scores. and which you can use as your starting point.

Now here’s a paper by Nihal Shah and Martin Wainwright commenter Dustin Mixon linked in my last ranking post. It suggests doing something much simpler: using a linear function as a proxy for mistake_j. What this amounts to is: score each applicant by the number of times they were placed above another applicant. Should I be doing this instead? My first instinct is no. It looks like Shah and Wainwright assume that each pair of applicants is equally likely to be compared; I think I don’t want to assume that, and I think (but correct me if I’m wrong!) the optimality they get may not be robust to that?

Anyway, all thoughts on this question — or suggestions as to something totally different I could be doing — welcome, of course.

Like this:

Related

5 thoughts on “Ranking mathematicians by hinge loss”

I’m intrigued by the problem. It seems like it might have applicability to a pinball league I’m in.

… The way that league works is everybody plays the same table, and they collect points based on the number of other players they beat. If ten people are at the match that night, the person with the highest score gets 10 points, the second-highest 9 points, the third-highest 8 points, and so on. Season standings are based on the total number of points accumulated.

The problem is if someone has to miss a league meeting (and can’t make it up; the format, obviously, makes it practical to play at a different time, if sanctioned), how can the absence be fairly calculated? Losing all points for the week is likely to leave them under-rated, since they’ll get zero points instead of the 5 per table (or whatever) they would likely get. But you can’t just average their standings from previous weeks, because the number of people playing will vary; a great performance might be 10 point one week, 14 points another, 8 points the next.

So I feel like there’s a similarity in problem to projecting a fair ‘average’ score.

This problem is the same asb(or, at least, quite similar) as the one psychometricians call test-equating. There’s been a lot of work on this, and it’s been a long time since I looked at it (my PhD is in psychometrics, but my coursework was in the early ’90s and I haven’t done this kind of stuff since). But you can look it up.

One simple idea is to give each rater a score based on how he/she rates people compared to how other raters rate the same people, then use that rater-score to weight the ratings.

Is this significantly different from sports-ranking problems? I don’t know how many applicants you get, but would it be much worse than trying to rank 130 college football teams that mostly don’t play each other? For the latter, there are several algorithms out there; the Colley, Massey, and Wolfe systems are among those that give actual mathematical explanation of how they work. Massey and Wolfe are both based on a maximum likelihood estimate; Colley’s is harder to summarize in that many words.

You could probably even improve on the accuracy of rankings over the same system in sports, by assigning the raters in such a way that the applicants aren’t divided into clusters (i.e. sports conferences).

Could you do something like matchpoint scoring in bridge? That would be fairly easy and sort of straightforward. Only two scores might not provide enough depth. Are the same two people reviewing all of the files, or are their different reviewers?