Wednesday, June 22, 2011

An Idle Experiment

I exchanged an email with Dan Baker (aka Spartan Dan) in which we discussed how to measure performance for predictors. He pointed me towards this posting where Ken Pomeroy looks at performance broken down into "bands" of predicted win percentages. That's an interesting way of looking at the data, although it's only applicable to rating systems that predict a winning percentage (like Bradley-Terry). I've done similar sorts of analysis in the past, and this spurred me to repeat some of those experiments.

I generated the usual test data for the TrueSkill rating system, but instead of testing against the whole data set, I filtered down to certain subsets. (I also changed the parameters of the cross-validation slightly because some of the subsets are relatively small -- that's why the performance on the entire data set doesn't match the previous version.) In the first experiment, I created subsets where the difference between the TrueSkill ratings for the home team and the away team was either greater than 40, less than -40, or in-between. (This roughly corresponds to predicting a ten point win by the home team, a ten point win by the away team, or somewhere in-between.) Here are the results for those subsets (as well as the results for the whole data set for comparison):

Predictor

% Correct

MOV Error

TrueSkill

73.4%

10.96

Delta TS > 40

95.7%

11.60

Delta TS < 40 && > -40

68.9%

10.85

Delta TS < -40

79.3%

10.71

There are a number of interesting results here. First, home teams with a big ratings advantage rarely lose and we consequently have a very good winning percentage for those games. However, our MOV Error is at its worst. This might be because of blowouts, or more volatility in these games. Second, our prediction rate in "close" games is very poor -- only slightly better than our naive 1-bit predictor (although it would presumably do worse on this subset). However, our MOV Error is pretty good -- probably because it's an absolute measure, and close games are likely to have a tighter MOV spread. Finally, we do a lot worse predicting games where the away team has the big advantage. We (essentially) predict all wins by the away team, but in reality the home team manages to win more than 20% of those games.

Recall that we use linear regression to create an equation that predicts the MOV. That equation looks something like this:

We can give some intuitive (if not perhaps completely correct) meanings to these coefficients. C represents the "Home Court Advantage" -- it is the bonus (if positive) the home team gets for playing at home. The ratio of Hc to Ac represents the relative advantage (or disadvantage) the home team has over the away team.

So let's compare the coefficients produced for each of the subsets:

Predictor

Hc

Ac

C

TrueSkill

0.249

-0.238

3.48

Delta TS > 40

0.249

-0.251

3.77

Delta TS < 40 && > -40

0.250

-0.235

3.29

Delta TS < -40

0.148

-0.164

-0.373

The first interesting thing that jumps out is the equation for big home underdogs, which is quite different from the other three. In particular, if we think of C as the Home Court Advantage, it completely disappears when the home team is a big underdog. Even worse for the home team, this is the subset where their relative advantage (Hc to Ac) is at its worst. But somehow the home team manages to win 20% of these games! There's probably something interesting to be found in that anomaly... Another interesting thing to note is that when home teams are big favorites, they actually slightly underplay their strength (relative to the away team) although this is somewhat compensated by a greater HCA.

This is far from a rigorous experiment, but it does illustrate that the overall performance of our model can obscure radically different performance on included subsets. It also suggests that we might want to build our predictor to use different models (or entirely different approaches) for different subsets of the data. (But interestingly, the "sum" of the three subset models actually slightly under-performs the single overall model. So maybe this intuition is wrong!) In particular, it might be worthwhile to start tracking the performance of our candidate models for "close games", since that's clearly the place where significant performance gains can be had.