Friday, April 29, 2011

Trueskill

We'll now turn our attention to the first of the "RPI Alternatives" we will evaluate. Like RPI, all of these approaches make use of only the won-loss record in rating teams.

The first approach is Microsoft's "Trueskill" rating system. The Trueskill system was developed at Microsoft Research to rank players on XBox Live. This has a number of unique challenges. First, there are many players, so most players have not played against each other. Second, many of the games on XBox Live are multiplayer or team games, and players may join or drop out of the game at different points in the game play. Finally, Trueskill is in some sense intended to be a predictive system. It's main use is to predict how competitive a game will be between two players.

As a rating system, Trueskill also has a couple of unique features. For each player it calculates not only a rating, but also an explicit uncertainty in the rating. After one game, Trueskill will provide a rating for the teams involved, but the uncertainty in the ratings will be high. As more games are played, the uncertainties become less.

What happens when a player performs better or worse than expected? Trueskill can either move the player's rating, or change the uncertainty, or some of both. There is explicit provision in the Trueskill system to tune this tradeoff.

Finally, Trueskill also accommodates games that can end in a draw.

The Trueskill system is based upon Bayesian inferencing. The fundamental ideas are not hard to grasp, but the details can be daunting. Fortunately, Jeff Moser has provided a very clear tutorial on Trueskill, which you can find here. Jeff also provides an implementation of Trueskill in C#, and was instrumental in helping me create the Trueskill implementation in Lisp, which you can download here.

College basketball is simpler than XBox Live in that we don't have to worry about multiplayer games or players dropping out before a game is finished. So to test Trueskill for predicting college basketball games, I was able to implement the simplest Trueskill algorithm: one that deals with just two player games.

As mentioned above, Trueskill accommodates draws. This is nice, since we showed earlier with RPI that it improved prediction accuracy to consider some games as draws. It's worth noting that Trueskill treats draws somewhat differently than we did with RPI. In the RPI tweak, we set an MOV cutoff and ignored games that fell below that cutoff. In Trueskill, there is a similar cutoff that identifies drawn games, and these games are ignored when updating a team's rating. However, Trueskill also uses the likelihood of a draw to control the impact a non-drawn game has on a team's ratings. For example, if draws are very likely, then a non-drawn game has a big impact upon ratings. This makes intuitive sense -- if it's very hard to get a decisive win, then that should be strong evidence that the winning team is better than the losing team.

To make use of draws, we need to tell Trueskill what point differential counts as a draw, as well as how likely draws are to occur. I used the last three seasons of games to determine the likelihood that a game will be decided by "N" or fewer points:

Point Spread

Likelihood

1 Point

4.5%

2 Points

10.5%

3 Points

17.0%

4 Points

22.3%

5 Points

28.7%

6 Points

34.0%

7 Points

39.1%

8 Points

44.0%

9 Points

49.3%

10 Points

54.2%

11 Points

58.6%

12 Points

62.7%

We can then test with draws set at various levels to find the best performance. We test here just as we did with RPI -- our main measure of performance is the error in predicted Margin of Victory (MOV):

Draw

Performance

2 Points

11.29

3 Points

11.24

4 Points

11.19

5 Points

11.16

6 Points

11.13

7 Points

11.13

8 Points

11.09

9 Points

11.10

10 Points

11.13

As this table shows, best performance is achieved with an amazingly high level of draws -- 8 points, which drops 44% of the played games from consideration. The performance of the Trueskill algorithm at this setting is also signficantly better than our best RPI algorithm:

Predictor

% Correct

MOV Error

1-Bit

62.6%

14.17

RPI (infinite depth, mov-cutoff=1)

72.1%

11.30

Trueskill (draw=8 points)

72.8%

11.09

To this point, we've made no use of the uncertainty measure that the Trueskill algorithm provides. And it isn't clear how to make use of it. We're using the Trueskill ratings for each team as inputs to a linear regression, resulting in an equation that looks something like this:

MOV = 1.514*HTrueskill - 1.462*ATrueskill + 2.449

Adding the uncertainty measures to this regression doesn't seem like it will add any useful information for determining the MOV, and indeed, when they are added they get optimized out of the regression. Another approach is exemplified by the Microsoft's Leaderboard, which uses a "conservative" strength estimate calculated by subtracting three times the uncertainty measure from the rating. Using this as the inputs to our predictor also fails to improve our predictive accuracy. Similar experiments with adding the uncertainty to the home team and subtracting it from the away team, etc., all fail to provide any improvements. So while the explicit uncertainty might prove to be useful for a more sophisticated predictor, it doesn't appear to provide any value for a simple linear regression.

Another area of tweaking we can look at for Trueskill is home court advantage (HCA). As with RPI, the linear regression effectively adjusts for the home court advantage, but in this case manual adjusting might provide some additional benefits. If we adjust for HCA by (say) subtracting points from the home team's score, it will affect which games are considered "draws" by the Trueskill algorithm, and this may lead to improved accuracy. Here is the performance with the HCA set to various values, including the Dick Vitale Approach:

Predictor

% Correct

MOV Error

1-Bit

62.6%

14.17

Trueskill (draw=8 points)

72.8%

11.09

Trueskill (draw=8 points, HCA=2.5)

72.5%

11.13

Trueskill (draw=8 points, HCA=3.5)

72.4%

11.13

Trueskill (draw=1 point, Vitale)

70.0%

12.09

Trueskill (draw=8 points, Vitale)

70.8%

11.80

I could find no adjustment for HCA which improved the predictive performance.

The other tweaks we tried for RPI (such as weighting recent games more heavily) do not easily apply to the Trueskill algorithm.

2 comments:

I'd been toying around with an implementation of Trueskill in ruby which allows 'partial play adjustment' -- perhaps that might be a more elegant adjustment to account for HCA than docking points? Saying "away teams perform at 90%" might account better for low-scoring games than docking a (fixed) number of points.