Thursday, April 14, 2011

The Ratings Percentage Index (RPI)

As mentioned previously, one of the simplest and most accessible pieces of information that we can use for prediction is a team's won-lost record. Naively, we might suppose that when Team A with a 6-2 record plays Team B with a 2-6 record, Team A will likely beat Team B.

But there's (at least) one significant problem with this supposition: we don't have any idea how Team A compiled its winning record, or Team B its losing record. It could well be that Team A is a Big East team that played 8 patsies at home, while Team B is a mid-major team that has played 8 road games against the best teams in the country, and managed wins against both Duke and Kansas. In that case we wouldn't be so certain that Team A could beat Team B.

The most widely known rating based on won-loss records is the Ratings Percentage Index (RPI). RPI tries to address the shortcoming of using won-loss records by rating each team not only by its winning percentage, but also by the winning percentages of its opponents. The assumption here is that opponents with good won-loss records are tougher opposition than those with poor records, so we should value wins (and losses!) against those opponents more highly.

Of course, you can extend this reasoning another level. A team's opponents have winning records -- so what? Again, we don't know if they compiled those records by playing good teams or bad teams.

And, in fact, the RPI addresses this concern by extending the rating another level, so that the RPI for a team is based upon:

The team's winning percentage (WP)

The team's opponents' winning percentage (OWP), and

The team's opponents' opponents' winning percentage (OOWP)

The RPI stops at this level, possibly because the NCAA had run out of the letter 'O'.

Previously we noted the significant impact of the home court advantage (HCA) on college basketball games. The RPI accounts for this, too, by weighting a teams home wins less than its road wins, and its road losses less than its home losses. The exact calculation of RPI is complicated, and I refer the interested reader to the Wikipedia article for a more detailed explanation. Studying that explanation for several days should lead to total enlightenment -- regarding RPI, anyway.

So how effective is RPI as a predictor? Using my standard methodology, I get this performance:

In this plot, each point represents a game. The Y axis is the RPI of the home team, and the X axis is the RPI of the away team. The color of each point indicates the winner of the game -- red for a home win, blue for an away win. The diagonal line splits the field into games where the home team had the higher RPI (above the line) and games where the away team had the higher RPI (below the line). While there are more blue points below the line and more red points above the line, the correlation is not overwhelming.

But wait. I tricked you a bit back in the second paragraph of this posting, when I claimed that the won-loss record of a team isn't a good predictor because it doesn't take into account the quality of opponents. Is that true? I've always heard that claim, and it seems reasonable. But perhaps we should take a minute to check it. If we use winning percentages as inputs for a predictor, we get the following performance:

Predictor

% Correct

MOV Error

Naive

50%

14.5

1-Bit

62.6%

14.17

WP

72.4%

11.65

RPI

73.2%

11.62

Interesting! RPI is a better predictor than just the winning percentage, but not by a huge margin.

So RPI provides a signficant improvement in prediction over the 1-Bit Predictor. But there are several obvious shortcomings in RPI. Can it be improved? In the next few postings I'll examine the various shortcomings in RPI and perform various experiments to see if addressing these shortcomings improves performance. Eventually, we'll also consider other schemes that make use of only won-loss records and see if those provide any significant advantage over RPI.