Saturday, January 31, 2015

I've removed my previous posts on Strength of Schedule and adjusted statistics from the blog because I now believe that the method I was proposing does not always converge -- in some cases it appears that the adjusted statistics will end up with one team at 1.0 and all the others at 0.0 -- not a particularly useful approach!

That will teach me to post before I've implemented -- although it also goes to show that the process of experimentation is bound to be fraught with problems. I usually avoid describing all the pitfalls, but the process of discovery is rarely as clean as the final product.

UPDATE

After some more work, I think I understand the conditions under which the approach converges, and I will go back, fix the original postings and put them up again.

So to some extent this measure is biased by the quality of a team's opponents' opponents' opponents (if you can follow that!). This bias may not be significant -- certainly by the middle of the season, the third-level opponents for a given team is a pretty big set, so there's going to be overlap between most teams. But each team's opponents will be dominated by their conference teams, and there are sizable differences in strength between conferences, so it may be significant.

The obvious way to eliminate this bias is to use the adjusted statistic in the SoS calculation. The whole purpose of the adjusted statistic is to normalize away these biases.

However, we've now created a circular definition! The adjusted statistic for a team depends upon it's opponents' opponents adjusted statistic -- and their adjusted statistic depends upon other teams, and so on.

But just because the definition is circular doesn't mean we can't compute its value. One approach to do this is to guess some initial value for the adjusted statistic (say, the unadjusted value) and then recalculate all the adjusted statistics. The values will change and then we can repeat until we converge on an answer.

NOTE: A system of non-linear equations is not guaranteed to have a solution. For this and other reasons, the iterative approach described here is not guaranteed to converge for every system. In this case, I believe that the solution will converge if the system has "full support", which for NCAA basketball should be true after about 500 games. Testing seems to bear this out, because it works for me for data from the past 5 seasons, but I haven't proven that it converges. It also turns out that the normalization step shown below isn't necessary for convergence, but I'm leaving it in below because the example doesn't work otherwise.

Let's take a look at how this works. For this experiment, we have a small league of three teams. They've all played each other once, and here are their unadjusted statistics for 3 point shooting:

Team

Unadj

Blue

34%

Gold

24%

Silver

30%

Now we will calculate the Strength of Schedule for each team. Since each team has played each other once, their opponents' opponents are each of the other two teams. (I'll wait while you confirm that.) We'll use the Unadjusted statistic as our first guess for the Adjusted statistic, so the SoS is just the average of the other two teams.

Team

Unadj

SoS(1)

Blue

34%

0.27

Gold

24%

0.32

Silver

30%

0.29

The Adjusted Statistic is then the Unadjusted Statistic divided by the SoS.

Team

Unadj

SoS(1)

Adj(1)

Blue

34%

0.27

1.26

Gold

24%

0.32

0.75

Silver

30%

0.29

1.03

Finally, we take the Adjusted Statistic and normalize it so that it sums to one to create the Normalized Adjusted Statistic:

Team

Unadj

SoS(1)

Adj(1)

Nadj(1)

Blue

0.34

0.27

1.26

0.41

Gold

0.24

0.32

0.75

0.25

Silver

0.3

0.29

1.03

0.34

Now let's repeat that calculation again for a second iteration.

Team

Unadj

SoS(1)

Adj(1)

Nadj(1)

SoS(2)

Adj(2)

Nadj(2)

Blue

0.34

0.27

1.26

0.41

0.29

1.16

0.43

Gold

0.24

0.32

0.75

0.25

0.38

0.64

0.24

Silver

0.3

0.29

1.03

0.34

0.33

0.91

0.34

The Normalized Adjusted Statistics still changed a little bit, so let's do a third iteration.

Team

Unadj

SoS(1)

Adj(1)

Nadj(1)

SoS(2)

Adj(2)

Nadj(2)

SoS(3)

Adj(3)

Nadj(3)

Blue

0.34

0.27

1.26

0.41

0.29

1.16

0.43

0.29

1.19

0.44

Gold

0.24

0.32

0.75

0.25

0.38

0.64

0.24

0.38

0.63

0.23

Silver

0.3

0.29

1.03

0.34

0.33

0.91

0.34

0.33

0.90

0.33

Still a little bit of change, so we do another iteration.

Team

Unadj

SoS(1)

Adj(1)

Nadj(1)

SoS(2)

Adj(2)

Nadj(2)

SoS(3)

Adj(3)

Nadj(3)

SoS(4)

Adj(4)

Nadj(4)

Blue

0.34

0.27

1.26

0.41

0.29

1.16

0.43

0.29

1.19

0.44

0.28

1.21

0.44

Gold

0.24

0.32

0.75

0.25

0.38

0.64

0.24

0.38

0.63

0.23

0.38

0.62

0.23

Silver

0.3

0.29

1.03

0.34

0.33

0.91

0.34

0.33

0.90

0.33

0.33

0.90

0.33

And voila! The Normalized Adjusted Statistics (and the SoS) have converged (at least to two decimal places). And a good thing, too, because that table was getting very wide. If you compare the adjusted statistics to the unadjusted statistics, you see that Blue is a much better 3 point shooting team than the unadjusted statistic shows, Gold is slightly worse, and Silver slightly better.

So there you have it -- a method for adjusting a non-symmetric statistic to account for strength of schedule, and a way to compute it. In the next post, I'll share some thoughts on how to create an efficient implementation of this method.

Wednesday, January 28, 2015

In a previous post, I talked about why Strength of Schedule (SoS) is important to interpreting a team's statistical performance, and I briefly described the standard SoS approach taken by Ken Pomeroy, SevenOvertimes, and others.

To briefly review, the standard approach for calculating SoS for a statistic like winning percentage (WP) for a team (T) is to average the winning percentage of all of a team's opponents:

`SoS(T) = (1/n) sum_(i="opponents"(T))^n WP(i)`

To use this SoS measure to interpret the original statistic, we could create an adjusted statistic:

`WP_"adj"(T) = WP(T) * SoS(T)`

This works fine for symmetric statistics like winning percentage, where a win for you means a loss for your opponent. Unfortunately, winning percentage (and other won-loss stats) are about the only statistics with this property. Most statistics are like three point percentage, where a team's performance is mostly unrelated to how well it's opponent does the same thing. Instead, there's an offense-defense aspect to the statistic, and to interpret the statistic you want to know how well the opponent does at defending the statistic. However, there's not usually a corresponding defense statistic (e.g., "3 PT defense"), so we have to derive the opponent's defensive strength by looking at how well the opponent has done in stopping other teams. So in the case of three point percentage, we want to know how well a team's opponents have done at stopping the three pointer.

We calculate the Strength of Schedule by averaging the team's opponents' opponents performance:

For example, suppose that Louisville is shooting 32% from the arc. If the teams Louisville has played have held all their opponents to an average three point percentage of only 24%, then Louisville's 32% is more impressive. Conversely, if Louisville's opponents have allowed the teams they played to average 48%, then Louisville's 32% looks less impressive.

Note that this SoS measure is backwards from the typical one used for symmetric statistics. In this case, a smaller SoS indicates tougher competition. (This all assumes that "bigger is better" for our statistic. If we have a statistic where you want to have a low number, such as turnovers, everything flips.)

We can capture this analytically as an adjusted statistic (using S for a generic statistic, and assuming that for S bigger is better):

`S_"adj"(T) = (S(T))/(SoS(T,S))`

To return to the Louisville example, if the strength of schedule is 24%, then Louisville's adjusted 3PT% is 1.33. But if the strength of schedule is 48%, then Louisville's adjusted 3PT% is only 0.66.

As should be obvious from that example, these adjusted statistics don't have any meaning. They're just a number, where bigger is better. But they can be used to compare teams, and may be more useful than the original statistic for prediction because they provide a common measure even when teams haven't faced the same opponents.

One problem that we haven't yet addressed is that this SoS measure only goes one level deep. Maybe Louisville's opponents held teams to 24% three point shooting because they played a bunch of teams that were terrible three point shooters. I'll address that in Part 3.

Tuesday, January 27, 2015

The usual method of predicting games is to look at a team's past performance -- usually expressed as a statistical value such as "winning percentage" -- and use that to estimate future performance. But this approach is problematic, because all statistics are not made the same. In the case of winning percentage, two teams with a winning percentage of 84% are not necessarily equivalent. Louisville's 16-3 record, with losses to #1 Kentucky, #4 Duke and #18 UNC is not the same as Dayton's 16-3 record with losses to #17 Connecticut, Arkansas and Davidson.

This problem arises because college basketball is a case of incomplete pairwise comparison. If every team played every other team twenty times, by the end of the (admittedly long) season, winning percentage would be a pretty good measure of team strength. But that's never going to happen, so we need other ways to compensate for this weakness. One of the simplest is to calculate a "Strength of Schedule" and use that to interpret the statistic.

In its simplest form, Strength of Schedule (SoS) is calculated as the average of all a team's opponents in the same statistic. So if we were looking at "winning percentage" SoS would be calculated by averaging the winning percentage of all of a team's opponents. So Louisville might have a SoS of .57 (meaning its opponents has overall won 54% of their games) while Dayton had a SoS of .51 (meaning its opponents had only won 51% of their games). In light of this, we could then say that Louisville's 84% winning percentage is better than Dayton's 84% winning percentage.

There are several shortcomings with this definition of SoS.

First, it doesn't always make sense to measure Strength of Schedule using the same statistic. For example, suppose we're looking at "3 Pt Shooting Percentage". In this case, SoS would tell us how well our opponents shot the three-pointer. That doesn't make a lot of sense. How well our opponent shot the three doesn't affect how well we shot the three. In this case, we really want to know how well our opponent's opponents shot the three (if you can follow that thought). The simplistic form of SoS only makes sense for symmetric statistics -- where a plus for one team is automatically a minus for the other team -- such as winning percentage, where a win for you necessarily means a loss for the other team.

Even for symmetric statistics, there are problems with this view of SoS. One is that we've only pushed off the problem one level by looking at a team's opponents. To return to the previous example, Louisville's opponents seem better because they have better records than Dayton's opponents. But maybe that's just because Louisville's opponents themselves played weak teams. This is the problem that RPI tries to address by looking at opponents and opponents' opponents. Two layers is pretty good, and RPI is a much better metric than straight Winning Percentage. Of course, you can take it deeper.

In general, many of the more sophisticated rating systems (e.g., Massey) can be viewed as different approaches to extending Strength of Schedule as deep as possible. I'm not sure there's a "right" answer to measuring Strength of Schedule, but it seems clear that the general idea -- to adjust or interpret statistics based upon a team's opponents -- is valuable.

I've got a couple of topics I'm trying to write up, but in the meantime a quick Top Twenty:

Rank

Team

Rating

1

Kentucky

31.67

2

Ohio State

30.29

3

Virginia

30.27

4

Wisconsin

30.05

5

Duke

30.04

6

Utah

29.80

7

Gonzaga

29.77

8

Notre Dame

29.72

9

North Carolina

29.58

10

Louisville

29.35

11

Arizona

29.35

12

Villanova

29.11

13

Texas

29.01

14

West Virginia

28.88

15

Butler

28.60

16

Oklahoma

28.52

17

Michigan State

28.44

18

Baylor

28.39

19

Wichita State

28.33

20

Indiana

28.29

The amazing thing here is that the gap between #1 and #2 is almost as big as the gap between #2 and #20. Kentucky's really way better than the rest of the field. And if not for Kentucky, we'd have a dead heat for the four #1 seeds for the tournament.

Overall, the Top Twenty has been pretty stable for the last three weeks. Some shuffling around, but only Wichita State is new. (Pushing off Maryland.)

Friday, January 23, 2015

For my basketball predictor, I could double the number of training examples by creating new examples with the Home and Away teams swapped. Would it make sense to do that?

That's an interesting thought.

The first concern is that you can't just swap Home and Away. No one is entirely sure why, but home teams play differently (and better) than away teams. In NCAA basketball, the Home Court Advantage (HCA) is between 4 and 5 points. What's more, all of a team's statistics are typically different for their home games. There's sometimes even a home court effect for neutral site games. In particular, the "home" team in NCAA Tournament games does better than you would expect -- even when you adjust for them being the better team. (In the Tournament, the better-seeded team in the matchup is the home team.) If you just blindly swap Home and Away, the newly created games are going to be bad training examples because they won't reflect what would happen in a real game with a swapped matchup.

But let's assume that there is no difference between Home and Away. In that case would it be valuable to swap Home and Away to create new training examples?

Let's look at a simple case. Suppose that our training examples consist of a strength rating for each team (S1, S2) and the margin of victory (MOV). Here's our very small training set:

Now let's augment the training set by adding new games where S1 and S2 are swapped and MOV is negated.

S1, S2, MOV
20, 18, 2
14, 28, -14
18, 20, -2
28, 14, 14

Based upon the new training data, we'll build the exact same model!

That happens because we didn't add any new information to our training set. We added new examples, but they didn't contain any new information. They just repeated existing information transformed in a way that was irrelevant to our model.

I suspect this will always happen if (1) the data transformation is perfect, and (2) the machine learning model can "see through" the transformation.

An example of modifying the training set where the data transformation is not perfect is boosting. In boosting, we duplicate SOME of the training examples, and that leads to a different model. If we duplicate all of the training examples, the transformation is perfect and there's no change to the learned model. It's only duplicating some of the examples that is key to boosting.

A situation where the machine learning model cannot "see through" the transformation is using non-linear transformations with a linear model. For example, if we have a data set where the data is arranged in a circle, a linear regression will not model the data well because it always produces a straight line. However, if we apply a polar coordinate transformation on the data the circle will become a line, and the linear regression will be more effective.

My guess is that swapping Home & Away to create new training examples isn't an effective technique for NCAA Basketball. However, I encouraged the original poster to go ahead and give it a try. It's a fairly easy experiment, and who knows, my intuitions may be wrong!

Monday, January 5, 2015

I had email today asking about my Top Twenty. I've been lax about doing that this year, but here's the ratings as of today:

Position

AP

Team

Rating

1

1

Kentucky

31.70

2

22

Ohio State

30.72

3

2

Duke

30.62

4

3

Virginia

30.18

5

4

Wisconsin

30.00

6

18

North Carolina

29.80

7

13

Notre Dame

29.78

8

5

Louisville

29.61

9

6

Gonzaga

29.51

10

9

Utah

29.24

11

10

Texas

29.23

12

8

Villanova

29.06

13

14

West Virginia

29.05

14

NR

Butler

28.62

15

7

Arizona

28.61

16

16

Oklahoma

28.60

17

NR

Illinois

28.45

18

NR

Michigan State

28.42

19

NR

Baylor

28.40

20

11

Maryland

28.30

I've included the AP rankings for comparison. Kentucky is very strong -- as much stronger than the #2 than the #2 is of #7. The PM likes Ohio State, North Carolina and Notre Dame much more than the poll. Conversely, it doesn't think as much of Arizona, Villanova or Maryland. Butler is probably the biggest darkhorse in this list, although they did receive votes in the AP poll.

I've been wondering whether having more training data (i.e., additional seasons of games) would further improve my predictor. This is problematic, because I already have data back to when the 3 point shot was introduced in the 2009-2010 season, so I can't actually get any more usable data. But the question persisted, so I did a quick and dirty experiment to try to characterize how much improvement I'll see with additional data.

I trained a model on differing amounts of training data and tested it on the entire training set. Ideally, I'd do this as some sort of cross-fold validation, picking different slices of the data for training, but I didn't want to spend the time that would require, so I just did each trial once. So there's necessarily a lot of fuzziness in these results, but I still think the result is instructive. The plot of error versus amount of training data looks like this:

That's error along the Y axis and number of training examples along the X. You can see that error falls fairly steeply for the first 10K or so training examples and then begins to level off. (Although it continues to slowly decrease.) Eyeballing this chart suggests that additional data isn't likely to provide any big improvement.

If you're building your own predictor, this suggests that you should try to get at least 15K games for training data. Depending upon how many games you throw out from the early season, that's around 3 full seasons of games.

This also shows the folly of trying to build a Tournament predictor based upon past Tournament games. At 63 games a year, you'd need about 238 years of Tournament results to get a decent error rate :-).

This is just a quick note to say that this season I've been submitting predictions to The Prediction Tracker. This site tracks the performance of a number of rating / prediction sites. Current standings as of today look like this:

I optimize my predictor on RMSE, so I'm pleased to see that I have the best performance in that metric of the tracked predictors. I'm also doing the best of the predictors against the spread, although that performance is a little higher than I'd expect from my own testing so I won't be surprised if that trends down. It's interesting to note that my predictor is about middle-of-the-pack for predicting the winner straight up and also not very good on Mean Error.

I only submit picks once a week on Monday, so that hurts my performance a little bit. The predictions for the Saturday games are five or six days stale, which probably costs me 0.1 or so in RMSE. (But for all I know the other predictors have the same problem.)