On Luck, Talent, Parity and Predicting the Upper Bounds with Machine Learning

Introduction

This last week I spent at the European Conference on Machine Learning where there was a workshop on Sports Analytics. I happen to be presenting on some of my previous #fancystats machine learning in predicting a single game. The more I look into this the more I realize just how much random chance (or “luck”) plays a role in determining the outcome of a game in the NHL. But when I talk to people about this they just cannot wrap their head around the idea.

Often they will jump to mantras like, “the best athletes make their own luck” but it is because of the elite nature of the game and how close in talent the teams are that the little things that are outside of their control often cause teams to win a game. Things such as pucks bouncing off players into the net, or the puck going over the wrong end of the glass for a penalty. Luck also often comes in the form of high shooting percentages for extended periods of time. With low events (goals) and games often being won by a single goal, it becomes easier to see how luck plays a large role in the results.

The amount of luck, due to parity, causes there to be an upper bound in prediction limits for machine learning in sports and this is what I want to explore.

Background

In my first experiment I looked at predicting success in NHL games, that is, who will win or lose. I was training on 72% of the 2012 season games and using 14 features for each team including advanced statistics such as PDO, Fenwick Close and 5v5 goals For/Against and traditional stats such as goals for, against, differential and location. Regardless of what I tried I was not able to achieve higher than ~60% accuracy in predicting games in the NHL. Further analysis showed that it was traditional statistics that was helping with the prediction more than any other statistic. Specifically: Location, commutative goals against and commutative goals differential.

Then I looked at the observed standard distribution of win percentage in the NHL, since the last lockout, and it turns out that number is 0.09. I then ran a Monte Carlo method varying the amount of luck/skill required to win a game until I was able to find a league that was as similar to the NHL that as possible. This turned out to be that the observed NHL league is similar to a league where 24% of results were determined by skill and 76% by random chance. This implies a possible theoretical upper limit of approximately 62% for the NHL.

Experiment

My hypothesis in this experiment is that I can create a machine-learning classifier for any hockey league using those three traditional statistics to predict who will win a game with an accuracy that is better than the baseline and near the upper limits.

The first step is looking at a number of different leagues.

I wanted to capture different talent levels, ages and areas of the world so I went with the National Hockey League (NHL), American Hockey League (AHL), East Coast Hockey League (ECHL), Western Hockey League (WHL), Ontario Hockey League (OHL), Quebec Major Junior Hockey League (QMJHL), BC Hockey League (BCHL), Swedish Hockey League (SHL), Kontinental Hockey League (KHL), Czech Hockey League (ELH), and Australian Ice Hockey League (AIHL).

For each league I first looked at their schedule to calculate their commutative statistics on location, goals for, against and differential. Then I formatted this into files readable by Weka, for machine learning. I play with the different machine learning algorithms to see what the best classification rate I can acheive. I use 10-fold cross-validation on a single season of data for each of the leagues I also look at the season rate to see what the percent of games won by the home team is and use this as a simple classifier to compare our classifier to. In almost all leagues, across many seasons this seems to regress to 55%.

As a number of classifying algorithms were surveyed no tuning was done. Initial feedback appears our reported classifier numbers could be increased by 1-2% by tuning the SMO, NN, or Logistic Regression.

I then look at the parity of each league. I look at the observed win percentage of every team as far back as reasonably possible for each league. Some leagues can only go back a few years, such as the KHL which only exists from 2008 on, same issue for the AIHL. In leagues that have existed for a while (WHL, OHL, QMJHL) I tried to go back to around 1996-1997 (200-300 observed team seasons). This gave me an observed win percentage of every team in that period in which I could calculate the standard deviation. I then ran a Monte Carlo to calculate the approximate upper limit for machine learning so I could compare the difference between my classifier and the upper limit.

There are two things we have to acknowledge.

The first is that when looking at the the number of team seasons to calculate the standard deviation of win percentages we only have 200-300 at most. This leads to a small sample size issue.

The second is that because we have such a low number of observations we can’t confirm its distribution so in each league we have to assume that it is a binomial distribution (it could be a beta distribution for all we know, Nick Emptage will be exploring this more on PuckPrediction).

Results

League

NHL

AHL

ECHL

WHL

OHL

QMJHL

Trained Season stdev

0.109

0.07

0.102

0.15

0.134

0.166

Obs win stdev

0.09

0.086

0.11

0.141

0.143

0.146

# Teams in MC

30

30

30

22

20

18

Seasons in Obs Data

240

231

238

324

298

283

Gms/Team in MC

82

76

76

72

68

68

Upper Bound

62%

60.50%

65%

72%

71.50%

72.50%

ClassifierTrained # Gms

512

1140

720

792

680

612

Classifier %

59.80%

52.58%

58.70%

63.07%

63.60%

65.52%

Limit Differential

-2.20%

-7.92%

-6.30%

-8.93%

-7.90%

-6.98%

Home Win%

56.80%

52.50%

55.90%

55.50%

55.50%

53.80%

Baseline Differential

3.00%

0.08%

2.80%

7.57%

8.10%

11.72%

Classifier

Voting

Simple Log

Simple Log

Logistic

Logistic w/ Bagging

NaiveBayes

League

BCHL

SHL

KHL

ELH

AIHL

Trained Season stdev

0.178

0.132

0.143

0.089

0.15

Obs win stdev

0.155

0.115

0.137

0.119

0.191

# Teams in MC

16

12

26

14

9

Seasons in Obs Data

165

204

120

238

47

Gms/Team in MC

56

55

52

52

24

Upper Bound

73%

69%

70.50%

66%

76.50%

ClassifierTrained # Gms

480

330

676

364

108

Classifier %

66.88%

60.61%

61.02%

61.53%

64.81%

Limit Differential

-6.13%

-8.39%

-9.48%

-4.47%

-11.69%

Home Win%

52.90%

52%

56%

61.50%

48%

Baseline Differential

13.98%

8.61%

5.02%

0.03%

16.81%

Classifier

SMO

Neural Network

Voting

SMO

simple Log

I graphed the year to year parity levels of each of the leagues. The horizontal axis is the season from the current 2013-2014 back as far as 1996-1997. The vertical axis is the standard deviation of win-percents that year. A win in regular time, OT, or the shootout were all considered wins, and vice-versa for loses.

Discussion

This year to year graph I find quite interesting. First of all a lot of leagues are not very steady in their parity levels year to year. This could be because the less parity a league has the more volatile they become. The other reason is that each of those points only come from 12-30 team seasons which gives us small sample size.

It does still show us which leagues have more parity than others. Factors such as league changes can cause these leagues to shift in direction in terms of parity such as the introduction of the salary cap in the NHL (which is why I didn’t look past 2005 for this league). Ofter factors might affect other leagues that I am not aware of that might explain issues on the change in direction for the ELH towards a more parity league. Also interesting is that all leagues seem to progress from the around the same point in 2004 (same year as the introduction of the NHL – not sure if there is causation here).

I was surprised by the difference in parity between the KHL and the NHL, the two leagues which are generally considered the two best in the world. It makes sense when you look at how they are financed and how much is spent on talent between teams It was interesting to see that the the AHL has more parity than the NHL, my assumption would be that is because it is talent that is near the same, people who are just below NHL talent level, or younger players developing. You don’t see the superstars that the NHL would have. Also interesting is how stable the OHL is versus the QMJHL, but that could be a small sample size.

Another thing that I was surprised to see, although it makes sense by inspection, that talent does not equal parity. This goes hand in hand with the NHL vs KHL. The KHL is the league closest to the NHL (at least in terms of NHL equivalency points, assuming the NHL is the top league). Leagues with much lower talent have more parity than the KHL (I.e. the ECHL). So while parity != talent, as parity is the difference between your best and worst teams, the less parity your league has the easier it is to predict the leagues. The more parity the closer the games move towards a coin flip.

Interesting, and again its seems obvious, it is important how much parity the season of data you are training on has. If your data has more parity than the observed then the classifier accuracy moves closer to the baseline the home win % (and never seems to diverge lower than that).

This suggests we should be training on multiple seasons of data. I did this for the NHL and we can see the results below. Also interesting to see the best classifiers seem to never perform worse than the home field advantage. In most leagues this is around 55-56% (except in Australia where its 48-49%. I guess everything is backwards there).

I am sure this has been explored but it’s interesting that despite parity the home team always has an advantage. It can come from a number of reasons such as lack of travel, more rest, sleeping in their own bed, the home crowd pumping them up and officiating bias. I will leave this to explore for someone else and I am curious how games at a neutral site would affect classifier rates (but we don’t have too many of these in the NHL).

Trg

2005

2005-06

2005-07

2005-08

2005-09

2005-10

2005-11

Test

2006

2007

2008

2009

2010

2011

2012

NB

58.1774

54.2718

57.6892

56.0618

52.8413

55.8991

56.0501

J48

58.1364

52.5631

58.4215

57.8519

55.4109

55.2075

56.3282

RF

52.7258

51.2205

51.874

82.0993

52.319

53.214

49.235

NN

58.2994

52.441

56.55

52.0342

53.4988

52.0749

51.3213

SVM

55.0854

53.7022

55.8991

56.0618

51.8308

55.8991

56.8846

We can see how some of these algorithms are better than others and we can also see there’s some concept drift suggesting that training on too many seasons too far back isn’t very helpful to our predictions. Some work with Logistic Regression suggests that it still works at 58% of the time when training on 2005-2011 data. Redoing this an experiment with a league with less parity (and thus a higher ceiling) will likely exaggerate the differences in years accuracies.

So going back to our hypothesis, I feel confident that yes there is an upper bound / glass ceiling on how good our classifiers can do. This is correlated with the parity of the league, and also the number of teams playing and the number of games played in a year. There is an r-squared correlation on this data of 0.9269 between the parity of the league over all observed seasons and the upper limit for prediction.

The monte carlo method I feel strongly that it can calculate the approximate limit and it seems fairly intuitive. As your league approaches parity then the more even your teams become and the more of a coin flip it becomes. I feel the second part of the hypothesis has been addressed that, yes, we are able to create classifiers for leagues in hockey using those three features and it will perform better than the baseline. It is dependant on the parity of the data you train on so training on multiple years is key. I also tried adding new features, such as commutative goals for, and that did not statistically change the results.

I was also surprised by the accuracy I was getting from Logistic Regression and Simple Logistic Regression when my initial experiments showed the success of SMO and NN. These algorithms are typically great with noisy data such as in this case. Tuning these algorithms might boost them to have a higher overall accuracy but I doubt it would be statistically different.

Conclusion

In this article I looked at the glass ceiling of prediction limits amongst a number of hockey leagues and then I tried to create classifiers to reach this. I feel both of these were achieved successfully. It is interesting to see that the amount of parity in a league affects the overall prediction level and that talent in a league does not equal parity. The question then becomes, would you rather watch talented league that is full of parity and anyone can win? Or a talented league with little parity and your team either dominates or gets dominated?

As always, if you want to discuss any of this, feel free to drop me a line.

Marc-Andre Fleury is having an unbelievable postseason. His current Sv% of .947 doesn’t just lead all goalies in these playoffs, it’s actually the highest Sv% of any goalie in a playoff year since the 1960s (min 8 games) …with one important caveat: he has one round yet to play. I think the biggest question heading into the…