Predicting the Stanley Cup Champion

When I was at the Recurse Center, I wanted to try the caret package for R. It provides a unified interface for training various types of classification and regression models, and parameter tuning through resampling. I needed a project to work on, and since I love hockey and the Stanley Cup playoffs were just starting, it was a natural choice.

The source code is all on GitHub, and is split into four R Markdown documents: scrape raw data, process data, train models, and make predictions. I’ll present a short summary here, and more details can be found behind those links. The repository also contains a Makefile to replicate the analysis. Random seeds are specified in the code to make it fully reproducible.

First, I used the nhlscrapr package to scrape play-by-play data from NHL.com starting from the 2002-2003 season. Then, I used dplyr to calculate some summary statistics. For each game, I calculated the following statistics for both the home and away teams:

power play, i.e. “power play goals scored / penalties for the other team”

penalty kill, i.e. “power play goals against / penalties for own team”

I’m sure many more useful predictor variables could be derived from the play-by-play data, which in turn would result in more accurate predictions. But since this was mainly an exercise to try out caret, these variables will suffice for now.

For each season, I then calculated the average performance of each team, separately for when they were playing at home and on the road. Here’s an example of away performance for six teams from the 2002-2003 season:

season

team

goals

shots

faceoffs

penalties

pp

pk

20022003

ANA

0.480

0.477

0.544

0.517

0.159

0.095

20022003

ATL

0.433

0.434

0.467

0.504

0.147

0.139

20022003

BOS

0.422

0.503

0.481

0.555

0.137

0.103

20022003

BUF

0.416

0.482

0.493

0.526

0.118

0.119

20022003

CAR

0.384

0.492

0.512

0.514

0.105

0.169

20022003

CBJ

0.376

0.432

0.472

0.511

0.110

0.104

Next, I took the outcomes of all playoff series from the past 11 seasons, and calculated two deltas to be used as explanatory variables. I calculated the difference between the home team’s home performance and the away team’s away performance, and also the home team’s away performance and the away team’s home performance. This was to capture how the two teams would perform at the two arenas for the series.

I then used caret to train five different types of statistical models on this training data. The methods I included were generalized linear model, linear discriminant analysis, neural network, random forest, and support vector machine with a linear kernel. For each, model parameters were tuned with 10-fold cross-validation, which was repeated 10 times. Parameter values with the best overall accuracy were used to fit the final model with all of the training data.

For my predictions, instead of picking just one of the five fitted models, I used all of them. For each playoff series, I used a majority vote from all five models to pick the winner. (That’s why I fitted an odd number of models.) The predictions are below, with the predicted winner in bold:

Round 1

Pittsburg Penguins at New York Rangers

Ottawa Senators at Montreal Canadians

Detroit Red Wings at Tampa Bay Lightning

New York Islanders at Washington Capitals

Winnipeg Jets at Anaheim Ducks

Minnesota Wild at St. Louis Blues

Chicago Blackhawks at Nashville Predators

Calgary Flames at Vancouver Canucks

Round 2

New York Islanders at New York Rangers

Tampa Bay Lightning at Montreal Canadians

Calgary Flames at Anaheim Ducks

Chicago Blackhawks at St. Louis Blues

Round 3 – Conference Finals

Montreal Canadians at New York Rangers

Chicago Blackhawks at Anaheim Ducks

Round 4 – Stanley Cup Finals

To be clear, this blog entry was posted after the playoffs were already over. The explanatory text in the R Markdown documents was also written during the playoffs. But the same prediction as presented above can be seen in this GitHub commit (and the same HTML document on RawGit) from April 23rd. This was not before the playoffs started (April 15th), but when the first round was 3-4 games in, depending on the series.

Validation Set

And since the playoffs are in fact already over, it means that the natural validation set is also available. Chicago Blackhawks did end up winning the Cup, but how did I do otherwise? Below are the predictions again, now together with the real outcomes. And since an incorrect prediction in one round leads to wrong pairs on the subsequent rounds, I have added in the series that actually ended up happening. (I made a prediction for all possible games that could happen, but only presented the resulting bracket here.) These added ones are in italics.

Round 1

Pittsburg Penguins at New York Rangers – correct

Ottawa Senators at Montreal Canadians – correct

Detroit Red Wings at Tampa Bay Lightning – correct

New York Islanders at Washington Capitals – INCORRECT

Winnipeg Jets at Anaheim Ducks – correct

Minnesota Wild at St. Louis Blues – INCORRECT

Chicago Blackhawks at Nashville Predators – correct

Calgary Flames at Vancouver Canucks – correct

Round 2

Washington Capitals at New York Rangers – correct

Tampa Bay Lightning at Montreal Canadians – INCORRECT

Calgary Flames at Anaheim Ducks – correct

Minnesota Wild at Chicago Blackhawks – correct

Round 3 – Conference Finals

Tampa Bay Lightning at New York Rangers – INCORRECT

Chicago Blackhawks at Anaheim Ducks – correct

Round 4 – Stanley Cup Finals

Chicago Blackhawks at Tampa Bay Lightning – correct

Overall, my accuracy was 11 out of 15, which is 73%.

An obvious follow-up from here could be to look at each of the five different models (generalized linear model, linear discriminant analysis, neural network, random forest, and support vector machine with a linear kernel) and compare their accuracies against each other.