Entering training camp, teams share a common goal: win the Stanley Cup. The gruelling 82-game regular season separates those with legitimate title hopes from those whose rosters are insufficient, leaving only the sixteen most eligible teams. The attrition of playoff hockey gradually whittles down this number until a single champion emerges victorious, battle-tested from the path they took to win hockey’s top prize. Two months off, then we do it all again.

Teams that have won the Stanley Cup share certain traits. Anecdotally, it’s been helpful to have a dominant 1st line centre akin to Sidney Crosby, Jonathan Toews or Anze Kopitar. Elite puck-moving defensemen don’t hurt either, nor does a hot goalie. Delving deeper, though, what do championship teams have in common?

I decided to answer this question systematically with the help of some machine learning.

Some Background on Classification

Classification is a popular branch of supervised machine learning where one attempts to create a model capable of making predictions on new data points. We do this by building up, or ‘training’, the model using historical data, explicitly telling the model whether each past data point achieved the target class that we’re trying to predict. In the context of hockey, this data point could be some number of team statistics produced by the 2015 Chicago Blackhawks. The target here would be whether they won the Stanley Cup, which they did.

Sufficiently robust classification models can identify a number of statistical trends that underpin the phenomenon that they’re observing. The models can then learn from these trends to make reasonably intelligent predictions on the outcome of future data points by comparing them to the data that the classifier has already seen.

Building a Hockey Classifier

We can apply these techniques to hockey. We have the tools to train a model to learn which team statistics are most predictive of playoff success. To do this, we must first decide which stats to include in our dataset. To create the most intelligent classifier, we decided to include as many meaningful team statistics as possible. Here’s what we came up with:

It’s worth noting that we engineered the ‘Div Avg Point’ feature by calculating the average number of points contained by all teams in a given team’s division. The remaining statistics were sourced from Corsica and Natural Stat Trick. An explanation of each of these stats can be found on the glossaries for the two websites.

Our dataset included 210 data points: 30 teams per season over the 7 seasons between 2010-11 and 2016-17. Each data point included team name, the above 53 team stats, and a binary variable to indicate whether the team in question won the Cup. Using this data, we trained nine different models to recognize the statistical commonalities between the 7 teams whose seasons ended with a Stanley Cup championship. The best-performing model was a Logistic Regression model trained on even-strength data, and so all further analysis was conducted using this model.

Results: Team Stats that Matter Most

To evaluate which team stats were most strongly linked to winning a Cup, we created a z-score standardized version of our team data. We then calculated the estimated coefficients that our logistic regression model assigned to each team stat. The size of these coefficients indicates the relative importance of different team stats in predicting Stanley Cup champions. The 5-highest ranking team stats can be seen below:

Of all team statistics, ‘Goals For Per 60 Minutes’, or GF/60, is most predictive of winning a Stanley Cup. Of the 7 champions in the dataset, 4 ranked within the top 5 league-wide in GF/60 in their respective season, with 2016-17 Pittsburgh most notably leading the league in the statistic. Impressive results in ‘High Danger Chances For’ and ‘Team Wins’ both strongly correlate to playoff success, while ‘Scoring Chance For Percentage’ and ‘Shots on Goal For Percentage’ round out the top 5.

What Does It Mean?

Generating a list of commonalities among past champions allows us to comment on what factors impact a team’s likelihood of going all the way. Most apparent is the importance of offense. It is more important to generate goals and high-danger chances than it is to prevent them, as GA/60 and HDCA rank 36th and 13th among all statistics, respectively (their corollaries are 1st and 2nd). In the playoffs, the best team offense tends to trump the best team defense, which we saw anecdotally in last year’s Pittsburgh v Nashville Final. If you want to win a Stanley Cup, the best defense is a good offense.

We can see that a team’s ability to generate scoring chances, both high-danger and otherwise, is more predictive of playoff success than their ability to generate shots. Although hockey analytics pioneers championed the use of shot metrics as a proxy for puck possession, recent industry sentiment has shifted towards the belief that shot quality matters more than shot volume. The thinking here, which is supported by the above results, is that not all shots have an equal chance of beating a goalie, and so it is more important to generate a shot with a high chance of going in than it is to generate a shot of any kind. Between a team who can consistently out-chance opponents and a team who can consistently out-shoot opponents, the former is more likely to win a hockey game, and therefore playoff series.

Application: The 2017-18 Season

A predictive model isn’t very helpful unless it can make predictions. So let’s make some predictions.

By feeding our model the team stats produced by the recently-completed 2017-18 regular season, we can output predictions of each team’s likelihood of winning the 2018 Stanley Cup. Since this is the fun part, let’s get right to the probability estimates for all 31 NHL teams:

The rankings above essentially indicate how similar each team’s season was to the regular season of teams that went on to win it all. In doing so, they hope to identify the teams most likely to replicate this success The model favours the Boston Bruins to win the 2018 Stanley Cup, predicting a victory over the Nashville Predators in the Final.

The above data highlights a few curiosities. Notably, we can see that some non-playoff teams had 5-on-5 numbers that were relatively comparable to past Cup champions. Specifically, the Blues, Stars, and Flames played 5-on-5 hockey well enough this season to qualify for the playoffs. The Blues and Flames can attribute their disappointingly long off-seasons to the 30th and 29th-ranked power plays, respectively. The Stars’ implosion is more of a statistical anomaly, and while conducting an autopsy would be interesting it would be better served as a subject for another article.

The lowest-ranked teams to have made the playoffs in the real world are the New Jersey Devils and the Washington Capitals. While their offensive star power might have been enough to get these squads to the dance, the model predicts a quick exit for them both.

A Computer-Generated Bracket:

For fun, I’ve filled out the above bracket using the class probability rankings generated by our model. Of the 8 teams who have won or are winning their first-round playoff series, the model picked 7 of them as at the winner, with Philadelphia being the exception. While it’s far too early to comment on the model’s accuracy, as only a single playoff series has been completed, it’s an encouraging start.

Limitations of the Analysis

The above results must be considered in the appropriate context. The model was trained and tested using only 5-on-5 data, which would explain the lack of love for teams with strong special teams like Pittsburgh and Toronto. The model is also blind to the NHL’s playoff format. Due to the NHL’s decision to have teams play against their divisional foes during the first two playoff rounds, teams in strong divisions have a much harder road to winning a Cup. Consider that Minnesota’s path to the conference final would likely involve Winnipeg and Nashville in the first two rounds, who finished 2nd and 1st in NHL standings in the regular season. Divisional difficulty is not reflected in the probabilities listed above, though incorporating divisional difficulty either probabilistically or through a strength of schedule modifier could be areas of further analysis.

A final limitation of the model is that it is trained using only 7 champions. In an ideal world, we would have access to dozens or hundreds of Stanley Cup positive instances, but due to the nature of the game there can only be one champion per year. We considered extending the dataset backwards past 2011 but ultimately decided against doing so. The NHL is different today than it was in the past. Training a model on a champion from 2000 tells us little about what it takes to have success in 2018. Using 2010-11 onwards represented a happy medium in the trade-off between data relevance and quantity.

What next?

Winning a Stanley Cup remains an inexact science. While it’s valuable to identify trends among past winners, there is no guarantee that what’s worked before will work again. It’s a game of educated guesses.

I believe that the most legitimate way to build a Stanley Cup winner is a combination of the past and the future. Analyzing historical data to identify team traits that are predictive of a championship is half the battle. The rest is anticipating what the future of the NHL will look like. The champions of the next few years will be lead by managers who are best able to identify what it’ll takes to win in the modern NHL. While the above framework approaches the first half in a systematic way, the latter remains much harder to crystallize.

In the meantime, let’s turn to what’s in front of our eyes. The playoffs have been tremendously entertaining thus far, and that’ll only pick up as teams are threatened by elimination. Let’s enjoy some playoff hockey. Let’s see which playing styles, tactics, and matchups seem to work. Let’s learn.

Even if your team gets eliminated, just remember that this season’s playoffs are just a couple months away from being data points to train next season’s model.

Featured Posts

RBIs are often criticized because they are largely dependent on how many plate opportunities the hitter gets with runners on base. Most analytics experts have dismissed RBIs as a dated stat, but many baseball insiders still claim that they have some relevance. We aim to address these flaws and create a stat that everyone can agree on.