I’ve previously discussed potential applications of association rules learning in the second half of a post from last year. The arulesViz package for R is broken for me right now, so I won’t be able to recreate the graph at the bottom without considerable effort. For now, we’ll use graphics I’ve generated on my own. While I’ve created some interesting results, I would like to strongly emphasize that this is essentially a prototype, so going forward one must remember the assumptions being made and the contexts of the statistics being analyzed.

Last weekend I presented some of my findings at Ottawa Hockey Analytics, and what I’m writing about today is largely the same. My data comes entirely from the NHL’s Play By Play (PBP) and Player TOI tables. For this work it suffices to use PBP data on its own; I find Corsi events reliable at correctly recording the players present on the ice. However, I originally intended this dataset to have other purposes, so I stripped the PBP of its player information and added it using TOI tables. There may be some errors as a consequence (e.g if shifts were not properly recorded). I organized the data into binary columns for every event and player. If a player is on the ice during an event, he is marked as a 1; those off the ice are marked as 0s. Events are recorded similarly. This is a sample of what I end up with:

In association rule learning, a binary dataset like this one can be thought of as a big list of itemsets. For our purposes, our items are players and events, and we’d like to measure how often they appear together. Our motivation is to highlight players’ presences during “good” or “bad” events; in this example, we’ll use Corsi For and Corsi Against, respectively. This technique is not limited to Corsi events — in fact, I’d like to expand it to many other events — but for now they’re the easiest to record and they have predictivevalue.

Before showing the results, I’ll provide a quick primer:

An itemset X is a a collection of items. It’s basically a row in our dataset. In the screenshot, our first itemset {ANZE_KOPITAR, JONATHAN_QUICK, DWIGHT_KING, CORSI_FOR}. Eventually I removed the goaltenders – I believe they have analytic value, but for now I’m keeping it simple. Well, simple-ish.

A rule X => Y is an implication between itemsets X and Y, written as X => Y.

Rules are split into the left-hand side (LHS, or “antecedent”) and right-hand side (RHS, or “consequent”)

The following metrics are interest measures. As the name implies, they are meant to highlight interesting relationships between variables.

The support of an itemset X is defined as the proportion of the database in which X appears

For us, a higher support means that the combination of players and events happens more often. Players with more ice time will necessarily have higher support simply because they’re on the ice when more things happen

The confidence of a rule X => Y is the ratio of the support for X and Y to the support of X alone.

CONF(X => Y) = SUPP(X, Y)/SUPP(X)

The lift of a rule X => Y is the ratio of the support for X and Y to the product of the supports of X and Y individually

LIFT(X => Y) = SUPP(X, Y)/[SUPP(X)*SUPP(Y)]

The difference of confidence of a rule X => Y is the difference of confidence between X => Y and ¬X => Y

DOC(X => Y) = CONF(X => Y) – CONF(¬X => Y)

Interest measures can be interpreted in a probability context. If we restrict our sample space to all Corsi events over an entire season, we are measuring the probability that Corsi events occur with respect to the players on the ice. Letting X = {Players on ice} and Y = {Event}, we get:

CONF(X => Y) = P(X, Y)/P(X) = P(Y|X) = Probability that Y occurs, given that player combination X is on the ice

LIFT(X => Y) =P(X, Y)/[P(X)P(Y)]. Statistical independence between events X and Y is defined as P(X)P(Y) = P(X, Y), so the lift’s closeness to 1 could be used to indicate independence.

DOC(X => Y) = P(Y|X) – P(Y|¬X) = Probability that an event occurs given player combination X is on the ice – Probability that an event occurs given player combination X is not on the ice

Note that X is the entire combination of players. If X = {KOPITAR, KING}, then ¬X = {All sets without both of KOPITAR, KING}. Thus, ¬X includes all events where: (1) Kopitar is on the ice but King is not; (2) King is on the ice but Kopitar is not; and (3) Neither Kopitar nor King are on the ice

With all that in mind, here is a description of what I’m working with:

All Corsi events are treated equally. I haven’t made adjustments for quality of competition,

We are limited to how often player-event combinations occur within a team. There are two sides to this: the first is how often a situation exists; the second is how often a player is in the situation to begin with

All analysis was done within teams. Comparing between teams may not be useful.

I believe the best application of this analysis is to compare how players are faring in their current roles on their teams. We can highlight player chemistry, their performances relative to teammates in the same position, and (hopefully) hidden potential. Over time, we may also be able to use multiple seasons to see how a player has performed with different teammates and team strengths.

I’ve generated six graphs for each team:

All individual performances

Defensemen, as individuals

Defensemen, as pairs

Forwards, as individuals

Forwards, as pairs

Forwards, as trios

Support for Corsi Against is on the left; support for Corsi For is on the right. More green means that the Difference of Confidence is higher for that statistic, implying that the event is more likely give that the specific player combination is on the ice. More purple is the opposite (an event is relatively less likely given that player combination). Rules with fewer than 25 occurrences were dropped, so if a column is missing on one side it’s probably because it missed that threshold.

Important note: The initial batch of images I uploaded on Feb 13th had some errors based on when players were on the ice. I have since fixed the error in my code but I won’t be immediately redoing the analysis for each team. Here is a direct link to the album.

Examples and discussion

Toronto Maple Leafs, 2013-14:

Possession numbers are terrible across the board

Phaneuf is on the ice for about a fifth of even-strength tied shot attempts against with a very strong difference of confidence against him

Jake Gardiner and Morgan Rielly have the strongest indication in favour of possession among defensive pairings

Forward possession is driven by Kessel and Van Riemsdyk. Looking at forward pairs, it suggests the Kessel and JVR have stronger chemistry than either combining with Bozak

Kadri’s support for Corsi For/Against are better when paired with either winger, and with stronger difference of confidence, suggesting he could be a better first line centre

Lupul and McClement had awful years

Los Angeles Kings, 2013-14:

The top pairing of Doughty-Muzzin is strong. The difference of confidence at an individual level is much higher in Muzzin’s favour, though he and Doughty don’t seem to have shared defence partners at any point

One way to gauge shot metrics is to measure their relationship to wins. We’ll use data from the last four full seasons: 2009-2010 to 2013-14, excluding the lockout-shortened season. Running the analysis and including the shortened season does not change the overall conclusions, but the relationships between all variables come out slightly weaker. Two games are missing from the dataset (an OTT-BUF game from 2009-10 and a WSH-CAR game from 2010-11).

Variables used

The data considered is during regulation time only. Additionally, we will be measuring regulation win percentage instead of the usual win percentage; teams that win in overtime or the shootout are not considered to have a regulation win. Corsi, Fenwick, and Shot Percentage are defined as usual: (Shot attempts for) / (Shot attempts for + Shot attempts against). When using Home/Away as a factor (dummy/indicator variable), the dataset is split and the win percentages refer to home and away regulation win percentages. Because of the two missing games, statistics for the teams involved will be slightly different from calculations using the NHL’s official results.

In all cases, the best predictor by far was the percentage of goals scored. This should not be surprising as winning is defined by outscoring your opponent. However, since goals are fairly rare, we would like to use more common events in analysis; goals are included for the sake of comparison, but we won’t dwell on their predictive value.

First case: All situations

We use all game data and don’t differentiate between home and away. The model being proposed is REGULATION WIN % = SHOT METRIC % + ε

Metric(s)

R2

Adjusted R2

Corsi %

0.2652

0.259

Fenwick %

0.2818

0.2757

Shot %

0.2897

0.2836

Goal %

0.8253

0.8238

Second case: All situations, split by home and away

We use all game data, but split the season into home and away games. The model being proposed is REGULATION WIN % = SHOT METRIC% + HOME AWAY STATUS + ε.

Metric(s)

R2

Adjusted R2

Corsi % + Home/Away

0.298

0.2921

Fenwick % + Home/Away

0.3124

0.3066

Shot % + Home/Away

0.31

0.3042

Goal % + Home/Away

0.8056

0.804

Third case: Even-strength 5v5 only, split by home and away

We only use game data where the score is tied and both teams are playing at full-strength. The model being proposed is REGULATION WIN % = SHOT METRIC % + HOME AWAY STATUS + ε.

Metric(s)

R2

Adjusted R2

Corsi % + Home/Away

0.3493

0.3438

Fenwick % + Home/Away

0.3369

0.3313

Shot % + Home/Away

0.3345

0.3289

Goal % + Home/Away

0.5

0.4958

Discussion

Without considering a team’s score differential or strength on the ice, the best shot metric is the actual shot percentage, explaining just under 29% of the regression variance. Splitting the dataset by home and away results in better accuracy, with Fenwick percentage slightly more predictive than shot percentage. Reducing our data to only even-strength 5v5 play, we find that Corsi percentage becomes the strongest predictor of regulation win percentage with an adjusted R2 of 0.3438.

This model doesn’t consider save percentage, special teams, or the myriad other aspects that compose a winning team. Considering how much happens in a game, Corsi percentage and home/away status alone act as very useful predictors, lending evidence that even-strength, tied shot attempts make a good metric for analysis.

Regression diagnostics

These are the diagnostics for the full-strength tied regression. The residuals appear to satisfy conditions for normality and homoscedasticity centred about mean zero.

It’s well-established that a team’s shooting numbers vary based on the period and the state of the game. As a team falls behind, it plays more aggressively, and vice versa. You can see it clearly in these graphs from the last five seasons. Each graph measures the league average even-strength Corsi percentage for the entire season, by score difference (down by two or more, down by one, tied, up by one, up by two or more)

2009-2010

2010-2011

2011-2012

2012-2013

2013-2014

Two consistent patterns appear in every season:

Possession declines with a lead and increases with a deficit.

The first and second periods are very similar, but in the third period there is a remarkable change in the magnitude of the differences in possession

Note: The 2012-2013 season was shortened by a lockout; while the overall conclusions are the same, the differences in the third period are much starker.

Lots of people ask the basic question: are the Corsi and Fenwick statistics any good? The logic behind them — that teams that control the puck more are more successful– is sound, but often unexamined. One way to judge their usefulness is to check how they relate to other variables, like goals for/against, or how they correlate to winningness relative to another statistic like save percentage.

Pairwise correlations compare two variables at a time. The diagonal (with the histograms) shows the distribution of a particular variable as well as its name. The left diagonal has a scatterplot of the two variables above it and to its right; the right diagonal prints the correlation coefficient for the same variables, with the font size increasing for larger correlations. An obvious example is the comparison between the Fenwick% and Corsi%: the correlation coefficient is 0.97, and the scatterplot forms a nearly perfect line — telling us that the Corsi% and Fenwick% are very, very closely related.

Looking at the rightmost column tells you how each of the variables relates to the percentage of possible points a team could earn. The Corsi% has a coefficient of 0.51, and the Fenwick 0.54. Save percentage is also at 0.51, meaning that a high save percentage correlates to winningness about as strongly as puck possession does. The top row relates to outhitting your opponent. I’ve discussed that a bit in my first post, so it suffices to say that if it’s a useful metric at all, it tends to be negatively related to winning.

The two remaining variables are Goals Against Per Game (GAPG) and Goals For Per Game (GFPG). The obvious conclusions are evident: GAPG is very negatively correlated with earning more points (-0.74), whereas GFPG is very positively correlated (0.62). The less obvious conclusions are there too, telling us that puck possession is positively correlated with GFPG (0.32 and 0.31 for Fenwick% and Corsi%, respectively) and negatively correlated with GAPG (-0.50 and -0.49 for Fenwick% and -0.5 for Corsi%, respectively). All of this demonstrates that puck possession stats look to be good predictors of success.

Controlling for save percentage

In my first post I uploaded graphs that showed a strong link between puck possession and success in the regular season and playoffs. The issue with that, though, is that some teams had very good possession numbers, yet didn’t qualify for the playoffs, while others achieved the opposite. The 2010-2011 Bruins had an even-strength Corsi of 50.73% — fairly average — and yet they won the Stanley Cup. Last year’s Toronto Maple Leafs, however, managed to have possession numbers among the worst in the last five years, but still qualified for the playoffs. One explanation that gets thrown around is save percentage — and it turns out it’s a decent one.

This graph plots the percentage of possible points that teams earned in a regular season against the Corsi%. The graphs are split into six groups based on a team’s save percentage. The teams with the lowest save percentages (roughly below .910) are in the bottom left, and the teams with the highest at the top right (roughly above 0.928). The main things to notice here:

The teams with the lowest save percentages necessarily need a good Corsi to qualify for the playoffs. As the save percentages get lower , teams with lower Corsi percentages have a harder time making the playoffs — you can see this by the positions of the black dots (non qualifying teams) in the bottom row.

Stanley Cup champions have had middling goaltending in the regular season (Chicago, 2009-2010), but they need damn good possession stats

Teams with consistently elite goaltending (Boston, 2010-2011) can win the Stanley Cup with a fairly average Corsi

What’s the moral of the story then? Good puck possession and a solid goalie are keys to winning (duh). If a team is weak in one of the two areas, though, then they must counterbalance with strength in the other, especially if the weakness is possession.1 A team with terrible possession stats might sneak into the playoffs, but don’t expect them to go anywhere without their goalie stealing the Cup.

1If your Fenwick % is floating around 0.45, you should probably work on that instead of hunting down Dominik Hasek for his blood.

Last week, the Hart Memorial Trophy candidates were announced. According to the infallible internet, it’s likely that Crosby will win — but let’s figure out if that’s true. To get the obvious stats out of the way, Crosby (36G, 68A, 80GP), Getzlaf (31G, 56A, 77GP), and Giroux(28G, 58A, 82GP) finished first, second, and third in scoring this year, with points-per-games of 1.30, 1.13, and 1.05, respectively. Pittsburgh, Anaheim, and Philadelphia finished with 242, 263, and 233 goals, respectively. Since the Hart looks at a player’s value to his team, it makes sense to look at his contributions to the team’s overall scoring.

Candidate’s points (in teal) as a proportion of a team’s overall goals.

Looking at points alone, Crosby has a pretty huge head start over the other two. Now, the Hart (allegedly) isn’t the Art Ross 2.0, so it makes sense for us to look at possession statistics and the some frequencies from association rule mining.

Team

Strength

Candidate

Corsi %

Fenwick %

Pittsburgh

All

On

0.6036322

0.6095969

Pittsburgh

All

Off

0.4225558

0.4267751

Anaheim

All

On

0.5423348

0.5500000

Anaheim

All

Off

0.4812510

0.4914966

Philadelphia

All

On

0.5998837

0.5955325

Philadelphia

All

Off

0.4552777

0.4539683

Pittsburgh

Even

On

0.5308595

0.5372152

Pittsburgh

Even

Off

0.4638907

0.4685562

Anaheim

Even

On

0.5193694

0.5249392

Anaheim

Even

Off

0.4928212

0.4995057

Philadelphia

Even

On

0.5437117

0.5356383

Philadelphia

Even

Off

0.4811052

0.4763085

You may have noticed that Crosby and Giroux have a larger impact on the ice for their team than Getzlaf. These differences become much clearer when they’re visualized.

At this point it becomes clear that if there’s any competition, it’s between Crosby and Giroux. While all of the players improve their teams’ performances, it’s obvious that Getzlaf’s relative contribution is not as strong as either of the other two.

Looking deeper, the Fenwick percentage at even-strength tilts the odds further towards Crosby and quite a bit farther away from Getzlaf. The next combination of graphs compares the game-by-game Fenwick.

Blue and red lines represent season averages with and without the player on the ice, respectively. Black lines are the team average.

Two things to notice here: (1) Pittsburgh’s possession stats with Crosby are higher than Philadelphia’s with Giroux; and (2) Pittsburgh possession stats without Crosby are lower than Philadelphia’s without Giroux. This is especially evident when you look at the gaps between points — Giroux is very good, but Crosby absolutely lifts his team. This caught me a bit off guard, since until I wrote this post I hadn’t noticed that Pittsburgh finished the regular season with Corsi and Fenwick percentages below 0.500.

When we look at association rules, we the same story being told, albeit in a different manner.

Rank

Player

Event

Support

Confidence

1

Any

Shot for

0.155

0.155

2

Any

Shot against

0.149

0.149

3

Any

Hit for

0.137

0.137

4

Any

Hit against

0.133

0.133

5

Any

Block for

0.075

0.075

6

Sidney Crosby

Shot for

0.073

0.201

7

Any

Block against

0.069

0.069

8

Any

They miss

0.062

0.062

9

Chris Kunitz

Shot for

0.062

0.201

10

Matt Niskanen

Shot for

0.061

0.178

Rank

Player

Event

Support

Confidence

1

Any

Shot against

0.150

0.150

2

Any

Shot for

0.149

0.149

3

Any

Hit for

0.130

0.130

4

Any

Hit against

0.125

0.125

5

Any

Block against

0.077

0.077

6

Any

Block for

0.072

0.072

7

Claude Giroux

Shot for

0.064

0.188

8

Braydon Coburn

Shot against

0.061

0.175

9

Any

We miss

0.060

0.060

10

Jakub Voracek

Shot for

0.057

0.200

The main takeaway from Crosby’s table is how high up his generation of offense is — about 7.3% of all active events in the game are a Pittsburgh shot on goal while he’s on the ice, and when he’s on the ice, there’s 20.1% chance that the active event will be a Pittsburgh shot hitting the net. In fact, Crosby was on the ice for a Pittsburgh shot on goal more often than any player was on the ice for an opponent having their shot blocked. Crosby’s linemate Kunitz is only on for 6.4% of Pittsburgh’s shots, so that suggests that Crosby is doing quite a bit on his own. (As a note, you’ll see similar stuff for players like Erik Karlsson, who tend to be head and shoulders above their teammates, even if their teammates are very skilled on their own.)

What’s the conclusion here? In terms of relative contributions to their teams, this is a race between Crosby and Giroux — one that Crosby will very probably win.

As part of a personal project, I started scraping regular season NHL play-by-play data from 2009/10 til 2013/14. I took pages like this one and make them readable to a computer, meaning I could work with the data at a really low level. It’s a detailed dataset to work with; any time an event (e.g. faceoff, hit, shot, etc.) happens, you get a list of the players from each team, their strength, and other game-related data. Mining the data was only a minor pain in the ass thanks to the Python library Beautiful Soup. After formatting the data and running it through R, I got some pretty graphs out of it. Oh, I also found some neat statistical relations.

First, I looked at the Corsi and Fenwick, since they’re new and exciting and discussed frequently. They’re the difference of the number of attempted shots between your team and the opposing team. The Corsi counts all attempts, and the Fenwick excludes blocked shots. Full details are available at Pension Plan Puppets (Corsi, Fenwick).

The results are pretty cool (see if you can spot the Edmonton Oilers!):

The line of best fit shows the linear relationship between the variables — generally, a higher Corsi/Fenwick means a team is more successful at earning points.

I did something similar for hits percentage. A team with a hit percentage of 0.500 hits exactly as often as its opponents, whereas a team below 0.500 gets out-hit. There’s a strong negative relationship between hitting and the possession stats. This makes sense intuitively, since if you’re hitting then you probably don’t have the puck.

The surprising part for me is that three of the last four Stanley Cup winners were out-hit in the regular season. Of course, this comes with all sorts of pitfalls – the definition of a hit is loose, and it’s possible that there’s a lot of bias working its way in. For all we know, crappy teams might have scorekeepers who count too many hits for the home team, and good teams might have the opposite situation. Regardless, it’s interesting.

Association rules

All of this is still game-level or season-level data though. The real fun of play-by-play data is that we can do stuff like association rule learning on it. One of its most well-known applications is market basket analysis. Let’s say you own a grocery store and want to know what people buy together. A basket would have something like milk, eggs, and bread in it. You can then create a rule: {milk,eggs}=>{bread}, meaning that the presence of milk and eggs is associated with the presence of bread. If you do this for every basket, you’ll see some rules come up more often than others. Rules like {milk,eggs}=>{bread} and {chips,cola}=>{salsa} would appear more often than, say, {sausage,bacon}=>{halal chicken}. You can use different measures to answer questions like: “Would higher sales of nachos increase sales of salsa?” or “Is someone more likely to buy Advil if they’re buying diapers?”. With play-by-play data we can do the same for players and events, like {Phil Kessel, James van Riemsdyk}=>{Shot taken}, or {Marc-Andre Fleury}=>{Comically reckless puck play}.

For an example, we’ve got the 2013-14 Toronto Maple Leafs. The support is how often an event occurs, out of all events in the dataset; the confidence is the chance that an event happens if a particular set of players are on the ice; and the lift measures the support for the event and the players being independent, based on how close the lift is to 1. For the Leafs, the most probable player-event is that Dion Phaneuf is on ice during an opponent’s shot. Next, it’s Phil Kessel being on the ice when the Leafs have a shot.

Rank

Player(s)

Event

Support

Confidence

Lift

1

DION_PHANEUF

SHOT_AGAINST

0.06218755

0.1680772

1.0388827

2

PHIL_KESSEL

SHOT_FOR

0.05619953

0.1645753

1.3070524

3

JAMES_VAN_RIEMSDYK

SHOT_FOR

0.05378234

0.1624896

1.2904881

4

CARL_GUNNARSSON

SHOT_AGAINST

0.05361754

0.1782648

1.1018523

5

JAMES_VAN_RIEMSDYK

SHOT_AGAINST

0.05356260

0.1618257

1.0002423

6

PHIL_KESSEL

SHOT_AGAINST

0.05092567

0.1491313

0.9217781

7

CODY_FRANSON

SHOT_AGAINST

0.05004670

0.1556467

0.9620497

8

DION_PHANEUF

SHOT_FOR

0.04916772

0.1328879

1.0553920

9

JAKE_GARDINER

SHOT_AGAINST

0.04905785

0.1444750

0.8929978

10

PHIL_KESSEL, JAMES_VAN_RIEMSDYK

SHOT_FOR

0.04823381

0.1760931

1.3985262

The table below ranks player-events by confidence. If Kessel, JVR, Phaneuf, and Franson are all on the ice, there’s a 25.53% chance that the event is the Leafs taking a shot. The lift is 2.26, which means that the combinations of players and the event is probably not a coincidence. If you want a simpler example, look at Colton Orr: when he’s on the ice, there’s a 24.79% chance that an event will be a Leaf (probably him) making a hit. The lift is 1.74, so the fact that the Leafs are hitting is likely because he’s on the ice. Looking at the table above, Dion Phaneuf is often on the ice when there’s a shot against, but the lift is close to 1 — my interpretation is that he’s on the ice so often that he’s bound to be around when they’re stuck in their own end.

Number

Player(s)

Event

Support

Confidence

Lift

1

PHIL_KESSEL, JAMES_VAN_RIEMSDYK, DION_PHANEUF, CODY_FRANSON

SHOT_FOR

0.01032797

0.2852807

2.265692

2

PHIL_KESSEL, DION_PHANEUF, CODY_FRANSON

SHOT_FOR

0.01098720

0.2824859

2.243495

3

JAMES_VAN_RIEMSDYK, DION_PHANEUF, CODY_FRANSON

SHOT_FOR

0.01043784

0.2749638

2.183755

4

COLTON_ORR

HIT_FOR

0.01587650

0.2478559

1.740633

5

DION_PHANEUF, CODY_FRANSON

SHOT_FOR

0.01241554

0.2364017

1.877495

6

TYLER_BOZAK, PHIL_KESSEL, JAMES_VAN_RIEMSDYK, CODY_FRANSON

SHOT_FOR

0.01285502

0.2186916

1.736842

7

TYLER_BOZAK, JAMES_VAN_RIEMSDYK, CODY_FRANSON

SHOT_FOR

0.01345932

0.2156690

1.712837

8

PHIL_KESSEL, JAMES_VAN_RIEMSDYK, CODY_FRANSON

SHOT_FOR

0.02126023

0.2086253

1.656897

9

DION_PHANEUF, JAY_MCCLEMENT

SHOT_AGAINST

0.02010658

0.2067797

1.278102

10

TYLER_BOZAK, PHIL_KESSEL, CODY_FRANSON

SHOT_FOR

0.01395374

0.2055016

1.632088

Looking at the events alone can give a decent overview of a team’s overall strategy. For the Leafs, it was some combination of getting outshot and hoping your goalie keeps you in the game, while outhitting your opponents and hoping that generates offense somehow. One thing to note is just how badly the Leafs get outshot. Out of all events considered, about 30.25% are the other team taking a shot, versus 23.41% for the Leafs taking a shot. This suggests that the Leafs spend a lot of time playing in their own defensive zone.

Rank

Event

Support

1

SHOT_AGAINST

0.16178652

2

HIT_FOR

0.14239411

3

SHOT_FOR

0.12591331

4

HIT_AGAINST

0.12036478

5

THEY_MISS

0.07059276

6

BLOCK_FOR

0.07020821

7

BLOCK_AGAINST

0.06086909

8

WE_GIVE

0.04900291

9

WE_MISS

0.04751964

10

THEY_GIVE

0.04482778

One last thing you can do is visualize these rules with the R package arulesViz. This organizes the rules we’ve created by lift and then groups them together. You can get a rough idea of what happens on the ice given that certain players are present. Darker circles mean it’s more likely that a player’s presence is causing an event, and larger circles mean that the event is more common. A large, dark circle (like SHOT_FOR under Phil Kessel) suggests that the player-event combination is frequent and likely caused by the player’s presence. A small, darker circle (like HIT_FOR under Colton Orr) suggests that a player has a very focused purpose — in this case, Colton Orr isn’t on the ice much, but when he is, he’s out to hit somebody.

Of course, everything here needs to be taken in context. Team strategy impacts individual players, so in some cases it can increase or decrease a player’s performance in some areas. Regardless, there’s a lot to look at here — we’re just scratching the surface.

Technical notes

Some games may be incomplete because not all events past a certain point were recorded. There were around 5600 games in the last five seasons, so I didn’t have time to verify all of them. The aggregate statistics (e.g. shots over a season) seem to match up though, so this is accurate enough for our purposes.

Some games are missing from the NHL website, so they were scraped from the Internet Archive (archive.org). Game 0836 from 2009-10 is missing entirely. If you have any conspiracy theories on why Bettman doesn’t want us to know what really happened in Buffalo that night, let me know.

The events I considered were hits, shots, blocked shots, missed shots, takeaways, giveaways, penalties, and goals. I consider these “active” events, since they happen during play. The others have their own analytical value and I plan to look at them another time.

Association rule learning can be written in terms of probabilities:

Support(X) = P(X)

Confidence(X => Y) = P(Y|X)

Lift(X => Y) = P(XY)/[P(X)P(Y)]

Two events X and Y are independent iff P(XY)=P(X)P(Y). A lift of 1 is evidence that two events are independent — a player-event combination with a lift of 1 probably has no significance. You can find more on lift here.

The plot from arulesViz defaults to k-means clustering on lift to group rules together. It’s a novel technique and is fun to play around with.