Archives

Introducing a new stat: Location Adjusted Expected Goals Percentage

August 28, 2013 8 minute read

If you have been looking into statistics not recorded by the NHL officially, you probably know what Corsi is. From Hockey Prospectus,

“Corsi is essentially a plus-minus statistic that measures shot attempts. A player receives a plus for any shot attempt (on net, missed, or blocked) that his team directs at the opponent’s net, and a minus for any shot attempt against his own net. A proxy for possession.”

The most cited drawback of Corsi is that it treats a shot on goal from point blank and a missed shot from the center line with the same weight. There’s even been talks of players purposely taking more low quality shots in order to game the system, once they hear that their coach or management uses Corsi as a performance metric. I created a new statistic, Location Adjusted Expected Goals Percentage, to fix that.

I will explain what I did to derive this statistic, in laymen terms, without going into too much boring detail because I know math is not everyone’s cup of tea. If you are only looking for super technical stuff, read the Methodology article.

Summary

Before we get into Location Adjusted Expected GoAls Percentage (LAEGAP), we will first explore what I coin Expected Goals For (EGF) and Expected Goals Against (EGA). I collected all available play by play shot data from NHL.com (out of 720 regular season games, only 634 were intact), and calculated the recording bias of shot distances for each arena. Then, I went through all even strength, non-empty net shots and goals and calculated the average shot percentage for each point in the rink. All of the data was then flipped to the east (right) half of the rink and added together if they overlap. After that, each point was regressed with its neighbors, which basically just means it takes a weighted average of itself and its neighbors’ shot percentage. Every available even strength and non-empty net shot and goal was processed and the shot percentage at that location was added to the EGF or EGA of each player on ice, depending if they were they shooting team or being shot at.

In one sentence:

“Expected Goals For (EGF) is the amount of goals a player will be on ice for, if each shot had a shot percentage (chance of puck going in) of the league average shot percentage at the position where it was taken.”

If you are still confused, think of EGF as shot quality multiplied by the amount of shots taken by your team when you are on the ice, and EGA is the opposite.

A heat map of even strength shooting percentage distribution in the 2012-13 season.

Now that we have EGF, a measure of on-ice offensive events, and EGA, a measure of on-ice defensive, or lack of therefore, events, we can calculate the difference, because, ultimately, hockey is a game you win by scoring more than your opponent. Simply subtracting EGA from EGF is not good enough. This is because it gives high event players an advantage, or disadvantage, depending if they were a positive possession player. For example, if our imaginary player, John, was on ice during a shot for, at a 10% shooting percentage location, and a shot against, at a 5% shooting percentage location, every shift, for 100 shifts, he would have a EGF of 10, and a EGA of 5, a difference of 5 goals. Now image a 2nd player, Ethan, who had the same percentages every shift, but had 1000 shifts. His expected goals difference/expected +/- would be 50 goals, even though he and John both performed equally, possession wise, every time they were on-ice.

The solution is simple. Calculate the percentage of EGF in the total expected goals events (EGF + EGA). Now, both John and Ethan have the same EG%: 66.7% (10/(10+5) = 100/(100+50)). This raises another problem. What if a 3rd imaginary player, Jacob, had one shift that logged a shot for at 5% before he suffered a season ending pinky toe strain? He would have a EG% of 100% (5/(5+0)). Clearly that’s not sustainable once the sample size increases (more shifts), so how do we differentiate small sample size error margins and actual performance?

Thankfully, the math has already been worked out for us in 1927, by mathematician Edwin Bidwell Wilson. It’s called a binomial proportion confidence interval (referred to as BPCI in this article). You might actually have seen it before. It’s fairly complicated so instead of explaining how it works and how to calculate it, I am just going to explain what it does (If you are interested in reading how it works read the Methodology article). BPCI basically gives a margin of sampling error that a estimation of a probability (EG%) will have. You might have seen something like “40% of surveyed voters will vote for Mr. Lincoln as President” in your local newspaper, and on the bottom it would say something like “All results have a margin of error +/- 5%”. This means that the surveyor is confident that the actual result of the votes for Lincoln would be anywhere from 35% to 45%. As the sample size increases, this margin lowers. The level of confidence you want in the estimation also will affect the margin (technical term for the margin is coverage probability). The lower confidence you are willing to accept, the smaller the margin. Generally a 95% confidence is used, and it is what we will be using. To allow for easy sorting, we will take the lowest value possible in the interval. By taking the lowest possible value we will undervalue a player, particularly low event ones, much more often than we will overvalue them, which I think is the better of the two.

In one sentence:

“With the data I have, what is the lowest possible true EG% I will get, 95% of the time?”

Phew. Now that we got all the math out of the way, here comes the fun part. Who excels the most in this new performance metric? Here is the top 30:

LAEGAP to GFPUB is the interval in which a player's true GF% lies, 95% of the time.

Does this mean the Hart and Norris trophy should’ve gone to Dan Boyle? Probably not. (The culture that Norris goes to the best offensive defensemen instead of “awarded to the defenseman who demonstrates throughout the season the greatest all-round ability in the position” is something I disagree with, but that’s for another article.) We must remember that just like any other attribute that defines a great hockey player, you can’t just evaluate players with one attribute alone. We must also take context, the most important attribute in evaluating a player, into account. Having the best LAEGAP in the league doesn’t mean anything if you start at the offensive zone 80% of the time playing the opponent’s fourth line or if you spend more time in the box than on the ice. It is also important to remember that this system rewards those who played through higher number of events. Defensemen naturally will be on the ice for more events as they have more ice time than forwards, on average, so they will have an advantage when compared to their offensive counterparts.

For the record, Crosby is 37th on the list.

Interesting Finds

Let’s look at players who surprise us in this top 30. Mark Fayne, Brandon Saad, Anton Stralman, Andy Greene, and Matt Irwin are among the list of traditionally undervalued candidates. Although we always knew Gallagher, runner up of this year’s Calder Trophy, was a great possessive player from his corsi this season, after being adjusted for shot location it turns out he is most likely the best (he does have a 66% offensive zone start though) out of all the forwards. I say most likely because players who played less still have a possibility of increasing their LAEGAP once the sample size increases as they play more games. If all these players had the same arbitrary EGF and have their EGA in proportion, Mark Fayne would actually be on top. (Basicly means that Mark Fayne would be the best possessive player in our calculations if difference in sample size was not an issue and everyone is able to maintain the exact same performance)

LAEGAP vs Corsi

The point of this LAEGAP is to fix the lack of accounting for shot quality in Corsi, so let’s see how we did. Unfortunately, the play by play data where I collected my data from doesn’t include missed shots but I think shot differential is a good enough proxy for Corsi for our purposes.

Let’s first look at those who are discredited by Corsi unfairly. Among those with a negative shot differential (more shots against than for), Devin Setoguchi has the highest LAEGAP of 0.3997 (EGF of 24.4 and EGA of 20.57) even though he had a -25 shot differential. This meant that although he was out shot 25 times, when he was on the ice, his teammates tend to shoot in more dangerous locations than the opponent team, striving for quality over quantity. Here’s the rest of that list (only players with more than 15 games played this season are included):

name

position

team

LAEGAP

games

EGF

EGA

shots for

shots against

shot diff

Devin Setoguchi

Right Wing

Minnesota Wild

0.399679

43

24.4251

20.5732

255

280

-25

David Legwand

Center

Nashville Predators

0.395207

41

26.2646

23.1492

300

304

-4

Eric Brewer

Defenseman

Tampa Bay Lightning

0.385438

41

31.2601

30.4075

338

377

-39

Jason Garrison

Defenseman

Vancouver Canucks

0.385014

40

24.9074

22.6941

320

321

-1

Matt Cullen

Center

Minnesota Wild

0.384695

37

19.7985

16.6929

210

232

-22

Mike Cammalleri

Center

Calgary Flames

0.383323

40

22.8856

20.4519

273

281

-8

Bryan Allen

Defenseman

Anaheim Ducks

0.383293

36

21.6961

19.0404

241

277

-36

Marc Methot

Defenseman

Ottawa Senators

0.383155

40

29.6553

28.7386

369

370

-1

Martin St Louis

Right Wing

Tampa Bay Lightning

0.382137

41

29.7007

28.9374

342

347

-5

Brad Stuart

Defenseman

San Jose Sharks

0.380803

43

30.1785

29.7269

365

379

-14

Wayne Simmonds

Right Wing

Philadelphia Flyers

0.379129

38

19.5821

16.9153

228

238

-10

Craig Smith

Center

Nashville Predators

0.379099

37

17.5183

14.4988

186

198

-12

Nicklas Backstrom

Center

Washington Capitals

0.376915

44

25.418

24.2575

315

328

-13

Nazem Kadri

Center

Toronto Maple Leafs

0.376833

43

22.8493

21.0802

258

296

-38

Matt Stajan

Center

Calgary Flames

0.374182

39

22.6381

21.0987

250

258

-8

Cam Fowler

Defenseman

Anaheim Ducks

0.370258

32

19.7708

17.9313

227

259

-32

Keith Aulie

Defenseman

Tampa Bay Lightning

0.368474

38

20.4526

18.9473

237

249

-12

Tommy Wingels

Center

San Jose Sharks

0.367168

37

20.1443

18.6854

222

236

-14

Nick Foligno

Left Wing

Columbus Blue Jackets

0.364859

41

23.6026

23.3645

267

288

-21

Alex Killorn

Center

Tampa Bay Lightning

0.360014

34

18.5661

17.3444

201

226

-25

Matt Halischuk

Right Wing

Nashville Predators

0.357274

32

14.4134

12.3335

147

156

-9

Jaromir Jagr

Right Wing

Dallas Stars

0.35502

32

19.9777

19.6615

226

229

-3

Eric Fehr

Right Wing

Washington Capitals

0.354895

38

17.7587

16.7559

221

227

-6

Joel Ward

Right Wing

Washington Capitals

0.352883

35

16.3444

15.091

202

205

-3

Mike Richards

Center

Los Angeles Kings

0.351278

42

19.6898

19.6549

234

236

-2

Clarke MacArthur

Left Wing

Toronto Maple Leafs

0.350016

35

17.5774

16.9476

212

224

-12

Deryk Engelland

Defenseman

Pittsburgh Penguins

0.349942

35

17.6935

17.1085

223

228

-5

Rich Clune

Left Wing

Nashville Predators

0.346042

40

15.9891

15.175

171

184

-13

Cody McLeod

Left Wing

Colorado Avalanche

0.342992

42

16.6384

16.3021

207

208

-1

Emerson Etem

Right Wing

Anaheim Ducks

0.338121

33

11.4053

9.7216

144

147

-3

Matt Beleskey

Left Wing

Anaheim Ducks

0.337689

36

14.5212

13.8578

170

185

-15

Daniel Paille

Left Wing

Boston Bruins

0.336772

40

15.1781

14.8251

195

214

-19

Rene Bourque

Right Wing

Montr̩al Canadiens

0.33275

23

11.1159

9.6442

131

135

-4

Martin Erat

Right Wing

Nashville Predators

0.332706

29

14.3807

14.0438

180

186

-6

Hal Gill

Defenseman

Nashville Predators

0.328599

28

11.2298

10.0277

115

136

-21

Richard Panik

Right Wing

Tampa Bay Lightning

0.328197

23

11.0786

9.849

119

122

-3

Antoine Roussel

Left Wing

Dallas Stars

0.326697

37

12.3511

11.6655

133

156

-23

Cory Sarich

Defenseman

Calgary Flames

0.32366

27

13.2168

13.0873

169

174

-5

Sven Baertschi

Left Wing

Calgary Flames

0.321896

19

10.3705

9.2466

113

123

-10

Shawn Thornton

Left Wing

Boston Bruins

0.317794

40

12.0868

11.8861

152

170

-18

Nick Bonino

Center

Anaheim Ducks

0.312019

23

10.492

9.9692

119

130

-11

Joffrey Lupul

Left Wing

Toronto Maple Leafs

0.30614

16

10.0747

9.7094

111

130

-19

Peter Holland

Center

Anaheim Ducks

0.302854

18

7.9056

6.796

83

90

-7

Beau Bennett

Right Wing

Pittsburgh Penguins

0.301209

22

9.1646

8.6582

100

112

-12

Jason Zucker

Left Wing

Minnesota Wild

0.300736

18

7.5318

6.3695

88

95

-7

David Steckel

Center

Anaheim Ducks

0.293259

20

6.346

5.0307

84

85

-1

Mike Rupp

Left Wing

Minnesota Wild

0.273492

28

7.3535

7.3038

90

104

-14

Jeff Halpern

Center

Montr̩al Canadiens

0.260309

16

5.7545

5.3446

65

69

-4

Jim Slater

Center

Winnipeg Jets

0.224588

16

4.233

4.1011

49

53

-4

Predictability

Statistics are useful because it allows us to make an educated prediction of future outcomes. To verify that LAEGAP is indeed useful, we must confirm that players have a high probability of repeating similar numbers season after season. After calculating the correlation between the LAEGAP of players who played the same proportion of games (to account for long term injuries in either season) in the 2011-12 and 2012-13 season, I’ve found the correlation coefficient (Pearson) to be 0.72, with 1 being perfect correlation and 0 being no correlation at all. A correlation coefficient of 0.72 indicates that there is a moderate to strong correlation, meaning that LAEGAP is a repeatable statistic.

Location adjusted shot data should make for at least a couple of more interesting studies and spawn a couple more stats that are its Corsi counterparts such as relative LAEGAP and LAEGAP quality of competition. I am excited for what I will find and I hope you are too. Interestingly, it took me exactly 1111 lines of code to mine and calculate the data for this project. I hope you enjoyed the article :)