Some goaltenders are thought of as being particularly streaky or particularly consistent, but are those labels fair? In this article, we compare Marc-Andre Fleury, Ilya Bryzgalov, Henrik Lundqvist, Pekka Rinne, Jaroslav Halak, and Carey Price and find two things: they all exhibit about the same amount of variability as each other, and they all exhibit about the same amount of variability as would be expected from simple random chance.

Introduction

Heading into the Flyers-Penguins series, one common defense of Marc-Andre Fleury from those who argued he was better than his stats was that he was an extremely consistent goalie, whereas Ilya Bryzgalov was unreliable and streaky. (I always feel compelled to prove that I’m not fighting a straw man, so here are a few examples from Darren Dreger, the Sporting News, Pro Hockey Talk, and the Penguins-focused blog HighHeelsAndHockey.)

Fleury’s play in the series may have done more to address that claim than I ever could, but at the time it started me down the road of trying to answer two questions:

Can I find evidence that goalies are more streaky or more consistent than would be expected from random variance?

Can I find evidence that some goalies are more streaky or more consistent than others?

To answer these questions, I simulated a random career for each goalie and compared it to his actual career to see which was streakier. I’ll save the details of the simulation methodology for the appendix, but the output was a 10,000 game simulated career which assumed that the goalie’s odds of stopping any given shot was exactly his career average save percentage, with no streakiness at all except for random chance.

Once we can have that, we can make a plot of how often a goalie posts a certain save percentage over a given number of games, and compare the results to his perfectly steady simulated counterpart.

If the goalie is truly streaky, we would expect him to run hot or cold more often than the coinflip model does, which would mean the distribution of save percentages in his actual career would be broader than in the simulation.

The Results

Here’s what we see for Fleury’s results over any given 3-game stretch compared to the simulated Fleury’s 3-game stretches:

However, the decision to look at 3-game stretches was arbitrary; maybe streaks last longer than that, so let’s look at some other cutoffs to be sure we aren’t missing something.

It could be argued that Fleury’s distribution over 5-game stretches is just the slightest bit broader than the simulated Fleury, but the difference is not large. The standard deviation – a measure of the spread of the results – is 0.026 for Fleury’s actual results and 0.025 for simulated Fleury, a difference that is virtually imperceptible in reality and as likely as not due to imperfections in the model (see appendix).

To a first approximation, it’s fair to say that Fleury’s consistency is what you’d see from a robot goalie that had no effects of injury, confidence, focus, or whatever else might cause a goaltender to appear more dialed in at some times than at others.

Perhaps that’s evidence that Fleury is indeed unusually consistent. Do other goalies fluctuate more than we’d expect by random chance, with streaks of running hot and cold beyond what the coinflip model achieves? Let’s take a look at Bryzgalov.

Again, that’s pretty darn similar. Bryzgalov had three bad starts in a row a little more often than the model did, but other than that his career is virtually identical to that of a .915-save-percentage puck-stopping robot. (Question: could we build such a device for less than $51 million?)

I looked at six goalies using this method, evaluating players who were suggested to me on Twitter as being either particularly consistent or particularly streaky. I’ll spare you the rest of the plots and summarize with a table showing the standard deviation of the distribution of results for each goalie and his simulated counterpart:

3-game stretches

5-game stretches

10-game stretches

Fleury / SimFleury

.033 / .033

.026 / .025

.018 / .018

Bryzgalov / SimBryzgalov

.034 / .032

.026 / .024

.017 / .017

Lundqvist / SimLundqvist

.032 / .031

.024 / .024

.018 / .017

Rinne / SimRinne

.033 / .032

.025 / .024

.018 / .018

Halak / SimHalak

.033 / .031

.026 / .024

.018 / .017

Price / SimPrice

.031 / .030

.025 / .023

.018 / .017

Each goalie is just a tiny bit less consistent than the random variance model, with differences pretty comparable to those plotted above. All of the factors that might contribute to making a real goalie less consistent (imperfections in the model, injury to the goalie, psychological factors, change in talent level over the years) all add up to increasing the standard deviation by about 0.1%.

Conclusion

I have written previously about people’s tendency to underestimate how streaky random chance is, and I think that is what has happened here. There is very little difference between goalies and perfectly consistent robots, and certainly nowhere near enough difference between the goalies to label one of them streaky and another consistent.

Goalies should be evaluated based on how much skill they have demonstrated, not how often we remember them going on a hot streak.

Appendix: How the model works

For each goalie, I went through the following process:

For each of his starts since the lockout, note how many shots he faced

Produce a histogram, a distribution of how often he faced a given number of shots (e.g. Fleury faced 23 shots in a game 11 times, 24 shots in a game 21 times, etc)

Simulate 10,000 games by the following method:

Select the number of shots faced randomly, using the histogram from step 2

Simulate each shot, assuming that the likelihood of stopping any given shot is exactly the goalie's career save percentage since the lockout

Record the number of shots faced and saves made in each simulated game

That gave me a simulated 10,000 game career in which the distribution of shots faced mirrors his real life distribution of shots faced and he had the exact same chance of stopping each shot. From there, the distribution of results in a 3-game (or 5-game, or 10-game) moving average could be compared to the distribution of results from his actual career.

I mentioned that the model is not perfect and might be expected to give a slightly tighter distribution than reality. Here are some examples of why:

I did not separate out even strength shots faced and power play shots faced. That adds a random factor that might cause a greater spread than would be predicted from this simpler model. In real life, sometimes Fleury went three games without seeing many power play shots, and sometimes he was under siege for three games, but in the model every shot came with the same .909 save percentage.

In real life, most of the time that a goalie faces only a few shots in a game, it is because he let in multiple goals and got pulled, and those short games can have a big impact on the goalie’s save percentage over the three-game stretch. In the simulation, the number of shots faced and goals scored are determined independently, so the goalie who lets in three goals on the first five shots will usually have another 20-30 shots to regress to the mean.

This study makes no effort to account for change in skill over time. Over the seven years in question, Fleury has gone from a 21-year-old rookie to a 27-year-old in his prime, so we might expect that he had more bad stretches in 2005-06 and more good stretches in 2011-12. This would look exactly the same in the plots above as a goalie who had hot and cold stretches throughout his career, but would not normally be considered streakiness.

Similarly, over those seven years the goalies have had a variety of coaches, teammates, and in some cases have switched teams altogether. If any of those things impact save percentage, they would have the same effect as aging, making the goalie appear more variable than he really is.

Goalies sometimes play through an injury that hampers their performance for a stretch of time. Simulated goalies never have to do that.

My hunch is that all of those factors put together easily account for the small differences between the simulated and actual distributions, and that all of the psychological factors commonly cited to explain variability (confidence, focus, etc) collectively add up to virtually zero effect. I haven’t proven that, however; all I can say with confidence right now is that all of the model imperfections and psychological factors put together collectively add up to something very small, and that goalie streakiness is mostly just random chance.

Excellent work here, Eric. I love to see someone acknowledging random chance as a reason as to why a particular outcome happened. It's a pitifully underused, albeit unsatisfying, explanation for events.

Awesome work as always EricT. To take things further, how close do you feel we are to stating in effect nhl results are 'all luck'. In other words, Bryz did not just have a bad playoffs it was justa blip we expect by random chance. And, is it time to become more radical,in rule changes to reward skill. i.e. make the nets larger, reduce number of blocked shots, etc etc so that skill takes more of a role. Personally, I would like to get hockey to a 50 /50 split. I believe its around 60/40 luck right now.

Bryzgalov did have a bad playoffs. There are any number of reasons for that, some of which is skill, some of which is mental, some physical, some luck.

Just because goalies don't show a tendency to be any more consistent than expected doesn't mean everything is luck. It means that every goalie has variance, "hot streaks" and "cold streaks", but that the frequency is what should be expected based on skill.

Apparently Pittsburgh had an imposter in goal against Philadelphia in the 2012 playoffs. In games 2, 3, and 4 of that series, nhl.com reports that Fleury saved 67 of 83 shots, for a save percentage of 80.7%. This three-game total is completely outside the distribution that you present for Fleury; ie your graph implies that 0% of his three-game stretches had save percentages that low. What gives? The five-game stretch does not fit with your graphs much better. In the last five games of that series Fleury saved 109 / 131 shots, or 83.2%. Right at the edge of your 5-game distribution.

Nevertheless that bad streak does speak against your thesis. A robot goalie would not have gone stone cold as he did.

Your phrase "surprising since he's such a clutch big-game goalie" suggests that you don't believe yourself that these human athletes are repeating robots that only express pure statistical variability.

I was being sarcastic. The clutch big-game goalie thing is a load of crap; he's only achieved his career .909 save percentage in one of his six playoff appearances.

What exactly is it that you think speaks against my thesis? That something that only comes up roughly once every 200 3-game stretches occurred for one goalie during one of his 63 career 3-game playoff stretches?

I wonder if we should consider this evidence that shot quality is not a major factor in goaltender performance. If actual goaltender performance lines up very closely with what would be expected if all shots had the same probability of going in, then that seems like evidence that shot quality is not significantly affecting the real life save percentage. Does that sound accurate?

I'm not sure where you got 1 in 200 from. The blue curve - actual Fleury three-game stretches - hits zero before that, indicating that it never happened in real life. If you have omitted blue values below that level, that omission would hide a fat tail that may exist. Whether or not the actual tail is much fatter than a robot's is central to your thesis. My earlier comment wondered how often a robot goalie would be as cold as Fleury was in the recent playoffs. Looking at your red simFleury three-game curve, and imagining its extension to lower values, it seems to me that the area under the curve of that extension is smaller than 1 / 200 as big as the total area under the curve. While I don't have access to your data, a simple estimate indicates that the robot's probability of performing that poorly is lower. (If a robot faces 83 shots, with p=0.909 binomial distribution, the probability of saving 67 or fewer is p=0.00308, or 1 in 325).

I'm also not sure about your claim of 1 in 63 playoff stretches. It's not as if I did an exhaustive search. I looked at exactly one instance - the most recent playoff series - guided by my anecdotal prejudice (which you claim is totally unreliable). The one case I looked at was outside your distribution. Imagining the vast array of potential checks like I did, the probability of finding such an outlier so easily should be very low.

Sorry about missing your sarcasm re clutch goalie. However a choker is also not a robot. One could presumably test whether the career playoff record has a distribution that is distinct from the regular season distribution that you have looked at. That would be non-robotic.

Another issue is that when coaches pull a cold goalie, they are trying to minimize the tail fatness - ie avoid the horror show performances. So that strategy is intended to keep the distribution narrower.

I got one-in-200 from eyeballing the red curve, but you're right -- it's probably more like one-in-325. Either way, the point is that the random model suggests that he'll have a result of .8077 every once in a while and we shouldn't be shocked that he did.

When you say the blue curve hits zero, what you mean is that there were specifically zero stretches of between .805 and .8149 in his regular season career. That's it, it's just a single data point at zero. Don't read too much into it -- notice for example that the 10-game curve for Fleury and the 3-game curve for Bryzgalov both have a data point that's zero and then come back up; at these tails where the totals are very small numbers, that'll happen. A single point at zero doesn't mean that there is absolutely nothing below that point, and it certainly doesn't mean that the goalie should not be expected to ever have an event in that area.

Fleury has never had a 3-game stretch in the playoffs anything like what he did this year. Last year, his worst stretch was .848. Before that, he was never even close. The fact that you didn't do an exhaustive search is exactly the point -- you just said "well wait, I remember he was awful this year" and used that as evidence, when in fact the frequency with which he has been awful is not at all inconsistent with the random model.

The whole point of the article is that we tend to remember the stretches where a goalie was great or terrible and think of them as streaky, when in fact those stretches do not seem to occur any more often than would be expected by random chance. The point was to look at whether stretches like that occur much more often than we would expect, and we found that no, they occur about once every 325 times -- the simulation looks very much like the actual results. So the argument "well wait, what about the time that one goalie had

..."what about that one time that one goalie had the worst performance in 17 years (http://hkref.com/tiny/NV7a3)" is exactly the thinking I'm trying to change; those extreme results seem more improbable than they are to people who aren't aware of how streaky random chance is, and they actually aren't true outliers but just simple tail-of-the-curve results.

The importance of tail shape is one area where we disagree. A three-game "result of .807 every once in a while" is very imprecise. Is it 1 / 325 or 1 / 63 or 1 / 20? These are all once in a while, but the differences among them can carry large impacts on an NHL team's success. I think Pittsburgh's coach and GM did indeed expect never to have 8 goals against in two playoff games in a row. I would have liked to see the full version of your 3-game graph that included all of the points for the real goalie. Anything popping up way out there is interesting.

Apart from the issue of a tail shape that does not change in time is the issue of shifting distribution come playoff time. Your statement "he's only achieved his career .909 save percentage in one of his six playoff appearances" suggests that there may be a distinct distribution, even before the most recent debacle. That shifting distribution would be another individual trait that coaches and GMs are interested in. People have nerves, and some individuals handle them better than others; robots do not. And some athletes overcome earlier difficulties later in their career (such as Tom Watson in golf). These individual human struggles are part and parcel of spectator sports, and are the main reason why so many people want to watch, as compared to near-zero spectators for robot hockey.

The extreme tails of the curve are included in the standard deviation calculation, even if they aren't included in the plot. In fact, the shape there has a large impact on the standard deviation. And yet the standard deviations are virtually identical for the actual and simulated curves.

The issue of a shifting distribution come playoff time would be a separate question. I've done some work on playoff and clutch performances and remain unconvinced that what we see is anything other than small sample size random variance, but the subject certainly hasn't been exhausted. If you want to dig into it further, feel free. But I think the fact that Fleury was widely praised as a clutch big-game goalie heading into this season only underscores the likelihood that it is simple variance.

I first want to say this is an interesting topic and would like to see the results of all the goaltenders in the league.

Secondly, I just want to let you know what I am concluding from this so that you may let me know of something I could be missing to help me further understand.

From what has been shown, it looks like these goalies have shown to produce consistent results (namely consistent SV%) over their careers compared to what a "robotic" goalie of their skill would produce. Also, it is able to conclude that this does not show whether these goalies actual play (movement, focus, rhythm, decision making, etc.) is consistent or not, but that just their measurable results are. This tells me that if one of these goaltenders does show inconsistencies in his play, then it does not affect the consistency of his SV% over a three, five, or ten game period.

Again, I think it would be interesting to the results of all the goaltenders (mostly because I am curious what Steve Mason's looks like) and also maybe the effects of a two game window as well (not sure what that would do or if you would consider that long enough to be relevant to consistency).

I am tending to agree with lj21 about fleurys playoff performance to be way off his statisical norm. The bell curve represents what is suppose to represent over 99 percent of his save percentage over three game sets. If Fleurys historical save percentage is .909 then his playoff save percentage of .807 should represent the same chance as having three shutouts in a row, or a 100% save percentage. The statement that it happens once in a while, I think, is a bit of a stretch.

Yeah, it's something that only happens once every couple hundred games.

That doesn't mean it never happens. It happens once in a while.

The guy has played 75 playoff games. It happened once. That's not shocking. It's very unlikely to happen to any particular goalie in any particular series, but with 30 monkeys banging on 30 typewriters every year, it was going to happen.