Great Streaks: A Response to Trent McCotter

In an article in the 2008 issue of The Baseball Research Journal, Trent McCotter argued that hitting streaks are achieved more frequently than if there were no “hot hand” effect.1 Here, the author acknowledges that finding, but argues that the effect is so small that it can be ignored for practical purposes. In addition, he uses the original researcher’s technique to identify the most seemingly-unlikely (although not necessarily longest) streaks of the past several baseball seasons.

Introduction

Recently for the Journal of Quantitative Analysis in Sports I wrote an article where I looked carefully at the streaky patterns of hitters during the 2005 season.2 After I wrote this, I vowed never again to write about streakiness. But after reading the recent article by Trent McCotter in The Baseball Research Journal, I had to break my vow. McCotter’s interesting look at streaky hitting and the statements made in the article deserve some explanation and comments. Also, he describes an attractive method for assessing streakiness and it is straightforward to apply his statistical approach to identify extreme hitting streaks in recent seasons. Using this methodology, I find some “great streaks” during the 2004 through 2008 baseball seasons.

Comments on BRJ Article by McCotter

In the BRJ article, McCotter wishes to construct a test of the common hypothesis that the individual batting outcomes of a particular player during a season represent independent, identically distributed trials. (We’ll call this the “IID assumption” or the “IID model.”) Essentially this hypothesis says that the batting outcomes are similar to flips of a coin where the chance of a hit on a single at-bat is equal to the batter’s “true” batting average.

To test this hypothesis, the author looks at the pattern of game hitting streaks of all players for the seasons 1957 through 2006. Suppose we collect the game-to-game hitting records of Mickey Mantle during the 1961 season. We record all his hitting streaks for this season—maybe he started with a hitting streak of one game, a second hitting streak of three games, a third hitting streak of four games, and so on. If the IID assumption is true, then the pattern of hitting streaks for Mantle is simply a by-product of chance variation. If we randomly rearranged his game-to-game hitting statistics, then that wouldn’t change the pattern of streaks. The general question is whether Mickey Mantle’s observed pattern of hitting streaks (and the streak patterns for other players) is consistent with the random patterns from a model with the IID assumption.

The author performs a computationally-intensive simulation of the patterns of hitting streaks for all players and seasons for the period 1957–2006. For each player’s game log for a season (for all 50 seasons), he randomly arranges the batting lines. Then he computes the lengths of all hitting streaks for all players for all seasons. Then he repeats this simulation process for a total of 10,000 iterations. When he is done, one has an empirical distribution of the lengths of hitting streaks under the IID assumption, and one can see if the actual lengths of streaks are consistent with this distribution.

The conclusions of the paper can be summarized by two tables that show that the actual number of long hitting streaks (of length 5 and greater) are consistently larger than the mean number of long streaks predicted from the IID model. Moreover, the differences are highly statistically significant. The author concludes by saying:

This study seems to provide some strong evidence that players’ games are not independent, identically distributed trials, as statisticians have assumed all these years, and it may even provide evidence that things like hot hands are a part of baseball streaks. . . .

From the overwhelming evidence of the permutations, it appears that, when the same math formulas used for coin tosses are used for hitting streaks, the probabilities they yield are incorrect.

Much of the article is devoted to a discussion of this conclusion, giving some possible explanations for the presence of long streaks.

Generally I’m fine with the statistical methodology used in the paper. As I’ll illustrate later, the permutation test procedure is an attractive method for testing the IID assumption and the results described in the article are interesting. But I am concerned about some of the author’s statements and conclusions about this analysis.

First, the author seems to make the implicit assumption that all statisticians believe in the IID assumption. The IID assumption is an example of a statistical model that we may use to fit baseball data. Any model we apply is actually wrong—that is, the real process behaves in a much more sophisticated manner than the model suggests. For example, take the standard IID assumption that individual at-bats are coin-tossing outcomes with a constant probability of hitting success p. Do I believe this is true? Of course not. I believe that the hitting talent of a player goes through many changes during the season and it depends on many other variables such as the quality of the pitcher, the game situation, whether the game is at home, etc. So if the IID model is wrong, why do we use it? Well, the IID model has been shown to be useful in understanding the variation of baseball data. One thing that I have found remarkable in my baseball research is that good simple models (like the IID model) are really helpful in predicting future baseball outcomes.

Second, the author gives the impression that this statistical analysis gives evidence for the hot hand effect in baseball. Suppose you reject the IID assumption—what does this mean? It could mean that there is a dependence structure in the batting sequence. That is, one’s performance in one at-bat is helpful in predicting the performance in successive atbats. But there is a second possible explanation. Maybe the outcomes are independent, but the chance of getting a hit changes across the season. Either explanation, a dependence pattern or a change in hitting probability, would explain the presence of long streaks. Also these two characteristics are confounded and it is difficult statistically to isolate their effects. So it is wrong to say that long streaks imply a dependence pattern in the hitting sequences. People love to believe in the hot hand and I’m concerned that this paper adds fuel to their hot-hand belief.

Last, what is the practical significance of the results? To find these streaky effects, the author had to consider all hitting sequences in 50 seasons of baseball data. This is a ton of data—the author was likely considering streaks present in over 30,000 player seasons! But we live in the context of a single season and these results really don’t say that the IID assumption is inappropriate for understanding the lengths of hitting streaks for a single season. I suspect that in the context of a single season, these streaky effects are relatively small and can safely be ignored. The author is concerned about the difficulty of devising a more accurate modeling method since he has shown that the IID assumption is incorrect. But that’s okay, since statisticians don’t need “exact” models. If the streak effect is “real” but small in size, then I’ll continue to use the IID model since I believe it is an attractive approximate model that works.

Is There Evidence of Long Streaks for the Past Five Seasons?

After reading this paper, it seemed natural to explore the presence of long streaks in the context of a single season. McCotter demonstrated that there was a streakiness effect, but didn’t measure the size of this effect. If the streakiness effect was substantial, then I would think it should manifest itself in a single season.

So I replicated the author’s analysis for each of the five recent baseball seasons from 2004 through 2008. I’ll carefully outline what I did for the 2004 season which may help explain the author’s method in the BRJ article.

-Using play-by-play files from Retrosheet, I collected the game-to-game hitting data (number of hits and number of at-bats) for all 959 players who had at least one official at-bat in the 2004 season.

-For each player’s game log, I collected the lengths of all hitting streaks. For example, for the 2004 John McDonald, I record if he got a hit (Y) or not (N) for each of the 40 games he had an official at-bat in the season. (See table 1.)

Table 1.

Game

0

0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1

1

1

2

2

2

2

2

2

2

2

2

2

3

3

3

3

3

3

3

3

3

3

4

Number

1

2

3

4

5

6

7

8

9

0

1

2

3

4

5

6

7

8

9

0

1

2

3

4

5

6

7

8

9

0

1

2

3

4

5

6

7

8

9

0

Hit?

N

N

N

N

Y

N

N

N

Y

N

N

N

N

N

Y

N

Y

Y

N

N

N

N

N

N

N

N

Y

Y

Y

Y

Y

Y

Y

N

N

N

N

N

N

N

Streak

1

1

1

1

2

1

2

3

4

5

6

7

I collect the hitting streak lengths 1, 1, 1, 2, 7. Likewise, I collect the streak lengths for all other 958 players.

-Next, I wish to simulate batting logs for all players under the assumption that the game order in each player’s batting log is recorded is not important. For each player’s batting log, I randomly permute the Y’s and N’s. For the simulated batting log, I again collect all of the streak lengths for all players. When I am done, I collect the number of streaks of length 1, the number of streaks of length 2, and so on.

-Last, I compare the distribution of simulated streak lengths with the actual streak lengths observed in the 2004 season. A sample of results is displayed in the following table. Suppose we are interested in the number of “long” streaks that are five or longer. Under the “Actual” column, we see that we observed 1707 streaks of length 5 or higher in the 2004 season. In the simulation, the mean number of streaks that were 5 or higher was 1690 and the standard deviation of the number of 5+ streaks was 24.2—these numbers are placed in the “Mean” and “Stand Dev” columns. We notice that we observed more streaks than one would expect under the IID model. Is this significant? To answer this question, we compute the p-value which is the probability that the simulated number of 5+ streaks is at least as large as the observed number of 5+ streaks. If the p-value is small (say, under 0.05), then we reject the IID model. Here we compute that the p-value is 0.25—the conclusion is that we have insufficient evidence to say that the data rejects the IID hypothesis. (By the way, the BRJ article didn’t contain p-values and I think the inclusion of those numbers would help the exposition.)

The above procedure was repeated for each of the five seasons and the results are displayed in the following five tables. In each table, we look at the number of streaks of length 5 or more, the number of streaks of length 10 or more, the number of length 15 or more, and the number of length 20 or more. The p-values indicate the consistency of the strength lengths with the IID model—small p-values indicate that the observed strength lengths are longer than one would expect under the IID model.

2004 Season

Streak Length

Actual

Mean

Stand Dev

P-value

5 or more

1,707

1,690

24.2

0.25

10 or more

235

227.4

12

0.28

15 or more

35

35.8

5.6

0.58

20 or more

7

6.2

2.4

0.42

2005 Season

Streak Length

Actual

Mean

Stand Dev

P-value

5 or more

1,707

1,665.3

23.7

0.04

10 or more

228

214.5

11.6

0.14

15 or more

29

32.7

5.2

0.79

20 or more

9

5.8

2.3

0.13

2006 Season

Streak Length

Actual

Mean

Stand Dev

P-value

5 or more

1,729

1,689.8

24.4

0.07

10 or more

231

227.6

11.9

0.40

15 or more

34

35.6

5.4

0.65

20 or more

9

6.2

2.3

0.16

2007 Season

Streak Length

Actual

Mean

Stand Dev

P-value

5 or more

1,712

1,691.6

23.9

0.20

10 or more

238

226.3

11.8

0.16

15 or more

46

35.4

5.5

0.04

20 or more

11

6.2

2.4

0.04

2008 Season

Streak Length

Actual

Mean

Stand Dev

P-value

5 or more

1,663

1,688

24

0.87

10 or more

236

227.1

12

0.24

15 or more

38

35.5

5.4

0.35

20 or more

4

6.2

2.3

0.87

What do we learn from this analysis? The p-values for the 2004 and 2008 seasons are large, indicating that for these seasons the streaks were consistent with the IID model. In contrast, the 2006 and 2007 p-values are small, suggesting that the streakiness is significant for these seasons, and the 2005 p-values are less conclusive. From this brief analysis, the IID model appears useful in explaining the variation in strength lengths for some seasons. The size of the streakiness effect is small enough that it is not detectable statistically for particular seasons. McCotter did find significant streakiness in his study of 50 seasons of data, but the practical significance of his result is questionable by this analysis.

Using a permutation test to identify great streaks In baseball, we simply define a long streak by the consecutive number of official games in which a player gets at least one base hit. On the web page www.baseball-reference.com/bullpen/Longest_Hitting_Streaks are listed all of the hitting streaks in baseball history of length 30 or greater. In my JQAS article, I explain that the length of a hitting strength is confounded with two variables. Better hitting players are more likely to have long streaks since they are more likely to get a hit in a game. Also, regular players who play all the games in a season are more likely to have long streaks than utility players who have fewer opportunities to hit. It is desirable to get a measure of streakiness that is not related to hitting success or number of games played.

The permutation test described in the BRJ article provides a simple method of assessing the size of a particular hitting streak that adjusts for player ability and number of games played. We illustrate the calculation using John McDonald’s data for the 2004 season.

We show again his game data. We see that his longest hitting streak was 7 games. Was this a noteworthy streak? (On the surface, you probably would say no, since 7 doesn’t sound very large.) (See table 2.)

As in the previous analysis, we simulate hitting sequences assuming the IID model. For each of the ten lines below, we randomly arrange the sequence of 12 Y’s (games with a hit) and 28 N’s (games with no hit), and then compute the length of the longest hitting sequence in each of the random permutations. (See table 3.)

If we repeat this exercise for 10,000 simulations, we obtain the empirical distribution of the longest hitting streak for McDonald if the game results were truly a random sequence. To see if McDonald’s streak of 7 is extreme, we compute the p-value, the probability that the longest streak length in the random sequence is 7 or higher. We see from the output below that the p-value is 0.0017, a pretty small number. We conclude that McDonald’s hitting streak of 7 is pretty impressive since the chance of getting a streak this large by chance is so small. (See table 4.)

I used this procedure to assess the greatness of the longest streak of hits for every player in the five seasons 2005 through 2008. To pick out a relatively small number of streaks, I arbitrarily decide that a streak is “great” if the p-value is smaller than 0.0032 (that is, if –log10(p-value) exceeds 2.5).

Table 1 displays 13 great streaks in this five-season period that satisfy this criterion. There are some obvious great streaks listed such as Jimmy Rollins’s streak of 36 games in 2005, Chase Utley’s streak of 35 in 2006, and Willy Taveras’ streak of 30 in 2006. But there are several surprising names on this list including John McDonald, Mike Napoli and So Taguchi. But remember that this streaky measure automatically adjusts for the hitting ability and number of games of the player. This measure essentially lists the most surprising hitting sequences as identified by the permutation test.

Closing Comments

Have we learned anything new about streakiness in baseball? McCotter proposes an interesting method of detecting streakiness using a large dataset (50 years of baseball) and he did show that “true” streakiness existed. But I believe his conclusions are similar in spirit to the conclusions in my JQAS paper. We see much streaky behavior in baseball data, but most of the variability in this behavior of it can be explained using simple probability models such as the IID model here. Although simple models explain most of the behavior, I concluded in my article that some players exhibited more streakiness than the models would predict. Moreover, it seems hard to find statistical evidence for players who are consistently streaky across seasons.

One interesting by-product of this work was the use of the permutation test to identify unusually long hitting streaks. By looking all at players instead of the regulars, one can identify players such as John McDonald, who exhibit strong streaky performances despite hitting for a poor average.

Notes

A version of this article appeared in By The Numbers 18, no. 4 (November 2008): 9–13.

JIM ALBERT is professor of mathematical statistics at Bowling Green State University.