Friday, August 18, 2017

Generations of Baseball Players

"A generation, like an individual, merges many different qualities, no one of which is definitive standing alone. But once all the evidence is assembled, we can build a persuasive case for identifying (by birthyear) eighteen generations over the course of American history. All Americans born over the past four centuries have belonged to one or another of these generations" (Generations page 68).

William Strauss and Neil Howe wrote the book on generations (literally). The aim of this post - and, indeed, this entire blog - is to apply their theory to baseball. You can call it my manifesto.

Therefore, by the end of this post, I hope to have built a persuasive case for identifying (by birthyear) NINE generations over the course of baseball history. I will claim that all ballplayers born over the past two centuries belong to one or another of these generations.

Why do we need definitive baseball generations? For one, to provide context for "best of their generation" conversations and arguments (best player, best pitcher, best 3rd baseman, best leadoff batter, etc.) Beyond that, having definitive generations restores meaning to baseball's hallowed leaderboards. For example, Roger Connor hit 138 home runs in his career - a modest total by today's standards, but the most ever hit by a player born before 1887. Similarly, Miguel Cabrera's career batting average of .318 ranks 55th all-time (as of this writing), but it ranks first among players born after 1960.

So if we can agree that sorting baseball players by generation is useful, how exactly should we go about doing it? I'll start with Strauss & Howe's definition:

"A GENERATION is a cohort-group whose length approximates the span of a phase of life and whose boundaries are fixed by peer personality" (page 60).

Earlier in the book (page 44), Strauss & Howe defined a "cohort" as "any set of persons born in the same year" and a "cohort-group" as "any wider set of persons born in a limited set of consecutive years."

The authors laid out (on page 56) four "phases of life," each 22 years long: youth (age 0 to 21), rising adulthood (age 22 to 43), midlife (age 44 to 65), and elderhood (age 66 to 87). Obviously, most major league careers fall almost entirely within the second phase, rising adulthood. And since the longest careers tend to last about 22 years, we can say then that the length of a baseball generation approximates the span of a very long major league career.

And that brings us to "peer personality" - "the element in our definition that distinguishes a generation as a cohesive cohort-group" (page 63). Strauss & Howe measured the similarity of cohorts by the similarity of their peer personality. In a pair of articles published to his website in July 2015 (and placed behind a subscription paywall), Bill James "measured the Similarity of Seasons...by the similarity of their statistical image."

Strauss & Howe "use peer personality to identify a generation and find the boundaries separating it from its neighbors" (page 64). Bill James used similarity scores to identify "natural groups of seasons" and find the "fault lines" separating them.

And while Strauss & Howe could "apply no reductive rules for comparing the beliefs and behavior of one cohort-group with those of its neighbors" (page 67), we have baseball's rich statistical record at our disposal for comparing the "behavior" of baseball cohort-groups.

So here then is my modified definition:

A BASEBALL GENERATION is a cohort-group whose length approximates the span of a very long major league career and whose boundaries are fixed by statistical image.

Bill James' similarity scores for seasons used 30 statistical categories, including both counting stats (hits, homeruns, strikeouts), and rate stats (batting average, on-base percentage, earned run average). But, as Kerry Whisnant explained, using counting stats to compare cohorts, like using them to compare players or seasons, will mean that only cohorts with "similar numbers of plate appearances" will be similar.

And the traditional rate stats "confound" talents, as Jim Albert explained on page 24 of an article in By the Numbers. "A batting average confounds three batter talents: the talent not to strikeout, the talent to hit a home run, and the talent to hit an in-play ball for a hit."

Peer personality has nothing to do with the raw numbers of a generation, but rather the collective behavior of its members. Strauss & Howe elaborated (on page 63 of Generations):

"The peer personality of a generation is essentially a caricature of its prototypical member. It is, in its sum of attributes, a distinctly personlike creation. A generation...can be safe or reckless, calm or aggressive, self-absorbed or outer-driven, generous or selfish, spiritual or secular, interested in culture or interested in politics."

Likewise, the "statistical image" of a baseball generation (like the statistical image of a team or a league) is essentially a caricature of its average player. It is, in its sum of attributes, like an individual player. A baseball generation can be patient or free-swinging, adept at making contact or prone to striking out, powerful or light-hitting, good at hitting the ball "where they ain't" or bad at avoiding defenders, aggressive on the base-paths or station-to-station.

So I'll use eight rate statistics - what I'll call the "attribute rates" - to measure the similarity of baseball cohorts. ("Each rate describes something specific," Tom Tango wrote, describing four of the rates.) These eight rates - representing eight different skills, or tools - taken together, reveal the "statistical image" or "peer personality" of a baseball generation; how its members collectively played the game.

The first rate, BF/G, uses pitching stats only. The next three, $BB, $SO, and $HR - the "three true outcomes" - draw from both batting and pitching stats (for the formulas listed below, I differentiate between batting and pitching with a small 'b' or 'p' in the variables). The last four use batting stats only.

BF/G - the number of batters a pitcher faces per game. = BF / G

$BB - the percentage of plate appearances that end in a walk or a hit by pitch.
= (bBB + bHBP + pBB + pHBP) / (PA + BF)

So, now all we need are the career totals for every batter and every pitcher in MLB history. Then, on a separate sheet in Excel, it's just a matter of using SUMIF formulas to add up the necessary batting and pitching totals for each cohort. From the firstborn player (Nate Berkenstock, 1831) to the lastborn (Julio Urias, 1996), there are 166 MLB cohort birthyears (through 2016; I'm typing this just days after Ozzie Albies became the first 1997-born major-leaguer). The table below shows the career totals for 1980s cohorts:

Batting Totals

Pitching Totals

Born

PA

H

2B

3B

HR

BB

SO

HBP

SB

G

BB

SO

HR

BF

HBP

1980

151,352

34,688

7,083

671

4,575

13,621

28,195

1,570

2,243

13,657

11,496

24,982

3,630

137,580

1,321

1981

138,495

32,250

6,336

834

3,409

10,444

25,592

1,068

2,766

14,427

14,339

29,760

4,482

162,821

1,476

1982

167,402

39,289

7,957

813

4,302

13,465

30,397

1,597

2,514

16,642

13,586

28,085

4,128

153,448

1,434

1983

210,540

49,437

10,073

981

5,721

18,420

38,815

1,721

3,422

19,557

16,301

36,786

5,174

197,787

1,708

1984

144,812

33,278

6,555

716

3,553

11,529

28,456

1,187

2,414

16,438

14,779

34,708

4,425

175,098

1,496

1985

109,480

25,513

5,128

610

2,836

8,077

22,168

1,007

1,882

16,635

11,104

26,232

3,256

128,041

1,199

1986

124,542

27,408

5,397

595

3,345

10,219

27,121

1,122

1,777

12,899

11,477

29,364

3,770

147,602

1,205

1987

121,873

27,901

5,613

699

3,340

9,785

25,865

976

1,872

12,667

9,868

23,021

3,000

115,816

1,004

1988

62,593

14,235

2,607

404

1,251

4,442

12,814

510

1,392

10,079

8,432

22,006

2,518

102,863

801

1989

69,496

15,655

3,024

348

1,752

5,221

14,455

623

859

7,291

6,304

16,419

2,066

79,846

717

All the batters born in 1980 combined for 151,352 plate appearances, 34,688 hits, and 4,575 homeruns. All the pitchers born that year combined for 11,496 bases on balls, 24,982 strikeouts, etc.

Then I can calculate the attribute rates for each cohort:

Born

BF/G

$BB

$SO

$HR

$H

$XBH

$3B

$SB

1980

10.1

.097

.204

.039

.291

.257

.087

.060

1981

11.3

.091

.202

.036

.294

.249

.116

.083

1982

9.2

.094

.201

.036

.297

.251

.093

.061

1983

10.1

.093

.204

.037

.300

.253

.089

.065

1984

10.7

.091

.217

.035

.297

.245

.098

.069

1985

7.7

.090

.224

.036

.301

.253

.106

.072

1986

11.4

.088

.228

.037

.291

.249

.099

.060

1987

9.1

.091

.226

.038

.300

.257

.111

.065

1988

10.2

.086

.230

.032

.298

.232

.134

.093

1989

11.0

.086

.226

.036

.293

.243

.103

.052

Next I need the standard deviations of each rate. I have 166 cohort birthyears, but many of the very early and very recent cohorts were not (or aren't yet) well-represented in the major leagues. So I'll set minimum requirements of 10,000 total plate appearances and 10,000 total batters faced, and therefore only include the 143 cohorts from 1850 (Al Spalding) through 1992 (Bryce Harper) in the population for my standard deviations.

Also, I'll need to assign weights to each rate. I wanted the "three true outcomes" rates ($BB, $SO, and $HR) to weigh double the other rates, because they use both hitting and pitching statistics, and I wanted the $XBH and $3B rates to weigh half the other rates, because they both deal with breakdowns of base hits. Finally, I wanted the weights to add up to 1,000, so that if two groups are exactly four standard deviations apart in every category, their similarity score will be zero.

BF/G

$BB

$SO

$HR

$H

$XBH

$3B

$SB

St. Dev.

8.3

.013

.047

.012

.011

.019

.071

.040

Weight

100

200

200

200

100

50

50

100

Multiplier

3.0

3870

1061

4229

2241

648

175

633

To find the similarity score between two groups, start at 1,000 and subtract a penalty for each attribute rate. The penalty is the difference between the two groups, times a multiplier. The multiplier is the rate's weight divided by (4 times its standard deviation).

For example, the 1980 cohort has a $BB rate of .097 and the 1981 cohort has a $BB rate of .091, a difference of .006. So the $BB penalty for 1980 and 1981 would be the difference (.006) times the multiplier (3,870), which is about 23. Add up the penalties for all eight rates and subtract from 1,000, and that is the similarity score.

To find "Epochs and Eras," Bill James asked of "every season in baseball history: Is it more like the season before it, or more like the season after it?" He then made two-year comparisons, three-year comparisons, four-year comparisons... comparing "each season to every other season in baseball history within 15 years before or after."

Instead of comparing each baseball cohort to other neighboring cohorts, I'm comparing them to the 15-year cohort-groups before and after. To find Baseball Generations, I'm asking of every baseball cohort: Is it more like the 15-year cohort-group before it, or more like the 15-year cohort-group after it?

Is the 1980 cohort more similar to the 1965-1979 cohort-group, or more similar to the 1981-1995 cohort-group?

1980 to 1965-1979 - 943
1980 to 1981-1995 - 916

The 1980 cohort is backward-looking, more similar to the cohort-group before it (943) than the cohort-group after it (916). What about 1981?

1981 to 1966-1980 - 938
1981 to 1982-1996 - 954

The 1981 cohort is forward-looking, more similar to the cohort-group after it (954) than the cohort-group before it (938).

To get previous and next cohort-groups for all 166 cohorts, I calculated attribute rates for every possible 15-year group, from the group before the first cohort (1816-1830) to the group after the last cohort (1997-2011), and all 180 groups in between. Attribute rates are calculated from batting and pitching totals. Cohort batting and pitching totals are found by adding up the career totals of the individual batters and pitchers belonging to each cohort; cohort-group batting and pitching totals are found by adding up the cohort totals of the 15 individual cohorts belonging to each cohort-group. (The 1816-1830 and 1997-2011 groups both have totals and rates of zero across the board, of course.)

The table below shows the similarity scores of the 1973-1993 cohorts to their respective previous and next 15-year groups. I also calculated a "forward score" for each cohort, which is simply its next-group similarity score MINUS its previous-group similarity score. The forward score shows just HOW forward- or backward-looking a cohort is. A positive forward score indicates a cohort is forward-looking and a negative score indicates it is backward-looking, and a score above +50 (or below -50) means that the cohort is VERY forward- (or backward-) looking.

Born

Prev.

Next

Forward

1973

957

950

-7

1974

953

919

-34

1975

950

941

-9

1976

951

945

-7

1977

955

944

-12

1978

940

948

8

1979

939

931

-8

1980

943

916

-27

1981

938

954

16

1982

958

948

-10

1983

956

941

-15

1984

932

961

28

1985

915

965

50

1986

927

955

27

1987

934

941

7

1988

878

942

65

1989

926

933

7

1990

884

892

7

1991

886

938

52

1992

891

943

52

1993

881

906

25

I've shaded the positive forward scores green and the negative forward scores red. Every cohort from 1973 to 1983, except for 1978 and 1981, is backward-looking. Every cohort from 1984 to 1993 is forward-looking.

While Bill James declined to develop a "specific protocol...based on this method," he did state, as a general rule, that "an 'epoch' is formed by a series of forward-looking seasons, followed by a series of backward-looking seasons." But what he was really looking for was the "hard break" between epochs - a series of backward-looking seasons (the end of one epoch) followed by a series of forward-looking seasons (the beginning of a new epoch). He was looking for "fault lines" separating "natural groups of seasons," just as Strauss & Howe looked for boundaries separating cohesive cohort-groups.

I showed the 1973-1993 cohorts in the table above, not because those cohorts form a cohesive group, but because they're halves of two different groups; the second half of one group and the first half of the next group, with the boundary between the two groups appearing to fall between 1983 and 1984. But it's not a clean break; not all of the 1973-1983 cohorts are backward-looking, and the 1981-1983 cohorts are all fairly similar to both their respective groups.

I know if a cohort is forward- or backward-looking, and how forward- or backward-looking it is; now I need a way to determine if a cohort is part of a forward- or backward-looking trend. And since I do want a specific protocol for defining generations by an objective process, I'm adding what I'll call a "trend score" for each cohort. The trend score - as its name applies - checks each cohort's forward score to see if it is part of a trend. If a cohort's forward score is positive (forward-looking), the trend score adds to it the forward scores of the next two cohorts. If its forward score is negative (backward-looking), the trend score adds to it the forward scores of the previous two cohorts.

When at least three backward-trending cohorts are followed by at least three forward-trending cohorts, I draw a generational boundary between the last backward-trending cohort and the first forward-trending cohort.

Rather than trying to explain any further how or why trend scores work now, I'll go ahead and start locating generational boundaries and explain them as I go. Here are the first ten baseball cohorts, 1831 to 1840:

Born

Prev.

Next

Forward

Trend

1831

204

-788

-992

-992

1832

204

-71

-275

-1,268

1833

545

-59

-604

-1,872

1834

545

-82

-627

-1,507

1835

-424

806

1,230

929

1836

821

732

-89

514

1837

97

-115

-211

929

1838

834

732

-101

-402

1839

109

-70

-179

-492

1840

736

665

-71

-352

Bill James gave a couple of caveats to his rule for defining epochs:

"1) Sometimes it is not a series of backward-looking years that ends an epoch, but just one year, and 2) Sometimes what ends an epoch is not a backward-looking phase, but rather a large difference between two adjacent seasons."

I take it to also be true that sometimes it is just one cohort, or a large difference between two adjacent cohorts, that STARTS a generation, and I tried to build these caveats into my trend scores. Even though the 1835 cohort is the only forward-looking cohort in its group, it is SO different from the cohorts that came before that it should be the start of a new generation. (The 1835 cohort consists of Harry Wright, the firstborn player to have a real major league career; the two players older than him appeared in just one game each as forty-somethings.) So even though the 1836 and 1837 cohorts are backward-looking, they're forward-trending because 1835's forward score is so high it overwhelms their negative scores.

The next several generational boundaries are easy to spot, without trend scores. We can draw one between 1856 and 1857:

Born

Prev.

Next

Forward

Trend

1852

877

756

-121

-576

1853

878

624

-254

-578

1854

799

795

-4

-380

1855

893

735

-158

-417

1856

861

758

-103

-265

1857

783

842

59

206

1858

826

844

18

174

1859

787

916

129

206

1860

847

875

28

165

1861

842

891

49

238

And 1873 and 1874:

Born

Prev.

Next

Forward

Trend

1869

940

933

-7

27

1870

900

884

-16

-3

1871

909

886

-22

-45

1872

923

885

-38

-76

1873

943

917

-26

-86

1874

919

929

10

44

1875

853

885

32

82

1876

920

921

1

58

1877

877

925

48

76

1878

911

920

9

42

And 1892 and 1893:

Born

Prev.

Next

Forward

Trend

1888

897

909

12

-67

1889

935

894

-42

-97

1890

930

892

-38

-67

1891

926

863

-63

-142

1892

922

862

-61

-161

1893

884

920

36

132

1894

905

917

11

185

1895

850

934

84

225

1896

864

954

89

171

1897

871

922

51

110

And 1911 and 1912:

Born

Prev.

Next

Forward

Trend

1907

904

952

48

-83

1908

956

903

-53

-49

1909

959

880

-79

-83

1910

963

879

-84

-216

1911

944

882

-61

-224

1912

886

944

58

212

1913

875

947

72

177

1914

881

963

82

116

1915

904

927

23

47

1916

912

924

11

-27

It looks like there might be a boundary between 1922 and 1923:

Born

Prev.

Next

Forward

Trend

1918

927

876

-51

-27

1919

941

867

-74

-112

1920

915

921

6

-26

1921

926

896

-30

-98

1922

898

896

-2

-26

1923

902

920

18

52

1924

882

911

29

46

1925

919

924

5

-13

1926

913

925

12

-28

1927

936

905

-31

-13

There're several mostly backward-looking cohorts followed by several forward-looking cohorts. But the 1925 and 1926 cohorts aren't forward-looking enough to be forward-trending; so the forward trend fizzles after the 1923 and 1924 cohorts, which means it doesn't meet my standard of at least three forward-trending cohorts. Besides, a boundary here would mean a generation of just 11 cohort birthyears (1912-1922), which is too short to be a true generation.

The actual boundary is six years later, between 1928 and 1929:

Born

Prev.

Next

Forward

Trend

1924

882

911

29

46

1925

919

924

5

-13

1926

913

925

12

-28

1927

936

905

-31

-13

1928

932

922

-10

-28

1929

882

943

61

94

1930

884

912

28

90

1931

885

889

4

59

1932

878

935

57

69

1933

931

929

-2

59

This time the backward-trending cohorts (1925-1928) are followed by a sustained forward trend. The 1929 cohort is the MOST forward-looking cohort since 1914, in the first wave of the previous generation.

The forward trend lasts through the 1941 cohort, and is then followed by 20 consecutive backward-trending cohorts. The next generational boundary isn't until 1961/62, 33 birthyears after the last one.

Born

Prev.

Next

Forward

Trend

1957

959

863

-95

-185

1958

921

896

-24

-154

1959

928

877

-51

-170

1960

938

906

-32

-107

1961

934

875

-59

-142

1962

905

938

33

144

1963

864

975

110

108

1964

906

907

1

48

1965

926

924

-3

108

1966

895

945

50

137

And finally, we're back to the boundary between the two currently-active generations:

Born

Prev.

Next

Forward

Trend

1979

939

931

-8

-12

1980

943

916

-27

-27

1981

938

954

16

-9

1982

958

948

-10

-21

1983

956

941

-15

-9

1984

932

961

28

106

1985

915

965

50

85

1986

927

955

27

99

1987

934

941

7

79

1988

878

942

65

79

And it looks like the boundary is indeed between 1983 and 1984, at least for now. These cohorts are still adding to their batting and pitching totals. The 1981-1983 group could possibly slip into the younger generation (I hope it does, anyway; it's hard to imagine the baby-faced Miguel Cabrera of the 1983 cohort being in the same generation as Clemens and Bonds).

So that's eight boundaries, which divides every MLB player in history into nine generations. Leaving out the first and last (partial) generations, the cohort lengths of the middle seven baseball generations range from 17 to 33 years and average 21.3 years. This average nearly matches Strauss & Howe's 22-year "phase of life", or the length of a very long major league career.