Sabermetric Research

Monday, November 26, 2007

The "K" study for real

[UPDATE: originally, this post had that the study used only players whose initials were both K. Commenters told me that I misread the study, that it was players where EITHER initial was K. Sorry for my screw-up. The original (incorrect) post can be obtained from me if you want it. This is now new.]

The biggest difference was in the 1920s -- only 1.2%. The biggest modern difference is the 60s and 80s, with 0.6%. From 2000-2003, the effect went the other way -- 15.4% for K players, 16.4 for others.

So I don't see where the authors' 1.6% difference comes from. I'll try a signficance test simulation later and update this post.

UPDATE: The "K" players in real life had 566,374 PA. I ran a simulation that had an average 568,558 PA. The SD of strikeout rate was 0.43 percentage points, which is higher than the 0.3 points that I observed.So the big question remains: why did the authors get such a high strikeout rate difference?

FURTHER UPDATE: Mystery solved by Tango! It looks like the authors weighted every player equally, instead of weighting every PA equally. See comments.

There, I was able to duplicate the difference in strikeout rates. The "K" players did indeed strikeout 1.5 percentage points more than the non-Ks:

14.7% for K players (37096/252439)12.8% for the others (1365946/10607440)

I didn't recheck signficance levels, but my guess is that the difference is about 3 SDs.

However, the difference can be fully explained by the fact that first names starting with K are more popular now than they were 50 years ago. So proportionally more of the "K" hitters played in high-strikeout eras.

Go to the "Baby Name Voyager," choose "boys" only, and enter "K". You'll see a consistent rise in K names from the 19th century to the end of WWII -- about 10 times as many "K" boys at the end than at the beginning. But then, they accelerate upward even faster, doubling again by the late 1960s before dropping a little bit after that. (Most of the post-war effect, by the way, seems to be concentrated in "Kevin." Which makes sense; I can't think of any really old guys named Kevin. Or Kyle, for that matter.)

If you average out the calendar seasons played by Ks, you get 1977. If you average the seasons played by non-Ks, it's 1963.

I think this accounts for the entire effect. Here are the stirkeout rates for Ks vs. the non-Ks by decade (players with 100+ AB):

Once you normalize by decade, the effect all but disappears. From 1960 to 2003, the rates are exactly the same. There does appear to be a small "K" effect from 1914 to 1959, but it almost certainly is not statistically significant.

But maybe the authors did correct for this, or did something different. We can check for sure when the study comes out.

First, the numbers are very high. The authors found at least a 17.2% strikeout rate. But between 1913 and 2003, the years in the study, only four seasons had an overall strikeout rate at least that high. And the highest overall was only 17.8%, while some of the early years had rates below 10%, and the middle years are in the low teens. (And all those numbers come from considering only AB and BB in the definition of PA.)

The authors did limit their dataset to players with 100 PA or more, but that should *lower* the strikeout rates, by eliminating lots of pitchers. So how did they get 17.2%?

Maybe they used AB as the denominator instead of PA. That gives an overall rate of only about 14% up to 2003.

Second, the difference between the K players and everyone else isn't as big as the authors say. The authors found a 1.6 percentage-point difference. David Gassko's study (using a different dataset) found a difference of only 0.5 percentage points (15.5% versus 15.0%). I checked all players from 1913 to 2003, and found a difference of only 0.2 points:

13.1% for "K" players (50761 out of 387611)12.9% for them others (1352281 out of 10472268)

(I used all players, though, not just players with enough PA.)

Finally, to check statistical significance, I ran a simulation. I pulled random players who were born after 1934 (to roughly match Gassko's dataset), and arbitrarily decided their last names started with "K". I kept pulling until the total PA went past 464664, to come close to Gassko's number (although they would all be a bit higher than 464664). Then I computed their overall strikeout rate.

I repeated that 100 times. The results:

Mean 14.94%, SD 0.44%

That means that Gassko's result – an 0.5 point difference -- is only about 1 SD higher than the mean. And my result is only half an SD higher than the mean. So, no statistical significance.

So what's going on? I guess we have to wait for the original study to be released before we find out.

P.S. From a quick glance, it looks like only one letter in Gassko's study is statistically signficant – the "O". And exactly one significant result out of 26 is itself not significant.

Suppose you want to get from Meville to Youtown. The cities are 11.66 miles apart, as the crow flies. But you can't get there in a straight line. Instead, you have to go east 6.32 miles, then north 9.72 miles.

Now, ask yourself: what percentage of the 11.66 mile difference is "explained" by the 6.32-mile eastbound leg? And what percentage is explained by the 9.72 mile northbound leg?

One way to answer the question is to figure, hey, the trip is 16.04 miles total. So the 6.32-mile leg is 39.4% of the total, and the 9.72-mile leg is 61.4% of the total:

39.4% short leg61.4% long leg

Alternatively, you might figure that distance doesn't matter, just time. And it turns out the short leg takes 20 minutes to ride on your bike, but the second leg also takes 20 minutes, even though it's longer, because it's on smoother terrain. So it's

50% short leg50% long leg

If you don't like those choices, then, intuitively, you might do it by the amout of gas used by each leg if you drive. Or how much you have to pay for a bus ticket for each leg. Or any other measure like that.

Or, you could do it this way. You could say: by the pythagorean theorem and the little map above, we know that 6.32 squared plus 9.72 squared = 11.66 squared. Put into numbers, the short leg squared is 39.94, the long leg squared is 94.48, and the total distance squared is 135.96. So if we divide the squares, we can say that the short leg is 39.94/135.96, and the long leg is 94.48/135.96:

29.4% short leg70.6% long leg

To me, among all these alternatives, this last one makes the *least* amount of sense. Why use the squares? The numbers 39.94 and 94.48 don't really mean anything. There is no human-related description of this trip that could make the 29.4% figure mean something intuitive. Squaring the distances squeezes the meaning out of them.

But this last method is exactly what r-squared is doing!

Take a baseball example. The numbers in the above example are exactly the standard deviations that apply to team wins. For a certain group of seasons (that I don't have in front of me right now), the standard deviation of team wins is 11.66 wins. Decomposing that, and making a few simplifying assumptions, you can figure out that it's comprised of

Luck, with a standard deviation of 6.32 wins;Talent, with a standard deviation of 9.72 wins.

If you assume luck and talent are independent, then, by the mathematical properties of standard deviations, the relationship between the three SDs (luck, talent, and total) is the same as the sides of a right triangle: 6.32 squared plus 9.72 squared equals 11.66 squared. That's the same right triangle we had for the Meville to Youtown trip.

Ask yourself now: how much of team wins is explained by luck, and how much by talent? An intuitive answer (which I'm not advocating, just noting that it's inutitive) would be to notice that the talent number is about 50% higher than the luck number.

But statisticians wouldn't do that. They'd square everything, like in the Mytown example. Since 6.32 squared divided by 11.66 squared equals .294, statisticians would say that the r-squared between luck and wins is .294. Or, they'd say

29.4% of the variance in wins is explained by luck;70.6% of the variance in wins is explained by talent.

That's true, as a mathematical statement. Notice the words "of the variance in wins". The variance is the standard deviation squared – about 136 square wins. And that number doesn't mean anything. Squaring the standard deviation is arbitrary, as arbitrary as squaring the 6.32 mile distance travelled on the eastbound leg of the trip. If you're going to square it, why not cube it? Or take the square root, or the fifth power? Well, actually, there are very good reasons to take the square – but those reasons are related to statistical properties, not real life.

And if you're talking about real life concepts, those are what you should be measuring. If a $20 shirt is on sale for $10, it's 50% off. The fact that if a $20-squared shirt were on sale for $10-squared, it would be 75% off ... well, that just doesn't matter.

I still argue that when a statistican says "XX% of the variance in Y is explained by Z," that is something that is only meaningful to statisticians. It is not something that should have a lot of meaning to a casual fan, or in real life.

Tuesday, November 20, 2007

But what about Steve Balboni and Harold Reynolds?

The study, by Leif Nelson and Joseph Simmons, says that batters whose names started with "K" struck out 18.8% of the time, as compared to only 17.2% for everyone else. That result is statistically significant, the authors say.

Also, students whose names begin with "C" and "D" are more likely to get Cs and Ds, especially if they say they like their initials. And the November 19 issue of Sports Illustrated also points out (page 30) the connection between Barry Bonds' initials and one of his all-time records. The paper will appear in "Psychological Science." Any sabermetrically-inclined psychologists with access to this paper should feel free to send it along when it comes out.

------

UPDATE: in the comments, Tango points to a post by David Gassko, who ran a similar study for all the letters of the alphabet. Gassko found that some letters are even more strikeout-prone than K.(By the way, the reason the results are statistically significant is that the null hypothesis -- that players whose names start with J have exactly the same skills as players whose names don't start with J -- is false.)

Wednesday, November 14, 2007

Rany Jazayerli's MLB draft study

My last post was partly about the 1984 Bill James draft study, which found that college players proved to be better choices than high-school players, and that hitters provided a better return on investment than pitchers.

In the comments, a couple of readers pointed me to separate studies by Philly and Rany Jazayerli. I haven't gone through Philly's analysis yet, but I did read through Jazayerli's.

It's a great 12-part series, that appeared at Baseball Prospectus over the course of a year or so. (Here's a link, as provided by commenter VKW – look for part 1 to part 12 in the article listing.) It's so thorough that I'm surprised it hasn't had more publicity (although maybe it has; I don't keep up as well as I used to).

If you don't have patience for all 12 articles, Part 11 is a summary of the main findings.

Basically, what Jazayerli found is that much of the gap between college players and high school players has disappeared. Why? Perhaps teams learned from experience, and from these types of studies. Over the past several years, the availability of high-school talent has increased (Jazayerli speculates it's that large signing bonuses are convincing young players to sign instead of going on to college). But the proportion of high-schoolers being drafted has stayed the same, or even dropped a bit. That means that teams are actually less likely to draft a given high-schooler than they were before. So they're concentrating more on the better ones, which increases the returns.

In the period covered by the study, college hitters are still the best drafting bet, but not as much as they used to be. Moreover, in the years since Moneyball revealed the "draft college players" strategy, teams have drafted so few high-schoolers that Jazayerli argues that they might now be worth 40% *more* than college players!

Lots of other good stuff in these studies ... I'm not sure what to make of the breakdowns by position. Some positions seem like better choices than others, but it seems to me that the sample sizes are pretty small compared to the variances between players.

In any case, I think this series joins the Bill James study as a must-read for anyone doing research in this area.

Monday, November 12, 2007

The value of an MLB draft choice

Back in 1984, Bill James examined the value of an MLB draft pick. In a definitive 36-page issue of his newsletter, he looked at the draft history from 1965 to 1983, and found, among many other things, that:

-- players out of college make much better selections than high-school players, producing 84% more value after adjusting for draft position.

-- pitchers make slightly worse choices (about 12% worse than average). They were 44% worse than average in the first round (relative to all first-round choices), but about average in subsequent rounds.

-- players from California and the northern states delivered about 20% higher return than expected. Players from the South were poor choices, producing 35% less than average.

The study was done as a special issue of James' newsletter, and, to my knowledge, has not been reprinted since. That's unfortunate; it probably went out, originally, to only several hundred people, and it's on a topic of very much interest to economists these days. And its regression-free methodology is completely understandable and convincing.

Anyway, the reason I bring this up is because I came across a 2006 study (probably from a mention somewhere on The Wages of Wins) that touches on some of the same issues that James did. It's called "Initial Public Offerings of Ballplayers," by John D. Burger, Richard D. Grayson, and Steven J. K. Walters (I'll call them BGW).

One of the most interesting things in the study is actually the authors quoting from another study by Jim Callis (of Baseball America). Callis looked at the first ten rounds of all the drafts from 1990 to 1997. Breaking down the data in various ways, he counted the number of players who became "stars," "good players," or "regular players."

Callis found that high-school draftees have closed much of the gap as compared to college players, at least in the first round:

(BGW find that none of the differences were statistically significant.)

The results appear to be different from Bill James' results back in 1984. Why? The most obvious explanation is that general managers read Bill's study, and adjusted their drafting accordingly.

It's important to note that Bill's study didn’t show that high-school players are worse than college players – it showed, rather, that the high-schoolers underperform *relative to teams' expectations,* where the expectations are measured by how high they drafted the player. So, between 1984 and 1990, perhaps teams just learned to lower their expectations.

BGW write,

"In 1971, for example, *every* first-round draftee was a high-school player, but by 1981 the majority of first-round picks were collegians for the first time. Over the 1990-'97 sample period ... the proportions from each pool were rougly equal, though recent years have seen a greater predilection for college players."

The James study also shows some evidence of teams moving more towards college players over time. 1967 apparently saw NO college players drafted in the first few round, and 1971 was second-worst with high-school picks forming over 96% of the expected value of the draft. But if you remove 1967-71, years when very few college players were drafted, the trend from 1965 on shows a level trend of about 25% of draft value used for collegians.

BGW took Callis' numbers one step further, and looked at first-round pitchers specifically. They found that they were moderatly less likely to become regulars (as compared to position players), but FAR less likely to become above-average players:

I couldn't find the sample sizes, but BGW report that only the difference in the "good" column is statistically significant.

----These results are very interesting, but they're not really what the paper is about. What BGW are trying to do is look at bonus payments made to draftees, and to see if they match what the player actually did. That is, they're trying to do for baseball what Massey and Thaler did for football.

Here's how they did it. First, they ran a regression to estimate how much money every additional win is worth to a team. They then went through every draft choice to see how many wins each player contributed (WARP). This let them figure out how much a particular player contributed to the team's bottom line.

Then, they ran a second regression to estimate, on average, how much a particular draft choice, at a specific draft position, will expect as a bonus. Comparing actual performance dollars to bonus dollars paid allowed BGW to see which draft choice positions are the best financially. Is it the early choices, which are expensive but produce the best players? Or do the later choices, which are much cheaper, provide the best returns?

It's a good idea, but I think there's one fatal flaw. The authors compute "best buys" based on "internal rate of return." That's basically the effective annual interest rate the owners are getting on that player. (To use a simplified example: suppose the team pays $1 million in bonus to player X. Three years later he earns the team $4 million by his performance, and then suffers a career-ending injury. The team turned its $1MM into $4MM in three years, which is an internal rate of return of 59% -- $1, compounded at 59% over three years, returns $4.)

But I don't think that's the right way to do it.

To see why, look at one of BGW's findings, which is that drafted college players earn 43%, while drafted high-school players earn only 27%. The authors conclude that college players are better choices than high-schoolers.

But that's not necessarily right. High-school players take longer to mature than collegians (if only because they're younger). So the rates of return are based on different time periods. Which would you rather have as a return on your investments: 43% for one year, or 27% a year for five years? I'd rather have the 27%. After five years at 27%, my $1 will have grown into $3.30. But after one year at 43%, I'll have only $1.43. Sure, I'll have four more years to invest my $1.43, but now the 27% option is no longer available, so I'm stuck investing at 7% or something, which brings my total to $1.87 – still much lss than the $3.30 I got with the high-schooler.

Put another way: would you rather pay $1 to get "Rance," a decent but unspectacular player now (worth $2 next year), or pay $1 to get "Junior," who in five years will be an MVP candidate (eventually worth $20)? The first way gets you a 100% return on your investment. The second way gets you "only" 82% a year for five years. But no team would choose Rance over Junior under those circumstances.

In the economics of the draft, it's not capital that's scarce: it's the draft choices themselves. The important thing is not how much you make per dollar of bonuses, it's how much you make per player drafted. Rance is worth $1, while Junior is worth $19. To account for the fact that Junior's value doesn't surface for an additional four years, you might bump up the four years by an appropriate rate on the bonus money you spent – maybe 10%. That means that, after five years, Rance is worth $1 plus four years' interest on the $1 – for a total of $1.46. That's still much less than the $19 Junior is worth.

So for this reason, I think the authors' conclusions – that teams irrationally overvalue high-school players compared to collegians – are not supported by their methodology.

----

As for how much money a draft choice is actually worth, the authors do have a bit evidence that casts some light on the question:

"In 1996, four first-round draft picks took advantage of a loophole in the draft's rules, becoming free agents because their drafting teams had missed a specified deadline for making contract offers. In the frantic bidding that resulted, the four players [Travis Lee, John Patterson, Matt White, and Bobby Seay] received bonuses that *averaged* over $5 million, more than two-and-one-half times the amount paid to the first (and otherwise-highest-paid) selection in that year's draft."

Assuming what was paid to these four players is typical, the signing bonuses that draftees actually receive are about 40% of their actual value as free agents. Additionally, BGW calculated that the actual performance value of a first round choice is $3.25 million; in that case, the $5 million paid is actually about 50% too high, but the implied $2 million signing bonus is not – it's only 60% of the actual value.

Still, paying a draft choice 60% of what he'll eventually be worth is a pretty good chunk of money. So, similar to what Massey and Thaler found for football, draft choices in MLB aren't as lucrative as one might think.

Tuesday, November 06, 2007

"Homegrown players" -- a viable strategy?

In 2007, the four teams in the league championship series had 187 of their wins (as measured by Win Shares) contributed by players "homegrown" by the respective team. That's up 68% from last year, and 43% more than the "recent average."

Adams doesn't make an explicit argument, but the implication is that teams are focusing on player development, rather than on signing free agents (who are getting very expensive), and that the strategy is working.

I'd argue that the strategy is not so much to concentrate on homegrown players, but perhaps to concentrate on *cheaper* players. After all, if you have a star in his "slave" years earning only $380,000, it doesn't matter whether he came from your farm system or someone else's. Either way, he's going to help you win equally.

The Indians, Diamondbacks and Rockies were all in the bottom eight payrolls for 2007. They won because their low-priced players performed well, not necessarily because their homegrown players performed well.

However, there is an argument that homegrown players are a better investment:

"Executives say promoting your own players makes sense not only because they are familiar, but because everyone in the organization knows how they've been trained. Instructors in the Phillies' farm system, for instance, follow a manual that describes the "Phillies' way" of doing everything from warming up a pitcher's arm to defending a bunt. Promoting from within is "a safer way to go," says the team's assistant general manager Mike Arbuckle."

Even if you don't accept that the "Phillies' way" is better than the "Brewers' way" or the "White Sox' way," it's still possible that bringing the player up yourself can benefit the team. You'd expect that the team that knows the player best would be the best judge of his major-league expectation. By watching the player carefully, perhaps the Phillies can avoid the mistake of bringing a player up too early. But if they got the guy in trade from the Astros, they might not know enough to make a proper judgment. (I don't know of any evidence either way.)

But still, there's nothing to stop other teams, even free-spending ones, from also developing homegrown players. Even high-spending teams have a budget. If the Red Sox, for instance, find a gem in their minor-league system, they can trade away the expensive free agent at his position, and use the money for someone else.

For teams with little money to spend on salaries, there is an obvious strategy, one that's also used in Rotisserie. You trade your expensive players for young minor-league talent. Eventually the acquired players are ready for the big leagues, and you get three years of free service out of them (and a couple of still reasonably-priced arbitration years). If that's what these teams are doing, then, again, it's not the "homegrown" factor at work – it's the "cheap" factor.

Monday, November 05, 2007

Evidence that NBA teams play better when rested

Do NBA teams play better if they've had more days of rest? Conventional wisdom says they do, and so does a study by Oliver Entine and Dylan Small.

(The study was presented at NESSIS, the New England Symposium on Statistics In Sports. Thanks to Paul Wendt of SABR, who pointed out that presentation slides from several studies are online. This particular study can be found here (.pdf).)

The subject of Entine and Small's study was actually home field advantage (HFA), but the results on rest are more interesting, so I'll start with them.

The authors ran a regression on points scored minus points allowed (UPDATE: this used to just say "points scored"), using indicator variables for team, visting team, which team was at home, and four additional indicators for each team – whether they were playing on 0 days rest (back-to-back games), 1 day, or 2 days, or 3+ days.

Only the 2.26 figure is statistically significant (at exactly .05). Only one season's worth of data was used. It would be nice to re-run this using a decade or so (for hockey and baseball too).

2.26 points doesn't seem like a lot, but it is. Home field advantage was only 3.62 points, and resulted in a home winning percentage of .608. This is about 60% of that.

The study re-ran a (logistic) regression for wins, rather than points, and got similar results. The odds of winning are only .62 as big in back-to-back games as after 3+ days' rest. So a team that's .500 after 3+ days' rest would be 1 win per 1 loss; but on 0 days' rest, it would be 0.62 wins per loss. That works out to only a .383 winning percentage. (Again, that result is only barely statistically significant.)

Where HFA comes in is that the authors noted that the way the NBA schedule was constructed, teams on the road played on fewer days' rest than home teams (see slide 5 for the numbers). They wondered whether that could account for the home team's advantage. They found that it only accounted for "9%" of the HFA.

Entine and Small downplay this result, but I find it quite significant a finding – explaining even 9% of home field advantage is more than I've seen anywhere.

Thursday, November 01, 2007

Playoff "closeout" games

Only 30% of playoff games were "closeout" games where a team could win or lose the series

Only 25% involved games where both teams were 2 or fewer wins away from winning the series

Only 17% of 7-game first round series and NBA finals met the 2 or fewer wins from winning situation.

On the other hand, Goff argues, in the NCAA and NFL, every game meets the first two conditions. So maybe, he says, the NBA could go to best-of-3 for the first round, or something.

My take:

1. Just because a game can't close out a series doesn't mean it's not important. In truth, if you go by "series win probability added," the 0-0 game is much more important than the 3-0 game.

2. Why assume that fans care so much about series-ending or near-series-ending games? Maybe they like a mix of close series and blowout series.

3. Might fans not be upset when their 60-22 team loses to a 41-41 team in the first round? That'll happen fairly often. A best-of-three does indeed make the series less predictable -- but, it seems to me, at the cost of fairness.

4. The first round has only half the number of "both teams with two wins" games, which is almost certainly because the teams are mismatched. Instead of tinkering with the series length, why not just allow fewer teams in the playoffs?

5. The maximum possible proportion of possible "closeout" games comes when team A wins the first three games, and team B wins the next three. When that happens, 4/7 of the games are possible closeouts. If you want that, just change the rules so that the home team wins 99.9% of games. Make sure that team A is at home for the first three games, and team B the next three. Then sit back and enjoy! :)