Sabermetric Research

Phil Birnbaum

Wednesday, August 27, 2008

Wider faces = more aggressive: evidence from the NHL

The Economist reports on a Canadian study that claims that men with wider faces are more aggressive. The hypothesis is that both characteristics are caused by high testosterone levels at puberty, which is why they're linked.For evidence, the authors looked at NHL players, and found that the wide-faced guys had significantly more penalty minutes than their narrow-visaged teammates.The article doesn't link to the actual study, and there's one obvious question it doesn't answer. Isn't it possible that aggressiveness by wide-faced people is just learned behavior? Men with wide faces tend to be bigger men in general, and bigger men might be more aggressive because, with their strength and size, they can get away with it.Hat tip: Tyler Cowen

Monday, August 25, 2008

How much does home-field advantage vary among teams?

What is the "reliability," in the statistical sense, of home field advantage? That is, if you had the opportunity to play a season over again, what would the correlation be between teams' HFA in the first season and their HFA in the second season?

One way to investigate the question might be to check real-life leagues from one season to the next. The problem, of course, is that things change over the winter – players move around, parks get modified, and so on.

Jones shows us a method dating back to 1950, called the Spearman-Brown prophecy formula. What you do is start by finding the correlation between half the season and the other half. To avoid timeline bias (a team might have changed personnel between the first and second halves of the season), you divide the season into odd and even games, and check the correlation that way.

Once you have that odd/even correlation, you calculate the full-season correlation like this:

full season r = 2(half season r) /(1 + half season r)

(That's a specific case of the formula, where you want to double the sample size. If you want to triple the sample size, you replace the "2" by "3" and the "1" by "2". In general, if your sample size multiplies by X, replace the 2 by X, and the 1 by (X-1).)

Jones defined HFA as home winning percentage minus road winning percentage. With that definition, he found the NBA half-season correlations turn out to be:2002-03: +.2432003-04: +.1472004-05: +.3232005-06: -.051

The formula doesn't work for negative correlations, but plugging those numbers in for the other three seasons gives full season reliabilities of:

2002-03: +.3912003-04: +.2562004-05: +.488

These look pretty high, but you have to keep in mind that the fourth one is negative. So, overall, it's not as strong a correlation as it looks. The top list, which has all four seasons, looks a bit bigger than zero, but not much.

Anyway, Jones talks about how researchers shouldn't be careful when using team HFA in research, because it doesn't meet the "80% reliability" rule of thumb for measurements. He suggests that reliability can be enhanced by grouping teams together, to create a larger sample (thus increasing the coefficient in the Spearman-Brown formula). But I don't think that's right. The formula assumes that the additional observations are from an identical sample; but the whole point of team HFA is that the Cavaliers *don't* have the same HFA as the Spurs.

I do agree with Jones that there's way too much noise in individual team HFAs to take them at face value. For one thing, there's a lot of random variation involved. The SD of HFA in a single season is 1.4 times as high as the SD of wins. For a .500 team, that's .156. That's an overestimate, since NBA results are farther from binomial than other sports (since the better team is much more likely to win in basketball than, say, baseball); but still, it gives you an idea that the SD of the luck is more than half as big as the actual number you're measuring.

Second, HFA is dependent on how good the team is. It's higher for teams near .500, and lower for better or worse teams. Two points to show why that's the case:

1. Teams that never win have an HFA of zero. Teams that never lose have an HFA of zero. This suggests, intuitively, that HFA is highest in the middle.

2. If you consider that HFA is worth a certain number of points – say, 3 – then it only makes a difference when the home team would otherwise have lost by 3 points or less. That happens most often when the home team is just slightly worse than the visiting team, which is a point near .500.

The point of all this is that a substantial portion of the observed correlations aren't due to anything other than the team's overall skill. Good and bad teams will have lower HFAs in both odd and even games, and average teams will have higher HFAs in both odd and even games. That effect could be causing the entire correlation.

Has anyone seen any study that proves that a team has a specifically better home field advantage because of characteristics other than its overall skill level? I think I remember Bill James writing about how the Red Sox seemed to do better at home by tailoring the team to the park, but I don't remember a formal study.

In any case, I wouldn't be surprised if HFAs (after adjusting for overall talent) turned out to show very, very little difference between teams.

Tuesday, August 19, 2008

Analysis in new JQAS paper makes no sense

The new issue of JQAS came out a couple of weeks ago, and I'm starting to go through some of the articles. The first one kind of floored me. It's the longest paper I've seen in JQAS so far, at 64 pages. It's ostensibly about measuring the effects of NFL coaching. But I've started reading it, and it makes no sense at all.

The paper starts out with some baseball arguments, but they're more numerology than sabermetrics. It starts by implying that, because the batter is one person, and the pitcher and catcher are two people, that the battery enjoys a 2-1 edge over the hitter. Also, you need four bases to score a run, and have only two outs to do so – which happens to be another 2-1 edge. (Braig knows you get three outs, but figures that since you're retired after the third out, you really only have two outs to expend.)

So baseball has an intrinsic 2-1 structure. And that's why, as Braig triumphantly notes, the historic Major League on-base percentage happens to be .333. See, it's two outs for the defense for every hit by the offense!

"On-base percentage confirms the battery's 2-1 design edge … in other words, hitters have succeeded at the *exact rate* that one would expect by taking outs from the hitters at a 2-1 rate."

That, of course, is ridiculous – the three numbers have nothing do with each other, and that each works out to a 2:1 ratio is nothing but coincidence.

Moving on, Braig figures out what would happen if a team gets on base at .333 – a regular .333, repeatedly alternating an "on base" with two outs. It turns out that if you assume that (a) the "on base" is a single, and (b) all runners advance one base on an out, you wind up alternating innings where one run scores (with a .400 OBP) with innings where no run scores (with a .250 OBP). From this, he concludes,

"These models show that baseball success emerges from the hitters' ability to make a base 40% of the time in an inning … as the hitters' OBP in an inning approaches .400, the hitters approach scoring 1 run."

Um, why is that? Why should we expect that the contrived example, that has so very little resemblance to real baseball, should give a correct result?

So far every step in this process is completely wrong, including this next one: Braig concludes that you can measure a team's "offensive efficiency" by dividing its OBP by .400. So the 1927 Yankees were 95.25% efficient (.381/.400).

And finally, one last leap of logic: because .333 is the "standard," its 2:1 ratio built into the structure of baseball, it must be that the contribution of the hitters ("human capital") to the results is only the difference between his OBP and .333. So the 1927 Yankee hitters, who were .048 above .333, were responsible for only 15% of the total offensive efficiency of the team.

At this point I pretty much stopped reading – all this took only the first seven pages out of 65. And there are additional glaring absurdities, even in those first few pages, that I haven't touched on here. This has got to among the worst psuedo-analyses that I've seen anywhere, never mind in a peer-reviewed publication.

If any intrepid readers want to check out the whole paper, to see if the football discussion has anything of value in it, please report back.

Friday, August 15, 2008

Why is the US lagging in 3-point percentage?

Tyler Cowen reports (via ESPN) that so far, the US basketball team has the worst 3-point percentage of any team in the Olympics.He asks: "can you build a simple model showing this is likely the case for the best team?"The comments are pretty high quality. The best comments, IMO, are the ones that don't try to answer the question, but try to figure out if there might be other reasons. My two favorites:-- luck-- the shorter international 3-point line (compared to the NBA) means the US players can't use their muscle memory, and have to think about the shots.Both of these are testable: the first by waiting a few more games; the second (as a commenter points out) by seeing if the European NBA players are hitting more threes.But I don't know much about basketball. What do you guys think?

Monday, August 11, 2008

Should you always expect pitchers to decline?

How should you change your expectations of a pitcher's performance as he ages?

For hitters, it's pretty much been established that, as a group, they will improve up to about age 27, then begin to decline. Is the same true for pitchers?

Studies of the history of some of the best pitchers, the ones with long careers, seem to suggest so. One such study comes from the study Justin Wolfers co-authored, then wrote about in the New York Times. I don't agree with the study's conclusions, but, even so, the diagram in the article does show the expected aging curve – better in the middle, and worse at the ends.

Last week, at the JSM convention in Denver, Jim Albert fitted some spline curves to various pitchers' careers (his study isn't online yet). He found a similar pattern for most pitchers – better in the late 20s and early 30s, and worse at the beginning and end of their careers.

But my little study showed something different. I used the "paired seasons" method, where you see what a pitcher did at (say) 25, and compare to what he did at 26. You repeat for all ages, and then chain everything together (by multiplying the improvements/declines) to get a career curve.

Here's the career curve that I got. It's in Component ERA, so higher is worse:

(Technical note: As Tom Tango has demonstrated, that method winds up biased, showing larger declines or smaller improvements than it should, because of the selective sampling problem whereby luckier pitchers get more playing time. That problem is attenuated somewhat if you regress the first-year performance to the mean, which I did. The above curve is after regressing 10% to the MLB mean. The basic shape is the same if you regress 0% or 30%.)

This is NOT the typical U-shaped curve you would expect. Indeed, it seems to show that pitchers get worse every year, regardless of their age. When you see a group of pitchers of age X, you *always* expect them to be worse at age X+1. (Actually, there was a very slight improvement from 19 to 20, I think, but it was negligible and the sample size was small.)

This contradicts the earlier studies. It also contradicts empirical data somewhat. Because, if 22-year-old pitchers are better than 27 year olds, how come there are so many more major-league pitchers who are 27? It doesn't make sense to suggest that teams are so dumb that they leave young pitchers in the minors who are better than their established staffs.

So what's going on? How can the two curves be consistent?

One suggestion (at The Book blog) was that all pitchers actually are physically better at young ages, but younger pitchers aren't ready for the majors until they learn to master their pitches and learn how to handle major-league hitters. Only once they have that experience do they get called up, at which point they start to decline in performance.

It's a nice theory, but it's contradicted by the Wolfers and Albert studies, which found U-shaped curves for the better pitchers.

So I'm leaning towards another theory (which was "#2" in a previous post on this subject), that the above curve is actually the sum of two separate curves. First, the Wolfers/Albert curve, which shows that, in retrospect, after analyzing a full career, pitchers do improve when young. And, second, the records of pitchers who did NOT have long careers, which led to them getting injured when young, and seeing their performance get much worse before they were forced to retire.

And so even though the guys with the long careers have U-shaped curves, the average is worse at every subsequent age.

If that's true, it means that

(a) if a pitcher stays healthy, he improves into his late 20s before declining, but(b) enough pitchers get injured that the odds are that the pitcher will actually decline.

Both curves are correct. But if you have a 25-year-old pitcher, and you're thinking of offering him a long-term contract, you have to keep in mind that his expected contribution is LESS than his talent at 25, because of the chance of career-ending injury.

That would also explain why pitchers are such a bad draft gamble, as Bill James showed back in 1985.

Wednesday, August 06, 2008

This guest post is by Charlie Pavitt. Take it away, Charlie ...-----I have a question about drug testing, which for the sake of this blog I will restrict to that for steroids by professional baseball leagues although it is just as relevant for all performance enhancing drugs (PEDs) by all sports organizations in general. I’ve had this question for awhile, but this is a good time to ask it, given that the Tour de France was recently completed (with the usual disqualifications for PEDs along the way) and the Olympics is about to begin (with, leading up to the Games, the same).

No drug test is perfect. All lead to some percentage of false negatives (users whom the test misses) and false positives (innocents whom the test implies are guilty). The latter is what specifically concerns me. In our court of law, one is supposedly innocent until proven guilty. But every few weeks I read of another minor leaguer suspended for a couple months for a positive test. Given that every player knows about the tests, I wonder how many are really still using. Yeah, I can see the possibility that some of the more pampered of them, who have never been punished for anything they’ve done in the past decade because their athletic skill has made them sacrosanct in their communities, probably imagine they can use and get away with it. But I’m not at all convinced that they are all guilty. And, given the publicity it generates, a reported false positive would result in a drug-free player pretty much permanently tarred-and-feathered (Rafael Palmeiro comes to mind; I’m not saying that he wasn’t a user, I have no idea either way, but talk about a positive image permanently crashing down to earth in an eyeblink…).

I just did a bit of web searching (googled “drug testing baseball false positive”) and didn’t find a ton of helpful information on it. One website mentions a study with a huge 14% false positive rate. I’m not the only one concerned (see here, among others).

So here’s my specific, two-part question: What is the false positive rate for the test (assuming there is only one) used in professional baseball? And what precautions are there against false positives (at the very least, there should be a blood sample divided in two, with the second tested in case the first comes up positive)? If anyone has these answers, it is not I’d like to know. More importantly, I’d like the general baseball-fan public to know also.--Charlie Pavitt

Monday, August 04, 2008

Gender-blind math admissions may be biased against men

Here’s an interesting post, from Robin Hanson at Overcoming Bias, that illustrates a nice non-sports application of regression to the mean.

A few years ago, in a speech, Harvard president Larry Summers quoted a possible explanation for why there are so few women in math-related fields.

The hypothesis is that when it comes to math ability, men and women are equal, on average. But men’s ability is more spread out – there are more math geniuses, and also more math idiots (“idiots” is my term, not his). That is, the variance of men is higher than the variance of women.

Even if the difference in variance is small, it could lead to very large differences at the extremes. Suppose that both men and women have average math ability of 5' 8". And suppose the variance among women is three inches, but among men it's four inches.

Now, suppose that math-related fields require a lot of math ability – say, 6' 5" or over. That's three standard deviations above the mean for women, but only 2.25 SDs for men.

That means 13 out of 10,000 women will reach 6' 5", but 122 men will.

So a relatively minor difference in variance leads to a huge disparity at the extremes – in this case, 9 men to 1 women. (As it turns out, the real ratio is only 7% higher for men, not 33% higher like in my example. Still, the M/F ratio is 3:1 at 4 standard deviations.)

Summers drew a lot of fire for his remarks, and he resigned a few months later.

(Aside: years earlier, if I recall correctly, someone had complained that there were many more blacks among the ranks of superstars than among average players, which showed that teams were biased in favor of whites when it didn't matter much. Bill James rebutted that accusation with a related observation – that if black players are, on average, just slightly better than whites, there will be a similar disparity at the extremes.

This is not the same argument – it relates to a difference in means, rather than variances – but it's pretty much the same idea.)

A couple of weeks ago, a new study came out that actually did the research, and found exactly what Summers had hypothesized. Most of the press reports got the story wrong, because they didn't understand that the key was that the variances were different. The reporters latched on to the fact that the means were the same, and incorrectly concluded that Summers was wrong.

Anyway, the point is regression to the mean. And Hanson points out another consequence of the male/female variance difference. Specificially, suppose you have a man and a woman, and they both score equally high on a math admissions test. If all you care about is the chance of choosing the better mathematician, and you can only admit one of them, which one should it be?

The statistical answer is: the man.

Why? Because there fewer talented women, relative to equally-talented men, so it's more likely the woman got her high score by luck.

A high score can be obtained by a less-brilliant student who got lucky, or a more-brilliant student who got unlucky. The ratio of less- to more- is higher for women. Therefore, the high-scoring woman is more likely to be closer to average. Therefore, you have to regress to the mean more for the woman than for the man.

Here's a baseball analogy. Suppose you have a group of twenty veterans. They're solid regulars, but none of them has showed signs of stardom in their five-year careers. And suppose you have a group of twenty draft choices, and you suspect that some of them are duds, but some of them are bona-fide superstars.

One player from each group hits .333 in April. Which do you think is the better hitter? Obviously, it's the rookie. The solid regular was probably just very lucky. The rookie was probably lucky too, but there's a chance that he's a star or superstar, in which case he might only have been a little lucky.

It's the same idea with the men and women -- not as extreme, because the difference in variance is small, but it's still there. The women are more likely to be the solid regulars, and the men have the potential to be the stars (or duds).

Of course, this doesn't really have any strong policy implications. Universities don't really have to hold women to a higher standard – they can just make the test longer, to reduce the effects of luck, or give multiple tests.

Still, it's a valid statistical consequence, and politically incorrect enough to be very interesting.