Sabermetric Research

Phil Birnbaum

Tuesday, December 18, 2012

Golf and luck

Given a player's talent, I routinely think of the results of a series of basketball free throws as "luck". That is, if the player has worked to become an 80% talent, whether he makes the current shot (80 percent chance) or misses it (20 percent) is just random, as if he flipped an 80% coin.

Some people don't like that idea ... they feel that because it's all within the player's control, it's wrong to think of it as "luck" or "random". I don't agree, but I won't argue that here. What I want to do here is try out a different example, one that I can use instead of free throws, that maybe we can all agree on.

So ... how about golf shots? Those aren't completely under a player's control, because of wind.

A difference in wind speed of only about 2 m/s (4.5 mph) is said to affect the ball's distance by around 15 meters (49 feet) (.pdf). That's pretty big. Pros sink 20-foot putts only 14 percent of the time, as compared to 38 percent for 10-foot putts ... and that's only a 10 foot difference, not 49 feet.

Now, you could argue that golfers should take the wind into account when swinging. And they do. But, wind changes while the ball is in the air, and it's literally impossible, from the ground, to predict how the wind will change. If 2 m/s wind is 15 meters of distance, we can guess that 0.2 m/s of wind is 1.5 meters of distance. If there's an unpredictable 0.2 m/s change for half the time the ball is in the air, that's 2 to 3 feet. That's still a fair bit. Moving a putt 2-3 feet closer is a big deal, especially when you're already close.

Or ... suppose a golfer gets a hole in one. It's reasonable to assume that if the wind had been even slightly different, in any direction, the ball wouldn't have gone in.

When does the ball go in on a tee shot? Consider where the ball would have landed if there were no hole. Let's say that if that spot is, maybe, 12 inches behind where the hole would be (in the line of trajectory), and four inches left to right, it would have gone in. That's 0.33 square feet. Let's round it up to 0.5.

If the wind makes an unpredictable difference of, say, 3 feet each direction, that's a circle of radius 3, or about 28 square feet.

28 divided by 0.5 is 56. So, because of wind, there'd be only a 2% chance the ball would go in if you did the exact same swing again.

That is: if you get a hole in one, you hit a 50-to-1 longshot, by luck. That's even if you're a perfect golfer in every respect.

------

Of course, the better a golfer you are, the more holes-in-one you're going to get. The argument is not that it's *all* luck -- the argument is that there's *some* luck.

Holing your tee shot is like winning the lottery. I'm not a very good golfer, but my lottery ticket might still come in someday. Tiger Woods, because of his skill, holds several thousand tickets, so he'll get lucky much more often.------

"Iron Byron" is a machine that swings a golf club, exactly the same way each time -- or at least, as close to "exactly" as a machine can get. But the balls it hits don't land in exactly the same place. This site says that, after multiple swings, the pattern of balls was 15 feet by 8 feet for cavity-back clubs, and "about 1/4 the size" for the club they were developing.

For the purposes of this discussion, that's close enough to the 3-foot radius I guessed at.

You'd think what the machine did would be the limit of human performance. Of course, you might think humans can be more precise than machines, which seems unlikely -- but feel free to argue it if that's what you think. Keep in mind, though, that the human is always a different distance from the pin, and has to adjust his swing every shot! On the other hand, the machine doesn't have to figure out how hard to swing, because it doesn't matter.

So the human has to be *more* perfect than the machine, to get the same results.

-------

Both these arguments -- theoretical, and empirical -- seem to imply that non-human forces have an effect on a golf shot, an effect that's significant enough to affect who wins a tournament. In other words, that there is at least some "external" luck in golf.

For those of you who disagree that there's luck in free throws, does this argument convince you that there's luck in golf shots? If someone hits a hole in one and wins a PGA tournament by two strokes, would you be comfortable agreeing that luck had a lot to do with it?

Tuesday, December 11, 2012

Linear regressions cause problems when reality isn't linear

"Moneyball" popularized the idea that in baseball, walks are undervalued. I believe, however, that baseball insiders didn't learn from the book, and that they continued to overvalue batting average.So I did a study.I took every player-season from 2005 to 2011 (minimum 50 AB), and ran a regression to predict playing time. I used RC27 (runs created per 27 outs) as a proxy for skill; it's a fairly accurate measure of a batter's performance, and you'd think that better batters would get more plate appearances.Then, I added a dummy variable for batting average.* (*Update: Oops! It's not really a dummy, as Alex points out in the comments. Dummy variables are binary 1/0 indicators, not linear ones like BA.) If managers are evaluating players correctly, the dummy coefficient should be zero. After all, if a player creates 5.00 runs per game, it shouldn't matter whether he does it with walks, or power, or batting average. If it does matter, it must be that managers aren't evaluating offense properly.It turned out that the dummy variable was very significant, at 10 SD. Every additional point of batting average was related to an extra 0.93 plate appearances for that player.This suggests that, managers are indeed overvaluing batting average.------I'm not making this up; I actually did this regression, and got that result. But, I don't believe the conclusion.Why not? Mainly, because I don't believe that playing time is linearly related to RC27. Past a certain point of good performance, you're not going get any additional plate appearances, because you're already a full-time player. And, at the bottom, you're not going to get less than zero, no matter how bad you are. And you'll probably get roughly the same playing time whether you're at .050 or .090 -- either way, the estimate of your future performance isn't much different.Taking that and oversimplifying, I think you'd expect an S-curve, not a straight line. For instance, just guessing, playing time might be low and horizontal from (say) 0.00 to 2.00, then sloping up from 2.00 to 6.00, then high and horizontal from 6.00 to infinity.I'd argue that problem is serious enough that we shouldn't trust the results.------Now, I'm pretty sure my objection wouldn't necessarily change any minds. There are always lots of skeptics for any paper, saying, "hey, it could be this," or, "it could be that." I may be sure my argument is valid, but, it's not obvious that it's valid, or that it's important enough an objection to dismiss the paper's conclusions. Furthermore, most readers wouldn't have the time or inclination to follow my objections. They'd think, well, the paper passed peer review, and if it's wrong, another paper will come around later with other evidence.And, so, generally, people will believe the paper's conclusion to be true. In fact, even as you read this, you might think I'm just nitpicking, and that the regression did, in fact, find something real.------So, let me try to convince you, and then I'll get to my "real" point.There's a crucial assumption in linear regressions, that estimation errors are random and unbiased. That means, when the regression tries to predict playing time based on the other variables, we should expect a positive error (player got more PA than expected) about as often as a negative error (player got fewer PA than expected). That should be the case regardless of the values of the Xs -- that is, regardless of how well the player batted.But that didn't happen here. Here's a graph of the regression's errors (residuals), plotted against the quality of the player:

That's not random.The really bad players got a lot more PA than the prediction, and the really good players fewer. That's exactly what you'd get by my "S-shaped" hypothesis.How big are the errors? Huge. For the worst 100 batters, the average actual playing time 48 PA higher than the estimate. For the best 100 batters, the average was 165 PA lower than the estimate. Compare that to the effect we got for batting average. A typical 20-point difference in batting average worked out to 19 PA. How can I argue that I've found a real 19 PA effect, when my measuring stick is obviously biased by a lot more than 19 PA? All it would take, for the effect to be artificial, is for the +20 BA players to be concentrated in the top half of the graph. That's probably what's happening. The dots on the far left and right are mostly part-time players, because no full-time player performs that well or that badly. There are many more part-time players on the left than the right. And the residuals are biased high on the left. So, the BA effect is likely just a part-time effect.But you don't have to buy my explanation. All you have to do is look at the graph, and see that there are huge biases at the extremes -- biases that are higher in magnitude than the effect we've found. At that point, you shouldn't need an explanation -- you should just realize that there's something wrong, and that we really shouldn't be drawing any conclusions from this study.-------In the real world, these types of regression studies don't normally use sabermetric stats like RC27 and such. More likely, they'll throw in all kinds of primary stats, like HR, BB, outs, and so on. And they'll add in other things, like all-star status, and draft position, and whatever else seems to be significant.But the problem remains. That's because, in this case, the primary issue isn't the way performance is measured -- it's that the rate of performance is not a linear predictor of playing time. The model just doesn't work. You could have the most perfect performance statistic ever, one that's accurate down to the third decimal, and you'd still have the problem. If it's not linear, it's not linear.------Anyway, that was just a very long way of getting to my real point, which is: is there an "automated" test to check for problems like this, where the readers won't have to listen to my argument, and the problem will make itself obvious? My first thought was: maybe we could see the problem by just looking at the correlation coefficient of that graph. But we can't. The correlation is zero! It always is, when you look at the residuals of a linear regression -- the math makes it work out that way.But just because the correlation is zero, doesn't mean there's no bias. A graph of zero correlation can have all kinds of fancy shapes. For instance, a symmetrical smiley face pattern has r=.00, and so does a frowny face, and a sine wave. (See more here.) Those all are obviously biased for certain Xs. Only the traditional "cloud" shape is unbiased everywhere.But how do you automatically test for "shape"? One way I can think of is to examine the extremes of the graph, because that's often where the effect is strongest. So, I'd suggest, tor every variable in your model (other than the dummy you're concerned with):1. Show the average error for the the rows with the highest values;2. Show the average error for the rows with the lowest values;3. Show the average error for the middle values.I used 100 rows for the top and bottom, but anything reasonable is fine; if you don't have a huge dataset, use the top and bottom 25%, or the top and bottom 10%, or whatever. Whatever cutoff you choose, it should turn out that those average errors, subject to random variation, can't be too much larger than the effect you think you've found. Suppose you're interested in a "rookie" dummy variable, and you find a statistically significant coefficient of 15 blorgs. But then you find that the fastest 10 percent of players are biased by 18 blorgs. That's probably OK -- for the 18 to cause the 15, you'd need 11/12 of rookies to be in the top 10%, which is unlikely.On the other hand, if you find a "fast" effect of 115 blorgs, you're in trouble. Then you'd need only a weak relationship between rookies and speed to cause an effect of 18. That's quite likely.So, it's not just the *shape* of the curve: it's the *magnitude* of the worst biases, compared to the effect. If you find a small effect, and you want to believe it's real, you have to prove that the regression controls for other factors at least that precisely. -----(It's true that one existing recommendation is to eyeball the residual graph to spot non-linearity. But, usually, they suggest looking only at the residuals for the regression as a whole -- errors vs. expected Y. That would work here, because the bias is strongly related to the expected Y. But, often, it's not (for instance, defense vs. age when your Y variable is salary). So, you really do need to look at each variable separately.)-----I'd like to see something like this, for a hypothetical study:"We attempted to find an effect for age, and we found an effect that young players outperformed by 13 runs, significant at 4 SD. However, examining the residuals for the top 10% of each variable, we found an average of +35 for stolen bases, and +49 for defensive skill. We have reason to believe that young players are significantly more likely to be faster and better fielders, and therefore we believe our dummy might be evidence of a biased, ill-fitting model, rather than an actual effect."------You should do the "10%" thing for every variable -- and then do it for *combinations* of variables that measure similar things, or that might be correlated. Find the 10% of players most above average in a combination of doubles, home runs, walks -- and see if those are biased. Find the 10% with a combination of few triples, few stolen bases, and games at catcher. You have to combine, a bit, because you might not find big enough effects for single factors. We found big biases for bad performance overall: if you split that into 'bad performance for doubles", and "bad performance for triples," and so on, you might find only small biases and miss the big one.All this could easily be automated by regression software. Most packages already can give you the correlation between your independent variables -- that is, they can tell you that players with lots of home runs tend to draw lots of walks and hit lots of doubles. So, have the software automatically look at players in the top 25% of all three categories, and see if there's a bias. Do the same for the bottom. Flag anything bigger than the size of the dummy you're looking at, along with a significance level. Then you can decide.Another thing the software could do, is this: taking the variable of concern -- batting average, in my case -- look at all the other variables that correlate highly with it (RC27, in my case). Check the bias when those variables are all high, and when they're all low. And, again, flag anything large, for further review.------Would this work? It seems to me that it would, at least enough of the time to make it worthwhile. Is it already being done? Am I reinventing the wheel?

Monday, December 03, 2012

You can't find small effects with coarse measures

Suppose you want to do a regression to see if black players in MLB are underpaid.

How might you do that? Well, you might take everything you can think of, throw it into a dataset along with a "player is black" dummy variable, and do a regression to predict salary. If the dummy comes up significant, you've proven there's a difference between black players and non-black players.

But ... there's a problem with that. You've shown a significant difference between black players and white players, but you don't really know if it's really because of the player's skin color. If black players and white players are different in certain ways, it could be that your model is just biased for those differences.

For instance ... you probably had a measure of player productivity in your regression. Suppose you used, say, "productivity = (TB+BB+SB)/PA". That's obviously not accurate: it treats a stolen base as equal to a single, and doesn't include CS, which correlates highly with SB. So, it's going to be biased too high for players who steal a lot of bases.

So if black players steal more bases than white players -- which I bet they do -- your measure will overrate the productivity of black players, and will incorrectly conclude that they're underpaid.

------

I've made my example a particularly egregious one, but this happens all the time to a lesser extent. If your regression is trying to relate salary to productivity, you have to be able to accurately measure both salary and productivity. Salary is easy -- it's just the amount, in dollars. Productivity is hard.

Sabermetricians have been trying to measure batter productivity for ... well, forever. We've got Total Average, and Runs Created, and Base Runs, and Linear Weights, and Extrapolated Runs, and so on. None of them is perfect. All of them have certain biases. (Self-promotion: this .pdf has an article where I investigate some of them, and here's a related post.) Bill James, himself, noted that Runs Created overestimates for the best hitters.

If we're so limited in our ability to measure productivity in the first place, how can anyone possibly think we're able to measure *very small differences* in productivity, like racial bias, or clutch vs. non-clutch, or walk year vs. non-walk-year?

We can't. Statistical significance gives you the illusion we can, but we can't.

Look, suppose you do a study, and you find that black free agents are underpaid by, say, $100K a year. Even if it's statistically significant, $100,000 is only one-fifth of a run. How can you say, with any kind of confidence, that you've found an effect of 0.2 runs, when your measure of productivity is almost certainly biased by a lot more than that?

How can you find a 1 gram difference in weight, when your scale is only unbiased to the kilogram?

------

Now, someone might argue: "I agree with that, that many measures of productivity are biased. That's why I didn't use one. So as not to preselect a biased measure, I put all the components of hitting into the dataset, and let the *regression* pick the best measure!"

That helps, but it isn't enough. Because, what if the relationship isn't linear? Like, playing time. If you're good, you play full-time. If you're bad, you play zero. In the middle ... well, you play part time, and *maybe* that part is linear, and maybe it's not. But, overall, you've got an s-curve, not a linear one. So your estimates will be biased too high for the best players, and too low for the worst. (See the previous post.)

------

I'd argue that, if you find a small effect that you think is real, you need to prove that your model is good enough that what you've found is an actual effect, and not just measurement error from an arbitrary linear model.

I don't think that's too hard to do ... I'm going to try to put together an example for a followup post.