Sabermetric Research

Phil Birnbaum

Wednesday, January 27, 2010

More pitching randomness than just DIPS

Last month, Nick Steiner discovered something interesting: pitchers don't actually seem to pitch any worse when getting shelled than when getting batters out. That is: if you were to look at the types of pitches thrown, their speed, and their break, the shutout pitches look the same as the "give up five runs in three innings" pitches.

If that's the case, the implication is that the differences are batters, umpires, or luck. I think it's mostly luck. Previously, I argued that simulation games like APBA serve as evidence of that. The same pitcher's APBA card, with the same batters, can result in a shutout or a blowout just based on dice rolls, and the pattern of performance looks fairly realistic. I don't know of any studies on this, but I bet if you played out a random pitcher's season on APBA, and then you put his APBA first innings side-by-side with the first innings of his actual starts, you'd have a tough time figuring out which was which.

Anyway, Nick expands on the topic today, over at Hardball Times. He found a bunch of pitches that looked almost exactly the same: 2-1 count, nobody on, fastball following a fastball, righty pitcher, righty hitter, similar location, similar speed, similar break. Although the pitches and situations were almost the same, the results were wildly different: sometimes a strike, sometimes a ball, sometimes a home run, sometimes an out, sometimes a double.

The idea is similar to DIPS, which posits that what happens after a batter puts the ball in play is mostly out of the control of the pitcher. Nick is saying that, outside of the actual characteristics of the pitch, what happens to a it is also out of the pitcher's control. All the pitcher can do is put the ball in a certain place at a certain speed with a certain movement, and then it's up to the batter and umpire and random chance.

So, suppose a certain pitch in a certain spot on a certain count usually results in 0.1 runs more than average. If a pitcher threw 50 of those, and it turned out the batters hit them hard, which resulted in 10 runs instead of 5, you could assume those extra five runs were random, and not the pitcher's "fault". That would allow you to better project his performance in future years -- as a GM, you might be able to find some underpriced pitchers, and trade away your overvalued ones. Just like a "DIPS ERA," which evaluates a pitcher based on factors other than his balls in play, you could create a "PitchF/X ERA" which evaluates a pitcher based only on the qualities of the pitches he threw, rather than what happened to them.

To do that, I think more work needs to be done. There probably aren't enough samples in each "bin" to get you good estimates of the value of a specific pitch. You might have to somehow smooth out the data, so you can extrapolate as to how much a 95 mph pitch is worth compared to a 93 mph pitch (Nick found the slower pitch actually led to better results for the pitcher, which is probably just a statistical anomaly you'd want to fix). You'd also have to think about other things that this analysis doesn't consider: patterns of pitch selection, tipping off pitches, understanding batters' weaknesses, and so on.

But even without those caveats, I think this is something that could work, with enough reliable data.

------

What I find interesting about all this stuff is that it further formalizes the idea that most of what happens in a baseball game is luck. There is a tendency to treat a player's accomplishments as a direct manifestation of his skill, even in small sample sizes where it probably isn't. When Albert Pujols hits a home run, there's tendency to talk about how it was the pitcher's fault for giving him that pitch to hit, or why Pujols was able to recognize it and hit it out. Reality is probably different. I'd bet that a lot of Pujols home runs came on "good" pitches, in the sense that their expectation was zero or positive, but Pujols just happened to get good wood on them. And I bet you'd find more than the occasional weak ground ball on what was actually a juicy pitch to hit.

Because Pujols is random too. Given a certain pitch on a certain count in a certain place with a certain amount of break, Pujols is sometimes going to hit it out, sometimes he'll swing and miss, sometimes he'll ground out, and so on. It's all a matter of the probabilities coming together, like the dice rolls in an APBA game.

In "The Physics of Baseball," Robert Adair reports that the difference between hitting the ball to center field and hitting the ball foul is 1/100 of a second. No batter is so good that he can always time his swing to .01 seconds. Some batters may be closer to that than others: maybe one batter has an SD of 1/100 of a second, so that he hits the ball fair 2/3 of the time, and another has an SD of 2/100 of a second, so that he hits the ball fair only 38% of the time. Talent and practice and concentration can lower a player's SD, but not to zero. There's still a lot of luck involved even for Albert Pujols. Over a season, the luck will even out and he'll hit over .300, but in any given game, he could easily go 0-for-4, without there being anything wrong with his talent on that day.

The luck isn't just because hitting a baseball is hard. Think about foul shooting in basketball. Even the best players don't shoot free throws with much more than 90% accuracy. And free throws are shot under the exact same circumstances every time, with no opposition trying to stop you. Foul shooting is something every NBA player has been doing since childhood, over and over and over, and they still can't get much better than 9 times out of 10. Why is that? It's probably the same thing as hitting a baseball: you have to do a lot of things right, with your eyes and arms and hands, and if you're more than a certain bit off, the ball won't go. And humans just aren't biologically built to be perfect enough that we can be within those very narrow limits every time. If the trajectory of a successful shot has to be within 0.1 degree of perfect (I'm making these numbers up), and our bodies are good enough only to give us a standard deviation of 0.06 degrees, even after practicing for 15 years, then no matter what, we're still going to be missing 10% of shots. That's just the way it is.

As a player, your goal can't be to make 100% of your shots. I think that's impossible, based on the limitations of the human brain and body. What you *can* do is practice enough, and the right way, that you reduce your slight errors, your standard deviation from the "ideal" shot, as much as possible. You can also work to make sure that you're always at your best. If you're normally a 90% shooter, but in clutch situations your hands shake and you hit only 80% -- well, getting your clutch performance from 80 to the 90 percent you're capable of will be a lot easier, I think, than raising your non-clutch 90 percent to 91 percent.

But no matter what, whether you're a pitcher, batter, basketball player, goalie, or whatever -- there is a limit to how perfect or predictable you can get. The best you can do is to work hard enough that your practice and talent gives you the best possible APBA card you can get. After that, you're at the mercy of the dice rolls.

Sunday, January 24, 2010

Do bad hitters see more fastballs than good hitters?

I love simple studies, where you ask an interesting question, and then just answer it by looking at the data, without the need for any fancy techniques.

So here's one from Dave Allen. Allen asks, do better hitters get worse pitches to hit? It turns out they do: at almost every ball-strike count, they get fewer fastballs. For instance, the 20 best hitters in the league got 62.6% fastballs on the first pitch, while the 20 worst hitters saw 66.3 fastballs.

A similar finding holds for the location of those fastballs. Again at 0-0, the good hitters saw only 50.7% of them in the zone, but the bad hitters got 54.8%.

For fastball frequency, the trend reverses at 0-2, 1-2, and 2-2 -- the good hitter gets more fastballs there. I'm not sure why that would be. Maybe on those counts it makes sense for the pitcher to "waste" a pitch outside, hoping the batter will swing at it? And maybe those pitches are more likely to be fastballs? (That would also explain why on 3-2, it's again the bad hitters that get all the fastballs -- you don't want to deliberately miss the strike zone on a full count.)

I don't know if that hypothesis makes sense, but you guys probably know more about this stuff than I do.

Here's the subset of Allen's data that I talked about here ... see his study for all of it.

It's studies like this that make me think that this kind of instantly-publishable "open-source" research (as a commenter on Tango's blog described it) delivers better results than peer-reviewed academic research, at least in sabermetrics. In academia, it seems like, to be accepted, it's not enough that a study teaches us something -- it also has to be "clever" or complex or sophisticated in a certain fashion -- usually a mathematical one. It's hard to describe, except in an "I know it when I see it" kind of way, but I bet anyone who reads a lot of papers will know what I mean.

I can't imagine a study like this one would make it into a journal. It just gets its results by counting. If you wanted to get these results into print, you'd have to embed it in another study of some kind, one that's a more mathematically complex. That's just my impression, of course, and I could be wrong ... any academics out there, tell me what you think.

Regardless, I think it's true that on the internet, all that counts is whether the reader learns something about baseball. And we definitely do learn something here.

In my mind, studies like this require cleverness too, but in a different way: figuring out that you can get a quick answer to an important question, with a very simple method, is something that's not easy to do. Kudos to Dave Allen for thinking of this one and writing it up.

---

Update: forgot to hat tip The Book blog, where mgl discusses the findings.

Thursday, January 21, 2010

Chopped liver II

David Berri and J.C. Bradbury have a new paper out. Called "Working in the Land of the Metricians," it purports to be a guide for Ph.D. sports economists in how to interact with sabermetricians.

The subject is an important one, but, sadly, the paper isn't very constructive. There's a good quote-by-quote critique of the paper at Tango's blog, which I agree with substantially enough that I can just refer you there. (Also, since the paper is gated, you can get a good feel for what's happening in it by reading Tango's take.) UPDATE: the paper is now available in full (.pdf) at Dave Berri's site.

For my part, I'll just concentrate on one major point -- that Berri and Bradbury still treat the non-academic community's work as if it barely has any value at all. And they're pretty forthright about it. Despite paying lip service to the idea that maybe non-academic researchers have *something* to contribute, the authors remain wilfully blind to 98 percent of the sabermetric community's contribution to the field.

Indeed, Berri and Bradbury explicitly refuse to recognize that the active research community has any expertise at all:

"Birnbaum considers sabermetricians to be "no less intelligent than academic economists" and superior to economists in their understanding of baseball. This statement reveals a curious worldview. On one hand, the aspect that is universal across both groups -- members of both communities have been devoted sports fans since an early age -- is considered unique to the nonacademic sports anaysts. On the other hand, when it comes to the aspect that is unique to academics -- academia normally involves many years of advanced training and requires its participants to be judged competent by their peers in a "publish or perish" environment -- metricians demand equal recognition. In our view, this mentality begets misplaced confidence."

Get what they're saying here? Bill James, Tom Tango, Mitchel Lichtman, Andy Dolphin, Pete Palmer, those guys hired by baseball teams, all those other sabermetric researchers -- they're not baseball experts at all. They're *just sports fans*. How could they possibly know more about analyzing baseball than any other fan, unless they've formally studied econometrics?

It's astonishing that Berri and Bradbury could possibly believe that economists, even sports economists, know more about baseball than these guys, who have built their careers around analyzing the game. Equally astonishing is their implication that we *are* less intelligent than they are. At first, I thought that they didn't realize what they were saying. But, no, it seems pretty clear that they *do* believe it.

What has become apparent in sabermetricians' debates with Bradbury (and, to a lesser extent, Berri, who doesn't engage us as much) is that they disagree with us on almost every point we make, and that they seem to be uncomfortable arguing informally. I can't recall a single time that either of them has conceded that they're wrong, even on a small point. Economists are supposed to be fond of hypotheticals, simple models, and playful arguments to illustrate a point (see this Paul Krugman column), but Bradbury and Berri, not so much. Attempts to try to describe their models with simplified analogies are usually met with detailed rebukes about how we don't understand their econometric methods.

In that light, it's easier to see where they're coming from. They believe that (a) only formal, peer-reviewed research counts as knowledge, and (b) all us non-peer-reviewed people have been wrong every time we've disagreed with their logic. If both of those were actually true, they'd be right -- we'd just be ignorant sports fans who don't have any expertise and need to be educated.

As for the sabermetric findings that they actually use, like DIPS and OPS ... Berri and Bradbury seem to consider them a form of folk wisdom that the unwashed baseball fans managed to stumble upon, and argue that economists should not accept them until they've been verified by regression methods that would pass peer review and be publishable in academic journals.

Back in 2006, in the post that Bradbury and Berri quoted above, I accused Bradbury of ignoring the findings of sabermetricians in one of his papers. At the time, I thought perhaps I was too harsh. I was wrong. This paper shows that he truly believes that, as non-peer-reviewed sports fans with no special expertise, our research findings are unworthy of being cited.

Indeed, in Bradbury's exposition (I am assuming all the arguments in the baseball portion of the current paper are Bradbury's, although Berri is still listed as co-author), he treats the history of sabermetric knowledge as if it were mostly a series of academic papers. That's absurd; it's indisputable that, conservatively, at least 90 percent of our sabermetric knowledge came from outside academia. If you were writing the history of research about Linear Weights, what would you include? Think about it for a minute.

Ready? Here's how Bradbury sees it:

--In 1992, academic A.A. Blass published a study where he estimated linear weights by regression.

--In 2005, academic Ted Turocy published a paper which highlighted the "omitted variable" bias making the results not as accurate as they could be.

--In 1963, academic George Lindsey had published a paper with a rudimentary form of the same equation, but not using regression.

Got it? Four academic papers (Albert/Bennett is actually their book, Curve Ball, but no matter) and Thorn/Palmer. On top of that, the only mentions of Thorn and Palmer are their "updating" and "popularizing". Each of the academics' work, on the other hand, is described in some detail.

Is that how you would characterize the state of knowledge, that the history of accumulated knowledge about Linear Weights comes from these five academics and a cursory contribution from Pete Palmer? I mean, come on. There's a huge literature out there if you look outside academia, including studies on various improvements to the original.

It gets worse, in the DIPS discussion.

Bradbury starts off by reviewing the seminal Gerald Scully paper. In 1974, Scully (a famed sports economist who passed away in 2009) published a paper that tried to value players' monetary contributions to their teams. Regrettably, he used strikeout-to-walk ratio as a proxy for the pitcher's value to his team's success on the field. That's not a particularly good measure of the value of a pitcher; ERA is much better. Even before sabermetrics, everyone knew that, including casual baseball fans and Joe Morgan. And if they didn't, Bill James would have made it clear to them in his Abstracts, so there was no excuse for a baseball researcher to not be aware of that after, say, 1983.

And Bradbury acknowledges that, that ERA was perceived to be better than K/BB ratio. Does he cite common sense, conventional wisdom, or Bill James? Nope. He cites two academic papers from the 1990s. No, really:

Anyway, with that established, Bradbury continues. In 2001, Voros McCracken came along, and, in an "essay" on the "popular sabermetric Web site Baseball Prospectus," he "suggested" that pitchers have little control over what happens to balls in play. At this point, Bradbury checks the correlation of a pitchers' BABIP in consecutive seasons, and finds it's fairly low (.24).

"This supports McCracken's assertion," he writes.

Okay, thanks for the verification! But, er, actually, it's not like the sabermetric community was just sitting on its hands the past eight or nine years, staring at McCracken's hypothesis and wondering if someone would ever come along and tell them if it was true. Sabermetricians of all sorts have done huge volumes of work to refine and prove versions of the DIPS hypothesis. For a long time, you couldn't hit any sabermetric website without DIPS this and DIPS that hitting you from every angle, with all kinds of theories about it tested and studied.

It's perfectly fine that Bradbury verifies McCracken with his quick little regression here, but why wouldn't he acknowledge everyone else, by, for instance, saying that his study is an example of the kinds of confirmation studies that the sabermetric community has been doing since 2001? In light of the rest of the article, which implies that sabermetricians are just unsophisticated baseball fans, that omission would reasonably lead readers to incorrectly assume that Bradbury's little study is the first of its kind.

But, instead, Bradbury ignores all those years of sabermetric study of the DIPS issue, just because it wasn't academically published or peer-reviewed. That's as wrong now as it was when I wrote about it in 2006 -- especially in an essay that's ostensibly suggesting that academics can benefit from outside research.

The DIPS approach is as well accepted in sabermetrics as the Coase Theorem is accepted in economics. The difference is, if I published something mentioning Coase's hypothesis, and then published a study "supporting" it without citing any other study or mentioning that it's a canon of the economics literature, Bradbury would go ballistic, laying into my ignorance like ... like Tiger Woods on a supermodel. (Sorry.) The other way around, though, and it's all OK.

Anyway, that's just the prelude -- this is the point where Bradbury's argument gets really bizarre.

Why did Bradbury bring up DIPS? Because it shows that, instead of using ERA to evaluate the skill of a pitcher, it's better to just use walks, strikeouts, and home runs allowed, thus eliminating a lot of random chance from the pitcher's record. That part is absolutely fine. But then he comes back to Scully.

Remember when Scully did his 1974 study that used strikeout-to-walk ratio as a measure of a pitcher's value? And everyone agreed that he should have used ERA instead? Well, now, hang on! McCracken has shown us that if we ignore a pitcher's balls in play, we can get a better measure of his talent than if we use ERA. Eliminating balls in play leaves only BB, K, and HR. That means strikeouts and walks are really important. And, in turn, that means that Scully was actually correct back in 1974 when he emphasized strikeouts and walks by using K/BB ratio as his statistic of choice!

To make this bizarre argument Bradbury ignores the fact you have to combine K and BB in a very specific way, or it doesn't work well at all, and that K/BB ratio is still worse than ERA. And HR are important too. But, argues Bradbury, at least Scully was right in that it had to be K and BB. Maybe he wasn't completely correct, but he had the right idea.

Why did Gerald Scully choose to value pitchers by K/BB? According to Bradbury, it wasn't because he just gave it a guess. He did it because he intuitively anticipated that DIPS was true. Voros McCracken, the non-academic sabermetrician, just served later to confirm Scully's original insight.

See? The academics were right all along!

If that sounds ludicrous, it is. I really hope you don't believe what I'm saying here ... I hope you're thinking that nobody who could actually make that argument with a straight face, that Gerald Scully's choice of K/BB ratio as his metric is an anticipation of a completely different, complex, 12-part formula for DIPS ERA that happens to partly depend on K and BB. I hope you believe that I'm making it up.

But I'm not. That's actually what Bradbury argues! Here are the quotes:

"If a measure varies considerably for an individual player over time, it is likely that the measure is heavily polluted by luck ... Scully appeared to understand this point with his choice of the strikeout-to-walk ratio to measure pitcher quality.

"Later research from outside academia [McCracken] confirmed Scully's original approach and suggested [also] including home-run prevention when evaluating pitchers. ...

"Consequently, we see Scully's general approach is confirmed by a metrician, demonstrating what the nonacademic sports research community can contribute to sports economics research."

Could there be a sillier, more self-serving rationalization?

Oh, and by the way ... it seems that Scully wasn't even a baseball fan at the time he started writing his study. How does Bradbury think Scully was able to anticipate a result that wouldn't become apparent until 27 years later? A result, moreover, that pertained to a sport Scully probably didn't know that much about, in a field of science that didn't even exist at the time, that wouldn't emerge until years of study by non-academic researchers, and that was so surprising it shocked even the most veteran researchers in the field?

Maybe it was that "years of advanced training" in economics -- obviously so much more valuable than the non-expertise in sabermetrics that Voros McCracken, a mere baseball fan, couldn't possibly have had.

Tuesday, January 19, 2010

Evaluating scientific debates: some ramblings

Last week's renewed debate on JC Bradbury's aging study (JC posted a new article to Baseball Prospectus, and comments followed there and on "The Book" blog) got me thinking about some things that are tangential to the study itself ... and since I have nothing else to write about at the moment, I thought I'd dump some of those random thoughts here.

1. Peer review works much better after publication than before.

When there's a debate between academics and non-academics, some observers argue that the academics are more likely to be correct, because their work was peer reviewed, while the critics' work was not.

I think it's the other way around. I think post-publication reaction, even informally on the internet, is a much better way to evaluate the paper than academic peer review.

Why? Because academic peer reviewers hear only one side of the question -- the author's. At best, they might have access to the comments of a couple of other referees. That's not enough.

After publication, on the internet, there's a back and forth between people on one side of the question and people on the other. That's the best way to get at the truth -- to have a debate about it.

Peer review is like the police deciding there's enough evidence to lay charges. Post-publication debate is like two lawyers arguing the case before a jury. It's when all the evidence is heard, not just the evidence on one side.

More importantly, no peer reviewer has as good a mastery of previous work on a subject than the collective mastery of the public. I may be an OK peer reviewer, but you know who's a better peer reviewer? The combination of me, and Tango, and MGL, and Pizza Cutter, and tens of other informed sabermetricians, some of whom I might only meet through the informal peer review process of blog commenting.

If you took twelve random sabermetricians whom I respect, and they unanimously came to the verdict that paper X is flawed, I would be at least 99% sure they were right and the peer reviewer was wrong.

2. The scientific consensus matters if you're not a scientist.

It's a principle of the scientific method that only evidence and argument count -- the identity of the arguer is irrelevant.

Indeed, there's a fallacy called "argument from authority," where someone argues that a particular view must be correct because the person espousing it is an expert on the subject. That's wrong because even experts can be wrong, and even the expertest expert has to bow to logic and evidence.

But that's a formal principle that applies to situations where you're trying to judge an argument on its merits. Not all of us are in a position to be able to do that all the time, and it's a reasonable shortcut in everyday life to base your decision on the expertise of the arguer.

If my doctor tells me I have disease X, and the guy who cleans my office tells me he saw my file and he thinks I really have disease Y ... well, it's perfectly legitimate for me to dismiss what the office cleaner says, and trust my doctor.

It only becomes "argument from authority" where I assert that I am going to judge the arguments on their merits. Then, and only then, am I required to look seriously at the office cleaner's argument, without being prejudiced by the fact that he has zero medical training.

Indeed, we make decisions based on authority all the time. We have to. There are many claims that are widely accepted, but still have a following of people who believe the opposite. There are people who believe the government is covering up UFO visits. There are people who believe the world is flat. There are people who believe 9/11 was an inside job.

If you're like me, you don't believe 9/11 was an inside job. And, again, if you're like me, you can't actually refute the arguments of those who do believe it. Still, your disbelief is rational, and based solely on what other people have said and written, and your evaluations of their credibility.

Disbelieving solely because of experts is NOT the result of a fallacy. The fallacy only happens when you try to use the experts as evidence. Experts are a substitute for evidence.

You get your choice: experts or evidence. If you choose evidence, you can't cite the experts. If you choose experts, you can't claim to be impartially evaluating the evidence, at least that part of the evidence on which you're deferring to the experts.

The experts are your agents -- if you look to them, it's because you are trusting them to evaluate the evidence in your stead. You're saying, "you know, your UFO arguments are extraordinary and weird. They might be absolutely correct, because you might have extraordinary evidence that refutes everyone else. But I don't have the time or inclination to bother weighing the evidence. So I'm going to just defer to the scientists who *have* looked at the evidence and decided you're wrong. Work on convincing them, and maybe I'll follow."

The reason I bring this up is that, over at BPro, MGL made this comment:

"I think that this is JC against the world on this one. There is no one in his corner that I am aware of, at least that actually does any serious baseball work. And there are plenty of brilliant minds who thoroughly understand this issue who have spoken their piece. Either JC is a cockeyed genius and we (Colin, Brian, Tango, me, et. al.) are all idiots, or..."

Is that comment relevant, or is it a fallacious argument from authority? It depends. If you're planning on reading all the studies and comments, and reaching a conclusion based on that, then you should totally ignore it -- whether an argument is correct doesn't depend on how many people think it is.

But if you're just reading casually and trying to get an intuitive grip on who's right, then it's perfectly legitimate.

And that's how MGL meant it. What he's saying is something like: "I've explained why I think JC is wrong and I'm right. But if you don't want to wade through all that, and if you're basing your unscientific decision on which side seems more credible -- which happens 99% of the time that we read opposing opinions on a question of scientific fact -- be aware that the weight of expert opinion is on my side."

Put that way, it's not an appeal to authority. It's a true statement about the scientific consensus.

3. Simple methods are often more trustworthy than complex ones.

There are lots of studies out there that have found that the peak age for hitters in MLB is about 27. There is one study, JC Bradbury's, that shows a peak of 29.

But it seems to me that there is a perception, in some quarters, that because JC's study is more mathematically sophisticated than the others, it's therefore more trustworthy. I think the opposite: that the complicated methods JC used make his results *less* believable, not more.

I've written before about simpler methods, in the context of regression and linear weights. Basically, there are two different methods that have been used to calculate the coefficients for the linear weights formula. One involves doing a regression. Another involves looking at play-by-play data and doing simple arithmetic. The simple method actually works better.

More importantly, for the argument I'm making here, the simple method is easily comprehensible, even without stats classes. It can be explained in a few sentences to any baseball fan of reasonable intelligence. And if you're going to say you know a specific fact, like that a single is worth about .46 runs, it's always nicer to know *why* than to have to trust someone else, who used a mathematical technique you don't completely understand.

Another advantage of the simple technique is that, because so many more people understand it, its pros and cons are discovered early. A complex method can have problems that don't get found out until much later, if ever.

For instance, how much do hitters lose in batting skill between age 28 and age 35? Well, one way to find out is to average the performance of 28-year-olds, and compare it to the averaged performance of 29-year-olds, 30-year-olds, and so on, up to 35-year-olds. Pretty simple method, right, and easy to understand? If you do it, you'll find there's not much difference among the ages. You might conclude that players don't lose much between 28 and 35.

But there's an obvious flaw: the two groups don't comprise the same players. Only above-average hitters stay in the league at 35, so you're comparing good players at 35 to all players at 28. That's why they look similar: the average of a young Joe Morgan and a young Roy Howell looks similar to the average of an old Joe Morgan and a retired, zero-at-bat Roy Howell, even though they both Morgan and Howell each declined substantially in the intervening seven years.

Now that flaw ... it's easy to spot, and the reason it's easy to spot is that the method is simple enough to understand. It's also easy to explain, and the reason it's easy to explain is again that the method is simple enough to understand.

If I use the more complicated method of linear regression (and a not very complicated regression), and describe it mathematically, it looks something like this:

"I ran an ordinary least squares regression, using the model P(it) = ax(it) + b[x(it)^2] + e, where P(it) is the performance of player i at age t, x(it) is the age of player i at age t, and all player-seasons of less than 300 PA were omitted. The e is an error term, assumed iid normal with mean 0."

The flaw is actually the same as in the original, simpler case, the fact that the sample of players is different at each age. But it's harder to see the flaw way, isn't it? It's also harder to describe where the flaw resides -- there's no easy one-sentence explanation about Morgan and Howell like there was before.

So why would you trust the complicated method more than the simple one?

Now, I'm not saying that complexity is necessarily bad. A complex method might be more precise, and give you better results, assuming that there aren't any flaws. But, you still have to check for flaws. If the complex method gives you substantially different results (peak age 29) from the simple methods (peak age 27), that's a warning sign. And so you have to explain the difference. Something must be wrong, either with the complex method, or with all the simple methods. It's not enough to just explain why the complex method is right. You also have to explain why the simple methods, which came up with 27, came out so wrong.

In the absence of a convincing explanation, all you have are different methods, and no indication which is more reliable. In that case, why would you choose to trust the complicated method that you don't understand, but reject a simple methods that you *do* understand? The only reason for doing so is that you have more faith that whoever introduced the complicated method actually got everything right, the method and the calculations and the logic.

I don't think that's justified. My experience leads me to think that it's very, very risky to give that kind of blind trust without understanding the method pretty darn well.

Sunday, January 10, 2010

Book review: Wayne Winston's "Mathletics"

"Mathletics," by Wayne Winston, is a fine book. It's meant as an introduction to the sabermetrics of baseball, football, and basketball, with a little bit of math/Excel textbook built in. It's not perfect, but it suits its purpose very well, and it's probably the first book I'd suggest for anyone who wants a quick overview of what sabermetrics is all about in practical terms.

One of the things that I think makes the book work well is that it's not full of itself. It doesn't make grand pronouncements about how it's a revolution in thinking about sports, or how its breakthroughs are going to change the game. It just gets to work, with clear explanations of the various findings in sabermetrics. Every subject gets its own chapter, and the chapters are generally exactly as long as they need to get the point across. The discussion of Joe DiMaggio's hitting streak takes eight pages, but the Park Factors chapter is only three, because, really, that's all it takes to explain park factor.

About a third of the book is devoted to each of the three sports. I guess I'm most qualified to evaluate the baseball section, and I'd say the selection of subjects is pretty decent. The first few chapters deal with the oldest, most established results -- pythagoras, linear weights, and runs created. There's a chapter on the various fielding evaluation methods, on streakiness, and on the "win probability added" method of evaluating offense. DIPS gets its own chapter, in the context of evaluating pitchers. There's even a chapter on replacement value, although, strangely, Winston discusses it only in the context of win probability, rather than methods that don't involve timing of events.

For the most part, it's a matter of personal opinion what topics in sabermetrics are more important and what topics are less important, and, since this is Winston's book and not mine, you should take my recommendations with a grain of salt. But my main complaint is that I wish there had been a discussion of random chance in the statistical record, and regression to the mean. Throughout the book, no mention is made of the fact that most extreme values of sports statistics are biased away from the mean, although I think there are a few casual mentions of small sample sizes. (But even as I write this, other topics occur to me ... Hall of Fame induction standards, for instance, and baseball draft findings.)

On the football side, there are discussions of quarterback rating methods, an analysis of NCAA overtime strategies, and NFL overtime probabilities. There's a chapter on the paradox of the passing premium, and one on fourth-down decision-making. All this stuff seems like solid summaries to me, at least from what I've learned about football strategy from research blogs like Brian Burke's.

One thing I learned about football that I'd never seen before (which might just be a gap in my football sabermetrics education, although I got the impression that this was original research by Winston) is a summary of the strategy of when to go for a two-point conversion instead of kicking the extra point. Winston presents a full table of the appropriate strategy depending on the score. Some of the findings are obvious, like never go for two when you're seven points behind (after the TD but before the PAT). But some are intriguing -- for instance, when you're six points ahead, you should go for the two-pointer if and only if there are likely to be fewer than 18 possessions left in the game.

Most of the baseball and football material was already familar to me, as was about half the material in the basketball section (formulas for ranking players, a summary of the research on referee racism, etc.). But there was a bunch of basketball stuff I hadn't seen before, or didn't know much about. Again, some of that might be because I don't follow basketball research that closely. But I'm sure some of the stuff is original, as Winston works as a consultant to the NBA's Dallas Mavericks. I found the "plus-minus" chapters to be the most interesting (and they're also the longest), but, after reading them, I still wasn't quite sure how much of the results were real, and how much were just noise due to small sample sizes.

The plus-minus system tries to figure out a player's value by how his team does when he's on the floor. The problem with that, of course, is that the player's rating will be biased by the teammates he plays with: a crappy player might look good if he plays with Kevin Garnett all the time. The system tries to factor that out, by keeping track of all the teammates and opposition players on the floor at the same time, and finding a set of ratings that most consistently predicts outcomes based on those other nine players. (Winston uses a feature of Microsoft Excel called "Excel Solver" for this; I'm not sure how it would differ from an ordinary least-squares regression.)

The results are impressive, but there aren't any confidence intervals, or even simple intuitive measures of how reliable the results might be. I really like the plus-minus method in theory, but I've always wondered about how much you can trust its answers, and Winston doesn't really tell us here. The question is especially relevant because Winston goes on to try to figure out the "chemistry" of various lineups. For instance, suppose you have five players who are +1 each, but, when they're on the court together, the team winds up +15 instead of the expected +5. Winston would say that those five players complement each other somehow and perform exceptionally well together. I'd ask, could it just be random?

Another interesting study in the book, which I think is original to Winston, is a measure of which draft positions give you the best value per dollar (similar to the Massey/Thaler study of the NFL draft). It turns out that the 1-10 choices are by far the most lucrative, but that 6-10 slightly outperforms 1-5 after adjusting for salaries. There are only five years in Winston's study, though, and he tells us that the 6-10s are "pumped up by the phenomenal success of #10 picks Paul Pierce, Jason Terry, Joe Johnson, and Caron Butler."

Finally, there's a fourth section of the book, which discusses topics that aren't specific to a single sport. Gambling probabilities are covered, along with team rating methods, competitive balance, and other such things.

--------

As I said, I really like the book and its method of presentation ... but I have to say I don't agree with everything in it, and I think there are things in it that are just plain wrong. Winston spends two chapters trying to evaluate how play has improved over the decades ("Would Ted Williams Hit .406 today?"), but I don't think the computations work. The method, as has been done by many others, is to look at all players who played two consecutive years, and see how their performance changed from one year to the next. If their performance dropped by (say) two points, you conclude that the league improved by two points between those two seasons.

As I have argued before, I think that method doesn't measure league improvement -- I think it measures the difference between player performance in the first year of their career, as compared to the last year of their career. So I think Winston's conclusion, that Ted Williams would have hit .344 in 2005, is completely without basis.

Another problem is the chapter on parity. Winston regresses NFL team's performance this season on its performance last season, and gets an r-squared of .12. He does the same thing for the NBA, and gets an r-squared of .32. He therefore concludes that the NFL has more parity, and it must be because of the salary cap, the draft, and the fact that contracts in the NFL are not guaranteed.

Those might all be factors, but, as I (and Tango, and GuyM, and many others) have pointed out, the main reason is that NFL teams play 16 games, while NBA teams play 82 games. Even if the other factors affecting year-to-year performance were exactly the same, the correlation would be lower in the NFL just because random chance is a much higher proportion of performance in a 16-game season than in an 82-game season.

Winston also revisits the question of whether payroll can buy salary. He finds that there's a reasonable correlation between team pay and performance in baseball, but low or negative correlations in the NBA and NFL. That, he speculates, is because it's much easier to evaluate the statistics to figure out if a baseball player is good, than to figure out the relative skill of a football player or basketball player. Under that theory, NBA and NFL teams just aren't very good at figuring out who's valuable and who's not.

That doesn't sound plausible to me, that teams could be that blind. Most of the effect, I think, is that because the NBA and NFL both have a salary cap, the distribution of team payroll is very narrow. Therefore, most of the variation is luck, which means the r-squared is going to be lower.

That is: the r-squared is not an absolute measure of the relationship between pay and performance -- it's a *relative* measure, relative to the other sources of variance. In any given year, there will be a high correlation between my salary and the total salary of people in my house -- but a lower correlation between my salary and the total salary of people in the country. The R-squared depends heavily on the size of the *other* factors that contribute to variance. In the NBA and NFL, those factors are much larger than the (compressed) payroll. In MLB, however, you have teams that spend $200 million, and teams that spend $60 million. That means a lot more of the observed difference between teams is payroll-related.

One last way to look at this: in Rotisserie League Baseball, there is a high correlation between player salary and performance: Albert Pujols goes for a lot more rotisserie dollars than Eric Hinske. But if you do a correlation between team pay and performance, you'll get a very low number, because all teams pay around $260!

Winston would do better to regress individual player performance on individual player salaries. If he did that, he'd find that there is indeed a strong link between pay and performance, but that the salary cap means it doesn't apply at the team level.

------

I should also mention a few picky things that could be improved. There are some silly errors that could have been fixed with a little more reviewing. For instance, in Chapter 1, Winston notes that in July, 2005, the Washington Nationals were 50-32 despite having allowed more runs than they scored. According to Pythagoras, based on their runs scored and allowed, they should have been around .500. "Sure enough," the book says, "the poor Nationals finished 81-81."

But, of course, that doesn't follow. Perhaps the Nationals should have finished .500 in their remaining 80 games, but that should have brought them to 90-72, not 81-81 -- you can't go back and reverse the games that already happened. That's just a little oversight that should have been caught, and could be misleading to someone who's reading about Pythagoras for the first time.

Another thing I found is that some of the Excel charts were a little off-putting. That's my opinion, which is not necessarily better than Winston's own editorial judgment (and, after all, it is his book, and part of its mandate is teaching a bit of Excel). But at least a little better formatting would have helped. In particular, numbers in cells should be rounded to the appropriate number of decimals; a chart showing the "mean strength" of the Buffalo Bills to be 3.107639211 is obviously a little too exact.

And I hate the term "Mathletics" as a substitute for sabermetrics. Hate, hate, hate. Hate.

-----

Another strength of the book is its bibliography. Even before getting to it, at the back of the book, it's obvious that the author is quite well read in the current state of the sabermetric art; almost every source I can think of (including this blog) is sourced somewhere in the text. The bibliography expands on the text references, with a listing of somewhere around 100 articles and websites, with full opinionated descriptions of what's in them. (Disclaimer: Winston says some very kind things about this site ... thanks!)

The only omission I found -- and it's a big one -- is that "The Book" blog isn't included. In my opinion, that should be among the first places sabermetricians go to learn what's new in the field (especially in baseball). Tango is very thorough in identifying which new research is worthy and which isn't, and I'm disappointed Winston didn't include that particular blog. However, "The Book" itself is listed, with a nicely favorable review and a link to Tango's own website (if not the book's).

-----

I think of "Mathletics" as a bit of a sabermetric Wikipedia between hard covers. Despite some shortcomings that I've described here, it's the only concise, current, beginner's description of sabermetric findings that I can think of. My preference would be to see it expanded a bit. I'd love for it to have a section on hockey -- there's lots of stuff we know thanks to Alan Ryder, Gabriel Desjardins, Tyler Dellow, and others -- and there are lots of other topics in the other three sports that could be added. I'd also prefer if more of the Excel stuff was left out of the book and placed on the author's website (where the full spreadsheets can be found.)

But, as I said, it's Winston's book, not mine, and until he appoints me paid editor, I should appreciate it for what it is, which is a book that fills what I think is an important untapped need. Even as it stands, it's now the first book I'd recommend to any beginner who wants a quick overview of the state of sabermetric knowledge.

Saturday, January 02, 2010

Gabriel Desjardins: NHL teams are playing for overtime

In the National Hockey League, the losing team gets zero points in the standings if it loses in regulation time, but one point if it loses in overtime or the shootout. Either way, the winning team gets two points. That means that, if a game is tied after 60 minutes, three points will be split between the teams instead of two. In theory, this should give teams an incentive to "play for the tie" and thus increase the joint reward by 50 percent.

To see how important that third point is, suppose that one average NHL team were to successfully collude with all the other teams to play to a regulation tie. Both teams would ensure that they didn't score in regulation, but, come overtime, both teams would now try hard to win.

What would happen? Well, if it's an average team, it would win 41 games, for two points each, and lose 41 regulation ties, for one point each. That would give it a total of 123 points on the season.

123 points would have led the league last year.

Of course, no team is going to collude to play every game to overtime. But the incentive is still high. On average, getting to overtime gets you 1.5 points. To claim 1.5 points in a regulation game requires a winning percentage of .750. So, for all but the very best teams, it's theoretically better to wait for overtime than to try to win in regulation.

Now, teams aren't actually going to act to maximize their expected points this way, for obvious reasons. First, the fans care about more than just standings points -- they want to be entertained. Paying customers and home viewers will get very upset at having to sit through 60 minutes of ragging the puck before the real action starts. Second, you could argue the idea of teams trying not to win (or even trying not to win "right now") is damaging to the integrity of the game.

Still, the incentive is there, and you should expect that teams will respond to it in whatever ways they can get away with. In a study I did about three years ago, I found that teams are playing to more regulation ties than they used to. After adjusting for the league level of scoring, there was about a 24 percent increase in the frequency of regulation ties in the NHL.

-------

Is it getting worse? Last week, in a Wall Street Journal column, hockey researcher Gabriel Desjardins found that it is, and also some striking evidence that might suggest why. Gabriel found that in the past two years, 22.5% of games were tied after 60 minutes, almost exactly the same as in 2005 to 2007. You could conclude that the effect has levelled off. But, this year, there's been a jump. So far this year, Gabriel found, 27.9 percent of games have gone to overtime.

Why is this happening? Well, Gabriel observes, if teams are playing for the regulation tie, you'd expect them to concentrate their efforts in games where the score is tied late. After all, it's hard to play to preserve a tie in the middle of the first period: you'll have to change your style of play for 50 minutes.

So Gabriel looked at what happens in games that are tied with three minutes left in the third period. What did he find?

-- In the past four seasons, scoring in the last three minutes is at about a third less than normal;

-- This season, it's dropped even further: it's more than 60% fewer than normal!

Well, I counted 157 games that went to overtime this year so far. I'm counting 10 days later than Gabriel was, so let's lower that to 145. If we assume there were also 145 games tied in the last three minutes, that's 435 minutes of hockey, or 7.25 games. If we assume that the distribution of goals per game (in the last three minutes) is Poisson with a mean of 4 (which is roughly the average of the past four years), that means the variance is also 4 (in a Poisson distribution, the mean equals the variance). That means the standard deviation is 2 goals per game. To get the SD of the average over 7.25 games, we divide by the square root of 7.25, which gives 0.74.

So the difference between this year's figure of 2.05, and the expected figure of about 4.00, is a bit less than 3 standard deviations. That's fairly significant. It's probably a combination of something real happening, and a certain amount of luck. But even if some of it is luck, the number does jump out at you.

------

Is the NHL happy with this state of affairs, where teams are playing for a tie where they can get away with it? I think, overall, they are. From the standpoint of the league, there are both pros and cons.

On the pro side, more overtime games equals more excitement. More overtimes also means more shootouts, and fans seem to like shootouts.

More subtly, games decided in overtime are less likely to be won by the better team, since there are only five minutes in which the better team must emerge instead of 60. The same is true for shootouts. Indeed, shootouts are so random, or at least so uncorrelated with the other skills of the team, that Gabriel also found evidence that teams might also be playing for the tie in overtime, hoping to get to the shootout. Overtime scoring is also down this year, by the equivalent of about one goal per 60 minutes.

All this means that the winners are more random, which means the standings are more random, which means that more teams are in the hunt for a playoff spot. As I write this, only one team in the East is more than five points out of the playoffs, and only one team in the West is more than ten points out.

And even though the result of any given overtime may be mostly random, it's likely that the better teams are less likely to appear in overtimes, being able to beat their opponents in the first three periods. In that case, the effect is to award the "extra" third points to worse teams more often than better teams, which also has the effect of compressing the standings. I'm not actually sure how large an effect this is, but logic suggests that it must happen to some extent.

On the con side ... I guess the hockey would be a bit less exciting late in the game when both teams are playing for the tie. And, because of the randomness, you might get the most talented teams a bit less likely to move up in the standings. Those seem like pretty small cons.

So I can see why the league likes the new system ... the pros do seem to outweigh the cons, as far as fan interest goes.

But I hate it.

Part of the reason is personal taste; I don't like the idea that you can lose and still gain ground in the standings. And I don't like the idea that mediocre teams are collectively rewarded for not being able to put the other team away. As a statistician, it bugs me that some games are worth three points and some games worth two.

But the more important reason is that I don't like the incentives. When teams go into the game, from the very beginning, hoping it winds up a tie ... well, I think that's just wrong. And when the tie game gets to the third period, between two teams both fighting to make the playoffs, you know that both coaches are seriously thinking about getting to overtime, and securing themselves at least one point. You know that, if they could, they would collude to keep the game tied until the end of regulation. And, for all I know, there might already be some kind of unspoken agreement that certain things won't be done in the last few minutes of a tie game. I'm not expert enough to spot that, but the lower scoring in tie games speaks for itself.

The bottom line is that with the extra point available for the tie, teams aren't playing every instant of every game with the same strategy and desire to win that they would otherwise. And that can't be good for the game.