Team Sites

Expanded Horizons: Two Dogmas of Sabermetrics

In 1951, W.V.O. Quine published his landmark paper, “Two Dogmas of Empiricism.” His goal was to disprove a certain type of empiricism that was trendy among analytic philosophers in the early 20th century. That set of beliefs—logical positivism—sought to deny that any statement that was not empirically verifiable had any meaning whatsoever. What Quine showed was that the two eponymous beliefs, which he derisively called dogmas, were necessary to logical positivism but also false. That paper remains one of the most important works of 20th -century philosophy because it demonstrated the limits of a system of knowledge based only in observable fact and logic.

There exists, I think, a certain set of analysts in baseball who adhere to a sort of logical positivism. That belief is demonstrated by the drive to completely separate outcomes from processes, despite the fact that such cleavage is not actually possible. For example, to eliminate the effect of luck is one of the guiding goals of data-driven performance analysts. Here are two theorems that reflect that goal:

Pitchers have only limited control over outcomes on balls in play

Evidence of clutch performance as a repeatable skill does not manifest itself in any measurable statistics

There are others, too, that are more minor. But these two theorems in particular—commonly accepted among sabermetricians and fellow travelers—are widely rejected by analysts and fans more generally. In each of the two cases, well-reasoned, data-backed arguments from very smart people go all the way to the water’s edge. But the data stops short of allowing analysts to defend conclusive statements—dogmas, if you will—like “pitchers have no influence over the rate of hits on balls in play” and “there is no such thing as clutch hitting.” Such certainty would perhaps be desirable—it’d certainly make analysis easier—but it is simply not supported by the data available.

Many (most? all?) of the easy-to-remember wisdom imparted by “stats geeks,” as the uninitiated seem fond of calling them, have come from attempts to separate processes from outcomes. Most of these insights have been useful, but over reliance on them can be deadly. For precisely the reasons why purely outcome-determined statistics incorporate luck, if newer metrics retain any scintilla of outcome (as opposed to process) data, they must be regarded with skepticism. But that’s just the problem: every single statistic we have—from pitcher wins to SIERA, is tinged with the bias of knowledge of outcomes.

First, we have to remember why any of this matters. After all, outcomes are what matters! It’s not how, as they say, it’s how many. Right?

No Coincidence You Can Also Open a Can of Prince Albert

Let’s take a recent example. Sunday, Prince Albert completed the Pujolsian feat and hit three home runs in a game. It wasn’t the first time he’d hit a hat trick (he did so twice in 2006 and once in 2004), but hitting three home runs in a single game is rare enough even for inner-circle Hall of Famers. So what was the probability that he would hit three home runs Sunday? This, it turns out, is a very difficult question to answer. Although we know for sure it came about, we cannot say that means there was a 100 percent, 90 percent, or even 50 percent chance it would happen. Of course, even if there was a .001 percent chance, it would happen on average once every 100,000 times, and it could be that we just so happen to live in that particular world.

In fact, we want to know the forward-looking probability from two days ago that he would hit three home runs. Such a figure might incorporate his career home-run rate (about 6 percent of plate appearances—can you believe he has 378 home runs?), his three-year weighted average home-run rate, some medical data about his back that day, the home-run tendencies of the likely pitchers, the park, the temperature, the month, the home plate umpire, what he ate for breakfast, and of course whether or not he is in fact a machine.

The point is that there are dozens of variables that need to be considered, and any model that uses only some will be merely an approximation. Even if we put everything we know into a roux, we are still relying on outcome data: career home-run rates are themselves based on previous outcomes, the probabilities of which were just as complex. The eager DB-jockeys among you may be clamoring to regress the rate to the mean, but now we have to figure out which mean is most appropriate—30-year-old Hall of Fame first basemen? Average NL first basemen? Average major-league hitters?—and then we have introduced uncertainty because of our choice of population to which to regress Albert.

Hey, He Might Have Hung That Slider On Purpose

That was a pretty simple example, don’t you think? But there are areas where we can say with more confidence that we are measuring skills and not outcomes. Take pitching, for example. There is panoply of rate-based metrics out there designed to disassociate a pitcher’s value from the outcome after the ball leaves the hitter’s bat. The basic DIPS components are K, BB, and HR, and various formulations expand on that to include GBs. Some have gone further, relying on batted-ball types like line drives and fly balls—even regressing those batted-ball rates to the league mean—but all these stats seek to assign credit and blame to pitchers only for those aspects of pitching over which they have control. That is to say, to assign credit and blame only for processes and not outcomes.

But this valiant effort, too, has logical limits. After all, strikeouts and walks are prone to fluctuation based on opponent strength, park factors, and umpire tendencies. Even putting aside disagreement over whether it is more appropriate to use actual home runs or fly balls as the input, it ought to be clear that either is subject to random variation as well as park and other effects. Again, we can correct for most park factors, and we can regress rates to an appropriate mean, but then we are making choices that introduce uncertainty (even as we remove bias).

There is an even bigger problem with our pitching statistics. The fact that we don’t know where a pitcher/catcher battery intend the ball to go means that we can’t separate the pitcher’s approach from his command. It’s impossible to say whether a changeup inside was what the catcher wanted or whether it was meant to be down and away. Because many times the best pitch is the one that is unexpected (even if it might be very hittable if the hitter were looking for it), we can’t even necessarily assume that a pitcher missed simply because he threw a pitch out over the plate. Perhaps we can figure this out for individual pitches based on video data, but aggregating this data over a whole season is downright impossible without standardized camera locations.

The Pure Gardens of Outcomes

The two areas in which we judge based on outcomes the most are the two areas that are the trendiest in the performance analysis world: baserunning and defense. These are two areas in which we simply don’t have the data to make inferences about processes.

In the case of baserunning, it is impossible to separate not only the process from the outcome (as in the unlucky cases where the wind carried the throw unexpectedly, or the dirt had a soft spot), but also the decisions of the players from the decision of the coaches, as well as the decision-making of the player from the in-game quickness of the player. All of these factors may conspire to make a baserunner seem better or worse than he truly is based on his processes.

Similarly in the case of defense, it is very hard to say even what outcomes resulted, at least at the individual level. It’s harder still to say when good outcomes were the result of good or bad processes, at least in a way that is aggregated over the course of a whole season. Invariably the raw data itself relies on human stringers who vary in their interpretations of the location and type of the batted balls. Even team-level data, like defensive efficiency, suffers from this problem because it only takes into account those balls that were in fact caught, not those balls the team deserved to have caught.

Question of the Day

I’m not suggesting that moving toward processes and away from outcomes isn’t a good thing. But I am suggesting that we ought to apply a similar level to skepticism to those halfway solutions we have at the moment. Perhaps the worst thing a data-loving fan could do is wait for perfect information (say in the form of Hit-f/x), because it not only isn’t coming, it isn’t possible. Am I wrong? Is knowledge of pure processes and skills possible to separate from knowledge of outcomes? Is the separation between truths that are analytic and those that are synthetic workable?

a lot of the problem is that when you claim outcome/process independence, you are likely to make totalizing claims based on outcome alone. instead of claiming to know something valuable about what's going on, you want to claim that based on these and these data, the situation is reducible to your model. however, this is obviously not the case. an awareness of process would mean that what we claim to know about baseball is limited to the "this is a valuable insight" variety. it encourages more refining of the analysis

Nice begining reference. Bill James I believe concurred with this idea in his recent speech to the statistics department at Kansas University. The more we learn, the more we realize ther is to learn. Thanks good material.

I don't really disagree with most of what you are saying, but I don't really see your point. Yes we try to separate outcomes from processes and yes we do an imperfect job at it. Until we have a perfect "universe simulator" that includes ALL the variables (including those regarding the minds of the players) then we will always fall short of being able to completely separate the processes from the outcomes. That doesn't mean we can't do a pretty good job of estimating. Our estimations (even estimations of true talent) have uncertainties, and there are even uncertainties around those uncertainties ad infinitum. We should definitely be aware of that, but that shouldn't stop us from trying to improve and it doesn't give sabermetrics' detractors a free pass.

>>>>>>We should definitely be aware of that, but that shouldn't stop us from trying to improve and it doesn't give sabermetrics' detractors a free pass.<<<<
That's really the point of the article though. No one gets a "free pass." Both the 100 percent saber detractors and the 100 percent "stat geeks" have weaknesses in their arguments that are not insignificant.
Like slideric above said (quoting Bill James): "The more we learn, the more we realize there is to learn."
Recently I read someone who said pitcher wins are a "meaningless" stat. To me, this kind of thinking is the epitome of the kind of thing Tommy is talking abut. Yes, wins aren't nearly as important as they used to be. And in a small sample size like a single game or even a single season, they don't have as much meaning as they do over the course of 4 or 5 seasons or a career. Not a lot of guys win 10-15 games a year. And if they do has meaning and value. Even if it isn't a sabre number.
I think it's important to understand that we can't "know" everything and also understand how that effects the new metrics we use.

Great article. I love BP, but bristle when I see things that imply that player performance not in line with DIPS or BABIP is simply attributable to luck. This article does a good job in reminding us that variations from these process statistics can be caused by many factors that are either difficult or impossible to add into a regression equation.

I'm not quite sure how the analogy between positivism (which you've defined in terms of the verificationist principle, so I presume this is the point of contact) and sabermetrics is supposed to work.
So you are right, the verificationist principle of cognitive significance states that meaning is verification conditions. The analogy seems to be that sabermetricians have classified processes as unverifiable, except insofar as we can get at them through outcomes. The point of your piece is to suggest that, even if we cannot employ rigorous means to study these processes, that does not mean that they are meaningless, etc.
If that's the point, I agree wholeheartedly. Note a difference though, processes are *not unverifiable*. Here's a really simple way to get some evidence - we can ask people what they intended to do. Of course, as psychologists have long shown, we are not very good at introspecting our own reasons for acting. So perhaps we can conduct some experiments, etc. Now, nobody does this kind of work because it is not really important, has huge costs, and would be really difficult. The point is that the claims are verifiable.
This is important, because then the point of view is not really positivistic. It is an example of the standard methods of idealization in the sciences. There is a ton of data out there - and we do not know how to fully integrate it all in any of our baseball theories. So what do we do? Well, we idealize. We look at the data that we can do robust work with, and see what kinds of insights that gives us. We can then test those results by making predictions, etc. Some predictions will fail, marking out a poor theory. Some predictions will fail, marking out the limits of our idealization.
The discussion we need to have, and I think it is the one the author is after, is between these two cases. When a prediction fails, do we have good grounds for classifying it as a consequence of our idealization, or as falsifying evidence for our theory? If this is right, then the view is not positivistic (though as a philosopher, I applaud bringing Quine into a popular piece!), but actually an example of a laudable scientific practice.

iirc, the main argument in the two dogmas essay is an attack on the reduction of meaning into claims of empirical experience. similarly, this article is an attack on the reduction of baseball into simple outcome data

Agree with Mangey re: positivism; the criterion of meaningfulness is verifiability in principle, not actual verification. The connection to Two Dogmas seems weak. Quine said that the analytic/synthetic distinction (truth of reason, knowable by their meanings alone, independent of sense experience, versus those knowable on the basis of particular experiences) was untenable. *Anything* can be rejected on the basis of sense experience, and because of meaning and evidence holism, there is no such thing as *the* sense experience that would confirm or disconfirm any particular claim. Claims face the tribunal of sense experience not individually, but collectively.
That said, I enjoyed the article, and the point is well taken, re: sabredogma.

Since I skimmed but didn't read the article it may be appropriate to conclude the logical step is not to read this.
But I love the concept. IMHO (which is incorported into what I do for a living) the point of 'sabermetrics' is to challenge paradigms & get to the truth. I love that BP has hired someone who is challenging 'sabermetrics' paradigms to get to the truth.
Will other BP authors now consider the possibility that sometimes Joe Morgan knows what he is talking about (and sometimes he doesn't)?

The statements "pitchers have no influence over the rate of hits on balls in play" and "there is no such thing as clutch hitting" are manifestly false. Rigorous statistical analysis has already put the lie to both these claims, and anybody with a genuine interest in sabermetrics should feel no discomfort in rejecting them.
Now, to say pitchers have *limited* control on BABIP is true and very useful. As far as clutch hitting, see for example Tango's analysis in The Book. There is evidence of statistically significant (if very small) skill in clutch hitting (or lack thereof).
As to the specific question of "which mean?" to regress Albert to: without doing my own in-depth statistical analysis to make the decision, I would say average major-league hitters at corner infield and outfield positions, adjusted for era. You might get a somewhat better number by limiting the sample to players who accumulated at least 5-6 years of service time. I answer this question not because it's the central question of the article, but because you seem to have the attitude that questions such as this are somehow nebulous or unanswerable. On the contrary: the quality of data improves over time, but at any given time you can generally find the right answer to this sort of question by looking at the data you *do* have available.
I don't know that "skepticism" is the right word: certainly we can't take as absolute gospel any predictive numbers that come out of an imperfect dataset, and we can always expect that as data improves predictions will improve, but we can still produce good, useful numbers in response to a lot of questions without ever even considering the idea of treating them as gospel in the first place.

This article and the resultant comments are probably a little over my head, but I think I get the gist, that the analysis of processes is inherently limited because we are using previous outcomes to analyze the processes, and those outcomes are based on myriad factors.
One way that maybe this makes sense to me is with respect to OBP and walks. I don't think it's necessarily valid to say, "that guy's got a great OBP, we should sign him" - there are numerous reasons why a player can have a good OBP, some desirable, and some maybe not.
The player could have a keen batting eye, he could be a dangerous hitter that pitchers don't want to pitch to, or he could be so selective that he walks (and strikes out) a ton, and most of his hits are XBH (thinkng Adam Dunn).
At some point, however, the pitcher has to be afraid to throw the hitter good pitches, because ultimately, the pitcher could just pump fastballs down the middle of the plate.
My point is, that the process of being a high-OBP hitter at some point is the result of pitchers not wanting to throw good pitches to them (because of the possibility of a bad outcome).
This I think is part of the problem with a PURELY statistical-based analysis of players...at some point the players have to have actual tools that translate into favorable stats. The best organizations use both.

"The best organizations use both."
I think even the geekiest of stat geeks will agree that evaluations should be made with statistical analysis AND scouting. Personally I know that I cannot do the latter, so my analyses will always be incomplete. I think there is a difference though between recognizing one's own ignorance of a complementary technique and offering complete disdain for the technique (and I think both sides have been guilty of the latter).

@Nate Sheetz: I respectfully disagree about regressing to the mean. Selecting a population that approximately, but not precisely, matches the player in question, and then adjusting estimates on that basis, is more likely to introduce bias than to reduce it. I believe that this generally is going to be the case in any realm with a huge number of variables, most of which we cannot measure. The true parameters underlying the performance of even the same player are different from year to year, let alone a different mix of players playing in different circumstances for each year. One must accept that some variation simply is not going to be accounted for. Regression to the mean is best understood as a phenomenon to account for where appropriate, not an adjustment to be made in every analysis.