Manufactured Runs

The Deconstruction of Falling Stars

Before we can talk about Derek Jeter (and yes, I think there’s still something to say about Derek Jeter that you haven’t already heard this season), we should probably clarify which Derek Jeter we’re talking about. There really are two Derek Jeters—the one who exists in fact, and the one who exists in myth.

The actual Derek Jeter is interesting enough as a player that one wonders why the myth was necessary—always an exceptional hitter, Jeter has always been a player who could’ve had a job on any team in the league. He will go into the Hall of Fame on the first ballot, and nobody will bat an eye. Then there’s the Captain—the athlete whom ad agencies consider akin to Tiger Woods and Roger Federer. The player so exceptional that he can displace a generational talent like Alex Rodriguez from his natural position.

At one point, Jeter—both Jeters, really—was the brightest star in the firmament of the Yankees infield, made to seem even brighter by virtue of being flanked by a Brosius/Ventura-caliber player at third and whatever member of the Yankees’ Baskin-Robbins rotation at second was the current flavor of the month. Now, as he declines, he find himself flanked by Alex Rodriguez and Robinson Cano, who are both the sort of stars that Jeter once was and that his contract would still suggest he is. (Rodriguez, of course, is on the decline as well, but in at least one important respect baseball stars are very unlike real stars—the bigger they are, the longer they tend to stay on the main sequence as they decline.

So let’s do what the Yankees couldn’t, or wouldn’t, this offseason, and put aside Jeter the legend when considering the future of Jeter the player. In short, given what we know about Derek Jeter, what can we expect him to do going forward? I’ve said most everything I have to say about Jeter as a fielder, so let’s think of him the same way we will in posterity—as a hitter first and foremost.

Killing Worms
The effect that just one game can have on a player’s batting line is a vivid reminder of how early we still are in the season. As recently as the fourth of the month, Jeter was spotting an abysmal .212 True Average. By the seventh, he had worked that up to a slightly less cringe-worthy .226. (As a side note, this is the point at which I began to write this article, dear reader.)

Then, after just one game, he raised his TAv almost 25 points to .250. Now, despite a two-homer outburst on Sunday, nobody is about to confuse Jeter with a power hitter—this year especially, he’s been driving the ball into the ground quite a bit. That’s kept his BABIP and power-on-contact numbers depressed. So now the question becomes—what does the season to date tell us about Jeter going forward?

Keeping It Real
I’m sure you’re familiar with the sort of article where a player’s stats are listed, and then after the author waves a dowsing wand over his splits page and some component breakdowns, the player's performance is declared to be either “for real” or a clever forgery. One of the most popular ways to determine whether or not a player’s performance is “for real” is to look at his line drive rate. If the line drive rate is high, it supposedly means a high BABIP is “real” and a low BABIP isn’t; if the line drive rate is low, it supposedly means a low BABIP is “real” and a high one isn’t.

Let’s attempt to answer a simple question: given a player’s BABIP, how much additional information does his line drive rate give us about his rest-of-season performance? What I did was split the stats of every batter season from 2003 through 2010 into two pools—the first 100 plate appearances, and everything after that. I ran a weighted regression to predict rest-of-season BABIP using both BABIP and LD percentage from the first 100 PAs. The resulting equation looks like this:

BABIP_AFTER = 0.258 + 0.122 * BABIP_PRIOR + -0.022 * LD_PRIOR

Now, there is a correlation of .42 between BABIP_PRIOR and LD_PRIOR, but that’s not enough for us to worry that the correlation is affecting the standard error of the regression coefficients, so we can interpret these in a rather straightforward fashion. BABIP is very clearly a better predictor in this regression than line drive rate. Also notable is the significance of the coefficients, measured in terms of p-value (which is the odds of obtaining a similar result by chance alone, without the variable having a meaningful relationship with the value being predicted). For LD_PRIOR, the p-value is 0.1215, above the typical rule-of-thumb for “statistical significance” (normally .05 - .10). If LD_PRIOR is omitted from the regression, it’s essentially a wash as to which model is better—the model with LD_PRIOR has a .028 r-squared, compared to 0.027 for the model where it is omitted. (The model where LD_PRIOR is omitted does have a lower standard error, but the difference is no more significant than the difference in r-squared.)

But is this study repeatable, or is it simply caused by particulars of the dataset and years I chose? As an alternative, let’s examine the Retrosheet batted ball data (sourced from Project Scoresheet and later the Baseball Project) from 1988 through 1999. The regression formula should look pretty familiar:

BABIP_AFTER = 0.251 + 0.124 * BABIP_PRIOR + -0.007 * LD_PRIOR

In testing on the Scoresheet data, the p-value for the LD_PRIOR coefficient jumps to 0.53. After omitting LD_PRIOR from the regression, r-squared and standard error remain unaffected out to three significant digits.

In other words, once you know a hitter’s BABIP in 100 plate appearances, his line drive rate offers no additional predictive value on his future BABIP. Whenever I point out flaws in the collection of batted ball data, someone will inevitably pipe up to say that’s it’s better than nothing, but in this instance, at least, the best that can be said is that it’s indistinguishable from nothing.

The problem is that this sort of analysis spreads like a weed, and what I mean by a weed is this: “a valueless plant growing wild, especially one that grows on cultivated ground to the exclusion or injury of the desired crop.” What weeds do is consume air and sunlight at the expense of plants that bear useful fruit.

Let’s continue on with the larger ’88-’99 dataset for a minute, to illustrate what I’m talking about. If I omit LD_PRIOR from the model, I get an r-squared of 0.032. If I instead omit BABIP_PRIOR, the r-squared drops all the way to 0.006. So while adding LD_PRIOR to the model doesn’t make the results worse, using LD_PRIOR instead of BABIP_PRIOR certainly does. Unfortunately, there is a strain of analysis that seems to prefer those sorts of “expected __________” estimators to the exclusion of actual results. That’s when it turns into a weed—when it crowds out other types of analysis that are more beneficial.

Another thing to be careful of when dealing with these sorts of figures is changing definitions over time. Let’s scoot back to the more recent dataset, as it allows us to make comparisons with another dataset—the Baseball Info Solutions batted ball type data, as presented on Fangraphs. I took league averages in BABIP, LD rate, and BIS LD rate, and normalized them so that one was average for that stat, to make comparisons over time:

Changes in BABIP and in the Retrosheet line drive data at the league level are pretty well correlated, at 0.89. That makes intuitive sense: as BABIP rises, the number of balls that look like liners should go up. The correlation for the BIS data, on the other hand, is bizarre—it’s -0.57, which would seem to imply that as BABIP goes up, the number of well-hit balls decreases. What’s happening is that there’s a common factor: time. The league trend in BABIP just happens to be running in one direction, while the trend in BIS line drive rate over time is heading in another. The definition of what a “line drive” is is changing over time, according to BIS.

The takeaway is that a model based on BIS data is going to be far more sensitive than a model based upon the Retrosheet data—using a regression based on 2003 – 2006 BIS data will return different results than using a regression based on 2007 – 2010 data, much moreso than in the Retrosheet data. It’s also a problem if you want to compare trends in a player’s performance, as you have to sort out whether or not a change in LD percentage means anything more or less than a change in how the data is collected.

You’ll note that I made it all this way without mentioning bias in batted ball scoring, which is yet another problem to contend with when using this sort of observational data. I’ve restricted this discussion to batted ball stats, but any sort of observational data will have the potential for these kinds of issues— I’ve found plenty of problems with pitch location data from BIS as well, for instance. Not that you need observational data to do this sort of bad analysis—you can get to much the same place by indiscriminately bandying about close-and late-splits, or what have you. The key is to build a model of how things work, test it, and use it, not to cherry pick players who are having unusual seasons and try to come up with an explanation for them.

What PECOTA Said
So what’s the proven model for predicting future performance? In short, past results—and the more of them, the better. In Jeter’s case, prior to the season he had a PECOTA weighted mean forecast of .264—in essence, that of a league-average hitter. To date, he’s had a .250 TAv. But what if we combine the two of them? In other words, how far does his season-to-date performance move the needle on our expectations for his immediate future?

Jeter’s forecast reflects roughly 3,500 past plate appearances—because PECOTA is making some behind-the-scenes adjustments, the effective PA quantity in PECOTA is going to be an estimate, not an exact count. His season-to-date performance, on the other hand, reflects only 136 PA. I’m sure you can see where this is going—a weighted average of the projected and observed TAv by the number of PAs isn’t enough to budge the needle any. (A slightly more sophisticated method of averaging the two together yields similar results—an expected TAv going forward of .263.) It was unlikely that Jeter was going to continue to slug under .300 for an entire season, and Jeter made this point more persuasively than I ever could have while I was in the midst of writing this article.

But… let’s talk about those things I said I wasn’t going to talk about to start off with. Jeter is being paid to be something more than a league-average hitter with a terrible glove. The positions he could conceivably be moved to in order to hide his poor glove are all occupied by better players and would have the side effect of exposing his underwhelming bat even more. And the scrutiny is intensified by the decision that Jeter should lead off, allocating more plate appearances to him than to any other Yankee hitter.

So will he hit? Some. Probably not enough. Do the Yankees have any other options? Maybe. Probably not worth the trouble, at least at this point. (Of course, this is something of a self-fulfilling prophecy—the Yankees are painted into a corner now, but they spent most of the offseason wielding the brush.) Jeter is not, as the kids say, “done.” (Maybe, in some sense, it would be better if he were—then we wouldn’t have to watch.) There’s still some life in him. What is done—for now—is the Jeter of myth; right now, there’s just too much distance between the player who’s left and the player of our imaginations. That Jeter may, and probably will, come back—after the other Jeter retires and we allow these last few years to fade, the great Jeter will be remembered again. How could he not be? We have the stats, we have the highlight reels. We were there.

What about his run-up to 3,000 hits? Most players approaching a major milestone tend to "press" and slump leading up to it. Granted Jeter had a declining year last year, but any chance he relaxes a little after getting #3,000 and improves?

I think you'd also have to note that players reaching major milestones like this are near the end of their careers, and their performance is on the downside of the bell-curve. What looks like a slump is probably just a reflection of their new performance level.

Besides, we're not talking about "players," we're talking about one particular player, who may or may not perform as the average. Don't forget that PECOTA is just a projection; it's predictive only in the aggregate.

Of course we heard this in 2002, 2005, and 2008. If he continues to regress I wouldn't be shocked. However, if he got hot and had a 2 month run this summer where he hit .310 .390 .435 I wouldn't be shocked either.

Well I guess that's the point. One bad year just isn't enough to say that he's done. It's enough to say a trend may be beginning to develop, but not quite enough to call him DOA. So what he did last year is of some relevance, but to say that one year removed from a .334 .406 .465 season he's now incapable of sustaining a run close to that for 60 days is a little premature IMO.

"Jeter is being paid to be something more than a league-average hitter with a terrible glove."

I'm not sure its meaningful to use Jeter's whole salary (which does indeed pay him as an above league average). Whether or not one places any value on 'intangibles' like leadership and clutch-iness, I think the Yankees are paying Jeter as least in part because they were unwilling to call his bluff and weather the PR hit they would have taken for NOT signing him. In other words, I think the Yankees felt that the marginal cost of not keeping him was substantial. And if that's an accurate analysis, then whatever that marginal cost is should be ignored when calculating Jeter's performance relative to salary as it was a sunk cost that they would incur no matter what.

Am I the only person who suffers from MEGO whenever the articles mention "p-value" and "r-squared" and references of that nature?

To be honest, I know these things are statistically important but I also have absolutely no idea what they are. Is it asking too much for BP to write a simple article explaining in layman's terms (and I cannot stress that point enough) precisely how and why a "weighted regression" is done or how the other aforementioned figures are calculated?

It would be an immense help if BP produced a column showing how the statistical sausage is made--with as little jargon as possible--and kept it tacked to the top of the board so we could reference it at any time. I know I'd get way more out of Colin's work, for example, than I do now when I'm glossing over entire paragraphs at times.

Just a suggestion that I believe would greatly enhance user-friendliness of this site. Thank you, BP.