Spitballing

Playing with Playing Time

the archives are now free.

All Baseball Prospectus Premium and Fantasy articles more than a year old are now free as a thank you to the entire Internet for making our work possible.

Not a subscriber? Get exclusive content like this delivered hot to your inbox every weekday. Click here for more information on Baseball Prospectus subscriptions or use the buttons to the right to subscribe and get instant access to the best baseball content on the web.

Projecting playing time is hard. When news hit that Adam Wainwright was almost certainly out for the year, forecasters went into fits. There’s no way to predict Tommy John surgery* for a pitcher coming off back-to-back 230-inning seasons. So you can say there’s a five percent chance he gets hurt and dismiss that possibility as too unlikely to weigh into your projection, or you can drag your forecast down by five percent. Regardless of which course of action is better, the only recourse after the fact is to “cheat” by manually updating the number of projected innings when such news comes out.

PECOTA, Dan Szymborski’s ZiPS, and Brian Cartwright’s Oliver all use manually adjusted depth charts. PECOTA also uses a simple average of past seasons for individual projections. Victor Wang tried his hand at projecting playing time using some more advanced techniques. But for now, I’d like to keep it simple.

Marcel the Monkey, the world’s most baseliniest projection system, projects Wainwright at a 2.98 ERA in 198 innings. Marcel was developed by Tom Tango to serve as a replacement-level forecasting system, as its methodology is entirely open source. Marcel is quite sophisticated in some ways, as it understands regression to the mean, handles the weighting of prior seasons, and includes an aging curve. While I wouldn’t trust myself to guess a random pitcher’s ERA over a projection system without using a computer of my own, I’m positive that I can do better in terms of innings pitched. I’m also positive that I can come up with an algorithm just as basic as the one that Marcel uses to better predict playing time on average.

Marcel starts by giving all batters 200 plate appearances. It then adds half of a batter’s plate appearances from the previous year and 0.1 times his total from two years prior. For pitchers, it uses the same weights, but with 60 innings for starters and 25 for relievers as the constants. I decided to use a similar dataset and see how easily I could beat the monkey with a couple of simple rules of my own.

My sample consisted of all regular-season plate appearances in year N since 1980 for everyone who played in the Majors in either year N-1 or year N-2 and played pro ball in some year thereafter.

Marcel’s equation explains 63 percent of the variance in a batter’s plate appearances in a given year. When I best-fit the data, the regression equation took 75 percent of the PAs in year N-1 in addition to 10 percent in year N-2. But we can do better than the best-fit line with one simple separation of the data. When projecting playing time in season N for players who played more in season N-1 than in season N-2, the playing time in season N-2 becomes irrelevant. For batters, that means that you project with 80 percent of the previous year’s plate appearances, and for pitchers it’s closer to 75 percent. Otherwise, the equation is 60 percent plus 20 percent for batters and 60 percent plus 15 percent for pitchers, where the first term is the PAs in year N-1 and the second term is year N-2.

The r-squared, however, was identical at 0.64. What this means is that there are multiple ways to skin this cat. (Are there actually multiple ways to skin a cat? Cat-skinners, get at me.)

So which way is more correct? Below I plot the best-fit lines for batters and pitchers and the games played per games projected indexes.

I submit that it is more sensible to force the intercept to zero, since if someone hasn’t played in the Majors in either of the past two years, he probably won’t play the following year. Indeed, for players projected at fewer than 300 PAs by Marcel, Marcel generally overshoots by 180 PAs, while the just-as-simple best-fit hereinafter referred to as Marcello, Marcel’s Italian twin* is on average within 10 PAs. A quarter of these players don’t play at all in the projected season, yet Marcel is projecting 95 percent of them for a career high in plate appearances. When Marcel projects for over 300 PAs, it misses by an average of 100 PAs, compared to an average of two PAs for Marcello. Marcello projects 20 fewer innings for a workhorse like Wainwright. There are two hitters and one pitcher Marcello projects for more plate appearances than Marcel: Rickie Weeks, Austin Jackson, and C.J. Wilson.

*I’m not sure why their parents named them that way, or how they can be of different nationalities.

This works back to the original uncertainty surrounding the question that a projection system is attempting to answer. Yes, it’s more likely that someone like Wainwright will pitch 200 innings than 180. For pitchers who throw 200 innings in back-to-back years, there’s a better than 50-50 chance that they have a third season at 200 innings-plus. However, the mean innings pitched for those players turns out to be way less than 200.

The Bill James forecasts are notoriously optimistic because they shoot for the mode as opposed to a mean. At the other end is Marcel, which regresses everything towards league average to such an extent that it is resistant to outliers. Marcel is trying to provide a true talent estimate, and therefore trying to minimize the error around projected production level. I think that similarly minimizing the error around projected playing time for a system like Marcel makes more sense than projecting playing time based on a different set of circumstances in which the player has likely outperformed his projected production. Of course, when it comes to projecting playing time, it’s unlikely that even the best algorithm supplied with the best data could outperform old-fashioned flesh, blood, and brainpower.

Jeremy Greenhouse is an author of Baseball Prospectus. Click here to see Jeremy's other articles.
You can contact Jeremy by clicking here

Jeremy, you addressed the median and the mode in the text, but as I stare at the graphs, I find myself wondering if there is a simple way to draw a line that goes through both the dark cluster around the origin and the dark cluster of full playing time in the upper right. A mathematical technique, I mean. The "best fit" line is having to account for all the points along the x-axis.

Wouldn't the cluster in the lower left (for pitchers) be the relievers (along with the injuries and such)? And so if you separated by pitcher role, would that help? You could obviously approach the model itself from multiple angles (logit, piece-wise, two separate models completely) if you had some type of logit flag for GS > G*0.8 or something. Or is that more complicated than what you wanted to do?

Anyway it would be interesting to see how the plot looks with pitchers separated into two groups with some simple rule of thumb.

OK, I checked it out, and I think I've made it clear I don't feel Marcel is competent in projecting playing time. Using simply year n-1 to predict year N obviously comes in with a higher average error than Marcello, as Marcello is simply a best fit and year n-1 is unregressed. But year n-1 and Marcello have practically the same average absolute errors. Marcel has a much larger average absolute error.