Follow me on Twitter

In response to my two articles on whether pitcher performance over the first 6 innings is predictive of their 7th inning performance (no), a common response from saber and non-saber leaning critics and commenters goes something like this:

No argument with the results or general method, but there’s a bit of a problem in selling these findings. MGL is right to say that you can’t use the stat line to predict inning number 7, but I would imagine that a lot of managers aren’t using the stat line as much as they are using their impression of the pitcher’s stuff and the swings the batters are taking.

You hear those kinds of comments pretty often even when a pitcher’s results aren’t good, “they threw the ball pretty well,” and “they didn’t have a lot of good swings.”

There’s no real way to test this and I don’t really think managers are particularly good at this either, but it’s worth pointing out that we probably aren’t able to do a great job capturing the crucial independent variable.

That is actually a comment on The Book Blog by Neil Weinberg, one of the editors of Beyond the Box Score and a sabermetric blog writer (I hope I got that somewhat right).

My (edited) response on The Book Blog was this:

Neil I hear that refrain all the time and with all due respect I’ve never seen any evidence to back it up. There is plenty of evidence, however, that for the most part it isn’t true.

If we are to believe that managers are any good whatsoever at figuring out which pitchers should stay and which should not, one of two things must be true:

1) The ones who stay must pitch well, especially in close games. That simply isn’t true.

2) The ones who do not stay would have pitched terribly. In order for that to be the case, we must be greatly under-estimating the TTO penalty. That strains credulity.

Let me explain the logic/math in # 2:

We have 100 pitchers pitching thru 6 innings. Their true talent is 4.0 RA9. 50 of them stay and 50 of them go, or some other proportion – it doesn’t matter.

We know that those who stay pitch to the tune of around 4.3. We know that. That’s what the data say. They pitch at the true talent plus the 3rd TTOP, after adjusting for the hitters faced in the 7th inning.

If we are to believe that managers can tell, to any extent whatsoever, whether a pitcher is likely to be good or bad in the next inning or so, then it must be true that the ones who stay will pitch better on the average then the ones who do not, assuming that the latter were allowed to stay in the game of course.

So let’s assume that those who were not permitted to continue would have pitched at a 4.8 level, .5 worse than the pitchers who were deemed fit to remain.

That tells us that if everyone were allowed to continue, they would pitch collectively at a 4.55 level, which implies a .55 rather than a .33 TTOP.

Are we to believe that the real TTOP is a lot higher than we think, but is depressed because managers know when to take pitchers out such that the ones they leave in actually pitch better than all pitchers would if they were all allowed to stay?

Again, to me that seems unlikely.

Anyway, here is some new data which I think strongly suggests that managers and pitching coaches have no better clue than you or I as to whether a pitcher should remain in a game or not. In fact, I think that the data suggest that whatever criteria they are using, be it runs allowed, more granular performance like K, BB, and HR, or keen, professional observation and insight, it is simply not working at all.

After 6 innings, if a game is close, a manager should make a very calculated decision as far as whether or not he should remove his starter. That decision ought to be based primarily on whether the manager thinks that his starter will pitch well in the 7th and possibly beyond, as opposed to one of his back-end relievers. Keep in mind that we are talking about general tendencies which should apply in close games going into the 7th inning. Obviously every game may be a little different in terms of who is on the mound, who is available in the pen, etc. However, in general, when the game is close in the 7th inning and the starter has already thrown 6 full, the decision to yank him or allow him to continue pitching is more important than when the game is not close.

If the game is already a blowout, it doesn’t matter much whether you leave in your starter or not. It has little effect on the win expectancy of the game. That is the whole concept of leverage. In cases where the game is not close, the tendency of the manager should be to do whatever is best for the team in the next few games and in the long run. That may be removing the starter because he is tired and he doesn’t want to risk injury or long-term fatigue. Or it may be letting his starter continue (the so-called “take one for the team” approach) in order to rest his bullpen. Or it may be to give some needed work to a reliever or two.

Let’s see what managers actually do in close and not-so-close games when their starter has pitched 6 full innings and we are heading into the 7th, and then how those starters actually perform in the 7th if they are allowed to continue.

In close games, which I defined as a tied or one-run game, the starter was allowed to begin the 7th inning 3,280 times and he was removed 1,138 times. So the starter was allowed to pitch to at least 1 batter in the 7th inning of a close game 74% of the time. That’s a pretty high percentage, although the average pitch count for those 3,280 pitcher-games was only 86 pitches, so it is not a complete shock that managers would let their starters continue especially when close games tend to be low scoring games. If a pitcher is winning or losing 2-1 or 3-2 or 1-0 or the game is tied 0-0, 1-1, 2-2, and the starter’s pitch count is not high, managers are typically loathe to remove their starter. In fact, in those 3,280 instances, the average runs allowed for the starter through 6 innings was only 1.73 runs (a RA9 of 2.6) and the average number of innings pitched beyond 6 innings was 1.15.

So these are presumably the starters that managers should have the most confidence in. These are the guys who, regardless of their runs allowed, or even their component results, like BB, K, and HR, are expected to pitch well into the 7th, right? Let’s see how they did.

These were average pitchers, on the average. Their seasonal RA9 was 4.39 which is almost exactly league average for our sample, 2003-2013 AL. They were facing the order for the 3rd time on the average, so we expect them to pitch .33 runs worse than they normally do if we know nothing about them.

These games are in slight pitcher’s parks, average PF of .994, and the batters they faced in the 7th were worse than average, including a platoon adjustment (it is almost always the case that batters faced by a starter in the 7th are worse than league average, adjusted for handedness). That reduces their expected RA9 by around .28 runs. Combine that with the .33 run “nick” that we expect from the TTOP and we expect these pitchers to pitch at a 4.45 level, again knowing nothing about them other than their seasonal levels and attaching a generic TTOP penalty and then adjusting for batter and park.

Surely their managers, in allowing them to pitch in a very close game in the 7th know something about their fitness to continue – their body language, talking to their catcher, their mechanics, location, past experience, etc. All of this will help them to weed out the ones who are not likely to pitch well if they continue, such that the ones who are called on to remain in the game, the 74% of pitchers who face this crossroad and move on, will surely pitch better than 4.45, which is about the level of a near-replacement reliever.

In other words, if a manager thought that these starters were going to pitch at a 4.45 level in such a close game in the 7th inning, they would surely bring in one of their better relievers – the kind of pitchers who typically have a 3.20 to 4.00 true talent.

So how did these hand-picked starters do in the 7th inning? They pitched at a 4.70 level. The worst reliever in any team’s pen could best that by ½ run. Apparently managers are not making very good decisions in these important close and late game situations, to say the least.

What about in non-close game situations, which I defined as a 4 or more run differential?

73% of pitchers who pitch through 6 were allowed to continue even in games that were not close. No different from the close games. The other numbers are similar too. The ones who are allowed to continue averaged 1.29 runs over the first 6 innings with a pitch count of 84, and pitched an average of 1.27 innings more.

These guys had a true talent of 4.39, the same as the ones in the close games – league average pitchers, collectively. They were expected to pitch at a 4.50 level after adjusting for TTOP, park and batters faced. They pitched at a 4.78 level, slightly worse than our starters in a close game.

So here we have two very different situations that call for very different decisions, on the average. In close games, managers should (and presumably think they are) be making very careful decision about whom to pitch in the 7th, trying to make sure that they use the best pitcher possible. In not-so-close games, especially blowouts, it doesn’t really matter who they pitch, in terms of the WE of the game, and the decision-making goal should be oriented toward the long-term.

Yet we see nothing in the data that suggests that managers are making good decisions in those close games. If we did, we would see much better performance from our starters than in not-so-close games and good performance in general. Instead we see rather poor performance, replacement level reliever numbers in the 7th inning of both close and not-so-close games. Surely that belies the, “Managers are able to see things that we don’t and thus can make better decisions about whether to leave starters in or not,” meme.

Let’s look at a couple more things to further examine this point.

In the first installment of these articles I showed that good or bad run prevention over the first 6 innings has no predictive value whatsoever for the 7th inning. In my second installment, there was some evidence that poor component performance, as measured by in-game, 6-inning FIP had some predictive value, but not good or great component performance.

Let’s see if we can glean what kind of things managers look at when deciding to yank starters in the 7th or not.

In all games in which a starter allows 1 or 0 runs through 6, even though his FIP was high, greater than 4, suggesting that he really wasn’t pitching such a great game, his manager let him continue 78% of the time, which was more than the 74% overall that starters pitched into the 7th.

In games where the starter allowed 3 or more runs through 6 but had a low FIP, less than 3, suggesting that he pitched better than his RA suggest, managers let them continue to pitch just 55% of the time.

Those numbers suggest that managers pay more attention to runs allowed than component results when deciding whether to pull their starter in the 7th. We know that that is not a good decision-making process as the data indicate that runs allowed have no predictive value while component results do, at least when those results reflect poor performance.

In addition, there is no evidence that managers can correctly determine who should stay and who to pull in close games – when that decision matters the most. Can we put to rest, for now at least, this notion that managers have some magical ability to figure out which of their starters has gas left in their tank and which do not? They don’t. They really, really, really don’t.

Note: “Guy,” a frequent participant on The Book Blog, pointed out an error I have been making in calculating the expected RA9 for starters. I have been using their season RA9 as the baseline, and then adjusting for context. That is wrong. I must consider the RA9 of the first 6 innings and then subtract that from the seasonal RA9. For example if a group of pitchers has a RA9 for the season of 4.40 and they have a RA9 of 1.50 for the first 6 innings, if they average 150 IP for the season, our baseline adjusted expectation for the 7th inning, not considering any effects from pitch count, TTOP, manager’s decision to let them continue, etc., is 73.3 (number of runs allowed over 150 IP for the season) minus 1 run for 6 innings, or 72.3 runs over 144 innings, which is an expected RA9 of 4.52, .12 runs higher than the seasonal RA9 of 4.40.

The same goes for the starters who have gotten shelled through 6. Their adjusted expected RA9 for any other time frame, e.g., the 7th inning, is a little lower than 4.40 if 4.40 is their full-season RA9. How much lower depends on the average number of runs allowed in those 6 innings. If it is 4, then we have 73.3 – 4, or 69.3, divided by 144, times 9, or 4.33.

So I will adjust all my numbers to the tune of .14 runs up for dealing pitchers and .07 down for non-dealing pitchers. The exact adjustments might vary a little from these, depending on the average number of runs allowed over the first 6 innings in the various groups of pitchers I looked at.

The other day I wrote that pitcher performance though 6 innings, as measured solely by runs allowed, is not a good predictor of performance in the 7th inning. Whether a pitcher is pitching a shutout or has allowed 4 runs thus far, his performance in the 7th is best projected mostly by his full-season true talent level plus a times through the order penalty of around .33 runs per 9 innings (the average batter faced in the 7th inning appears for the 3rd time). Pitch count has a small effect on those late inning projections as well.

Obviously if you have allowed no or even 1 run through 6 your component results will tend to be much better than if you have allowed 3 or 4 runs, however there is going to be some overlap. Some small proportion of 0 or 1 run starters will have allowed a HR, 6 or 7 walks and hits, and few if any strikeouts. Similarly, some small percentage of pitchers who allow 3 or 4 runs through 6 will have struck out 7 or 8 batters and only allowed a few hits and walks.

If we want to know whether pitching ”well” or not through 6 innings has some predictive value for the 7th (and later) inning, it is better to focus on things that reflect the pitcher’s raw performance than simply runs allowed. It is an established fact that pitchers have little control over whether their non-HR batted balls fall for hits or outs or whether their hits and walks get “clustered” to produce lots of runs or are spread out such that few if any runs are scored.

It is also established that the components most under control by a pitcher are HR, walks, and strikeouts, and that pitchers who excel at the K, and limit walks and HR tend to be the most talented, and vice versa. It also follows that when a pitcher strikes out a lot of batters in a game and limits his HR and walks total that he is pitching “well,” regardless of how many runs he has allowed – and vice versa.

Accordingly, I have extended my inquiry into whether pitching “well” or not has some predictive value intra-game to focus on in-game FIP rather than runs allowed. My intra-game FIP is merely HR, walks, and strikeouts per inning, using the same weights as are used in the standard FIP formula – 13 for HR, 3 for walks and 2 for strikeouts.

So, rather than defining dealing as allowing 1 or fewer runs through 6 and not dealing as 3 or more runs, I will define the former as an FIP through 6 innings below some maximum threshold and the latter as above some minimum threshold. Although I am not nearly convinced that managers and pitching coaches, and certainly not the casual fan, look much further than runs allowed, I think we can all agree that they should be looking at these FIP components instead.

Here is the same data that I presented in my last article, this time using FIP rather than runs allowed to differentiate pitchers who have been pitching very well through 6 innings or not.

Pitchers who have been dealing or not through 6 innings – how they fared in the 7th

Starters through 6 innings

Avg runs allowed through 6

# of Games

RA9 in the 7th inning

Dealing (FIP less than 3 through 6)

1.02

5,338

4.39

Not-dealing (FIP greater than 4)

2.72

3,058

5.03

The first thing that should jump out at you is while our pitchers who are not pitching well do indeed continue to pitch poorly, our dealing pitchers, based upon K, BB, and HR rate over the first 6 innings, are not exactly breaking the bank either in the 7th inning.

Let’s put some context into those numbers.

Pitchers who have been dealing or not through 6 innings – how they fared in the 7th

Starters through 6 innings

True talent level based on season RA9

Expected RA9 in 7th

RA9 in the 7th inning

Dealing (FIP less than 3 through 6)

4.25

4.50

4.39

Not-dealing (FIP greater than 4)

4.57

4.62

5.03

As you can see, our new dealing pitchers are much better pitchers. They normally allow 4.25 runs per game during the season. Yet they allow 4.39 runs in the 7th despite pitching very well through 6, irrespective of runs allowed (and of course they allow few runs too). In other words, we have eliminated those pitchers who allowed few runs but may have actually pitched badly or at least not as well as their meager runs allowed would suggest. All of these dealing pitchers had some combination of high K rates, and low BB and HR rates through 6 innings. But still, we see only around .1 runs per 9 in predictive value – not significantly different from zero or none.

On the other hand, pitchers who have genuinely been pitching badly, at least in terms of some combination of a low K rate and high BB and HR rates, do continue to pitch around .4 runs per 9 innings worse than we would expect given their true talent level and the TTOP.

There is one other thing that is driving some of the difference. Remember that in our last inquiry we found that pitch count was a factor in future performance. We found that while pitchers who only had 78 pitches through 6 innings pitched about as well as expected in the 7th, pitchers with an average of 97 pitches through 6 performed more than .2 runs worse than expected.

In our above 2 groups, the dealing pitchers averaged 84 pitches through 6 and the non-dealing 88, so we expect some bump in the 7th inning performance of the latter group because of a touch of fatigue, at least as compared to the dealing group.

So when we use a more granular approach to determining whether pitchers have been dealing through 6, there is not any evidence that it has much predictive value – the same thing we concluded when we looked at runs allowed only. These pitchers only pitches .11 runs per 9 better than expected.

On the other hand, if pitchers have been pitching poorly for 6 innings, as reflected in the components in which they exert the most control, K, BB, and HR rates, they do in fact pitch worse than expected, even after accounting for a slight elevation in pitch count as compared to the dealing pitchers. That decrease in performance is about .4 runs per 9.

I also want to take this time to state that based on this data and the data from my previous article, there is little evidence that managers are able to identify when pitchers should stay in the game or should be removed. We are only looking at pitchers who were chosen to continue pitching in the 7th inning by their managers and coaches. Yet, the performance of those pitchers is worse than their seasonal numbers, even for the dealing pitchers. If managers could identify those pitchers who were likely to pitch well, whether they had pitched well in prior innings or not, clearly we would see better numbers from them in the 7th inning. At best a dealing pitcher is able to mitigate his TTOP, and a non-dealing pitcher who is allowed to pitch the 7th pitches terribly, which does not bode well for the notion that managers know whom to pull and and whom to keep in the game.

For example, in the above charts, we see that dealing pitchers threw .14 runs per 9 worse than their seasonal average – which also happens to be exactly at league average levels. The non-dealing pitchers, who were also deemed fit to continue by their managers, pitched almost ½ run worse than their seasonal performance and more than .6 runs worse than the league average pitcher. Almost any reliever in the 7th inning would have been a better alternative than either the dealing or non-dealing pitchers. Once again, I have yet to see some concrete evidence that the ubiquitous cry from some of the sabermetric naysayers, “Managers know more about their players’ performance prospects than we do,” has any merit whatsoever.

Note: “Guy,” a frequent participant on The Book Blog, pointed out an error I have been making in calculating the expected RA9 for starters. I have been using their season RA9 as the baseline, and then adjusting for context. That is wrong. I must consider the RA9 of the first 6 innings and then subtract that from the seasonal RA9. For example if a group of pitchers has a RA9 for the season of 4.40 and they have a RA9 of 1.50 for the first 6 innings, if they average 150 IP for the season, our baseline adjusted expectation for the 7th inning, not considering any effects from pitch count, TTOP, manager’s decision to let them continue, etc., is 73.3 (number of runs allowed over 150 IP for the season) minus 1 run for 6 innings, or 72.3 runs over 144 innings, which is an expected RA9 of 4.52, .12 runs higher than the seasonal RA9 of 4.40.

The same goes for the starters who have gotten shelled through 6. Their adjusted expected RA9 for any other time frame, e.g., the 7th inning, is a little lower than 4.40 if 4.40 is their full-season RA9. How much lower depends on the average number of runs allowed in those 6 innings. If it is 4, then we have 73.3 – 4, or 69.3, divided by 144, times 9, or 4.33.

So I will adjust all my numbers to the tune of .14 runs up for dealing pitchers and .07 down for non-dealing pitchers. The exact adjustments might vary a little from these, depending on the average number of runs allowed over the first 6 innings in the various groups of pitchers I looked at.

Almost everyone, to a man, thinks that a manager’s decision as to whether to allow his starter to pitch in the 6th, 7th, or 8th (or later) innings of an important game hinges, at least in part, on whether said starter has been dealing or getting banged around thus far in the game.

Obviously there are many other variables that a manager can and does consider in making such a decision, including pitch count, times through the order (not high in a manager’s hierarchy of criteria, as analysts have been pointing out more and more lately), the quality and handedness of the upcoming hitters, and the state of the bullpen, both in term of quality and availability.

For the purposes of this article, we will put aside most of these other criteria. The two questions we are going to ask is this:

If a starter is dealing thus far, say, in the first 6 innings, and he is allowed to continue, how does he fare in the very next inning? Again, most people, including almost every baseball insider, (player, manager, coach, media commentator, etc.), will assume that he will continue to pitch well.

If a starter has not been dealing, or worse yet, he is achieving particularly poor results, these same folks will usually argue that it is time to take him out and replace him with a fresh arm from the pen. As with the starter who has been dealing, the presumption is that the pitcher’s bad performance over the first, say, 6 innings, is at least somewhat predictive of his performance in the next inning or two. Is that true as well?

Keep in mind that one thing we are not able to look at is how a poorly performing pitcher might perform if he were left in a game, even though he was removed. In other words, we can’t do the controlled experiment we would like – start a bunch of pitchers, track how they perform through 6 innings and then look at their performance through the next inning or two.

So, while we have to assume that, in some cases at least, when a pitcher is pitching poorly and his manager allows him to pitch a while longer, that said manager still had some confidence in the pitcher’s performance over the remaining innings, we also must assume that if most people’s instincts are right, the dealing pitchers through 6 innings will continue to pitch exceptionally well and the not-so dealing pitchers will continue to falter.

Let’s take a look at some basic numbers before we start to parse them and do some necessary adjustments. The data below is from the AL only, 2003-2013.

Pitchers who have been dealing or not through 6 innings – how they fared in the 7th

Starters through 6 innings

# of Games

RA9 in the 7th inning

Dealing (0 or 1 run allowed through 6)

5,822

4.46

Not-dealing (3 or more runs allowed through 6)

2,960

4.48

First, let me explain what “RA9 in the 7th inning” means: It is the average number of runs allowed by the starter in the 7th inning extrapolated to 9 innings, i.e. runs per inning in the 7th multiplied by 9. Since the starter is often removed in the middle of the 7th inning whether has been dealing or not, I calculated his runs allowed in the entire inning by adding together his actual runs allowed while he was pitching plus the run expectancy of the average pitcher when he left the game, scaled to his talent level and adjusted for time through the order, based on the number of outs and base runners.
For example, let’s say that a starter who is normally 10% worse than a league average pitcher allowed 1 run in the 7th inning and then left with 2 outs and a runner on first base. He would be charged with allowing 1 plus (.231 * 1.1 * 1.08) runs or 1.274 runs in the 7th inning. The .231 is the average run expectancy for a runner on first base and 2 outs, the 1.1 multiplier is because he is 10% worse than a league average pitcher, and the 1.08 multiplier is because most batters in the 7th inning are appearing for the 3rd time (TTOP). When all the 7th inning runs are tallied, we can convert them into a runs per 9 innings or the RA9 you see in the chart above.

At first glance it appears that whether a starter has been dealing in prior innings or not has absolutely no bearing on how he is expected to pitch in the following inning, at least with respect to those pitchers who were allowed to remain in the game past the 6th inning. However, we have different pools of pitchers, batters, parks, etc., so the numbers will have to be parsed to make sure we are comparing apples to apples.

Let’s add some pertinent data to the above chart:

Starters through 6

RA9 in the 7th

Seasonal RA9

Dealing

4.46

4.29

Not-dealing

4.48

4.46

As you can see, the starters who have been dealing are, not surprisingly, better pitchers. However, interestingly, we have a reverse hot and cold effect. The pitchers who have allowed only 1 run or less through 6 innings pitch worse than expected in the 7th inning, based on their season-long RA9. Many of you will know why – the times through the order penalty. If you have not read my two articles on the TTOP, and I suggest you do, each time through the order, a starting pitcher fares worse and worse, to the tune of about .33 runs per 9 innings each time he faces the entire lineup. In the 7th inning, the average TTO is 3.0, so we expect our good pitchers, the ones with the 4.29 RA9 during the season, to average around 4.76 RA9 in the 7th inning (the 3rd time though the order, a starter pitches about .33 runs per 9 worse than he pitches overall, and the seasonal adjustment – see the note above – adds another .14 runs). They actually pitch to the tune of 4.46 or .3 runs better than expected after considering the TTOP. What’s going on there?

Well, as it turns out, there are 3 contextual factors that depress a dealing starter’s results in the 7th inning that have nothing to do with his performance in the 6 previous innings:

The batters that a dealing pitcher is allowed to face are 5 points lower in wOBA than the average batter that each faces over the course of the season, after adjusting for handedness. This should not be surprising. If any starting pitcher is allowed to pitch the 7th inning, it is likely that the batters in that inning are slightly less formidable or more advantageous platoon-wise, than is normally the case. Those 5 points of wOBA translate to around .17 runs per 9 innings, reducing our expected RA9 to 4.59.

The parks in which we find dealing pitchers are not-surprisingly, slightly pitcher friendly, with an average PF of .995, further reducing our expectation of future performance by .02 runs per 9, further reducing our expectation to 4.57.

The temperature in which this performance occurs is also slightly more pitcher friendly by around a degree F, although this would have a de minimus effect on run scoring (it takes about a 10 degree difference in temperature to move run scoring by around .025 runs per game).

So our dealing starters pitch .11 runs per 9 innings better than expected, a small effect, but nothing to write home about, and well within the range of values that can be explained purely by chance.

What about the starters who were not dealing? They out-perform their seasonal RA9 plus the TTOP by around .3 runs per 9. The batters they face in the 7th inning are 6 points worse than the average league batter after adjusting for the platoon advantage, and the average park and ambient temperature tend to slightly favor the hitter. Adjusting their seasonal RA9 to account for the fact that they pitched poorly through 6 (see my note at the beginning of this article), we get an expectation of 4.51. So these starters fare almost exactly as expected (4.48 to 4.51) in the 7th inning, after adjusting for the batter pool, despite allowing 3 or more runs for the first 6 innings. Keep in mind that we are only dealing with data from around 9,000 BF. One standard deviation in “luck” is around 5 points of wOBA which translates to around .16 runs per 9.

It appears to be quite damning that starters who are allowed to continue after pitching 6 stellar or mediocre to poor innings pitch almost exactly as (poorly as) expected – their normal adjusted level plus .33 runs per 9 because of the TTOP – as if we had no idea how well or poorly they pitched in the prior 6 innings.

Score one for simply using a projection plus the TTOP to project how any pitcher is likely to pitch in the middle to late innings, regardless of how well or poorly they have pitched thus far in the game. Prior performance in the same game has almost no bearing on that performance. If anything, when a manager allows a dealing pitcher to continue pitching after 6 innings, when facing the lineup for the 3rd time on the average, he is riding that pitcher too long. And, more importantly, presumably he has failed to identify anything that the pitcher might be doing, velocity-wise, mechanics-wise, repertoire-wise, command-wise, results-wise, that would suggest that he is indeed on that day and will continue to pitch well for another inning or so.

In fact, whether pitchers have pitched very well or very poorly or anything in between for the first 6 innings of a game, managers and pitching coaches seem to have no ability to determine whether they are likely to pitch well if they remain in the game. The best predictor of 7th inning performance for any pitcher who is allowed to remain in the game, is his seasonal performance (or projection) plus a fixed times through the order penalty. The TTOP is approximately .33 runs per 9 innings for every pass through the order. Since the second time through the order is roughly equal to a pitcher’s overall performance, starting with the 3rd time through the lineup we expect that starter to pitch .33 runs worse than he does overall, again, regardless of how he has pitched thus far in the game. The 4th time TTO, we expect a .66 drop in performance. Pitchers rarely if ever get to throw to the order for the 5th time.

Fatigue and Pitch Counts

Let’s look at fatigue using pitch count as a proxy, and see if that has any effect on 7th inning performance for pitchers who allowed 3 or more runs through 6 innings. For example, if a pitcher has not pitched particularly well, should we allow him to continue if he has a low pitch count?

Pitch count and 7th inning performance for non-dealing pitchers:

Pitch count through 6

Expected RA9

Actual RA9

Less than 85 (avg=78)

4.56

4.70

Greater than 90 (avg=97)

4.66

4.97

Expected RA9 accounts for the pitchers’ adjusted seasonal RA9 plus the pool of batters faced in the 7th inning including platoon considerations, as well as park and weather. The latter 2 affect the numbers minimally. As you can see, pitchers who had relatively high pitch counts going into the 7th inning but were allowed to pitch for whatever reasons despite allowing at least 3 runs thus far, fared .3 runs worse than expected, even after adjusting for the TTOP. Pitchers with low pitch counts did only about .14 runs worse than expected, including the TTOP. Those 20 extra pitches appear to account for around .17 runs per 9, not a surprising result. Again, please keep in mind that we dealing with limited sample sizes, so these small differences are inferential suggestions and are not to be accepted with a high degree of certainty. They do point us in a certain direction, however, and one which comports with our prior expectation – at least my prior expectation.

What about if a pitcher has been dealing and he also has a low pitch count going into the 7th inning. Very few managers, if any, would remove a starter who allowed zero or 1 run through 6 innings and has only thrown 65 or 70 pitchers. That would be baseball blasphemy. Besides the affront to the pitcher (which may be a legitimate concern, but one which is beyond the scope of this article), the assumption by nearly everyone is that the pitcher will continue to pitch exceptionally well. After all, he is not at all tired and he has been dealing! Let’s see if that is true – that these starters continue to pitch well, better than expected based on their projections or seasonal performance plus the TTOP.

Pitch count and 7th inning performance for dealing pitchers:

Pitch count through 6

Expected RA9

Actual RA9

Less than 80 (avg=72)

4.75

4.50

Greater than 90 (avg=96)

4.39

4.44

Keep in mind that these pitchers normally allow 4.30 runs per 9 innings during the entire season (4.44 after doing the seasonal adjustment). The reason the expected RA9 is so much higher for pitchers with a low pitch count is primarily due to the TTOP. For pitchers with a high pitch count, the batters they face in the 7th are 10 points less in wOBA than league average, thus the 4.39 expected RA9, despite the usual .3 to .35 TTOP.

Similar to the non-dealing pitchers, fatigue appears to play a factor in a dealing pitcher’s performance in the 7th. However, in either case, low-pitch or high-pitch, their performance through the first 6 innings has little bearing on their 7th inning performance. With no fatigue they out-perform their expectation by .25 runs per 9. The fatigued pitchers under-performed their overall season-long adjusted talent plus the usual TTOP by .05 runs per 9.

Again, we see that there is little value to taking out a pitcher who has been getting a little knocked around or leaving in a pitcher who has been dealing for 6 straight innings. Both groups will continue to perform at around their expected full-season levels plus any applicable TTOP, with a slight increase in performance for a low-pitch count pitcher and a slight decrease for a high-pitch count pitcher. The biggest increase we see, .25 runs, is for pitchers who were dealing and had very low pitch counts.

What about if we increase our threshold to pitchers who allow 4 or more runs over 6 innings and those who are pitching a shutout?

Starters through 6

Seasonal RA9

Expected RA9

7th inning RA9

Dealing (shutouts only)

4.23

4.62

4.70

Not-dealing (4 or more runs)

4.62

4.81

4.87

Here, we see no predictive value in the first 6 innings of performance. In fact, for some reason starters pitching a shutout pitched slightly worse than expected in the 7th inning, after adjusting for the pool of batters faced and the TTOP.

How about the holy grail of starters who are expected to keep lighting it up in the 7th inning – starters pitching a shutout and with a low pitch count? These were true talent 4.25 pitchers facing better than average batters in the 7th, mostly for the third time in the game, so we expect a .3 bump or so for the TTOP. Our expected RA9 was 4.78 after making all the adjustments, and the actual was 4.61. Nothing much to speak of. Their dealing combined with a low pitch count had a very small predictive value in the 7th. Less than .2 runs per 9 innings.

Conclusion

As I have been preaching for what seems like forever – and the data are in accordance – however a pitcher is pitching through X innings in a game, at least as measured by runs allowed, even at the extremes, has very little relevance with regard to how he is expected to pitch in subsequent innings. The best marker for whether to pull a pitcher or not seems to be pitch count.

If you want to know the most likely result, or the mean expected result at any point in the game, you should mostly ignore prior performance in that game and use a credible projection plus a fixed times through the order penalty, which is around .33 runs per 9 the 3rd time through, and another .33 the 4th time through. Of course the batters faced, park, weather, etc. will further dictate the absolute performance of the pitcher in question.

Keep in mind that I have not looked at a more granular approach to determining whether a pitcher has been pitching extremely well or getting shelled, such as hits, walks, strikeouts, and the like. It is possible that such an approach might yield a subset of pitching performance that indeed has some predictive value within a game. For now, however, you should be pretty convinced that run prevention alone during a game has little predictive value in terms of subsequent innings. Certainly a lot less than what most fans, managers, and other baseball insiders think.

There is a prolific base stealer on first base in a tight game. The pitcher steps off the rubber, varies his timing, or throws over to first several times during the AB. You’ve no doubt heard some version of the following refrain from your favorite media commentator: “The runner is disrupting the defense and the pitcher, and the latter has to throw more fastballs and perhaps speed up his delivery or use a slide step, thus giving the batter an advantage.”

There may be another side of the same coin: The batter is distracted by all these ministrations, he may even be distracted if and when the batter takes off for second, and he may take a pitch that he would ordinarily swing at in order to let the runner steal a base. All of this leads to decreased production from the batter, as compared to a proverbial statue on first, to which the defense and the pitcher pay little attention.

So what is the actual net effect? Is it in favor of the batter, as the commentators would have you believe (after all, they’ve played the game and you haven’t), or does it benefit the pitcher – an unintended negative consequence of being a frequent base stealer?

Now, even if the net effect of a stolen base threat is negative for the batter, that doesn’t mean that being a prolific base stealer is necessarily a bad thing. Attempting stolen bases, given a high enough success rate, presumably provides extra value to the offense independent of the effect on the batter. If that extra value exceeds that given up by virtue of the batter being distracted, then being a good and prolific base stealer may be a good thing. If the pundits are correct and the “net value of distraction” is in favor of the batter, then perhaps the stolen base or stolen base attempt is implicitly worth more than we think.

Let’s not also forget that the stolen base attempt, independent of the success rate, is surely a net positive for the offense, not withstanding any potential distraction effects. That is due to the fact that when the batter puts the ball in play, whether it is a hit and run or a straight steal, there are fewer forces at second, fewer GDP’s, and the runner advances the extra base more often on a single, double, or out. Granted, there are a few extra line drive and fly ball DP, but there are many fewer GDP to offset those.

If you’ve already gotten the feeling that this whole steal thing is a lot more complicated than it appears on its face, you would be right. It is also not easy, to say the least, to try and ascertain whether there is a distraction effect and who gets the benefit, the offense or the defense. You might think, “Let’s just look at batter performance with a disruptive runner on first as compared to a non-disruptive runner.” We can even use a “delta,” “matched pairs,” or “WOWY” approach in order control for the batter, and perhaps even the pitcher and other pertinent variables. For example, with Cabrera at the plate, we can look at his wOBA with a base stealing threat on first and a non-base stealing threat. We can take the difference, say 10 points in wOBA in favor of with the threat (IOW, the defense is distracted and not the batter), and weight that by the number of times we find a matched pair (the lesser of the two PA). In other words, a “matched pair” is one PA with a stolen base threat on first and one PA with a non-threat.

If Cabrera had 10 PA with a stolen base threat and 8 PA with someone else on first, we would weight the wOBA difference by 8 – we have 8 matched pairs. We do that for all the batters, weighting each batter’s difference by their number of matched pairs, and voila, we have a measure of the amount that a stolen base threat on first affects the batter’s production, as compared to a non-stolen base threat. Seems pretty simple and effective, right? Eh, not so fast.

Unfortunately there are myriad problems associated with that methodology. First of all, do we use all PA where the runner started on first but may have ended up on another base, or was thrown out, by the time the batter completed his PA? If we do that, we will be comparing apples to oranges. With the base stealing threats, there will be many more PA with a runner on second or third, or with no runners at all (on a CS or PO). And we know that wOBA goes down once we remove a runner from first base, because we are eliminating the first base “hole” with the runner being held on. We also know that the value of the offensive components are different depending on the runners and outs. For example, with a runner on second, the walk is not as valuable to the batter and the K is worse than a batted ball out which has a chance to advance the runner.

What if we only look at PA where the runner was still at first when the batter completed his PA? Several researchers have done that, included myself and my co-authors in The Book. The problem with that method is that those PA are not an unbiased sample. For the non-base stealers, most PA will end with a runner on first, so that is not a problem. But with a stolen base threat on first, if we only include those PA that end with the runner still on first, we are only including PA that are likely biased in terms of count, score, game situation, and even the pitcher. In other words, we are only including PA where the runner has not attempted a steal yet (other than on a foul ball). That could mean that the pitcher is difficult to steal on (many of these PA will be with a LHP on the mound), the score is lopsided, the count is biased one way or another, etc. Again, if we only look at times where the PA ended with the runner on first, we are comparing apples to oranges when looking at the difference in wOBA between a stolen base threat on first and a statue.

It almost seems like we are at an impasse and there is nothing we can do, unless perhaps we try to control for everything, including the count, which would be quite an endeavor. Fortunately there is a way to solve this – or at least come close. We can first figure out the overall difference in value to the offense between having a base stealer and a non-base stealer on first, including the actual stolen base attempts. How can we do that? That is actually quite simple. We need only look at the change in run expectancy starting from the beginning to the end of the PA, starting with a runner on first base only. We can then use the delta or matched pairs method to come up with an average difference in change in RE. This difference represents the sum total of the value of a base stealer at first versus a non-base stealer, including any effect, positive or negative, on the batter.

From there we can try and back out the value of the stolen bases and caught stealings (including pick-offs, balks, pick-off errors, catcher errors on the throw, etc.) as well as the extra base runner advances and the avoidance of the GDP when the ball is put into play. What is left is any “distraction effect” whether it be in favor of the batter or the pitcher.

First, in order to classify the base runners, I looked at their number of steal attempts per times on first (BB+HP+S+ROE) for that year and the year before. If it was greater than 20%, they were classified as a “stolen-base threat.” If it was less than 2%, they were classified as a statue. Those were the two groups I looked at vis-à-vis the runner on first base. All other runners (the ones in the middle) were ignored. Around 10% of all runners were in the SB threat group and around 50% were in the rarely steal group.

Then I looked at all situations starting with a runner on first (in one or the other stolen base group) and ending when the batter completes his PA or the runner makes the third out of the inning. The batter may have completed his PA with the runner still on first, on second or third, or with no one on base because the runner was thrown out or scored, via stolen bases, errors, balks, wild pitches, passed balls, etc.

I only included innings 1-6 (to try and eliminate pinch runners, elite relievers, late and close-game strategies, etc.) and batters who occupied the 1-7 slots. I created matched pairs for each batter such that I could use the “delta method” described above to compute the average difference in RE change. I did it year by year, i.e., the matched pairs had to be in the same year, but I included 20 years of data, from 1994-2013. The batters in each matched pair had to be on the same team as well as the same year. For example, Cabrera’s matched pairs of 8 PA with base stealers and 10 PA with non-base stealers would be in one season only. In another season, he would have another set of matched pairs.

Here is how it works: Batter A may have had 3 PA with a base stealer on first and 5 with a statue. His average change in RE (everyone starts with a runner on first only) at the end of the PA may have been +.130 runs for those 3 PA with the stolen base threat on first at the beginning of the PA.

For the 5 PA with a non-threat on first, his average change in RE may have been .110 runs. The difference is .02 runs in favor of the stolen base on first and that gets weighed by 3 PA (the lesser of the 5 and the 3 PA). We do the same thing for the next batter. He may have had a difference of -.01 runs (in favor of the non-threat) weighted by, say, 2 PA. So now we have (.02 * 3 – .01 * 2) / 5 as our total average difference in RE change using the matched pair or delta method. Presumably (hopefully) the pitcher, score, parks, etc. are the same or very similar for both groups. If they are, then that final difference represents the advantage of having a stolen base threat on first base, including the stolen base attempts themselves.

A plus number means a total net advantage to the offense with a prolific base stealer on first, including his SB, CS, and speed on the bases when the ball is put into play, and a negative number means that the offense is better off with a slow, non-base stealer on first, which is unlikely of course. Let’s see what the initial numbers tell us. By the way, for the changes in RE, I am using Tango’s 1969-1992 RE matric from this web site: http://www.tangotiger.net/re24.html.

We’ll start with less than 0 outs, so one of the advantages of a base stealer on first is staying out of the GDP (again, offset by a few extra line drive and fly ball DP). There were a total of 5,065 matched pair PA (adding the lesser of the two PA for each matched pair). Remember a matched pair is a certain batter with a base stealing threat on first and that same batter in the same year with a non-threat on first. The runners are on first base when the batter steps up to the plate but may not be when the PA is completed. That way we are capturing the run expectancy change of the entire PA, regardless of what happens to the runner during the PA.

The average advantage in RE change (again, that is the ending RE after the PA is over minus the starting RE, which is always with a runner on first only, in this case with 0 out) was .032 runs per PA. So, as we expect, a base stealing threat on first confers an overall advantage to the offensive team, at least with no outs. This includes the net run expectancy of SB (including balks, errors, etc.) and CS (including pick-offs), advancing on WP and PB, advancing on balls in play, staying out of the GDP, etc., as well as any advantage or disadvantage to the batter by virtue of the “distraction effect.”

The average wOBA of the batter, for all PA, whether the runner advanced a base or was thrown out during the PA, was .365 with a non-base stealer on first and .368 for a base stealer.

What are the differences in individual offensive components between a base stealing threat and a non-threat originally on first base? The batter with a statue who starts on first base has a few more singles, which is expected given that he hits with a runner on first more often. As well, the batter with a base stealing threat walks and strikes out a lot more, due to the fact he is hitting with a base open more often.

If we then compute the RE value of SB, CS (and balks, pickoffs, errors, etc.) for the base stealer and non-base stealer, as well as the RE value of advancing the extra base and staying out of the DP, we get an advantage to the offense with a base stealer on first of .034 runs per PA.

So, if the overall value of having a base stealer on first is .032 runs per PA, and we compute that .034 runs comes from greater and more efficient stolen bases and runner advances, we must conclude that that there is a .002 runs disadvantage to the batter when there is a stolen base threat on first base. That corresponds to around 2 points in wOBA. So we can say that with no outs, there is a 2 point penalty that the batter pays when there is a prolific base stealer on first base, as compared to a runner who rarely attempts a SB. In 5065 matched PA, one SD of the difference between a threat and non-threat is around 10 points in wOBA, so we have to conclude that there is likely no influence on the batter.

Let’s do the same exercise with 1 and then 2 outs.

With 1 out, in 3,485 matched pair, batters with non-threats hit .388 and batters with threats hit .367. The former had many more singles and of course fewer BB (a lot fewer) and K. Overall, with a non-base stealer starting on first base at the beginning of the PA, batters produced an RE that was .002 runs per PA better than with a base stealing threat. In other words, having a prolific, and presumably very fast, base stealer on first base offered no overall advantage to the offensive team, including the value of the SB, base runner advances, and avoiding the GDP.

If we compute the value that the stolen base threats provide on the base paths, we get .019 runs per PA, so the disadvantage to the batter by virtue of having a prolific base stealer on first base is .021 runs per PA, which is the equivalent of the batter losing 24 points in wOBA.

What about with 2 outs? With 2 outs, we can ignore the GDP advantage for the base stealer as well as the extra value from moving up a base on an out. So, once we get the average RE advantage for a base stealing threat, we can more easily factor out the stolen base and base running advantage to arrive at the net advantage or disadvantage to the batter himself.

With 2 outs, the average RE advantage with a base stealer on first (again, as compared to a non-base stealer) is .050 runs per PA, in a total of 2,390 matched pair PA. Here, the batter has a wOBA of .350 with a non-base stealer on first, and .345 with a base stealer. There is a still a difference in the number of singles because of the extra hole with the first baseman holding on the runner, as well as the usual greater rate of BB with a prolific stealer on base. (Interestingly, with 2 outs, the batter has a higher K rate with a non-threat on base – it is usually the opposite.) Let’s again tease out the advantage due to the actual SB/CS and base running and see what we’re left with. Here, you can see how I did the calculations.

With the non-base stealer, the runner on first is out before the PA is completed 1.3% of the time, he advances to second, 4.4% of the time, and to third, .2%. The total RE change for all that is .013 * -.216 + .044 * .109 + .002 * .157, or .0023 runs, not considering the count when these events occurred. The minus .216, plus .109, and plus .157 are the change in RE when a base runner is eliminated from first, advances from first to second, and advances from first to third prior to the end of the PA (technically prior to the beginning of the PA). The .013, .044, and .002 are the frequencies of those base running events.

For the base stealer, we have .085 (thrown out) times -.216 + .199 (advance to 2nd) * .109 + .025 (advance to 3rd) * .157, or .0117. So the net advantage to the base stealer from advancing or being thrown is .0117 minus .0023, or .014 runs per PA.

What about the advantage to the prolific and presumably fast base stealers from advancing on hits? The above .014 runs was from advances prior to the completion of the PA, from SB, CS, pick-offs, balks, errors, WP, and PB.

The base stealer advances the extra base from first on a single 13.5% more often and 21.7% more often on a double. Part of that is from being on the move and part of that is from being faster.

12.5% of the time, there is a single with a base stealing threat on first. He advances the extra base 13.5% more often, but the extra base with 2 outs is only worth .04 runs, so the gain is negligible (.0007 runs).

A runner on second and a single occurs 2.8% of the time with a stolen base threat on base. The base stealer advances the extra base and scores 14.6% more often than the non-threat for a gain of .73 runs (being able to score from second on a 2-out single is extremely valuable), for a total gain of .73 * .028 * .146, or .003 runs.

With a runner on first and a double, the base stealer gains an extra .0056 runs.

So, the total base running advantage when the runner on first is a stolen base threat is .00925 runs per PA. Add that to the SB/CS advantage of .014 runs, and we get a grand total of .023 runs.

Remember that the overall RE advantage was .050 runs, so if we subtract out the base runner advantage, we get a presumed advantage to the batter of .050 – .023, or .027 runs per PA. That is around 31 points in wOBA.

So let’s recap what we found. For each of no outs, 1 out, and 2 outs, we computed the average change in RE for every batter with a base stealer on first (at the beginning of the PA) and a non-base stealer on first. That tells us the value of the PA from the batter and the base runner combined. (That is RE24, by the way.) We expect that this number will be higher with base stealers, otherwise what is the point of being a base stealer in the first place if you are not giving your team an advantage?

Table I – Overall net value of having a prolific and disruptive base stealing threat on first base at the beginning of the PA, the value of his base stealing and base running, and the presumed value to the batter in terms of any “distraction effect.” Plus is good for the offense and minus good for the defense.

Outs

Overall net value

SB and base running value

“Batter distraction” value

0

.032 runs (per PA)

.034 runs

-.002 runs (-2 points of wOBA)

1

-.002 runs

.019

-.21 runs (-24 pts)

2

.050 runs

.023

+ .027 (31 pts)

We found that very much to be the case with no outs and with 2 outs, but not with 1 out. With no outs, the effect of a prolific base runner on first was .032 runs per PA, the equivalent of raising the batter’s wOBA by 37 points, and with 2 outs, the overall effect was .050 runs, the equivalent of an extra 57 points for the batter. With 1 out, however, the prolific base stealer is in effect lowering the wOBA of the batter by 2 points. Remember that these numbers include the base running and base stealing value of the runner as well as any “distraction effect” that a base stealer might have on the batter, positive or negative. In other words, RE24 captures the influence of the batter as well we the base runners.

In order to estimate the effect on the batter component, we can “back out” the base running value by looking at how often the various base running events occur and their value in terms of the “before and after” RE change. When we do that, we find that with 0 outs there is no effect on the batter from a prolific base stealer starting on first base. With 1 out, there is a 24 point wOBA disadvantage to the batter, and with 2 outs, there is a 31 point advantage to the batter. Overall, that leaves around a 3 or 4 point negative effect on the batter. Given the relatively small sample sizes of this study, one would not want to reject the hypothesis that having a prolific base stealer on first base has no net effect on the batter’s performance. Why the effect depends so much on the number of outs, and what if anything managers and players can do to mitigate or eliminate these effects, I will leave for the reader to ponder.

Those of you who follow me on Twitter know that I am somewhat obsessed with how teams (managers) construct their lineups. With few exceptions, managers tend to do two things when it comes to setting their daily lineups: One, they follow more or less the traditional model of lineup construction, which is to put your best overall offensive player third, a slugger fourth, and scrappy, speedy players in the one and/or two holes. Two, monkey with lineups based on things like starting pitcher handedness (relevant), hot and cold streaks, and batter/pitcher matchups, the latter two generally being not so relevant. For example, in 2012, the average team used 122 different lineups.

If you have read The Book (co-authored by Yours Truly, Tom Tango and Andy Dolphin), you may remember that the optimal lineup differs from the traditional one. According to The Book, a team’s 3 best hitters should bat 1,2, and 4, and the 4th and 5th best hitters 3 and 5. The 1 and 2 batters should be more walk prone than the 4 and 5 hitters. Slots 6 through 9 should feature the remaining hitters in more or less descending order of quality. As we know, managers violate or in some cases butcher this structure by batting poor, sometimes awful hitters, in the 1 and 2 holes, and usually slotting their best overall hitter third. They also sometimes bat a slow, but good offensive player, often a catcher, down in the order.

In addition to these guidelines, The Book suggests placing good base stealers in front of low walk, and high singles and doubles hitters. That often means the 6 hole rather than the traditional 1 and 2 holes in which managers like to put their speedy, base stealing players. Also, because the 3 hole faces a disproportionate number of GDP opportunities, putting a good hitter who hits into a lot of DP, like a Miguel Cabrera, into the third slot can be quite costly. Surprisingly, a good spot for a GDP-prone hitter is leadoff, where a hitter encounters relatively few GDP opportunities.

Of course, other than L/R considerations (and perhaps G/F pitcher/batter matchups for extreme players) and when substituting one player for another, optimal lineups should rarely if ever change. The notion that a team has to use 152 different lineups (like TB did in 2012) in 162 games, is silly at best, and a waste of a manager’s time and sub-optimal behavior at worst.

Contrary to the beliefs of some sabermetric naysayers, most good baseball analysts and sabermetricians are not unaware of or insensitive to the notion that some players may be more or less happy or comfortable in one lineup slot or another. In fact, the general rule should be that player preference trumps a “computer generated” optimal lineup slot. That is not to say that it is impossible to change or influence a player’s preferences.

For those of you who are thinking, “Batting order doesn’t really matter, as long as it is somewhat reasonable,” you are right and you are wrong. It depends on what you mean by “matter.” It is likely that in most cases the difference between a prevailing, traditional order and an optimal one, not-withstanding any effect from player preferences, is on the order of less than 1 win (10 or 11 runs) per season; however, teams pay on the free agent market over 5 million dollars for a player win, so maybe those 10 runs do “matter.” We also occasionally find that the difference between an actual and optimal lineup is 2 wins or more. In any case, as the old sabermetric saying goes, “Why do something wrong, when you can do it right?” In other words, in order to give up even a few runs per season, there has to be some relevant countervailing and advantageous argument, otherwise you are simply throwing away potential runs, wins, and dollars.

Probably the worst lineup offense that managers commit is putting a scrappy, speedy, bunt-happy, bat-control, but poor overall offensive player in the two hole. Remember that The Book (the real Book) says that the second slot in the lineup should be reserved for one of your two best hitters, not one of your worst. Yet teams like the Reds, Braves, and the Indians, among others, consistently put awful hitting, scrappy players in the two-hole. The consequence, of course, is that there are fewer base runners for the third and fourth hitters to drive in, and you give an awful hitter many more PA per season and per game. This might surprise some people, but the #2 hitter will get over 100 more PA than the #8 hitter, per 150 games. For a bad hitter, that means more outs for the team with less production. It is debatable what else a poor, but scrappy hitter batting second brings to the table to offset those extra empty 100 PA.

The other mistake (among many) that managers make in constructing what they (presumably) think is an optimal order is using current season statistics, and often spurious ones like BA and RBI, rather than projections. I would venture to guess that you can count on one hand, at best, the number of managers that actually look at credible projections when making decisions about likely future performance, especially 4 or 5 months into the season. Unless a manager has a time machine, what a player has done so far during the season has nothing to do with how he is likely to do in the upcoming game, other than how those current season stats inform an estimate of future performance. While it is true that there is obviously a strong correlation between 4 or 5 months past performance and future performance, there are many instances where a hitter is projected as a good hitter but has had an awful season thus far, and vice versa. If you have read my previous article on projections, you will know that projections trump seasonal performance at any point in the season (good projections include current season performance to-date – of course). So, for example, if a manager sees that a hitter has a .280 wOBA for the first 4 months of the season, despite a .330 projection, and bats him 8th, he would be making a mistake, since we expect him to bat like a .330 hitter and not a .280 hitter, and in fact he does, according to an analysis of historical player seasons (again, see my article on projections).

Let’s recap the mistakes that managers typically make in constructing what they think are the best possible lineups. Again, we will ignore player preferences and other “psychological factors” not because they are unimportant, but because we don’t know when a manager might slot a player in a position that even he doesn’t think is optimal in deference to that player. The fact that managers constantly monkey with lineups anyway suggests that player preferences are not that much of a factor. Additionally, more often than not I think, we hear players say things like, “My job is to hit as well as I can wherever the manager puts me in the lineup.” Again, that is not to say that some players don’t have certain preferences and that managers shouldn’t give some, if not complete, deference to them, especially with veteran players. In other words, an analyst advising a team or manager should suggest an optimal lineup taking into consideration player preferences. No credible analyst is going to say (or at least they shouldn’t), “I don’t care where Jeter is comfortable hitting or where he wants to hit, he should bat 8th!”

Managers typically follow the traditonal batting order philosophy which is to bat your best hitter 3rd, your slugger 4th, and fast, scrappy, good-bat handlers 1 or 2, whether they are good overall hitters or not. This is not nearly the same as an optimal batting order, based on extensive computer and mathematical research, which suggest that your best hitter should bat 2 or 4, and that you need to put your worst hitters at the bottom of the order in order to limit the number of PA they get per game and per season. Probably the biggest and most pervasive mistake that managers make is slotting terrible hitters at the top, especially in the 2-hole. Managers also put too many base stealers in front of power hitters and hitters who are prone to the GDP in the 3 hole.

Finally, managers pay too much attention (they should pay none) to short term and seasonal performance as well as specific batter/pitcher past results when constructing their batting orders. In general, your batting order versus lefty and righty starting pitchers should rarely change, other than when substituting/resting players, or occasionally when player projections significantly change, in order to suit certain ballparks or weather conditions, or extreme ground ball or fly ball opposing pitchers (and perhaps according to the opposing team’s defense). Other than L/R platoon considerations (and avoiding batting consecutive lefties if possible), most of these other considerations (G/F, park, etc.) are marginal at best.

With that as a background and primer on batting orders, here is what I did: I looked at all 30 teams’ lineups as of a few days ago. No preference was made for whether the opposing pitcher was right or left-handed or whether full-time starters or substitutes were in the lineup on that particular day. Basically these were middle of August random lineups for all 30 teams.

The first thing I did was to compare a team’s projected runs scored based on adding up each player’s projected linear weights in runs per PA and then weighting each lineup slot by its average number of PA per game, to the number of runs scored using a game simulator and those same projections. For example, if the leadoff batter had a linear weights projection of -.01 runs per PA, we would multiply that by 4.8 since the average number of PA per game for a leadoff hitter is 4.8. I would do that for every player in the lineup in order to get a total linear weights for the team. In the NL, I assumed an average hitting pitcher for every team. I also added in every player’s base running (not base stealing) projected linear weights, using the UBR (Ultimate Base Running) stat you see on Fangraphs. The projections I used were my own. They are likely to be similar to those you see on Fangraphs, The Hardball Times, or BP, but in some cases they may be different.

In order to calculate runs per game in a simulated fashion, I ran a simple game simulator which uses each player’s projected singles, doubles, triples, HR, UIBB+HP, ROE, G/F ratio, GDP propensity, and base running ability. No bunts, steals or any in-game strategies (such as IBB) were used in the simulation. The way the base running works is this: Every player is assigned a base running rating from 1-5, based on their base running projections in runs above/below average (typically from -5 to +5 per season). In the simulator, every time a base running opportunity is encountered, like how many bases to advance on a single or double, or whether to score from third on a fly ball, it checks the rating of the appropriate base runner and makes an adjustment. For example, on an outfield single with a runner on first, if the runner is rated as a “1” (slow and/or poor runner), he advances to third just 18% of the time, whereas if he is a “5”, he advances 2 bases 41% of the time. The same thing is done with a ground ball and a runner on first (whether he is safe at second and the play goes to first), a ground ball, runner on second, advances on hits, tagging up on fly balls, and advancing on potential wild pitches, passed balls, and errors in the middle of a play (not ROE).

Keep in mind that a lineup does 2 things. One, it gives players at the top more PA than players at the bottom, which is a pretty straightforward thing. Because of that, it should be obvious that you want your best hitters batting near the top and your worst near the bottom. But, if that were the only thing that lineups “do,” then you would simply arrange the lineup in a descending order of quality. The second way that a lineup creates runs is by each player interacting with other players, especially those near them in the order. This is very tricky and complex. Although a computer analysis can give us rules of thumb for optimal lineup construction, as we do in The Book, it is also very player dependent, in terms of each player’s exact offensive profile (again, ignoring things like player preferences or abilities of players to optimize their approach to each lineup slot). As well, if you move one player from one slot to another, you have to move at least one other player. When moving players around in order to create an optimal lineup, things can get very messy. As we discuss in The Book, in general, you want on base guys in front of power hitters and vice versa, good base stealers in front of singles hitters with low walk totals, high GDP guys in the one hole or at the bottom of the order, etc. Basically, constructing an optimal batting order is impossible for a human being to do. If any manager thinks he can, he is either lying or fooling himself. Again, that is not to say that a computer can necessarily do a better job. As with most things in MLB, the proper combination of “scouting and stats” is usually what the doctor ordered.

In any case, adding up each player’s batting and base running projected linear weights, after controlling for the number of PA per game in each batting slot, is one way to project how many runs a lineup will score per game. Running a simulation using the same projections is another way which also captures to some extent the complex interactions among the players’ offensive profiles. Presumably, if you just stack hitters from best to worst, the “adding up the linear weights” method will result in the maximum runs per game, while the simulation should result in a runs per game quite a bit less, and certainly less than with an optimal lineup construction.

I was curious as to the extent that the actual lineups I looked at optimized these interactions. In order to do that, I compared one method to the other. For example, for a given lineup, the total linear weights prorated by number of PA per game might be -30 per 150 games. That is a below average offensive lineup by 30/150 or .2 runs per game. If the lineup simulator resulted in actual runs scored of -20 per 150 games, presumably there were advantageous interactions among the players that added another 10 runs. Perhaps the lineup avoided a high GDP player in the 3-hole or perhaps they had high on base guys in front of power hitters. Again, this has nothing to do with order per se. If a lineup has poor hitters batting first and/or second, against the advice given in The Book, both the linear weights and the simulation methods would bear the brunt of that poor construction. In fact, if those poor hitters were excellent base runners and it is advisable to have good base runners at the top of the order (and I don’t know that it is), then presumably the simulation should reflect that and perhaps create added value (more runs per game) as compared to the linear weights method of projecting runs per game.

The second thing I did was to try and use a basic model for optimizing each lineup, using the prescriptions in The Book. I then re-ran the simulation and re-calculated the total linear weights to see which teams could benefit the most from a re-working of their lineup, at least based on the lineups I chose for this analysis. This is probably the more interesting query. For the simulations, I ran 100,000 games per team, which is actually not a whole lot of games in terms of minimizing the random noise in the resultant average runs per game. One standard error in runs per 150 games is around 1.31. So take these results with a grain or two of salt.

In the NL, here are the top 3 and bottom 3 teams in terms of additional or fewer runs that a lineup simulation produced, as compared to simply adding up each player’s projected batting and base running runs, adjusting for the league average number of PA per game for each lineup slot.

Top 3

Team

Linear Weights

Lineup Simulation

Gain per 150 games

ARI

-97

-86

11

COL

-23

-13

10

PIT

10

17

6

Here are those lineups:

ARI

Inciarte

Pennington

Peralta

Trumbo

Hill

Pacheco

Marte

Gosewisch

COL

Blackmon

Stubbs

Morneau

Arenado

Dickerson

Rosario

Culberson

Lemahieu

PIT

Harrison

Polanco

Martin

Walker

Marte

Snider

Davis

Alvarez

Bottom 3

Team

Linear Weights

Lineup Simulation

Gain per 150 games

LAD

43

28

-15

SFN

35

27

-7

WAS

42

35

-7

LAD

Gordon

Puig

Gonzalez

Kemp

Crawford

Uribe

Ellis

Rojas

SFN

Pagan

Pence

Posey

Sandoval

Morse

Duvall

Panik

Crawford

WAS

Span

Rendon

Werth

Laroche

Ramos

Harper

Cabrera

Espinosa

In “optimizing” each of the 30 lineups, I used some simple criteria. I put the top two overall hitters in the 2 and 4 holes. Whichever of the two had the greatest SLG batted 4th. The next two best hitters batted 1 and 3, with the highest SLG in the 3 hole. From 5 through 8 or 8, I simply slotted them in descending order of quality.

Here is a comparison of the simple “optimal” lineup to the lineups that the teams actually used. Remember, I am using the same personnel and changing only the batting orders.

Before giving you the numbers, the first thing that jumped out at me was how little most of the numbers changed. Conventional, and even most sabermetric, thought is that any one reasonable lineup is usually just about as good as any other, give or take a few runs. As well, a good lineup must strike a balance between putting better hitters at the top of the lineup, and those who are good base runners but poor overall hitters.

The average absolute difference between the runs per game generated by the simulator from the actual and the “optimal” lineup was 3.1 runs per 150 games per team. Again, keep in mind that much of that is noise since I am running only 100,000 games per team, which generates a standard error of something like 1.3 runs per 150 games.

The kicker, however, is that the “optimal” lineups, on the average, only slightly outperformed the actual ones, by only 2/3 of a run per team. Essentially there was no difference between the lineups chosen by the managers and ones that were “optimized” according to the simple rules explained above. Keep in mind that a real optimization – one that tried every possible batting order configuration and chose the best one – would likely generate better results.

That being said, here are the teams whose actual lineups out-performed and were out-performed by the “optimal” ones:

Most sub-optimal lineups

Team

Actual Lineup Simulation Results (Runs per 150)

“Optimal” Lineup Simulation Results

Gain per 150 games

STL

62

74

12

ATL

31

37

6

CLE

-33

-27

6

MIA

7

12

5

Here are those lineups. The numbers after each player’s name represents their projected batting runs per 630 PA (around 150 games). Keep in mind that these lineups faced either RH or LH starting pitchers. When I run my simulations, I am using overall projections for each player which do not take into consideration the handedness of the batter or any opposing pitcher.

Cardinals

Name

Projected Batting runs

Carpenter

30

Wong

-11

Holliday

26

Adams

14

Peralta

7

Pierz

-10

Jay

17

Robinson

-18

Here, even though we have plenty of good bats in this lineup, Matheny prefers to slot one of the worst in the two hole. Many managers just can’t resist doing so, and I’m not really sure why, other than it seems to be a tradition without a good reason. Perhaps it harkens back to the day when managers would often sac bunt or hit and run after the leadoff hitter reached base with no outs. It is also a mystery why Jay bats 7th. He is even having a very good year at the plate, so it’s not like his seasonal performance belies his projection.

What if we swap Wong and Jay? That generates 69 runs above average per 150 games, which is 7 runs better than with Wong batting second, and 5 runs worse than my original “optimal” lineup. Let’s try another “manual” optimization. We’ll put Jay lead off, followed by Carp, Adams, Holliday, Peralta, Wong, Pierz, and Robinson. That lineup produces 76 runs above average, 14 runs better than the actual one, and better than my computer generated simple “optimal” one. So for the Cardinals, we’ve added 1.5 wins per season just by shuffling around their lineup, and especially by removing a poor hitter from the number 2 slot and moving up a good hitter in Jay (and who also happens to be an excellent base runner).

Braves

Name

Projected Batting runs

Heyward

23

Gosselin

-29

Freeman

24

J Upton

20

Johnson

9

Gattis

-1

Simmons

-16

BJ Upton

-13

Our old friend Fredi Gonzalez finally moved BJ Upton from first to last (and correctly so, although he was about a year too late), he puts Heyward at lead off, which is pretty radical, yet he somehow bats one of the worst batters in all of baseball in the 2-hole, accumulating far too many outs at the top of the order. If we do nothing but move Gosselin down to 8th, where he belongs, we generate 35 runs, 4 more than with him batting second. Not a huge difference, but 1/2 win is a half a win. They all count and they all add up.

Indians

Name

Projected Batting runs

Kipnis

5

Aviles

-19

Brantley

13

Santana

6

Gomes

8

Rayburn

-9

Walters

-13

Holt

-21

Jose Ramirez

-32

The theme here is obvious. When a team puts a terrible hitter in the two-hole, they lose runs, which is not surprising. If we merely move Aviles down to the 7 spot and move everyone up accordingly, the lineup produces -28 runs rather than -33 runs, a gain of 5 runs just by removing Aviles from the second slot.

Marlins

Name

Projected Batting runs

Yelich

15

Solano

-21

Stanton

34

McGhee

-8

Jones

-10

Salty

0

Ozuna

4

Hechavarria

-27

With the Fish, we have an awful batter in the two hole, a poor hitter in the 4 hole, and decent batters in the 6 and 7 hole. What if we just swap Solano for Ozuna, getting that putrid bat out of the 2 hole? Running another simulation results in 13 runs above average per 150 games, besting the actual lineup by 6 runs.

Just for the heck of it, let’s rework the entire lineup, putting Ozuna in the 2 hole, Salty in the 3 hole, Stanton in the 4 hole, then McGhee, Jones, Solano, and Hechy. Surpisingly, that only generates 12 runs above average per 150, better than their actual lineup, but slightly worse than just swapping Solano and Ozuna. The achilles heel for that lineup, as it is for several others, appears to be the poor hitter batting second.

Most optimal lineups

Team

Actual Lineup Simulation Results (Runs per 150)

“Optimal” Lineup Simulation Results

Gain per 150 games

LAA

160

153

-7

SEA

45

39

-6

DET

13

8

-5

TOR

86

82

-4

Finally, let’s take a look at the actual lineups that generate more runs per game than my simple “optimal” batting order.

Angels

Name

Projected Batting runs

Calhoun

20

Trout

59

Pujols

7

Hamilton

17

Kendrick

10

Freese

8

Aybar

0

Iannetta

2

Cowgill

-7

Mariners

Name

Projected Batting runs

Jackson

11

Ackley

-3

Cano

35

Morales

1

Seager

13

Zunino

-14

Morrison

-2

Chavez

-24

Taylor

-2

Tigers

Name

Projected Batting runs

Davis

-2

Kinsler

6

Cabrera

50

V Martinez

17

Hunter

10

JD Martinez

-4

Castellanos

-20

Holaday

-44

Suarez

-23

Blue Jays

Name

Projected Batting runs

Reyes

11

Cabrera

15

Bautista

34

Encarnacion

20

Lind

6

Navarro

-7

Rasmus

-1

Valencia

-9

Lawasaki

-23

Looking at all these “optimal” lineups, the trend is pretty clear. Bat your best hitters at the top and your worst at the bottom, and do NOT put a scrappy, no-hit batter in the two hole! The average projected linear weights per 150 games for the number two hitter in our 4 best actual lineups is 19.25 runs. The average 2-hole hitter in our 4 worst lineups is -20 runs. That should tell you just about everything you need to know about lineups construction.

Note: According to The Book, batting your pitcher 8th in an NL lineup generates slightly more runs per game than batting him 9th, as most managers do. Tony LaRussa sometimes did this, especially with McGwire in the lineup. Other managers, like Maddon, occasionally do the same. There is some controversy over which option is optimal.

When I ran my simulations above, swapping the pitcher and the 8th hitter in the NL lineups. the resultant runs per game were around 2 runs worse (per 150) than with the traditional order. It probably depends on who the position player is at the bottom of the order and perhaps on the players at the top of the order as well.

Yesterday I looked at how and whether a hitter’s mid-season-to-date stats can help us to inform his rest-of-season performance, over and above a credible up-to-date mid-season projection. Obviously the answer to that depends on the quality of the projection – specifically how well it incorporates the season-to-date data in the projection model.

For players who were having dismal performances after the first, second, third, all the way through the fifth month of the season, the projection accurately predicted the last month’s performance and the first 5 months of data added nothing to the equation. In fact, those players who were having dismal seasons so far, even into the last month of the season, performed fairly admirably the rest of the way – nowhere near the level of their season-to-date stats. I concluded that the answer to the question, “When should we worry about a player’s especially poor performance?” was, “Never. It is irrelevant other than how it influences our projection for that player, which is not much, apparently.” For example, full-time players who had a .277 wOBA after the first month of the season, were still projected to be .342 hitters, and in fact, they hit .343 for the remainder of the season. Even halfway through the season, players who hit .283 for 3 solid months were still projected at .334 and hit .335 from then on. So, ignore bad performances and simply look at a player’s projection if you want to estimate his likely performance tomorrow, tonight, next week, or for the rest of the season.

On the other hand, players who have been hitting well-above their mid-season projections (crafted after and including the hot hitting) actually outhit their projections by anywhere from 4 to 16 points, still nowhere near the level of their “hotness,” however. This suggests that the projection algorithm is not handling recent “hot” hitting properly – at least my projection algorithm. Then again, when I looked at hitters who were projected at well-above average 2 months into the season, around .353, the hot ones and the cold ones each hit almost exactly the same over the rest of the season, equivalent to their respective projections. In that case, how they performed over those 3 months gave us no useful information beyond the mid-season projection. In one group, the “cold” group, players hit .303 for the first 2 months of the season, and they were still projected at .352. Indeed, they hit .349 for the rest of the season. The “hot” batters hit .403 for the first 2 months, they were projected to hit .352 after that and they did indeed hit exactly .352. So there would be no reason to treat these hot and cold above-average hitters any differently from one another in terms of playing time or slot in the batting order.

Today, I am going to look at pitchers. I think the perception is that because pitchers get injured more easily than position players, learn and experiment with new and different pitches, often lose velocity, their mechanics can break down, and their performance can be affected by psychological and emotional factors more easily than hitters, that early or mid-season “trends” are important in terms of future performance. Let’s see to what extent that might be true.

After one month, there were 256 pitchers or around 1/3 of all qualified pitchers (at least 50 TBF) who pitched terribly, to the tune of a normalized ERA (NERA) of 5.80 (league average is defined as 4.00). I included all pitchers whose NERA was at least 1/2 run worse than their projection. What was their projection after that poor first month? 4.08. How did they pitch over the next 5 months? 4.10. They faced 531 more batters over the last 5 months of the season.

What about the “hot” pitchers? They were projected after one month at 3.86 and they pitched at 2.56 for that first month. Their performance over the next 5 months was 3.85. So for the “hot” and “cold” pitchers after one month, their updated projection accurately told us what to expect for the remainder of the season and their performance to-date was irrelevant.

In fact, if we look at pitchers who had good projections after one month and divide those into two groups: One that pitches terribly for the first month, and one that pitches brilliantly for the first month, here is what we get:

Good pitchers who were cold for 1 month

First month: 5.38
Projection after that month: 3.79
Performance over the last 5 months: 3.75

Good pitchers who were hot for 1 month

First month: 2.49
Projection after that month: 3.78
Performance over the last 5 months: 3.78

So, and this is critical, one month into the season if you are projected to pitch above average, at, say 3.78, it makes no difference whether you have pitched great or terribly thus far. You are going to pitch at exactly your projection for the remainder of the season!

Yet the cold group faced 587 more batters and the hot group 630. Managers again are putting too much emphasis in those first month’s stats.

What if you are projected after one month as a mediocre pitcher but you have pitched brilliantly or poorly over the first month?

Bad pitchers who were cold for 1 month

First month: 6.24
Projection after that month: 4.39
Performance over the last 5 months: 4.40

Bad pitchers who were hot for 1 month

First month: 3.06
Projection after that month: 4.39
Performance over the last 5 months: 4.47

Same thing. It makes no difference whether a poor or mediocre pitcher had pitched well or poorly over the first month of the season. If you want to know how he is likely to pitch for the remainder of the season, simply look at his projection and ignore the first month. Those stats give you no more useful information. Again, the “hot” but mediocre pitchers got 44 more TBF over the final 5 months of the season, despite pitching exactly the same as the “cold” group over that 5 month period.

What about halfway into the season? Do pitchers with the same mid-season projection but one group was “hot” over the first 3 months and the other group was “cold,” pitch the same for the remaining 3 months? The projection algorithm does not handle the 3-month anomalous performances very well. Here are the numbers:

Good pitchers who were cold for 3 months

First month: 4.60
Projection after 3 months: 3.67
Performance over the last 3 months: 3.84

Good pitchers who were hot for 3 months

First month: 2.74
Projection after 3 months: 3.64
Performance over the last 3 months: 3.46

So for the hot pitchers the projection is undershooting them by around .18 runs per 9 IP and for the cold ones, it is over-shooting them by .17 runs per 9. Then again the actual performance is much closer to the projection than to the season-to-date performance. As you can see, mid-season pitcher stats halfway through the season are a terrible proxy for true talent/future performance. These “hot” and “cold” pitchers whose first half performance and second half projections were divergent by at least .5 runs per 9, performed in the second half around .75 runs per 9 better or worse than in the first half. You are much better off using the mid-season projection than the actual first-half performance.

For poorer pitchers who were “hot” and “cold” for 3 months, we get these numbers:

Poor pitchers who were cold for 3 months

First month: 5.51
Projection after 3 months: 4.41
Performance over the last 3 months: 4.64

Poor pitchers who were hot for 3 months

First month: 3.53
Projection after 3 months: 4.43
Performance over the last 3 months: 4.33

The projection model is still not giving enough weight to the recent performance, apparently. That is especially true of the “cold” pitchers. It over values them by .23 runs per 9. It is likely that these pitchers are suffering some kind of injury or velocity decline and the projection algorithm is not properly accounting for that. For the “hot” pitchers, the model only undervalues these mediocre pitchers by .1 runs per 9. Again, if you try and use the actual 3-month performance as a proxy for true talent or to project their future performance, you would be making a much bigger mistake – to the tune of around .8 runs per 9.

What about 5 months into the season? If the projection and the 5 month performance is divergent, which is better? Is using those 5 month stats a bad idea?

Yes, it still is. In fact, it is a terrible idea. For some reason, the projection does a lot better after 5 months than after 3 months. Perhaps some of those injured pitchers are selected out. Even though the projection slightly under and over values the hot and cold pitchers, using their 5 month performance as a harbinger of the last month is a terrible idea. Look at these numbers:

Poor pitchers who were cold for 5 months

First month: 5.45
Projection after 5 months: 4.41
Performance over the last month: 4.40

Poor pitchers who were hot for 5 months

First month: 3.59
Projection after 5 months: 4.39
Performance over the last month: 4.31

For the mediocre pitchers, the projection almost nails both groups, despite it being nowhere near the level of the first 5 months of the season. I cannot emphasize this enough: Even 5 months into the season, using a pitcher’s season-to-date stats as a predictor of future performance or a proxy for true talent (which is pretty much the same thing) is a terrible idea!

Look at the mistakes you would be making. You would be thinking that the hot group were comprised of 3.59 pitchers when in fact they were 4.40 pitchers who performed as such. That is a difference of .71 runs per 9. For your cold pitchers, you would undervalue them by more than a run per 9! What do managers do after 5 months of “hot” and “cold” pitching, despite the fact that both groups pitched almost the same for the last month of the season? They gave the hot group an average of 13 more TBF per pitcher. That is around a 3 inning difference in one month.

Here are the good pitchers who were hot and cold over the first 5 months of the season:

Good pitchers who were cold for 5 months

First month: 4.62
Projection after 5 months: 3.72
Performance over the last month: 3.54

Good pitchers who were hot for 5 months

First month: 2.88
Projection after 5 months: 3.71
Performance over the last month: 3.72

Here the “hot,” good pitchers pitched exactly at their projection despite pitching at .83 runs per 9 better over the first 5 months of the season. The “cold” group actually outperformed their projection by .18 runs and pitched better than the “hot” group! This is probably a sample size blip, but the message is clear: Even after 5 months, forget about how your favorite pitcher has been pitching, even for most of the season. The only thing that counts is his projection, which utilizes many years of performance plus a regression component, and not just 5 months worth of data. It would be a huge mistake to use those 5 month stats to predict these pitchers’ performances.

Managers can learn a huge lesson from this. The average number of batters faced in the last month of the season among the hot pitchers was 137, or around 32 IP. For the cold group, it was 108 TBF, or 25 IP. Again, the “hot” group pitched 7 more IP in only a month, yet they pitched worse than the “cold” group and both groups had the same projection!

The moral of the story here is that for the most part, and especially at the beginning and end of the season, ignore actual pitching performance to-date and use credible mid-season projections if you want to predict how your favorite or not-so favorite pitcher is likely to pitch tonight or over the remainder of the season. If you don’t, and that actual performance is significantly different from the updated projection, you are making a sizable mistake.

Recently on twitter I have been harping on the folly of using a player’s season-to-date stats, be it OPS, wOBA, RC+, or some other metric, for anything other than, well, how they have done so far. From a week into the season until the last pitch is thrown in November, we are inundated with articles and TV and radio commentaries about how so and so should be getting more playing time because his OPS is .956 or how player X should be benched or at least dropped in the order because he hitting .245 (in wOBA). Commentators, writers, analysts and fans wonder whether player Y’s unusually great or poor performance is “sustainable,” whether it is a “breakout” likely to continue, an age or injury related decline that portends an end to a career or a temporary blip after said injury is healed.

With web sites such as Fangraphs.com allowing us to look up a player’s current, up-to-date projections which already account for season-to-date performance, the question that all these writers and fans must ask themselves is, “Do these current season stats offer any information over and above the projections that might be helpful in any future decisions, such as whom to play or where to slot a player in the lineup, or simply whom to be optimistic or pessimistic about on your favorite team?”

Sure, if you don’t have a projection for a player, and you know nothing about his history or pedigree, a player’s season-to-date performance tells you something about what he is likely to do in the future, but even then, it depends on the sample size of that performance – at the very least you must regress that performance towards the league mean, the amount of regression being a function of the number of opportunities (PA) underlying the seasonal stats.

However, it is so easy for virtually anyone to look up a player’s projection on Fangraphs, Baseball Prospectus, The Hardball Times, or a host of other fantasy baseball web sites, why should we care about those current stats other than as a reflection of what a certain player has accomplished thus far in the season? Let’s face it, 2 or 3 months into the season, if a player who is projected at .359 (wOBA) is hitting .286, it is human nature to call for his benching, dropping him in the batting order, or simply expecting him to continue to hit in a putrid fashion. Virtually everyone thinks this way, even many astute analysts. It is an example of recency bias, which is one of the most pervasive human traits in all facets of life, including and especially in sports.

Who would you rather have in your lineup – Player A who has a Steamer wOBA projection of .350 but who is hitting .290 4 months into the season or Player B whom Steamer projects at .330, but is hitting .375 with 400 PA in July? If you said, “Player A,” I think you are either lying or you are in a very, very small minority.

Let’s start out by looking at some players whose current projection and season-to-date performance are divergent. I’ll use Steamer ROS (rest-of-season) wOBA projections from Fangraphs as compared to their actual 2014 wOBA. I’ll include anyone who has at least 200 PA and the absolute difference between their wOBA and wOBA projection is at least 40 points. The difference between a .320 and .360 hitter is the difference between an average player and a star player like Pujols or Cano, and the difference between a .280 and a .320 batter is like comparing a light-hitting backup catcher to a league average hitter.

Believe it or not, even though we are 40% into the season, around 20% of all qualified (by PA) players have a current wOBA projection that is more than 39 points greater or less than their season-to-date wOBA.

Now tell the truth: Who would you rather have at the plate tonight or tomorrow, Billy Butler, with his .359 projection and .278 actual, or Carlos Gomez, projected at .334, but currently hitting at .405? How about Hosmer (not to pick on the Royals) or Michael Morse? If you are like most people, you probably would choose Gomez over Butler, despite the fact that he is projected as 25 points worse, and Morse over Hosmer, even though Hosmer is supposedly 10 points better than Morse. (I am ignoring park effects to simplify this part of the analysis.)

So how can we test whether your decision or blindly going with the Steamer projections would likely be the correct thing to do, emotions and recency bias aside? That’s relatively simple, if we are willing to get our hands dirty doing some lengthy and somewhat complicated historical mid-season projections. Luckily, I’ve already done that. I have a database of my own proprietary projections on a month-by-month basis for 2007-2013. So, for example, 2 months into the 2013 season, I have a season-to-date projection for all players. It incorporates their 2009-2012 performance, including AA and AAA, as well as their 2-month performance (again, including the minor leagues) so far in 2013. These projections are park and context neutral. We can then compare the projections with both their season-to-date performance (also context-neutral) and their rest-of-season performance in order to see whether, for example, a player who is projected at .350 even though he has hit .290 after 2 months will perform any differently in the last 4 months of the season than another player who is also projected at .350 but who has hit .410 after 2 months. We can do the same thing after one month (looking at the next 5 months of performance) or 5 months (looking at the final month performance). The results of this analysis should suggest to us whether we would be better off with Butler for the remainder of the season or with Gomez, or with Hosmer or Morse.

I took all players in 2007-2013 whose projection was at least 40 points less than their actual wOBA after one month into the season. They had to have had at least 50 PA. There were 116 such players, or around 20% of all qualified players. Their collective projected wOBA was .341 and they were hitting .412 after one month with an average of 111 PA per player. For the remainder of the season, in a total of 12,922 PA, or 494 PA per player, they hit .346, or 5 points better than their projection, but 66 points worse than their season-to-date performance. Again, all numbers are context (park, opponent, etc.) neutral. One standard deviation in that many PA is 4 points, so a 5 point difference between projected and actual is not statistically significant. There is some suggestion, however, that the projection algorithm is slightly undervaluing the “hot” (as compared to their projection) hitter during the first month of the season, perhaps by giving too little weight to the current season.

What about the players who were “cold” (relative to their projections) the first month of the season? There were 92 such players and they averaged 110 PA during the first month with a .277 wOBA. Their projection after 1 month was .342, slightly higher than the first group. Interestingly, they only averaged 464 PA for the remainder of the season, 30 PA less than the “hot” group, even though they were equivalently projected, suggesting that managers were benching more of the “cold” players or moving them down in the batting order. How did they hit for the remainder of the season? .343 or almost exactly equal to their projection. This suggests that managers are depriving these players of deserved playing time. By the way, after only one month, more than 40% of all qualified players are hitting 40 points better or worse than their projections. That’s a lot of fodder for internet articles and sports talk radio!

You might be thinking, “Well, sure, if a player is “hot” or “cold” after only a month, it probably doesn’t mean anything.” In fact, most commentaries you read or hear will give the standard SSS (small sample size) disclaimer only a month or even two months into the season. But what about halfway into the season? Surely, a player’s season-to-date stats will have stabilized by then and we will be able to identify those young players who have “broken out,” old, washed-up players, or players who have lost their swing or their mental or physical capabilities.

About half into the season, around 9% of all qualified (50 PA per month) players were hitting 40 points or less than their projections in an average of 271 PA. Their collective projection was .334 and their actual performance after 3 months and 271 PA was .283. Basically, these guys, despite being supposed league-average full-time players, stunk for 3 solid months. Surely, they would stink, or at least not be up to “par,” for the rest of the season. After all, wOBA at least starts to “stabilize” after almost 300 PA, right? Well, these guys, just like the “cold” players after one month, hit .335 for the remainder of the season, 1 point better than their projection. So after 1 month or 3 months, their season-to-date performance tells us nothing that our up-to-date projection doesn’t tell us. A player is expected to perform at his projected level regardless of his current season performance after 3 months, at least for the “cold” players. What about the “hot” ones, you know, the ones who may be having a breakout season?

There were also about 9% of all qualified players who were having a “hot” first half. Their collective projection was .339, and their average performance was .391 after 275 PA. How did they hit the remainder of the season? .346, 7 points better than their projection and 45 points worse than their actual performance. Again, there is some suggestion that the projection algorithm is undervaluing these guys for some reason. Again, the “hot” first-half players accumulated 54 more PA over the last 3 months of the season than the “cold” first-half players despite hitting only 11 points better. It seems that managers are over-reacting to that first-half performance, which should hardly be surprising.

Finally, let’s look at the last month of the season as compared to the first 5 months of performance. Do we have a right to ignore projections and simply focus on season-to-date stats when it comes to discussing the future – the last month of the season?

The 5-month “hot” players were hitting .391 in 461 PA. Their projection was .343, and they hit .359 over the last month. So, we are still more than twice as close to the projection than we are to the actual, although there is a strong inference that the projection is not weighting the current season enough or doing something else wrong, at least for the “hot” players.

For the “cold” players, we see the same thing as we do at any point in the season. The season-to-date stats are worthless if you know the projection. 3% of all qualified players (at least 250 PA) hit at least 40 points worse than their projection after 5 months. They were projected at .338, hit .289 for the first 5 months in 413 PA, and then hit .339 in that last month. They only got an average of 70 PA over the last month of the season, as compared to 103 PA for the “hot” batters, despite proving that they were league-average players even though they stunk up the field for 5 straight months.

After 4 months, BTW, “cold” players actually hit 7 points better than their projection for the last 2 months of the season, even though their actual season-to-date performance was 49 points worse. The “hot” players hit only 10 points better than their projection despite hitting 52 points better over the first 4 months.

Let’s look at the numbers in another way. Let’s say that we are 2 months into the season, similar to the present time. How do .350 projected hitters fare for the rest of the season if we split them into two groups: One, those that have been “cold” so far and those that have been “hot.” This is like our Butler or Gomez, Morse or Hosmer question.

I looked at all “hot” and “cold” players who were projected at greater than .330 after 2 months into the season. The “hot” ones, the Carlos Gomez’ and Michael Morse’s, hit .403 for 2 months, and were then projected at .352. How did they hit over the rest of the season? .352.

What about the “cold” hitters who were also projected at greater than .330? These are the Butler’s and Hosmer’s. They hit a collective .303 for the first 2 months of the season, their projection was .352, the same as the “hot” hitters, and their wOBA for the last 4 months was .349! Wow. Both groups of good hitters (according to their projections) hit almost exactly the same. They were both projected at .353 and one group hit .352 and the other hit .349. Of course the “hot” group got 56 more PA per player over the remainder of the season, despite being projected the same and performing essentially the same.

Let’s try those same hitters who are projected at better than .330, but who have been “hot” or “cold” for 5 months rather than only 2.

Cold

Projected: .350 Season-to-date: .311 ROS: .351

Hot

Projected: .354 Season-to-date: .393 ROS: .363

Again, after 5 months, the players projected well who have been hot are undervalued by the projection, but not nearly as much as the season-to-date performance might suggest. Good players who have been cold for 5 months hit exactly as projected and the “cold” 5 months has no predictive value, other than how it changes the up-to-date projection.

For players who are projected poorly, less than a .320 wOBA, the 5-month hot ones outperform their projections and the cold ones under-perform their projections, both by around 8 points. After 2 months, there is no difference – both “hot” and “cold” players perform at around their projected levels over the last 4 months of the season.

So what are our conclusions? Until we get into the last month or two of the season, season-to-date stats provide virtually no useful information once we have a credible projection for a player. For “hot” players, we might “bump” the projection by a few points in wOBA even 2 or 3 months into the season – apparently the projection is slightly under-valuing these players for some reason. However, it does not appear to be correct to prefer a “hot” player like Gomez versus a “cold” one like Butler when the “cold” player is projected at 25 points better, regardless of the time-frame. Later in the season, at around the 4th or 5th month, we might need to “bump” our projection, at least my projection, by 10 or 15 points to account for a torrid first 4 or 5 months. However, the 20 or 25 point better player, according to the projection, is still the better choice.

For “cold” players, season-to-date stats appear to provide no information whatsoever over and above a player’s projection, regardless of what point in the season we are at. So, when should we be worried about a hitter if he is performing far below his “expected” performance? Never. If you want a good estimate of his future performance, simply use his projection and ignore his putrid season-to-date stats.

In the next installment, I am going to look at the spread of performance for hot and cold players. You might hypothesize that while being hot or cold for 2 or 3 months has almost no effect on the next few months of performance, perhaps it does change the distribution of that performance among the group of hot and cold players.

Note: These are rules of thumb which apply 90-99% of the time (or so). Some of them have a few or even many exceptions and nuances to consider. I do believe, however, that if every manager followed these religiously, even without employing any exceptions or considering any of the nuances, that he would be much better off than the status quo. There are also many other suggestions, commandments, and considerations that I would use, that are not included in this list.

1) Though shalt never use individual batter/pitcher matchups, recent batter or pitcher stats, or even seasonal batter or pitcher stats. Ever. The only thing that this organization uses are projections based on long-term performance. You will use those constantly.

2) Thou shalt never, ever use batting average again. wOBA is your new BA. Learn how to construct it and learn what it means.

3) Thou shalt be given and thou shalt use the following batter/pitcher matchups every game: Each batter’s projection versus each pitcher. They include platoon considerations. Those numbers will be used for all your personnel decisions. They are your new “index cards.”

4) Thou shalt never issue another IBB again, other than obvious late and close-game situations.

5) Thou shalt instruct your batters whether to sacrifice bunt or not, in all sacrifice situations, based on a “commit line.” If the defense plays in front of that line, thy batters will hit away. If they play behind the line, thy batters will bunt. If they are at the commit line, they may do as they please. Each batter will have his own commit line against each pitcher. Some batters will never bunt.

6) Thou shalt never sacrifice with runners at first and third, even with a pitcher at bat. You may squeeze if you want. With 1 out and a runner on 1st only your worst hitting pitchers will bunt.

7) Thou shalt keep thy starter in or remove him based on two things and two things only: One, his pitch count, and two, the number of times he has faced the order. Remember that ALL pitchers lose 1/3 of a run in ERA each time through the order, regardless of how they are pitching thus far.

8) Thou shalt remove thy starter for a pinch hitter in a high leverage situation if he is facing the order for the 3rd time or more, regardless of how he is pitching.

9) Speaking of leverage, thou shalt be given a leverage chart with score, inning, runners, and outs. Use it!

10) Thou shalt, if at all possible, use thy best pitchers in high leverage situations and thy worst pitchers in low leverage situations, regardless of the score or inning. Remember that “best” and “worst” are based on your new “index cards” (batter v. pitcher projections) or your chart which contains each pitcher’s generic projection. It is never based on how they did yesterday, last week, or even the entire season. Thou sometimes may use “specialty” pitchers, such as when a GDP or a K are at a premium.

11) Thou shalt be given a chart for every base runner and several of the most common inning, out, and score situations. There will be a number next to each player’s name for each situation. If the pitcher’s time home plus the catcher’s pop time are less than that number, thy runner will not steal. If it is greater, thy runner may steal. No runner shall steal second base with a lefty pitcher on the mound.

12) Thou shalt not let thy heart be troubled by the outcome of your decisions. No one who works for this team will ever question your decision based on the outcome. Each decision you make is either right, wrong, or a toss-up, before we know, and regardless of, the outcome.

13) Thou shalt be held responsible for your decisions, also regardless of the outcome. If your decisions are contrary to what we believe as an organization, we would like to hear your explanation and we will discuss it with you. However, you are expected to make the right decisions at all times, based on the beliefs and philosophies of the organization. We don’t care what the fans or the media think. We will take care of that. We will all make sure that our players are on the same page as we are.

14) Finally, thou shalt know that we respect and admire your leadership and motivational skills. That is one of the reasons we hired you. However, if you are not on board with our decision-making processes and willing to employ them at all times, please find yourself another team to manage.

Yesterday, I posted an article describing how I modeled to some extent a way to tell whether and by how much pitchers may be able to pitch in such a way as to allow fewer or more runs than their components, including the more subtle ones, like balks, SB/CS, WP, catcher PB, GIDP, and ROE suggest.

For various reasons, I suggest taking these numbers with a grain of salt. For one thing, I need to tweak my RA9 simulator to take into consideration a few more of these subtle components. For another, there may be some things that stick with a pitcher from year to year that have nothing to do with his “RA9 skill” but which serve to increase or decrease run scoring, given the same set of components. Two of these are a pitcher’s outfielder arms and the vagueries of his home park, which both have an effect on base runner advances on hits and outs. Using a pitcher’s actual sac flies against will mitigate this, but the sim is also using league averages for base runner advances on hits, which, as I said, can vary from pitchers to pitcher, and tend to persist from year to year (if a pitcher stays on the same team) based on his outfielders and his home park. Like DIPS, it would be better to do these correlations only on pitchers who switch teams, but I fear that the sample would be too small to get any meaningful results.

Anyway, I have a database now of the last 10 years’ differences between a pitcher’s RA9 and his sim RA9 (the runs per 27 outs generated by my sim), for all pitchers who threw to at least 100 batters in a season.

First here are some interesting categorical observations:

Jared Cross, of Steamer projections, suggested to me that perhaps some pitchers, like lefties, might hold base runners on first base better than others, and therefore depress scoring a little as compared to the sim, which uses league-average base running advancement numbers. Well, lefties actually did a hair worse in my database. Their RA9 was .02 greater than their sim RA. Righties were -.01 better. That does not necessarily mean that RHP have some kind of RA skill that LHP do not have. It is more likely a bias in the sim that I am not correcting for.

How about number of pitches in a pitcher’s repertoire. I hypothesized that pitchers with more pitches would be better able to tailor their approach to the situation. For example, with a base open, you want your pitcher to be able to throw lots of good off-speed pitches in order to induce a strikeout or weak contact, whereas you don’t mind if he walks the batter.

I was wrong. Pitchers with 3 or more pitches that they throw at least 10% of the time are .01 runs worse in RA9. Pitchers with only 2 or fewer pitches, are .02 runs better. I have no idea why that is.

How about pitchers who are just flat out good in their components such that their sim RA is low, like under 4.00 runs? Their RA9 is .04 worse. Again, their might be some bias in the sim which is causing that. Or perhaps if you just go out and there “air it out” and try and get as many outs and strikeouts as possible, regardless of the situation, you are not pitching optimally.

Conversely, pitchers with a sim RA of 4.5 or greater shave .03 points off their RA9. If you are over 5 in your sim RA, your actual RA9 is .07 points better and if you are below 3.5, your RA9 is .07 runs higher. So, there probably is something about having extreme components that even the sim is not picking up. I’m not sure what that could be. Or, perhaps if you are simply not that good of a pitcher, you have to find ways to minimize run scoring above and beyond the hits and walks you allow overall.

For the NL pitchers, their RA9 is .05 runs better than their sim RA, and for the AL, they are .05 runs worse. So the sim is not doing a good job with respect to the leagues, likely because of pitchers batting. I’m not sure why, but I need to fix that. For now, I’ll adjust a pitcher’s sim RA according to his league.

You might think that younger pitchers would be “throwers” and older ones would be “pitchers” and thus their RA skill would reflect that. This time you would be right – to some extent.

Pitchers less than 26 years old were .01 runs worse in RA9. Pitchers older than 30 were .03 better. But that might just reflect the fact that pitchers older than 30 are just not very good – remember, we have a bias in terms of quality of the sim RA and the difference between that and regular RA9.

Actually, even when I control for the quality of the pitcher, the older pitchers had more RA skill than the younger ones by around .02 to .04 runs. As you can see, none of these effects, even if they are other than noise, is very large.

Finally, here are the lists of the 10 best and worst pitchers with respect to “RA skill,” with no commentary. I adjusted for the “quality of the sim RA” bias, as well as the league bias. Again, take these with a large grain of salt, considering the discussion above.

Best, 2004-2013:

Sean Chacon -.18

Steve Trachsel -.18

Francisco Rodriguez -.18

Jose Mijares -.17

Scott Linebrink -.16

Roy Oswalt -.16

Dennys Reyes -.15

Dave Riske -.15

Ian Snell -.15

5 others tied for 10th.

Worst:

Derek Lowe .27

Luke Hochevar .20

Randy Johnson .19

Jeremy Bonderman .18

Blaine Boyer .18

Rich Hill .18

Jason Johnson .18

5 others tied for 8th place.

(None of these pitchers stand out to me one way or another. The “good” ones are not any you would expect, I don’t think.)

We showed in The Book that there is a small but palpable “pitching from the stretch” talent. That of course would effect a pitcher’s RA as compared to some kind of base runner and “timing” neutral measure like FIP or component ERA, or really any of the ERA estimators.

As well, a pitcher’s ability to tailor his approach to the situation, runners, outs, score, batter, etc., would also implicate some kind of “RA talent,” again, as compared to a “timing” neutral RA estimator.

A few months ago I looked to see if RE24 results for pitchers showed any kind of talent for pitching to the situation, by comparing that to the results of a straight linear weights analysis or even a BaseRuns measure. I found no year-to-year correlations for the difference between RE24 and regular linear weights. In other words, I was trying to see if some pitchers were able to change their approach to benefit them in certain bases/outs situations more than other pitchers. I was surprised that there was no discernible correlation, i.e., that it didn’t seem to be much of a skill if at all. You would think that some pitchers would either be smarter than others or have a certain skill set that would enable them, for example, to get more K with a runner on 3rd and less than 2 outs, more walks and fewer hits with a base open, or fewer home runs with runners on base or with 2 outs and no one on base. Obviously all pitchers, on the average, vary their approach a lot with respect to these things, but I found nothing much when doing these correlations. Essentially an “r” of zero.

To some extent the pitching from the stretch talent should show up in comparing RE24 to regular lwts, but it didn’t, so again, I was a little surprised at the results.

Anyway, I decided to try one more thing.

I used my “pitching sim” to compute a component ERA for each pitcher. I tried to include everything that would create or not create runs while he was pitching, like WP/PB, SB/CS, GIDP, roe, in addition to s,d,t,hr,bb, and so. I considered an IBB as a 1/2 BB in the sim, since I didn’t program IBB into it.

So now, for each year, I recorded the difference between a pitcher’s RA9 and his simulated component RA9, and then ran year-to-year correlations. This was again to see if I could find a “RA talent” wherever it may lie – clutch pitching, stretch talent, approach talent, etc.

I got a small year-to-year correlation which, as always, varied with the underlying sample size – TBF in each of the paired years. When I limited it to pitchers with at least 500 TBF in each year, I got an “r” of .142 with an average PA of 791 in each year. That comes out to a 50% regression at around 5000 PA, or 5 years for a full-time starter, similar to BABIP for pitchers. In other words, the “stabilization” point was around 5,000 TBF.

If that .142 is accurate (at 2 sigma the confidence interval is .072 to .211), I think that is pretty interesting. For example, notable “ERA whiz” Tom Glavine from 2001 to 2006, was an average of .246 in RA9 better than his sim RA9 (simulated component RA). If we regress that difference 50%, we get .133 runs per game, which is pretty sizable I think. That is over 1/3 of a win per season. Notable “ERA hack” Ricky Nolasco from 2008 to 2010 (I only looked at 2001-2010) was an average of .357 worse in his ERA. Regress that 62.5%, and we get .134 runs worse per season, also 1/3 of a win.

So, for example, if you want to know how to reconcile fWAR (FG) and bWAR (B-R) for pitchers, take the difference and regress according to the number of TBF, using the formula 5000/(5000+TBF) for the amount of regression.

Here are a couple more interesting ones, off the top of my head. I thought that Livan Hernandez seemed like a crafty pitcher, despite having inferior stuff late in his career. Sure enough, he out-pitched his components by around .164 runs per game over 9 seasons. After regressing, that’s .105 rpg.

The other name that popped into my head was Wakefield. I always wondered if a knuckler was able to pitch to the situation as well as other pitchers could. It does not seem like they can, with only one pitch with comparatively little control. His RA9 was .143 worse than his components suggest, despite his FIP being .3 runs per 9 worse than his ERA! After regressing, he is around .095 worse than his simulated component RA.

Of course, after looking at Wake, we have to check Dickey as well. He didn’t start throwing a knuckle ball until 2005, and then only half the time. His average difference between RA9 and simulated RA9 is .03 on the good side, but our sample size for him is small with a total of only 1600 TBF, implying a regression of 76%.

If you want the numbers on any of your favorite or no-so-favorite pitchers, let me know in the comments section.