Baseball Therapy

Should You Trust the Projections?

Last week, the sabermetric community had—well, not an argument, because the participants were generally professional and cordial to one another, but a debate about what we might expect over the rest of the season from a player who is currently enjoying a hot (or cold) streak. It all started with researcher Mitchel Lichtman (better known by his initials, MGL) posting two articles, one on hitters and one on pitchers, that made the case that we should trust the projection systems rather than expect a player’s recent performance to continue. Remember Charlie Blackmon, who was the best player in baseball for three weeks and was smart enough to make those weeks the first three weeks of the 2014 season? He’s a good example. He had never been anything special, nor was he projected for greatness this year. And in retrospect, his hot streak to start the season looks a lot like a small-sample fluke.

MGL’s methodology was reasonable. He identified hitters who had significantly outperformed their projections in April and then looked to see how well they did from May to September. He found that as a group, their subsequent performance was much closer to their projection than it was to their early-season hot streak, which held true even if he looked at longer stretches of overperformance to start the year. He found roughly the same for hitters who underperformed relative to their projections, and then found roughly the same for pitchers. His conclusion: Don’t get too wrapped up in an early season hot or cold streak. The player will most likely regress.

Dave Cameron followed with a post at FanGraphs, in which he summarized MGL’s work and briefly discussed the fact that the “trust the projection” mantra isn’t 100 percent accurate. He ended with this nugget: “Without perfect information, we’re going to be wrong on some guys. The evidence suggests the conservative path, leaning almost entirely on forecasts and putting little weight on seasonal performance, is the one that is wrong the least.” Again, this is a perfectly reasonable thing to say. The problem isn’t that Messrs. Lichtman and Cameron are wrong. What they’ve said is all factually correct and the product of good, solid thinking. And yes, most of the time, “trust the projection” will be correct. It’s a nice antidote to the hope-fueled longings of fans who swear that while everyone else is “due for a regression,” our guy has “made a significant adjustment to his game” and will sustain this .400 batting average and 60 home run pace all the way through Game 5 of the World Series (because we won’t need seven).

The problem is that we’re asking the wrong question. To understand why, we need an analogy.

Suppose that some serious disease that no one had seen before were making the rounds. Naturally, biomedical researchers and public health officials would be hard at work immediately trying to figure out what was going on and would likely try to develop a test that could pinpoint whether someone was infected with this disease. Early detection of just about anything saves lives. In an ideal world, we’d want the test to get it right every time. If a person really were infected, we’d want the test to say “yes.” If a person were disease free, we’d want the test to say “no.” It’s rare to get a test that’s 100 percent accurate, but that’s the goal.

Now, let’s say that we can reasonably assume, based on some surveillance and epidemiology data, that 10 percent of the population is infected. But which 10 percent? Ah, that’s where I would come to the rescue with my extra super-duper magical test, because I am brilliant. I would simply declare, according to my test, which is actually some stray wires tapes to a cardboard box, that no one actually has the disease. And I would be right 90 percent of the time. That’s an A-minus, mom!

Oh right, that doesn’t really help the people who are infected, does it? Okay, instead I'll say "Everyone has the disease!" Now, I have accurately identified everyone who is infected, without missing anyone. I’ve gone up to an A-plus! Yeah, there are cases where that sort of “just assume everyone has the disease” model works in public health, but coming back to baseball, it’s basically like saying in March, “I know that a couple of these 750 players are going to break out this year. I just know it.” Technically, you can take credit for having “called” every breakout in baseball that year, but your original statement isn’t useful.

In public health and statistics, we call this a signal detection problem. A signal detection problem has two parts: something that we’re looking for (the signal) and some test that tries to find it. You can visualize the problem like this:

Signal is Really There

Signal is Not Really There

Test Says it’s There

Test Says it’s Not There

Now, let’s fill in two of those empty squares:

Signal is Really There

Signal is Not Really There

Test Says it’s There

Correct Identification

Test Says it’s Not There

Correct Rejection

These are the squares that you want to be in. The perfect test would sort everything into these two boxes.

Now, about the other two:

Signal is Really There

Signal is Not Really There

Test Says it’s There

False Positive

Test Says it’s Not There

False Negative

In the context of our test for breakouts, a false positive is the guy who has a hot month. He makes you all excited because this is the new breakout star, and you start talking about him to all your friends so that you can say that you were “on him” earlier than anyone else (or so that you can pretend that your team has a shot at the playoffs). But in mid-May, he turns back into a pumpkin. A false negative, on the other hand, is the one that “we” missed on. “We” all assumed that it was just a hot streak, but it turns out that he had changed.

There are two different kinds of errors that a person can make in a signal detection problem, and in signal detection theory, there are two things that we want to know about a test to determine how good it is. One is a measure of how good a test is at sorting cases into the “good” boxes. This measure, called detectability (often abbreviated d’) is what you really want in a good test. But the other measure of a test is called response bias (often abbreviated with the Greek letter beta). This is a measure of which type of error your test will make more often.

You can think of it in terms of what you might do in a case where you looked at the evidence and found that it wasn’t quite clear whether you should go with “Definitely a breakout” or “Pshaw, just a small-sample fluke.” To which one do you usually give the benefit of the doubt? That’s your response bias. Again, in public health, there are cases where it makes sense to prefer one sort of error over another, but adjusting the response bias isn’t helping you to get more cases correctly classified. It’s just adjusting what sort of errors you will make. Sometimes that’s the only thing that you can do, and it can make the test better, but it’s no substitute for better detectability.

Here’s the problem with “Always trust the projections.” It’s also the problem with “Everyone (or no one) has the disease.” We are trying to figure out whether a player who is playing above his head really is breaking out, or if it’s just acne. Going with “always trust the projection” is a way of saying “adjust your response bias toward saying ‘No breakout’ rather than working on making the test a better detector.” If we made a list of players who have exceeded expectations this year (pick whatever definition of that you want), most will probably revert to form, but some really are emerging from their chrysalis and have become beautiful butterflies. Let’s say that 10 percent of them are real breakouts (just picking a number). Saying “small-sample fluke” all the time will be correct 90 percent of the time. And only minimally useful.

The real question that teams are concerned with is the detectability question. Suppose that a team saw a player that was starting to break out at the end of a season and could detect that yes, this one was real. At the Winter Meetings, the team’s GM would invite the breakout player’s GM out for some lemonade-fueled debauchery on the hotel mini-golf course, and somewhere over by the windmill would mention an idea for a “minor” deal. Those are the kinds of moves that a World Series team is built on. Sticking to “always trust the projection” probably does keep you from over-reacting to (and over-paying for) a two-month hot stretch, and maybe if it’s one of your own guys, you can sell high on him, but even then, you’re only getting half the benefit that you could.

In fairness to Messrs. Lichtman and Cameron (Hi guys!), I doubt either one would significantly disagree with my general point, and likely they'd be all for a method that could better detect a real breakout when it's happening (or about to happen). They’d likely agree that in a perfect world, we’d have a perfect test, but since we don’t live in a perfect world or have a perfect test, it’s better to pick the option that makes you wrong the least often. That’s perfectly sound thinking from a statistical point of view, until you look at it from the point of view of a team or anyone else who needs to be able to pick out the real breakout. Anyone can adopt "trust the projections." There’s no strategic value in it at all. Tell me when I should disregard even my own model!

Again, to be fair, MGL's projection system (and others) allow for some types of new information to re-write the projection mid-season (For example, there were specific mentions of a pitcher who is clearly losing velocity, which would be factored into the projection.) But there's another problem. What happens when there’s information that the model doesn't account for? Sure, a good model tries to take everything into account, but what happens when a scouting report comes back that says, "No really, he really has changed his whole approach and it's working for him." We can't privilege all information like that.

Your cousin's girlfriend's brother's boss who has been a Rockies fan for 40 years (yeah, I know) isn't a reliable source of information on Charlie Blackmon. And yes, ideally, a more complete model might find a way to incorporate that sort of feedback to make the model better, but we're kidding ourselves if we think our models are that complete at this point. The problem with "trust the projection" is that you're leaving out any information that isn't fueling that projection, but that might still be important. The fact that projection systems miss on a lot of breakout guys is evidence that that we’re leaving out some critical information.

We should aspire to greater things than that, even though that aspiration is a mighty tall order. We have good ways for measuring what a player did on the field, and some nifty one-number catch-all stats, but very little in the way of measuring some of the more base component skills. How good is Smith's pitch recognition? What does it mean that someone finally explained how not to chase breaking stuff low and away to him in a way that he could understand? How does that affect all the other variables? What does it mean that not only is his wrist actually healthy, but that he actually trusts it now? How does all that interact with the rest of his skillset? It's a harder question and a more humbling one. It's going to be messy to figure it out, with fits and starts and failures and maybe some long pauses between breakthroughs. But that sort of mistrust of even the most sophisticated model is the difference between saying something that's correct and reaching the point of saying something that's useful.

Projections have a way of covering themselves for unexpected results mid-season, which is that the pre-season projections tend to reflect a baseline expectation for the player's performance. Strong deviation from that baseline might encourage reactive steps that bring performance back near the expected level.

A case that springs to mind is Ryan Zimmerman in 2012. Marcel projected a .357 wOBA for the season, and he finished at .352. What Marcel, and no projection system knew, was that through June 23 Zimmerman would have a .590 OPS, and that this poor performance led to an injury evaluation. In late June Zimmerman began receiving cortisone shots in his shoulder, and his OPS the rest of the season was .967.

Was the model right? Are these adjustments due to deviation from expected performance part of what's assumed in the predicted result?

Love the article, and this is a great question. Do we give credit to a projection that is "wrong" twice? If Mark Beuhrle finishes with 13 wins, was my projection for his season useful, given that he's had a Cy Young first half and would have to have a horrendous second half? It's kind of like saying the average of ten coin flips is "the edge".

I think the answer to your last question is "yes". There can be multiple paths to any one outcome that make up that outcome's likelihood. How often the outcome of OBA=.352 happens to players like Zimmerman is based in part on how players like Zimmerman have fared before, and some percent of them got to .352 in a more direct way, and some got there in this wandering way, as well as various other ways, I'm sure. It's all automatically folded into the modeling of empirical data.

Did MGL or anyone look at BABiP? If the player has a pretty normal BABiP or worse and he's hitting way over his expected level, then I'd guess he's going to significantly beat his projection the rest of the year. That's mainly why I bet on Jose Bautista in 2010 and Edwin Encarnacion in 2012 - and am betting on Eugenio Suarez - but not Danny Santana right now.

There are players and pitchers who have upgraded their game where there is evidence of both the actual upgrade and its effects.

Submitted for your approval: Clayton Kershaw's slider, which has improved measurably - just check its velocity change first, and then the other stats - over the last few years.

My take on it is if you can establish (with hard data) that someone has actually upgraded their game and you can see the impact of the upgrade (by examining data from the results), then it's time to step away from the projections.

I found it interesting that the next article by Ben Lindbergh highlights the plays PECOTA missed and "will continue to do so".

I was disappointed that that there was no mention, or even better, reconciliation with other articles by Mr. Carleton and others about the stabilization of various statistics, such as strikeout rates, walk rates etc. If I recall, pitcher strikeout rates stabilize at around 70 batters faced. If I recall, Dellin Betances strikeout rate after the first seventy batters faced in 2014 was substantially above the PECOTA projection. Why would this not offer an opportunity to improve the projection?

Because you're misunderstanding the concept of stabilising. There's nothing magical about the 70th batter faced that suddenly means his strikeout rate this season is much more likely to be sustainable than it was after 60 or 65 batters faced.

Granted that there is nothing magical about the 70th batter. Let me try to put it another way, the 2014 PECOTA projection for Betances was 60 strikeouts for 57.6 innings pitched or 1.04 strikeouts per innings (PECOTA does not display the batters faced but working backwards using projected hits, home runs, base on balls, and BABIP, i came up with a projected 275 batters faced in 2014 for a strike out rate of 0.21818 per batter faced.

After the May 7 game against the Angels, Betances had faced 72 batters and recorded 30 strikeouts in 17.3 innings for 2014; a strikeout rate of 1.73 per inning and 0.41666 per batter faced.

Many projections systems do incorporate in-season data like that. Those are the most recent data we have, and if you're going to use something, it might as well be that.

The problem with using the stabilization points in the manner that you suggest (and it's a method which I commonly see used), is that they were never meant to be used that way. The fact that Betances reached 70 PA means that we can feel "comfortable enough" with that sample to say that over those 70 PA (17 innings), he really was (past tense) a pitcher with a talent level around 42% (yikes!). It's reasonable to think that over another 70 or 100 or 500 PA, he'd pitch similarly, but that's an assumption. Perfectly reasonable assumption, but not ironclad.

Going forward, he won't have the element of mystery any more (pitchers usually have the upper hand in the first meeting with a batter), he'll be a bunch of pitches deeper into his season (and more tired?) and he'll face a different suite of hitters, probably in higher leverage situations. In other words, the next 70 hitters could be very different than the first 70. Stabilization answers the question "If we gave him two sets of 70 batters in roughly the same circumstances, how closely would the two performances match." The question of what he will do next month is entirely different.

Would it be possible to write a projection system that included velocity drop/increase or an injury that could measurably affect performance? I suspect it would be incredibly difficult to maintain vs. current projection systems but velocity is a measurable stat. You could possibly look at post-TJS or post-concussion performances and see if there's a trend.