Introducing SIERA

Part 1

Baseball fans who have no use for advanced metrics can realize the flaws in evaluating pitchers by their won-lost records, but may struggle to understand the inherent flaws in the more commonly used earned run average. Henry Chadwick invented ERA in the 19th century to measure the effect of defense on pitching performance, but not until Voros McCracken explained the concept of Defense Independent Pitching Statistics (DIPS) did our understanding of the relationship between pitching and defense take a big step forward. McCracken explained that pitchers controlled the rates of whiffing, walking, and getting walloped with home runs, showing that the correlation between these statistics in consecutive years was strong. Though he inferred an ability for hurlers to control these numbers, another finding suggested little persistence in their Batting Average on Balls in Play (BABIP), leading to the conclusion that ERAs were dependent on defense (or luck), and therefore very volatile.

Armed with this information, sabermetricians began to develop methods of estimating ERA by controlling for the factors that can muddy the proverbial waters. These estimators enable the evaluation of pitching performance based on what pitchers actually control, rendering more accurate the tracking of their abilities. Watching trends in actual skills that pitchers control can help us better grasp whether shifts in ERA are the result of changes from the individual or from external factors. Since then, many competing estimators have emerged with their accompanying strengths and weaknesses. Perhaps the most popular ERA estimator is Fielding Independent Pitching (FIP), which uses the following straightforward formula: FIP = 3.20 + (3*BB - 2*K + 13*HR)/9, where the 3.20 is a constant dependent on the league and year, used to place the outputted number on the ERA scale.

Researchers have noted that, among the defense-independent statistics, home runs are by far the least predictable. Although home-run rate has shown itself to be more repeatable than BABIP, the lack of persistence makes such a comparison similar to justifying a D grade by mentioning that other classmates failed the test. Further research revealed that the percentage of fly balls that left the yard (HR/FB) sported about as little persistence as BABIP, and second-generation estimators attempted to eliminate HR/FB luck from estimation. One of the more obvious adjustments is to simply approximate the number of home runs that would have been hit if the pitcher had neutral luck in the fly ball department, and re-computing FIP with this estimate. This metric, known as Expected Fielding Independent Pitching (xFIP), uses the regular FIP formula but it replaces HR with xHR, the metric described above. This estimator marked an upgrade over FIP given the accepted notion that HR/FB has much more of a foundation in luck than actual skill, but there was still ample room for improvement.

Nate Silver invented QERA back in 2006 for Baseball Prospectus to adjust for a few issues with FIP and xFIP, and while he referred to the stat as a toy, it represented a big step upward in the methodology of estimators. The formula—QERA = (2.69 - .66*GB% + 3.88*BB% - 3.4*K%)^2—derives one of its main benefits from the fact that it accounts for non-linear run scoring; the more baserunners allowed, the higher the percentage that will score. It also removes the bias that innings pitched totals are subject to batted-ball luck and a pitcher with a higher BABIP will have a lower K/IP even if he strikes out the same percentage of hitters. QERA has another problem of its own, in that GB% is really GB/Ball in Play (or, GB/BIP), while BB% and K% are measured per batters faced (SO/PA and BB/PA).

In other words, for pitchers who strike out and walk large numbers of hitters, changes in ground balls per ball in play affect their QERA as much as they do for pitchers who barely strike out or walk any hitters, even though the latter group’s ground-ball rate actually represents a higher tally. Further, while QERA picks up some of the interaction between walk, strikeout, and ground-ball rates, it does not necessarily weight them correctly.

With that in mind, we have invented a new statistic, Skill-Interactive Earned Run Average (SIERA), which corrects the problems with old estimators while adding a few more realistic assumptions. This was done first by un-foiling all of the individual components in QERA while making an adjustment for the issue with the ground-ball denominator issue, and testing to see which interactions and squared terms were relevant by using multiple linear regression analysis. Essentially, we changed the GB/BIP to (GB-FB-PU)/PA and evaluated all of the terms in the exponential regression, removing those with insignificant p-values; while the QERA formula only shows three variables, un-foiling the formula reveals several more. We identified two terms that were not useful: the squared term of walks, and the interaction between walk and strikeout rate. The squared terms on strikeout and ground-ball rates were both significant, and we also found important interactions between walks and grounders and between whiffs and grounders that have strong effects on run scoring.

Allows for the fact that a high ground-ball rate is more useful to pitchers who walk more batters, due to the potential that double plays wipe away runners.

Allows for the fact that a low fly-ball rate (and therefore, a low HR rate) is less useful to pitchers who strike out a lot of batters (e.g. Johan Santana's FIP tends to be higher than his ERA because the former treats all HR the same, even though Santana’s skill set portends this bombs allowed will usually be solo shots).

Allows for the fact that adding strikeouts is more useful when you don't strike out many guys to begin with, since more runners get stranded.

Allows for the fact that adding ground balls is more useful when you already allow a lot of ground balls because there are frequently runners on first.

Corrects for the fact that QERA used GB/BIP instead of GB/PA (e.g. Joel Pineiro is all contact, so increasing his ground-ball rate means more ground balls than if Oliver Perez had done it, given he's not a high contact guy).

Corrects for the fact that FIP and xFIP use IP as a denominator which means that luck on balls in play changes one's FIP.

The new ground-ball statistic used is: (GB-(FB+PU))/PA. Now walks, strikeouts, and grounders use the same denominator, avoiding any type of weighting issues. GB/PA could have been used instead of GB/BIP, but our findings suggested that line drives per ball in play exhibited virtually no persistence, and did not represent a pitcher skill. When his line-drive rate is low, the pitcher is probably just lucky, but ground-ball, fly-ball, and pop-up rates will increase to make up the difference. Since ground-ball rate for the league as a whole is similar to the sum fly-ball and pop-up rates, using the difference between the two eliminates some of the luck that would make this estimator look bigger than its britches. For the same reason, pop-up rate was allowed to negatively affect SIERA because it is a symptom of the pitcher throwing the ball that generates an upward trajectory, which could lead to an increase in home runs. A pitcher’s skills are throwing strikes, making hitters miss, and throwing with angles and spins such that the trajectory of the ball is downward when it hits the bat. A popup almost always represents an out, but it also represents a potential problem for the pitcher in the future.

Simply running a regression analysis to predict park-adjusted ERA and developing a statistic that introduces these improvements to Defense Independent Pitching Statistics would be useless if it did not predict ERA better than other statistics. Not only did SIERA emerge as the leader in ERA estimators, we discovered more importantly that using the same regression analysis on different datasets shows that the coefficients developed continue to predict ERA better than other estimators, proving that our analysis was not biased by retroactively predicting the mark. Specifically, using 2003-08 data to generate a formula and then testing it on 2009 pitchers, SIERA emerged as the best estimator of park-adjusted ERA in the following year and the best at predicting same-year ERA amongst the estimators that treat HR/FB as luck; FIP and tRA consider it to be more skill-laden.

In other words, it is impossible to best FIP in terms of same-year mirroring unless HR/FB is treated as a skill, but tests have shown that HR/FB itself is unstable and not indicative of something within the control of the pitcher. FIP and tRA lead other estimators that do not credit the pitcher for this luck in predicting same year Earned Run Average, but SIERA overtakes both in predicting future performance, which is arguably much more important. After all, the primary goal of ERA estimators is to approximate a skill set that can successfully generate low ERAs while being as accurate as possible in the modeling and assumptions deriving the formula.

In the coming days, we will explain in more detail the derivation of SIERA, provide some tests to check its performance, and offer examples of pitchers for whom the metric performs vastly better than other estimators. The last part is very important, as a small change in ERA estimation is not necessarily a big deal unless there are pitchers who are perpetually underrated or overrated by similar statistics. This is certainly true in the case of SIERA and FIP for a player like Santana, whose solo home run tendencies are inaccurately punished by FIP in a way that underestimates his skill by a significant amount. The introduction of a metric that properly accounts for all that was mentioned above helps to evaluate pitchers in a more precise and useful way than ever before.

For now, we leave you with the formula for the statistic that will be kept here moving forward and will soon be found on the revamped reports:

Matt Swartz is an author of Baseball Prospectus. Click here to see Matt's other articles.
You can contact Matt by clicking here
Eric Seidman is an author of Baseball Prospectus. Click here to see Eric's other articles.
You can contact Eric by clicking here

Isn't one of Matt/Eric's points that the second-order interaction terms are actually implicit in Nate's QERA formula, it's just that we don't think about them, because of the way in which he left the form of the formula?

Exactly right. We also let them take on more realistic values because we unfoiled the regression.

For instance, in Nate's formula: QERA = (a + b*BB% + c*K% + d*GB%)^2, it's essential that b is positive and that c & d are negative, but that requires that b*c is negative when it is actually positive in nearly every regression we ran, and it also requires that d^2 is positive, when it actually should be negative.

Thanks for highlighting this point, though. I think it makes the metric more transparent.

Could you comment on the use of a second order positive term for K%? Is this a fitting artifact (ie no constant could accurately fit the data for a first order term), or is there a concept behind it? If it is the former, might I suggest letting the exponent of the first term float to clean up the equation a bit.

Very good question. We did run regressions on various subsets of data and this term was basically always positive. It fits with the general theme pretty well too. Think of it this way: when do you want a strikeout most? With runners on base. The more K's you get, the fewer runners are on base though, so it tends to get gradually less affective.

In Wednesday's article, this will be flushed out a little more, but the methodology behind a lot of the quadratic and interactions terms entails asking "which pitchers need this most?" That's why GB rate is more important for pitchers who allow more base-runners, and why marginal improvements in K-rate is less important the higher the K-rate goes.

It happens bit by bit, so it's tough to pick an exact number. I know that SIERA tested particularly well for pitchers in the 6.5-9.0 range of K/9 while doing about as well as everything else for K/9 above that.

There is no inflection point. The inflection point would be where the second derivitive of SIERA with respect to (SO/PA) equals zero. But the second derivitive of SIERA with respect to (SO/PA) is a constant (2*10.169). [This assumes that SO is independent of the other parameters, which isn't quite true--in the extreme, a pitcher who strikes every batter wouldn't have any BB, GB, or FB. But I suspect that this effect is small.]

Will, there's always a diminished return on increasing K rate per this fit, but if you're asking where the first negative Krate term is overwhelmed by the two positive Krate terms, the answer is going to be a function of GBrate, as there is the +9.561*(SO/PA)*((GB-FB-PU)/PA) term, which is also positive. The answer then, for a GBrate of 0.49 (Brandon Webb 2006) is Krate=0.66, GB=0.35 (Adam Wainwright 2009) is Krate=0.73, and GBrate of 0.28 (Aaron Harang 2007) is Krate=0.76. These are obviously non-physical numbers, since you can't strike out even 66% of batters faced, and results from this equation being fit to a data set of real values. Since you can't extrapolate a purely phenomenological equation outside of its set of data, these numbers are meaningless.

Matt and Eric, was there any thought to adding a second order term in BBrate? was plotting the numbers for an average pitcher, of Krate vs Siera, and found it to look to linear (by eye) in BBrate.

As far as the diminishing run prevention effect of strikeouts, it does really matter where the BB and GB numbers are because those determine the number of base-runners and the double play ability to remove those base-runners.

We did test the second order term for BB-rate, which we'll explain in more detail on Wednesday, but it kept coming up as insignificant so we left it out of the equation.

The positive coefficient on the (SO/PA)^2 term does mean that at some point additional SO/PA increases SIERA. This is when the first derivitive of SIERA wrt (SO/PA) is equal to zero, which is at a SO/PA of (18.055-9.561(GB-FB-PU)/PA)/(2*10.169). This looks like an outrageous strikeout rate, so it probably isn't an issue.

Haha-- unfortunately, Glavine baffles SIERA as well, at least for 2003-2008 where there is actually batted ball statistics recorded. His SIERA's look similar to his other ERA estimators, all ahead of his ERA (about 4.9 versus 4.2 for those six years). The thing about Glavine was that he was far superior to his peers at pitching to the situation. I think that pitchers with really high ground ball rates may be particularly good at pitching to the situation, but at least from 2003-09, Glavine is pretty average there so he is not a puzzle SIERA can foil.

As a huge Braves fan throughout Glavine's career, I don't ever recall him being an extraordinary groundball pitcher-- at the very least, not on Maddux's level. He got his share of GBs, but mostly he just seemed to coax a lot of lazy fly balls to CF and LF.

There is speculation among Braves fans that his circle-change (for which he was famous) behaved somewhat like a knuckleball, which is a pitch that is known to outperform many ERA predictors (I'm assuming that Wakefield probably outperforms SIERA as well). But that's all anecdotal evidence, obviously.

Definitely could be part of it. Glavine's career BABIP is .286, while Wakefield's is .281. SIERA and FIP do about as well when it comes to Wakfield. The thing about Glavine's BABIP partly is that he played in front of good defense, so that's not all the effect. It definitely would explain some of it, though.

SIERA helped find some of the mistakes in the first round of PECOTAs but it wasn't early enough to actually build it in to 2010 PECOTA. It definitely could be part of the process more next year, though I'm not quite sure about that.

The 2010 Annual will list 2009 SIERA and will compute 2010 SIERA according to the projection. Pretty soon, the 2003-09 SIERA's will be available on the Statistics Reports and the 2009 SIERA's will do very well at helping predict 2010 ERA, at least net of park effects.

@Matt and Eric: interesting work. I think some of us would like to see a simple correlation matrix of SIERA against the other ERA estimators including FIP, QERA and ERA itself (!) based on cross-sectional (single season) data.

Also a correlation matrix of the same estimators over time, i.e., interseasonal correlations for each indicator with itself and with the other estimators, e.g., 2008 and 2009 will do.

Finally, some information about the correlations (intra- and interseasonal) for pitchers with different numbers of innings pitched (let's just say high (150+), medium (75-149), and low (<75) (or some such breakdown).

If you can't put all of this into the article, perhaps you could offer it as a downloadable spreadsheet or table?

Over the next four days Matt and I have articles going really in-depth into a number of topics, one of which is testing SIERA against FIP, xFIP, QERA, tRA, ERA-Park, ERA, you get the picture. So just hold tight. We're going to have the data in the articles themselves too.

This says "a pitcher with a higher BABIP will have a lower K/IP even if he strikes out the same percentage of hitters."

Isn't that backwards? If you have a higher BABIP, you will face more hitters per batted ball, as more will get on base. If you K the same percentage of batters faced, you should get MORE K's per inning. In the extreme case, Two pitchers both strike out every third hitter, and one has a BABIP of .000 and the other has a BABIP of 1.000, the first will have one K/IP, and the latter will have 3 K/IP.

Oops-- you're right. It should be higher/higher and lower/lower rather than the way we did it. Thanks for pointing that out. It shouldn't obscure the error though-- looking at BABIP-neutral statistics on a per out (or per 3 outs) basis isn't really BABIP-neutral at all.

Are you asking how well this would match up with linear weights? If so, that's not directly in subsequent articles, but I think it would probably match it reasonably well, at least as more data is collected on batted ball numbers over the next few years and the coefficients are refined. Some of the strength in the estimation might come from situational pitching, which probably wouldn't show up quite as much in linear weights as I understand it, but certainly the magnitude of the interaction terms at the end might work out pretty well at least. It's a good question, but I'm not sure yet.

Quick note that we didn't mention in the formula-- for pitchers who give up MORE fly balls and pop ups than ground balls, the ((GB-FB-PU)/PA)^2 term would be positive, but we basically made it negative in that case. So that term should be negative or positive depending on the sign of (GB-FB-PU)/PA.

On a somewhat related note, I'm sitting here at work where we have various TVs tuned to different news networks (I work for a media company), and I just saw Nate Silver discussing politics on MSNBC. From QERA to senatorial elections; what a career path.

Are all the predictive systems you mentioned using the same factors when arriving at your "park adjusted ERA"?

My other question is about weather, umpire and pitching to ballparks. I know that all three affect scoring, and when added together can do so significantly. But never do I see any mention of them in really, anything that anyone does. It really bothers me that we can see that over 30+ starts the same offense may produce dramatically different Runs/Game for a given starter on the same team, but it is somehow assumed that the umpires and the game conditions for each of those starters "even out". These things count, but they're not counted.

It predicts park-adjusted ERA the following year best, and I have a large doubt that starters are systematically paired with certain weather and umpires, so that should even out when predicting the following year's ERA. It also does better than other HR/FB-luck-neutral estimators in same-year ERA so it covers all the bases. Although there could be parks that favor lefties and righties, I'm not sure how this would correlate with K%, BB%, and GB% enough to affect this.

This looks like a very nice advance in metrics that assume that pitchers do not have significant variations in hardness of contact allowed. That is an incorrect assumption*, but the variations are small enough at the MLB level to make such a metric very useful, and may help get a handle on those slight but real differences.

*I've lost count of the number of ways this can be demonstrated. Most obviously, there is a significant correlation between team BABIP allowed and the three true outcomes; staffs that are better according to the latter also allow a lower BABIP just as you'd expect. But here's perhaps the best bit of evidence yet:

Take all the pitchers with 200+ BFP in consecutive seasons for the same club in the same park since we've had UZR data (2002-2009). The change in BABIP is of course correlated to the change in team UZR Range + Error, with the correlation surprisingly weak (r = .19) because BABIP is just so damn noisy. Now, adjust the yearly change in K/BFP for age and change in role (starter vs. relief) and toss that into the regression. Surprise! The change in BABIP is also significantly (p = .02) correlated to the change in K rate. That is very unlikely to be caused by changes in FB/GB ratio that correlate to K rate changes and very likely to be what it looks like: you strike out more batters, they also hit the ball less hard (and the opposite).

It's also interesting that the change in BABIP with age (again, with change in UZR included in the regression) precisely mirrors the change in HR / Contact with age. In both cases, there is no change until age 28 or 29, then a worsening. (Contrast to K rate, which significantly improves to age 27, then declines at the same rate). The worsening of BABIP after age 29 is not significant (p = .25), but in this data set neither is the improvement in BB rate by young pitchers (p = .26), and we're pretty sure that's real.

(Some of this is in a thread at SoSH looking at the impact of team defense on pitching.)

Thank you for pointing this out. The correlation of BABIP and Defense Independent Pitching Statistic is something I've discussed before. It's small but it's there. The benefit of using regression to do this is that it picks up this effect. Pitchers with higher K-rates have lower BABIPs, and both the extra K/PA and the fewer H/BIP lower ERA, but the regression will pick up both effects. The only thing that SIERA will leave out is BABIP effects that are uncorrelated with ground ball, strikeout, and walk rates, which are very small effects. That's why this is more of an ERA estimator based on skills than based on DIPS.

The typical boxscore in a newspaper typically doesn't write down batted ball types, but even FanGraphs boxscores just lump it all in with fly balls. That's okay because we have fly balls and pop ups always added together in the equation so it's okay to call all pop ups fly balls and totally have the equation work. Any game summary on Gameday will include pop ups too.