With so many complex statistics out there, I wondered if there was an easier way to project winning percentage or runs, a way that is simple yet more complex than Bill James’ classic Pythagorean Win Expectancy. To create a statistic like that, I would have to create one comprehensive stat for offense and one for pitching. Ultimately, I came up with the following and named them “Run Value” and “Pitching Run Value,” respectively.

RVAL = ( ( TB + BB – SO )/4) + RBI + HR

PRVAL = ( ( ( H + BB – SO )/4 ) + HR) x FIP

These two metrics are used for teams. In the batting RVal formula, the higher the better. I tried to get down to the pure number of runs that a player or team produces by using the very relaxed definition of a run being four bases. In the pitching PRVal formula, the lower the better. I did something very similar to the batting stat by trying to get the pure run total. I then put the two stats into the win expectancy formula:

RVALWinExp = RVal^1.83 / ( RVal^1.83 + PRVal^1.83)

I then ran a program in R to see how closely this stat correlates to actual team win percentage for all teams from the 1998 season through the 2018 season. In addition, I tested to see how Bill James’ win expectancy formula correlates to team win percentage over the same period of time. The results are below.

Bill James’ expected W.L.% → Actual W.L.%

r squared = .88

standard error = .0247

RVal W.L.% → Actual W.L.%

r squared = .853

standard error = .0273

Since I found that my stat had pretty good correlation to team win percentage, I decided to figure out if the offensive RVAL was better than pitching PRVAL or vice versa. So I tested RVAL against runs scored and PRVAL against runs against. The results are below:

RVAL → Runs Scored

r squared = .947

standard error = 22.8

PRVAL → Runs Against

r squared = .815

standard error = 94.7

The graphs and stats show that the offensive statistic is much more accurate than the pitching stat, and since I was getting such good results with the offensive stat, I decided to test it further, this time against a currently popular measurement of offensive production in Base Runs. The results are pictured below:

Base Runs → Runs

r squared = .927

standard error = 21.9

My stat had a higher r squared value by .02, which indicates that my statistic is about as good as Base Runs. I realized that my stat must have some sort of significance to be this accurate in predicting runs. I then took a look at a histogram of RVAL and compared it to a histogram of Runs and saw that RVAL was too high and too spread out to mirror a distribution of runs. This is what it looked like:

I then found the mean of the runs data and scaled RVAL down so that the centers of each distribution were the same. I then adjusted the data by shrinking the standard deviation to look similar to the histogram of runs and ended up with the following equation:

AdjustedRVAL = (RVAL/1.753685) – ( ( RVAL-745.1053 ) / 5)

This new formula divides each value to make the mean of the data the same as the mean of the runs data and then subtracts one-fifth of the difference between the value and the mean of the data. That means that it moves each data point one fifth of the way closer to the mean, bringing the data closer together. Comparing the two histograms again, I found this:

The new standard error for AdjustedRVAL was 18.2 and the r squared value remained the same. Now that these distributions look very similar to one another, I could see that my statistic pretty accurately predicts team runs scored. I now went back to the pitching statistic, PRVAL, because it’s results weren’t terrible, so I tried to do something with it. I looked at the histogram:

I already adjusted the mean of the data, which is why it seems centered in the mid-700s. I then did the same thing as I did to RVAL and brought the data closer together and ended up with the equation below:

AdjustedPRVAL = (PRVAL/2.2173) – ((PRVAL – 744.31)/1.6)

This equation gave me the following comparison:

The new standard deviation of the PRVAL was 35.5 and the r squared value remained the same. This depiction of the adjusted pitching statistics shows a pretty good relationship between it and runs against for teams dating back to 1998. The relationship is not as strong given the r squared of .815, but not too shabby.

After creating this statistic and testing it out, I have found that it doesn’t take a complicated metric to accurately make predictions, as even simple stats created by trial and error can have interesting outcomes. I plan to continue to use this statistic and test its accuracy, and I also am developing a version that can be applied specifically to position players as well.

I am a High School Senior at Xaverian Brothers High School in Massachusetts. I am a sabermetrics enthusiast and Red Sox Diehard. I love Moneyball and I have created my own custom statistics. I also use R to analyze baseball stats and am currently learning Java in AP Computer Science.

newestoldestmost voted

Member

London Yank

You can flag a comment by clicking its flag icon. Website admin will know that you reported it. Admins may or may not choose to remove the comment or block the author. And please don't worry, your report will be anonymous.

Joshua, You’re doing really good work. I have a few comments: 1. You want to keep in mind what your question is. For example, Pythagorean record asks “What should this team’s record be based on how many runs they have scored and conceded?” Base Runs asks “How many runs should this team have scored and conceded based on their batting and pitching outcomes? And, based on the run differential this team should have had, how many games should they have won?” It isn’t clear what question your statistic is designed to answer because, for example, you have mixed batting outcomes… Read more »

You can flag a comment by clicking its flag icon. Website admin will know that you reported it. Admins may or may not choose to remove the comment or block the author. And please don't worry, your report will be anonymous.

Thank you for your comments. I was hoping to get some good feedback because I am trying to learn the language on my own and also get a better understanding of what quantitative analysts are thinking about when they analyze baseball stats.

Vote Up1Vote Down

16 days ago

Member

D4P

You can flag a comment by clicking its flag icon. Website admin will know that you reported it. Admins may or may not choose to remove the comment or block the author. And please don't worry, your report will be anonymous.

A few comments: 1. By including both RBI and HR in the RVAL formula, it seems like you are double-counting the runs scored by the hitter of a HR, in that said runs count both in HR and RBI. Is that a problem? 2. Using RBI to “predict runs” seems a bit redundant, in the sense that RBI is already a measure of (at least some of the) runs that have been scored. It seems like you’re essentially saying “Runs scored increases with the number of runs that were scored by being batted in”, which doesn’t seem like two different… Read more »

You can flag a comment by clicking its flag icon. Website admin will know that you reported it. Admins may or may not choose to remove the comment or block the author. And please don't worry, your report will be anonymous.

I was purposefully trying to count both RBI and HR in the formula because both are important in offensive production, home runs often mean a larger number of runs were scored and RBI account for players being able to hit well when their teammates are in scoring position: situational hitting. In a team statistic this is important I believe. Is there a generic problem to that or does it depend on the goal of the stat? I’m actually not sure if I’m trying to predict runs exactly or not, but most likely not. I’m mostly experimenting and trying to see… Read more »

Vote Up0Vote Down

16 days ago

Member

London Yank

You can flag a comment by clicking its flag icon. Website admin will know that you reported it. Admins may or may not choose to remove the comment or block the author. And please don't worry, your report will be anonymous.

At the team level, RBI and Runs Scored are basically the same thing. Every time a team gets one RBI, it also gets one run. The small discrepancy between team runs and team RBI is due to rare run scoring events such as a run that scores on a double play. So, predicting runs from RBI is a bit like predicting people’s heights by measuring them from their feet to their nose. It will be highly accurate but it doesn’t tell you much since the length from foot to nose is just directly measuring 95% of the thing you want… Read more »

Vote Up0Vote Down

16 days ago

Member

Lanidrac

You can flag a comment by clicking its flag icon. Website admin will know that you reported it. Admins may or may not choose to remove the comment or block the author. And please don't worry, your report will be anonymous.

OK, nice work, but is it really necessary to create a new, more complicated metric that is only a little better than what we already have with the other two metrics?