Category Archives: baseball

The backwards K is normally used to denote a called third strike in a strikeout. It’s typically written on a scorecard. I’ve been looking for the backwards K so I can denote the strikeout looking on Twitter, and I finally found it:

ꓘ

(for unsupported browsers — Chrome)

The easiest way to use this character is to copy and paste the backwards K from above and save it in a note or something you can copy and paste from routinely. This character is actually from Apple’s implementation of the Unicode from the artificial, Latinized version of the Lisu alphabet. This alphabet contains an upside-down, turned K which looks similar enough to a backwards K I think this pass on Twitter.

If you don’t see the backwards K in the block above, you computer or mobile device probably isn’t using a font that supports that specific character. It’s supported on Macs and iPhones (as well as the Edge browser in Windows 10).

This post is a work in progress. The data concerning the pace of play is rather messy and this project is rather large compare to what I normally tackle. For that reason I’m going start this post and update it as a ‘working post’. Please feel free to contact me if anyone has any input: @seandolinar on Twitter or sean.dolinar@gmail.com

Having collected the time between pitches from PITCH/fx, I was able to look at the different factors that affect how long pitchers took between plays. [I’m defining this as the pitch pace.] PITCH/fx has a time stamp associated with each pitch. Using that time stamp, I was able to calculate the time between each pitch. I used the resulting calculation combined with other information available about each at-bat to draw some conclusions about what affects pace of play.

The most obvious influence on the time between pitches is whether or not there was a baserunner. This was rather simple to explore since PITCH/fx provides information on whether or not there is a runner on 1B, 2B, or 3B. Using this I was able to create the following table of median pitch pace. [I’ll explain why I decided to use the median and not the mean/average later.]

The data matches what your experience with baseball suggests. Pitchers will slow down the game when there is a runner on base. This will happen for several reasons: run-game tactics, conferences on the mound, and even time for the ball to get back to the pitcher after the play. Given the fact there is a slight drop off for when there isn’t an open base or there are two outs, I would conclude that the run-game prevention tactics play a rather significant role in the pitch pace.

The distribution of pitch pace data shows how often pitchers take 5-10 seconds, 10-15 seconds, 15-20 seconds, etc. between pitches. Both distributions are highly skewed right, so the average pitch pace isn’t representative of the central tendency of the data set; the median works a lot better in this situation to describe the most likely outcome.

The pitch pace with the highest frequency with the bases empty is the 15-20 second range, while the most frequent pitch pace bumps up to 20-25 seconds when runners are on base. MLB is kicking around the idea of having a 20 second pitch clock. From the distribution, it becomes apparent that keeping the pace to under 20 seconds would have an impact on the pitch pace of play.

I created a box plot to show another perspective of the distributions. The mean of the runners on base pitch pace is significantly higher than the mean of the pitch pace with bases empty.

Data Background

PITCH/fx data isn’t designed to accurately measure the time between pitches; it has some problems. A human operator is needed to enter data on each pitch such as ball/strike, information about the hit or if runs scored. For this reason, the data is very messy. It has problems where subtracting the time of each subsequent pitch from the pitch prior yields negative numbers because of the operator entered the previous pitch after the pitcher threw the next pitch. For these reasons I have to re-examine cleaning and processing the PITCH/fx data.

Further Work

I need to clean the data further. This will include identifying and excluding first pitches from at-bats and aggregating each at-bat. This should alleviate some of the delay problems associated with the human entry component of PITCH/fx.

I want to look at leverage’s impact on the pitch pace. My initial analysis is that leverage doesn’t matter all too much when you consider if there’s a player on base or not since leverage and having a player on base are collinear. With cleaner data the effect of leverage or post season play might be more apparent.

I’m going look at the time between innings. This should change depending on the broadcast; national broadcasts have longer commercial breaks. There also should be artifacts for weather delays.

Pitching changes should also be included. Inning breaks with new pitchers tend to be longer, it would be nice to see how much longer they are on the aggregate.

All of these need to be programmed into a parser that looks at the data sequentially. My plan is to update this page once I have more research available.

The Royals and A’s had quite the entertaining 12-inning game Tuesday night. These are a few graphs I made from Twitter data. Yellow is Oakland; blue is Kansas City. The proportions of tweets between teams might be off, but I would venture to guess the Royals had much more social media activity than the A’s. The map shows geotagged tweets from 5PM to 1AM EDT from yesterday. The middle of the country was solid blue, California was pretty yellow, and the East Coast was rather mixed.

The volume of tweets per minute is a pretty cool view of what happened during the game. It looks like the Royals really outpaced the A’s for volume, but I’d have to use some controls to determine that for sure. These are just for fun.

I used Twitter’s streaming API to collect tweets with keywords like “Royals”, “A’s”, “TakeTheCrown”, “GreenCollar”, etc. I could have missed a crucial element of discussion, and none of this takes into account sentiment just frequency of mention in a tweet.

One of the more fan-accessible advanced stats are playoff odds [technically postseason probabilities]. Playoff odds range from 0% – 100% telling the fan the probability that a certain team will reach the MLB postseason. These are determined by creating a Monte Carlo simulation which runs the baseball season thousands of times [FanGraph runs theirs 10,000 times]. In those simulations, if a team reaches the postseason 5,000 times, then the team is predicted to have a 50% probability for making the postseason. FanGraphs and Baseball Prospectus run these every day, so playoff odds can be collected every day and show the story of a team’s season if they are graphed.

Above is a composite graph of the three different types of teams. The Dodgers were identified as a good team early in the season and their playoff odds stayed high because of consistently good play. The Brewers started their season off strong but had two steep drop offs in early July and early September. Even though the Brewers had more wins than the Dodgers, the FanGraphs playoff odds never valued the Brewers more than the Dodgers. The Royals started slow and had a strong finish to secure themselves their first postseason birth since 1985. All these seasons are different and their stories are captured by the graph. Generally, this is how fans will remember their team’s season — by the storyline.

Since the playoff odds change every day and become either 100% or 0% by the end of the season, the projections need to be compared to the actual results at the end of the season. The interpretation of having a playoff probability of 85% means that 85% of the time teams with the given parameters will make the postseason.

I gathered the entire 2014 season playoff odds from FanGraphs, put their predictions in buckets containing 10% increments of playoff probability. The bucket containing all the predictions for 20% bucket means that 20% of all the predictions in that bucket will go on to postseason. This can be applied to all the buckets 0%, 10%, 20%, etc.

Above is a chart comparing the buckets to the actual results. Since this is only using one year of data and only 10 teams made the playoffs, the results don’t quite match up to the buckets. The desired pattern is encouraging, but I would insist on looking at multiple years before making any real conclusions. The results for any given year is subject to the ‘stories’ of the 30 teams that play that season. For example, the 2014 season did have a team like the 2011 Red Sox, who failed to make the postseason after having a > 95% playoff probability. This is colloquially considered an epic ‘collapse’, but the 95% probability prediction not only implies there’s chance the team might fail, but it PREDICTS that 5% of the teams will fail. So there would be nothing wrong with the playoff odds model if ‘collapses’ like the Red Sox only happened once in a while.

The playoff probability model relies on an expected winning percentage. Unlike a binary variable like making the postseason, a winning percentage has a more continuous quality to the data, so this will make the evaluation of the model easier. For the most part most teams do a good job staying around the initial predicted winning percentage coming really close to the prediction by the end of the season. Not every prediction is correct, but if there are enough good predictions the predictive model is useful. Teams also aren’t static, so bad teams can become worse by trading away players at the trade deadline or improve by acquiring those good players who were traded. There are also factors like injuries or player improvement, that the prediction system can’t account for because they are unpredictable by definition. The following line graph allows you to pick a team and check to see how they did relative to the predicted winning percentage. Some teams are spot on, but there are a few like the Orioles or Red Sox which are really far off.

The residual distribution [the actual values – the predicted values] should be a normal distribution centered around 0 wins. The following graph shows the residual distribution in numbers of wins, the teams in the middle had their actual results close to the predicted values. The values on the edges of the distribution are more extreme deviations. You would expect that improved teams would balance out the teams that got worse. However, the graph is skewed toward the teams that become much worse implying that there would be some mechanism that makes bad teams lose more often. This is where attitude, trades, and changes in strategy would come into play. I’d would go so far to say this is evidence that soft skills of a team like chemistry break down.

Since I don’t have access to more years of FanGraphs projections or other projection systems, I can’t do a full evaluation of the team projections. More years of playoff odds should yield probability buckets that reflect the expectation much better than a single year. This would allow for more than 10 different paths to the postseason to be present in the data. In the absence of this, I would say the playoff odds and predicted win expectancy are on the right track and a good predictor of how a team will perform.

This is an extension of an earlier post I wrote about the runs per inning distribution. In this post I use the negative binomial distribution to better model the how MLB teams score runs in an inning or in a game. I wrote a primer on the math of the different distributions mentioned in the post for reference.

The Baseball Side

A team in the American League will average .4830 runs per inning, but does this mean they will score a run every two innings? This seems intuitive if you apply math from Algebra I [1 run / 2 innings ~ .4830 runs/inning]. However, if you attend a baseball game, the vast majority of innings you’ll watch will be scoreless. This large number of scoreless innings can be described by discrete probability distributions that account for teams scoring none, one, or multiple runs in one inning.

Runs in baseball are considered rare events and count data, so they will follow a discrete probability distribution if they are random. The overall goal of this post is to describe the random process that arises with scoring runs in baseball. Previously, I’ve used the Poisson distribution (PD) to describe the probability of getting a certain number of runs within an inning. The Poisson distribution describes count data like car crashes or earthquakes over a given period of time and defined space. This worked reasonably well to get the general shape of the distribution, but it didn’t capture all the variance that the real data set contained. It predicted fewer scoreless innings and many more 1-run innings than what really occurred. The PD makes an assumption that the mean and variance are equal. In both runs per inning and runs per game, the variance is about twice as much as the mean, so the real data will ‘spread out’ more than a PD predicts.

The graph above shows an example of the application of count data distributions. The actual data is in gray and the Poisson distribution in yellow. It’s not a terrible way to approximate the data or to conceptually understand the randomness behind baseball scoring, but the negative binomial distribution (NBD) works much better. The NBD is also a discrete probability distribution, but it finds the probability of a certain number of failures occurring before a certain number of successes. It would answer the question, what’s the probability that I get 3 TAILS before I get 5 HEADS when I continue to flip a coin. This doesn’t at first intuitively seem like it relates to a baseball game or an inning, but that will be explained later.

From a conceptual stand point, the two distributions are closely related. So if you are trying to describe why 73% of all MLB innings are scoreless to a friend over a beer, either will work. I’ve ploted both distributions for comparison through out the post. The second section of the post will discuss the specific equations and their application to baseball.

Runs per Inning

Because of the difference in rules regarding the designated hitter between the two different leagues there will be a different expected value [average] and variance of runs/inning for each league. I separated the two leagues to get a better fit for the data. Using data from 2011-2013, the American League had an expected value of 0.4830 runs/inning with a 1.0136 variance, while the National League had 0.4468 runs/innings as the expected value with a .9037 variance. [So NL games are shorter and more boring to watch.] Using only the expected value and the variance, the negative binomial distribution [the red line in the graph] approximates the distribution of runs per inning more accurately than the Poisson distribution.

It’s clear that there are a lot of scoreless innings, and very few innings having multiple runs scored. This distribution allows someone to calculate the probability of the likelihood of an MLB team scoring more than 7 runs in an inning or the probability that the home team forces extra innings down by a run in the bottom of the 9th. Using a pitcher’s expected runs/inning, the NBD could be used to approximate the pitcher’s chances of throwing a no-hitter assuming he will pitch for all 9 innings.

Runs Per Game

The NBD and PD can be used to describe the runs scored in a game by a team as well. Once again, I separated the AL and NL, because the AL had an expected run value of 4.4995 runs/game and a 9.9989 variance, and the NL had 4.2577 runs/game expected value and 9.1394 variance. This data is taken from 2008-2013. I used a larger span of years to increase the total number of games.

Even though MLB teams average more than 4 runs in a game, the single most likely run total for one team in a game is actually 3 runs. The negative binomial distribution once again modeled the distribution well, but the Poisson distribution had a terrible fit when compared to the previous graph. Both models, however, underestimate the shut-out rate. A remedy for this is to adjust for zero-inflation. This would increase the likelihood of getting a shut out in the model and adjust the rest of the probabilities accordingly. An inference of needing zero-inflation is that baseball scoring isn’t completely random. A manager is more likely to use his best pitchers to continue a shut out rather than randomly assign pitchers from the bullpen.

Hits Per Inning

It turns out the NBD/PD are useful in many other baseball statistics like hits per inning.

The distribution for hits per inning are slightly similar to runs per inning, except the expected value is higher and the variance is lower. [AL: .9769 hits/inning, 1.2847 variance | NL: .9677 hits/inning, 1.2579 variance (2011-2013)] Since the variance is much closer to the expected value, the hits per inning has more values in the middle and fewer at the extremes than the runs per inning distribution.

I could spend all day finding more applications of the NBD and PD, because there are really a lot of examples within baseball. Understanding how these discrete distributions will help you understand how the game works, and they could be used to model outcomes within baseball.

The Math Side

Hopefully, you skipped down to this section right away if you are curious about the math behind this. I’ve compiled the numbers used in the graphs for the American League above for those curious enough to look at examples of the actual values.

The Poisson distribution is given by the equation:

$latex P(X = x) = \frac{e^{-\lambda}\lambda^x}{x!}&s=2$

There are two parameters for this equation: expected value [$latex \lambda&s=1$] and the number of runs you are looking to calculate the probability for [$latex x&s=1$]. To determine the probability of a team scoring exactly three runs in a game, you would set $latex x = 3&s=1$ and using the AL expected runs per game you’d calculate:

$latex P(X = x) = \frac{e^{-4.4995}4.4995^3}{3!} = 16.87\% &s=2$

This is repeated for the entire set of $latex x&s=1$ = {0, 1, 2, 3, 4, 5, 6, … } to get the Poisson distribution used through out the post.

One of the assumption the PD makes is that mean and the variance are equal. For these examples, this assumption doesn’t hold true, so the empirical data from actual baseball results doesn’t quite fit the PD and is overdispersed. The NBD accounts for the variance by including it in the parameters.

The negative binomial distribution is usually symbolized by the following equation:

$latex P(X=k) = {{r+k-1}\choose{k}} p^{r} (1-p)^{k}&s=2$

where $latex r&s=1$ is the number of successes, $latex k&s=1$ is the number of failures, and $latex p&s=1$ is the probability of success. A key restriction is that a success has to be the last event in the series of successes and failures.

Unfortunately, we don’t have a clear value for $latex p&s=1$ or a clear concept on what will be measured, because the NBD measures the probability of binary, Bernoulli trials. It’s help to view this problem from the vantage point of the fielding team or pitcher, because a SUCCESS will be defined as getting out of the inning or game, and a FAILURE will be allowing 1 run to score. This will conform to the restriction by having a success [getting out of the inning/game] being the ultimate event of the series.

In order to make this work the NBD needs to be parameterized differently, for mean, variance, and number of runs allowed [failures]. The following equations are derived from the mean and variance equations of a negative binomial. $latex \alpha&s=1$ represents the ‘odds in favor‘ of getting out of the inning. And $latex r&s=1$ is the expected value multiplied by the ‘odds in favor’ which will yield a real, non-integer for the number of successes. The NBD can then be written as

The above equations are adapted from this blog about negative binomials and this one about applying the distribution to baseball. The $latex \Gamma &s=1$ function is used in the equation instead of a combination operator because the combination operator, specifically the factorial, can’t handle the non-whole numbers we are using to describe the number of successes, and the gamma function is a continuous function from 0 to infinity.

Conclusion

The negative binomial distribution is really useful in modeling the distribution of discrete count data from baseball for a given inning or game. The most interesting aspect of the NBD is that a success is considered getting out of the inning/game, while a failure would be letting a run score. This is a little counterintuitive if you approach modeling the distribution from the perspective of the batting team. While the NBD has a better fit, the PD has a simpler concept to explain: the count of discrete event over a given period of time, which might make it better to discuss over beers with your friends.

The fit of the NBD suggests that run scoring is a negative binomial process, but inconsistencies especially with shut outs indicate elements of the game aren’t completely random. I’m explaining the underestimation number of shut outs as the increase use of the best relievers in shut out games over other games increasing the total number of shut outs and subsequently decreasing the frequency of other run-total games.

All MLB data is from retrosheet.org. It’s available free of charge from there. So please check it out, because it’s a great data set. If there are any errors or if you have questions, comments, or want to grab a beer to talk about the Poisson distribution please feel free to tweet me @seandolinar.

The Pirates bullpen has a been a source of problems and criticisms for the Pirates this year. At the beginning of 2014, the bullpen had almost the same personnel as the 2013 season. Bullpens can vary wildly from year to year, and the Pirates relievers pitched out of their minds for most of 2013, so you’d expect there to be some fall off. Currently [August 26, 2014], the Pirates lead MLB with 22 blown saves. Personally, I abhor saves and blown saves, but I needed to get this out of the way, since it’s the stat that will get thrown around the most. And for reference Tony Watson [the All-Star] leads the team with 6 blown saves. So there’s that.

I wanted to look at some of the peripheral stats of the Pirates bullpen to understand the entire story. First, the Pirates starters have been terrible this year. They rank last in starter WAR, middle of the pack in FIP, and near the bottom in WPA. Analyzing that situation is for another day, but suffice it to say they give up a lot of runs before the bullpen gets into the game. The smaller the average lead the bullpen has to hold on to, the more often they will give up the lead [accrue a blown save]. Shutdowns and meltdowns are Fangraphs stats which are better for evaluating individual relievers than saves. They provide a broader evaluation of how a pitcher or bullpen has performed rather than just looking at save situations. For a shutdown a pitcher basically adds to the win probability while for a meltdown a pitcher subtracts from the win probability. For instance last night Jared Hughes had a meltdown allowing three runs and inverting the win probability.

The Pirates are in the middle of the pack for both of those stats. There really isn’t anything interesting here.

Finally, the Pirates’ reliever xFIP is not very good. It’s towards the lower end of MLB. xFIP is one of the better park-independent, context-independent predictors of pitching skill. It just uses BB, K, and flyballs [for HR/FB]. This will also ‘adjust’ for some of Grilli and Frieri’s HRs that they gave up when they were struggling earlier this year. Those struggles won’t affect the bullpen moving forward since they are no longer on the team.

After this quick analysis to answer my initial question about the Pirates bullpen, they aren’t good. They aren’t terrible, but they aren’t good. They do have two really good pitchers with Melancon and Watson. Two decent pitchers in Wilson and Hughes. Then the rest aren’t great. Taking this analysis, what could the Pirates do to improve? Frieri was a gamble that didn’t pay off. But honestly, I think from a management stand point, you had to get rid of Grilli to get him out of the closer role. John Axford might help. He’s been good in the 5 appearances for the Pirates so far, and his career xFIP is 3.26 which is pretty good. As far as a trade, ‘proven’ relievers are overvalued in the free agent market, and the trade market was really expensive this year. Overall, one reliever isn’t going to affect your win total dramatically.

Bases loaded, no outs is one of the most tenuous points of a close baseball game. If you are rooting for the team at the plate, you feel confident your team will score here. Anything else, would be a huge disappointment. If you are rooting for the fielding team and your pitcher gets out of the jam, you are elated and praising the pitching staff for being able to handle pressure. Even though bases loaded, no outs (BLNO) seems like a sure thing, there is about a 15% chance the team DOESN’T score at all.

I’ve created this table of probability of scoring AT LEASE ONE RUN in the various base-out state situations using data from 2011-2013. The base-out states represent the 8 possible combinations of runners on base with the 3 out states that can exist [24 total]. 1- – means there’s only a runner on first, 1-3 means first and third, and 123 is bases loaded. Looking at the chart there is only an 85.18% chance that the team with BLNO scores a run. It’s one of the highest run probability situations, but there’s still a significance chance they won’t score a run.

This table considers every play that started with this base-out configuration and looks at the remainder of the inning to see if the team scored. [It uses every play in baseball from 2011-2013 including playoff games.] In general these numbers fluctuate slightly over time and between teams. This table is also context neutral, specifically batter neutral, so having Mike Trout at bat would significantly change the probability versus a player like Clint Barmes.

Looking at the table, it’s apparent to score AT LEAST one run the lead runner is the most important factor, since all the base-out states have similar probabilities between the states when the lead runner is at third or second. So having a lead-off triple is about as valuable [in the context of scoring ONLY one run] as having the bases loaded, no out.

There are different run and out possibilities that exist with each base-out state. For the lead-off triple, there is no force play on the bases, while a bases-loaded situation has a force play at every bag including home. Having bases loaded would turn a ground ball into a potential run robbing force play, while a single runner on third would require a tag. Conversely, BLNO allows for walks and hit by pitches to drive in a run. This table also looks uses the entire rest of the inning, not just the play that occurs with BLNO. So if the team got the bases loaded with no out, gets two outs, then scores a run, it still counts as a success. A double play, which is easier to get with bases loaded than just a runner on third, will dramatically reduce the run probability of the next play affecting the previous base-out state. In summary, there are trade offs that can occur effecting the overall, context-neutral probability of the base-out state.

Example — Pirates Game

Failing to score a run in the context of this post means after loading the bases, the team does not score any runs before the end of the inning. All the probabilities are determined empirically.

Something kind of cool happened during the Pirates game last night (8/8/2014). There were two instances that bases were loaded with no outs, and the teams weren’t able to score any runs. The not being able to score any runs with the bases loaded/no outs isn’t that uncommon. A run-probability table can tell you that ~14% of the time a team will fail to score any runs for the rest of the inning after achieving that base-out state.

A base-out state is one of the 24 possible combinations of baserunners and number of outs. So there are 8 base states, bases empty, runner on first, etc. to bases loaded, and three different out states, 0, 1, or 2 outs. 8 x 3 = 24.

In the control room at the Pirates game last night, we were debating how often you see two occasions in the same game where no runs are scored after the bases are loaded with no outs. It turns out it relatively rare, but it happened twice at PNC Park before 2014: May 12, 2002 and August 28, 2003.

Between 2003 and 2013, bases were loaded with no out and no runs scored 1,092 times. There were 25 games that this happened multiple times, which is 0.0923% of all games played during that time [27,094 games]. This is on par with the probability of seeing a no-hitter (0.111%) and less probable than seeing a walk-off walk to end the game (0.266%).

The probability of seeing a game with two or more non-scoring bases loaded/no outs situations is 0.0923%

Using the table below bases empty/no outs will occur in every game (this happens at the start of every inning), and all the other base-out states have varying frequencies with runners on third with low out-states being the rarest. Bases loaded/no outs is the rarest base-out state occurring in only 21.92% of all games and occurring twice in the same game only in 6.05% of all games.

Just for reference here is a chart of how often the base-out state events occur relative all events. This would represent the probability that any random event (plate appearance, at-bat, stolen base, etc.) would have that base-out state.

Usually I use stats to describe baseball, but this post is going to use baseball to illustrate stats. There’ll be some math. If that scares you, you’ve been duly warned. Also I have collected the SAS output for each model for technical reference.

A time series is data that has been collected at a regular interval over time. This is rather intuitive when given the definition, but they are different from cross-sectional data, which is the type of data set most people are familiar with. The closing price of a stock is a time series, because it’s a measurement at 4PM every M-F. Cross-sectional data would looking at which type of stocks gained the most over a quarter in your portfolio. This is one measurement (quarterly change) made for a many different stocks. Not every data set fits neatly into a category and the analysis goal is different for each instrument.

The goal of univariate time series analysis (TSA) is to forecast a variable only using past observations of that variable. In the case of the stock market example, TSA seeks to project what the closing price for the next day will be using data from the specified time frame. However, finance is boring and I wanted a data set that I can extract some insight from, so we’ll be looking at MLB strikeouts (K) per year and home runs (HR) per year as the data sets.

What does a time series look like. If you scroll down or look up a stock market graph, you’ll see what a time series looks like. It’s messy. I created this data set, so I can describe this process accurately. It’s a first-order moving average process with a lag_1 coefficient of 0.9 and a series mean of 0. I’ve also included the normal linear regression (OLS) trend for the time series that shows it to have a slightly positive trend. This is a typical analytical technique to show that a time series is moving. In this case the trend is non-significant over these 50 data points. There is no trend, and the mean is zero.

The model that corresponds to the graph above has the general form as follows:

$latex y_t = \mu + a_t + \theta_1 a_{t-1}&s=2$

where $latex y&s=1$ is the time-dependent target variable, $latex \mu&s=1$ is the average of the entire series of data, $latex \theta&s=1$ is the regression coefficient, and $latex a&s=1$ is a time dependent shock to the system. The $latex t&s=1$ terms describe which time period the variable is from starting with the most current one, $latex t=50&s=1$.

Before describing the model above, it is important to fully understand what the $latex a_t&s=1$ represents. This is a shock term that can encompass a lot of different things. If you are consider something like quarterly earnings, factors influencing the shock term are unemployment, economic growth, marketing campaigns, etc. We are looking at the data in absence of this knowledge, and since we are in the dark, the causes of the shocks appear random. The $latex a_t&s=1$ terms should be a normally distributed and not autocorrelated. The expected value should be zero, $latex E[a_t] = 0&s=1$. The expected value is another way to describe the average of all the $latex a_t&s=1$ terms.

Here’s a great way to think about the MA process. Think about a simplified personal monthly expenditures where you had a constant salary and a modest saving account. Shocks that would be included in the $latex a_t&s=1$ term would be unexpected expenses. The unexpected expense could influence the next time period if you had to dip into savings. So a high unexpected expense in January would impact the spending in February, because you’d have payoff your credit card or put money back into savings.

There are many more details to understanding time series such as autocorrelation. Hopefully I’ll write a separate post on that in the future.

Let’s look at some real data. Luckily, I have every play from MLB in a database thanks to retrosheet.org, so we’ll look at some time series from there specifically, HR and Ks per year. Conceptually for this rudimentary modeling, a MA process makes sense. A shock from the previous year like expansion, steroids, or selection bias would carry over year to year. Looking at the time series graph below, it doesn’t behave like the previous time series that was centered around zero. This time series is considered non-stationary, which means there’s a trend and that trend changes over time. The number of HR per season increased over time up until around 2001 when it leveled off and started to decline. There’s a trend up until 2001 a trend after it, and they aren’t the same. To get around this instead of modeling the actual values, the differences between two years of HRs will be model. A difference ($latex \nabla&s=1$) is simply $latex y_t – y_{t-1}&s=1$. Or the difference in HRs in 2013 and 2012, which would be -279 HRs.

The green line are the actual HRs each year. The ‘cantaloupe’ colored lines are the 50% confidence interval (CI) of the forecast. The red line are the forecasted values. I used 50% CIs to show likely deviations, not statistically significant deviations.

The differenced moving average model [ARIMA(0,1,1)] takes the form:

$latex \nabla y_t = \mu + a_t – \theta a_{t-1}&s=2$

Substituting the estimated coefficient for $latex \theta&s=1$ and $latex \mu&s=1$ a forecast can be made with the following equation:

$latex y_{t+1} = \mu + y_t + a_{t+1} – \theta * a_{t}&s=2$

$latex y_{t+1} = 50.11163 + y_t + a_{t+1} – .45073 * a_{t}&s=2$

The last equation is used to generate the forecast line and the ultimately the 50% CI lines. The interpretation of this equation is that half of the shock from the previous time period still has an effect on the change to the current period. The forecast predicts that the home runs will actually increase over the past few years and not continue the decline. Looking backwards the model can be used to identify some years of interest, and I’ve marked those on the graph. Expansion probably has the greatest impact on the number of HRs, because it dilutes the talent pool and increases the total number of games per season. If you wanted to measure the impact training or steroids had on HRs, you’d wanted to use a HR/game time series [see below] instead of total HRs. [This is total HRs between both teams.]

The HR/Gm is the time series that a baseball analyst would want to use, because it controls for extra games from expansion, so the trends are also less pronounced. This is still a non-stationary time series, so it needs to be difference like the previous model and can be described by the following equation:

$latex y_{t+1} = 0.0045989 + y_t + a_t – .49927 * a_{t}&s=2$

Still the greatest shocks are the expansion years, which tend to have a bit of a lingering effect before regressing. 1987 now stands as a really enigmatic outlier. There was no expansion that year. The best explanation is there was a strike zone change, but I can only find that in one article. The home run outburst of the late 90s and early 2000s happens with the ‘steroid era’ and two close periods of expansion. This post isn’t interested in analyzing steroids effect on MLB, only that it’s ‘shock’ is mixed in with expansion team ‘shock’. Also it should be noted HRs/Gm haven’t returned to pre-1993 expansion levels.

Looking at the opposite of a home run, the strike outs per year has a trend that is much more steady, and it’s increasing.

The graph displayed above is also differenced first order moving average process, ARIMA(0,1,1). Its equation looks very similar to the last two so I won’t write it out. The parameters can be found in the SAS output appendix, I have for this page. The forecast has a definite increase in total strike outs over the next few years. Just like the HR per year time series, the time series of Ks are best analyzed by looking at the K/Gm. The K/Gm time series turns out to be a different model than the first three models, because it is a just a random walk around a linear trend.

This process has random shocks around a positive trend with no ‘memory’ of the past shocks like the other three models had. This model for K/Gm, ARIMA(0,1,0), looks a little different than the ARIMA(0,1,1) models seen earlier since there is no lagged $latex a_{t-1}&s=1$ term. The ARIMA(0,1,0) model is given by the following equation:

$latex \nabla y_t = \mu + a_t&s=2$

and the forecast equation with parameters in it would be:

$latex y_{t+1} = 0.11637 + y_t + a_t&s=2$

This indicates that the K/Gm will increase by 0.11637 every year on average. Obviously since there are only 54 outs in a baseball game this trend can’t go on forever. As of the beginning August 2014, the current K/Gm is 15.4 and it is forecasted to be 15.2497, which is within the 50% CI of the forecast.

While these models can make predictions about baseball, I wouldn’t considering this the best [or even good] models for forecasting since we could incorporate other variables or improve the granularity of the forecast to individual players. There also isn’t much value in saying there’ll be more strike outs in 2014 than 2013. However, this example is a good academic exercise in understanding how univariate time series work. And hopefully it provides some insight into both time series and a little bit about trends in baseball.

My friend sparked my recent interest in Poisson distributions by mentioning how rare it is to meet a romantic interest/significant other that you’ll have a long term relationship opposed to going out for just a few dates or even dating at all. I immediately though about earthquakes. It’s strange, but makes some sense, since the large-impact earthquakes are both very unpredictable and rare, much like dating. I’d love to show this actually happens, but since I can’t download relationship data, I’ve found something almost as good: baseball data!

A Poisson distribution [pronunciation] is used for count data and rare events over a specified time/area. This is in contrast to the more familiar bell-curve normal distribution which uses continuous data. [For math/science people, it’s a decaying exponential] A few good example potential models using a Poisson distribution are number of sick days a person uses through out a year or traffic accidents per month on a certain stretch of road. Earthquake frequency modeling is probably one of the more famous uses of a Poisson distribution.

Getting back to baseball, runs are not common events, and I wouldn’t go so far to call them rare events. However, in the context of individual innings, runs are rare. Going back to a previous post about the Pirates’ run probability, any given team in MLB only has a 26% chance that they will score in any given inning. This means that 73% of the time you are watching baseball you are watching the teams not score. I am interested in how often a team will score 0, 1, 2, 3 or more runs in an inning. To determine the probability that a certain number of runs are scored in any inning a Poisson distribution can be used and it follows the general form:

[latex]P(X) = \frac{e^{-\lambda}\lambda^X}{X!} [/latex]

Substituting the [latex] \lambda [/latex] term for the Run Expectancy for the beginning of a inning which is .4615 runs an inning in 2013, you will the red distribution line below. [ Run Expectancy/Expected Runs is a fancy way to say the average runs for a given situation.] The blue area represents the actual run frequencies, and the gold line is the distribution which I obtained from regression.

The Poisson distribution describes how often runs are scored during innings pretty well, but it’s not perfect. The trend line underestimates the shutout and big-run innings, while overestimating the one-run innings. The model shown above is suffering from overdispersion, which means the variance [how spread out the data is] is larger than what the model assumes. The short reason to account for the lack of fit is that baseball isn’t completely random. You’ll have better teams who score multiple runs in an inning against poor teams who will in turn fail to score any runs in an inning. The disparity in teams will cause a wider variance in run scoring.

The red line in the graph above is a distribution I obtained when I regressed the count data against the number of runs and obtained a ‘new’ mean. This distribution is a little bit closer to the empirical data, though it still suffers overdispersion.

I’ve put all the counts and frequencies/probabilities into a table so that it is easier to reference. If you wanted to calculate the probability that you would see an entire game (full 9 innings) with 7 or more runs in an inning (like last night’s Pirates game), you would use the following formula:

[latex] P(X) =1-P([/latex]of not having any +7-run innings[latex])^{18 innings}[/latex]

[latex]= 1 – (1-[.0009+.0003+.0001])^{18} = .02314 [/latex]

So there’s a roughly 2% chance that any baseball game you attend will have an inning with 7 or more runs scored in it.

This is a lot of debate about the usefulness of the comprehensive baseball statistic, WAR — Wins Above Replacement. I don’t think that WAR is the end all statistic, but it is a useful tool. Why? Because it can describe relatively accurately how a player contributes to a team. It also can help fans understand the real impact of one player. I might have to refer people here once people start clamoring that a single player will change the direction of a team at the trade deadline.

If anyone wants a primer on the details of what goes into the WAR stat, check out baseball-reference.com’s comparison between systems. Basically, WAR is the number of statistical wins the player is responsible for above a replacement player. In theory the replacement is the mediocre AAA player that is not a prospect. That statistic is the middle estimate of the impact the player will have, a player can be ‘responsible’ for more wins than their WAR number, but also drastically less. Think of WAR as the average wins he’s responsible for.

For probably over a year, I’ve wanted to see if WAR actually can predict the number of wins a team will have. I forget my original methods of trying to determine this, but this time round, I used FanGraphs’ WAR numbers for both pitching and batting from the last decade of season for all 30 teams. That’s 300 data points. After assembling the data and then running it through a basic linear regression, I was quite happy with what I saw. I’ve heard that if you add 48 to the team’s WAR number that you will get their total wins, and this can be seen mathematically by looking at real data.

I’ve graphed the actual wins to WAR and actual wins to the Pythagorean predicted wins for comparison. [Pythagorean wins performed better.] The linear regression for the WAR comparison actually turns out to be incredibly powerful. The regression coefficient is almost exactly equal to one meaning that each unit increase in WAR means an equal increase in wins. The y-intercept is +48.5, which means for the last decade the number of theoretical replacement wins has been just about 48. This should make sense, since the calculation of WAR is calibrated to a 48 win replacement level. The actual implementation of WAR works really well to predict teams wins. Unfortunately, this model will have a 95% prediction interval of 20 wins. That seems like a lot but, it shows how much luck has to do with a baseball season.

Pythagorean wins are typically used to show how lucky the team has been this year or not. This is actually a slightly better predictor of a teams’ success than WAR. There is less variance since run differential is just one step away from wins. You can see from the histograms that the spread on Pythagorean wins is less than with WAR. This can also be seen in the r-square for the linear regression. Pythagorean wins linear model has an r-square of .87 while the WAR model has an r-square of .77. This ultimately means that 87% and 77% of the variance is explained by the model indicating that the Pythagorean wins is slightly more accurate. The trade off is that WAR can give you player-level detail while run differential is only team-specific.

As always, let’s look at what the Pirates did.

A theme I always harp on was that the 2013 Pirates were good and really lucky. This can be seen by the data point for 2013 falling above the linear regression trend line. If you were wondering 2012 and 2011 (the two ‘collapse’ years) also fall above this line. I don’t know if this is the best way to measure a collapse, but the in-season stats did indicate regression during all three seasons 2011-2013.