2.0 Importing baseball data

Now, let's try loading some historical baseball data into R. For the examples in this article, I focus on player salaries and team performance. We used Version 5.1 of the comma-delimited data from the Baseball Archive. The Baseball Archive is the result of a collaborative project begun by Sean Lahman to produce a freely available database of major league baseball stats. You can download baseball data from this site for non-commercial use, but the authors request a donation. (See the Baseball Archive for more information about terms of use and download links.)

R includes a function read.table that loads an external text file and returns a data frame. We'll use this function to load in a few of the most important tables. Suppose that you have your data in the directory "D:/lahman51-csv." The Baseball Archive tables are comma-separated value files with no column names in the header. Each call to read.table specifies the separator as a comma, specifies that there is no header line, and includes a list of column names. In our example, we'll load and use only two tables.

The salaries table includes only five columns. The teams table includes 49 different columns, summarizing the season for each team and year. As you can see, the columns include a mix of unique identifiers for teams and players, some information about each baseball team, and some common baseball statistics.

If you'd like more information about what each abbreviation stands for, see the documentation at the Baseball Archive. If you'd like more information about what each statistic means, mlb.com has a good reference page.

2.1 Example: Checking data integrity

Before we begin our analysis, let's check to make sure that the files loaded completely and that all the data makes sense. Many things can go wrong in extracting or loading data, so it's always a good idea to do a few sanity checks on a data mining or data analysis project. In this section, I show a few examples for the salaries table.

Let's start by checking that all the observations have been loaded. We know that there are 15,616 lines and five columns in the salaries file. To check that the file loaded correctly, we'll get the number of rows and columns in the salaries table using the "length" function.

Now, let's look at a few sample observations in each data frame. We'd like to take a look at the loaded data to make sure that nothing obvious looks wrong: columns of missing values, mislabeled columns, etc. In R, you can select a subset of the columns or rows from a data frame by using R's [] notation to select a subset of rows and columns.

No surprises so far. As a final check, let's look at some basic statistics for this table to make sure that every value makes sense. R includes a summary() function that provides some basic statistics about each variable. For numeric variables, this procedure returns some distribution information (the minimum value, value at the 25th percentile, median, value at the 75th percentile, and maximum) and the mean. For character variables, this procedure returns (up to) the six most common values. This data is great for sanity-checking a variable. If the minimum or maximum values look too low or high (respectively), you may need to filter outliers or do further checks that the data has loaded correctly. If the quartile, median, or mean values are surprising, or if the distribution is unexpected, you may want to double check your data source to make sure that it's correctly formatted, that it loaded completely, and that it means what you think it means.

We see that the salary data spans 1985 to 2003, as expected. The top team had 586 players over 19 years. The average number of players per year is 386/19 = ~30.8. The official roster has 25 players for most of each season, so that number is not unreasonable. The split between leagues (AL and NL) is about even, which makes sense. Now, let's look at the playerID numbers. First, let's check the maximum value. It looks like no player appears more than 19 times in the table, which is a good thing as there are only 19 years between 1985 and 2003. It also looks like there are only a few players who have played during all 19 years, which is also good as very few players play in the major leagues for 19 years or more. The maximum salary of $22 million looks right (that's A-Rod!), but the minimum looks funny. A salary of 0? Surely that would violate union rules! Let's do a little more investigation:

There are only two values that are zero; that's not a huge problem (we can investigate these two cases to correct them or just cut these values from our analysis). Below, we focus on year 2003 statistics, so this doesn't affect our analysis at all.

2.2 Example: Plotting a histogram of 2003 salaries

We're now ready to start doing some analysis. For convenience, let's start by creating a new data frame containing only salary values from 2003.

The easiest way to do this for a data frame is to use the subset function. This function takes a data frame as an argument and returns a new data frame containing a subset of the rows and columns from the input data frame. This function takes many options, but we'll focus on two: the data frame object and the row selection condition. For convenience, let's start by creating a data frame containing only salaries from the year 2003.

Now, let's take a look at the distribution of salaries in 2003 (a histogram), using the hist() command.

> hist(salaries2003$salary)

By default, R will split the salary range into ten evenly spaced segments, then count the number of observation in each segment, and plot the results as a column chart. (That's not completely true. What R will actually do is use a function called "Sturges" to calculate the appropriate number of bins given the data and then use this value. For this data, it works out to about ten bins.) When you type this command, R will pop up a new window showing the following diagram.

It looks like most salaries are pretty low, so let's increase the detail in this diagram. Let's increase the number of bins to 100.

> hist(salaries2003$salary, breaks=100)

After typing this command, R will display a graphic like this in the graphics window:

Hmmm, it looks like a lot of salaries are still bunched up together in the lowest bin. We can use a log scale to show more detail among the low values by plotting the log of salary. As you may remember, you can easily apply a function to a column of values using R. Let's calculate the (base 10) logarithm of each salary in our data and plot the results:

> hist(log10(salaries2003$salary), breaks=100)

R will plot a diagram like this.

It still looks like most salaries are pretty low. Most likely, this is because major league teams have a number of young players under contract for the league minimum (or close to the minimum). Turnover, even within a season, among low paid players is very high, as teams "call up" players from minor league teams and "send down" players to minor league teams. (As an exercise, it would be interesting to filter rookies from the data to focus on free agents or players who have been through salary arbitration.)

2.3 Example: Relationship between wins and team statistics in 2003

Next, let's look at the relationship between some different statistics about each team and the number of wins in the year 2003. Our primary focus is on wins, but we'll look at the relationship between other pairs of statistics as well. Let's start by creating a new data frame for 2003 and looking at the available columns:

There's a lot of good stuff here: hits (H), walks (BB), home runs (HR), runs scored (R), strikeouts (SO) by batters, runs allowed (RA), earned run average of pitching staff (ERA), strikeouts by pitching staff (SOA), and other variables. Unfortunately, there is no total salary, so we need to calculate total team salaries and add that column.

We know the salary of each player from the salaries data frame. We'd like to create a new data frame describing the total payroll of each team in 2003 and then merge the teams2003 table above with the salary information. Here's how to do that in R.

We now have salary information by team, so we're ready to plot some charts. R includes many functions for graphing values. One of my favorite plots is the pairs() plot. This function prints out a matrix of different scatter plots, one plot for every pair of variables in your data. It's a great way to look at a lot of relationships at once. It's a little hard to describe how to do this without an example, so let's use the pairs() function on a few variables in the 2003 team data.

As you can see, for every combination of variables in wins (w), runs (R), hits (H), home runs (HR), walks ("base on balls", BB), earned run average (ERA), strikeouts (SOA), and payroll (salary), there is a scatter plot. Actually, there are two scatter plots that are mirror images. Here's an example: there are two scatter plots comparing wins and runs. The second scatter plot from the left in the top row shows runs horizontally (the X axis) and wins vertically (the Y axis). The second scatter plot from the left in the top row shows wins horizontally (the X axis) and runs vertically (the Y axis).

This matrix of plots is a very good way to look for correlation between variables. On the diagonal, you can see the names of variables. For each plot (i,j) where i is the column and j is the row, the variable of the Y-axis is given by cell (j,j), and the variable of the X-axis is given by cell (i,i). For example, let's look at the cell in the fifth row and third column, (3,5). In this cell, the X-axis represents hits and the Y-axis represents walks.

That explanation was precise, but could be difficult to understand, so here's an illustration of how this works. I marked up a second copy of this plot to try to make it easier to understand. Here's a procedure for reading these charts. I picked a cell in the grid below and highlighted it in red. Let's start by figuring out what the vertical (Y) dimension means. All plots in a row share the same vertical (Y) dimension, so just look for the box in the row with a name. In this case, it's BB. Now, let's figure out the horizontal (X) dimension. All plots in a column share the same horizontal (X) dimension, so look for the box in the column with a name. Here, it is H.

There is another trick for deciphering these diagrams. Notice that a scale is given for some boxes. If each dimension has a different scale, this can help you read the boxes. For example, see the first label at the top of this pairs plot, showing a dimension from 600 to 900? Well, all plots in that column share that same scale. It turns out that the only measurement with an appropriate range is Runs, which is what that column means.

Now, what can we learn from these plots? Scatter plots are a great way to see relationships between pairs of values: a line or curve is a good indication of a strong relationship. You can see if the relationship is strong or weak depending on how noisy it is. If the points are all clustered together, that can also mean that there is a relationship. Let's focus on just payroll. The far right column has plots with payroll as the X (vertical) dimension, so these plots depict each measurement as a function of payroll. Most of these plots show a fuzzy relationship between payroll and performance. For example, teams with higher payrolls had lower ERAs, and teams with higher payrolls had more wins. However, this isn't as some other relationships in this data. For example, look at the relationship between strikeouts and ERA: the higher the strikeouts, the lower the ERA. Or look at the relationship between hits and runs. Just looking at this data, I'd guess that payroll is a good predictor of wins but that other metrics are much better. Let's use some other tools in R to see if this is true.

2.4 Example: Building a model for wins in 2003

How important is a team's payroll in determining wins? Let's try to fit a simple linear model, and ask R to show us some summary statistics about that model. We'll use an ordinary linear regression model, modeling wins as a function of payroll. The lm() function of R takes a model formula and data frame as arguments and returns a model object. (There are many other options available; see the help file for more information.) Below, we also use the summary() and plot() routines to examine the model object.

When you fit a line in R, the standard lm() routine returns a lot of statistical information about how well the model fits. For a thorough explanation of how this function works, and what all the results mean, please take a look at a statistics book.

Here's my interpretation of a few of these results. First, R shows the "residuals." The residuals are the difference between wins predicted by this model and the actual number of wins for each team. (So, a negative residual is an overestimate, a positive residual is an underestimate.) Ideally,
we'd see a symmetrical distribution--median near zero, min and max opposite each other. The median of 2.57 is pretty close to zero; there are 162 games a year, so that implies that the median predicted number of wins is about 2% over the mean number of Wins. In 2003, the Detroit Tigers came close to setting the record for the most losses in one season and ended with 43 losses and a payroll of $49 million. This model predicted 76.61647 wins for the tigers, hence the residual of -33.61647 shown above. At the opposite end of the scale, in 2003, the Oakland A's won 96 games with a payroll of only $50 million (remarkably close to the Tigers' payroll). According to this model, Oakland was expected to win 76.83840 games, hence the residual of 19.165917.

Next, R lists the coefficients and what each one means. The "Estimate" column contains the actual coefficients. In this case, the equation is Wins = 67.9329 + 1.99788 * 10-7 * salary. R also shows statistical values like standard error, t test, and p values that can be used to understand the significance of each variable. The significance codes (shown by asterisks on the right) are intended for quickly ranking the significance of each variable. Next, R lists the residual standard error, the coefficient of determination (R^2 and adjusted R^2), F statistics, and p values (all measurements of how well the model fits).

The coefficient of determination (R^2) represents the proportion of the variation in wins that is explained by the variation in salary. For this model, the R^2 value of .1753126 implies that salary explains some of the variation in wins. To help understand how well salary explains wins, let's look at what happens if we try to predict wins from other values. If we built a model for wins in terms of runs scored and earned from average, the R^2 value would be .8959. If we built a model for wins in terms of at bats, the R^2 value would be .04687. This implies that salary explains more than at bats but less than runs scored and earned run average. I'd say that there is a weak relationship between salary and wins: teams with higher salaries tend to perform better, but the players still have to go out and play baseball.

The analyses shown here demonstrate a weak correlation between payroll and wins. Teams that spend more money tend to do better than teams that spend less money, but not always. What the analysis doesn't show is why this is true. Some of this has to do with long-term contracts: young players are committed to the same team at a fixed salary for a few years, proving a good value to some low payroll teams. Some of this has to do with improperly valuing players with long-term contracts: as good as A-Rod is, a single $22 million player wasn't enough to make Texas a winning team. There's a lot more opportunity to explore what's going on here: what performance measurements teams pay for, and what performance measurements best predict wins.

3.0 Where to go from here

Whether you're a baseball fan who never thought she could calculate her own stats, an experienced stats geek who wants to check out some new tools, or a statistician looking for some fun data to analyze, I hope you learned something and try out some of these tools yourself.

Joseph Adler
has many years of experience in data mining and data analysis at companies including DoubleClick, American Express, and VeriSign. He is the inventor of several patents for computer security and cryptography, and the author of Baseball Hacks.