sab-R-metrics: Displaying Line Plots and Time Series Data

It's been a while since I've had the chance to add anything here, but last time I left everyone with some scatter plots and some customization tools for your graphics. This week will be a little more brief than the last few tutorials and what I'd like to do is show you how to display line graphs for time series data. For this, I'll be using New York Yankees attendance and win percent from 1903 through 2010. I grabbed it from Sports Business Data and you can find it HERE at my website (along with the data from the other tutorials here at the site).

Go ahead and set your working directory to the one you prefer and put the data file in that folder. We'll just name our data "yanks":

##load data (already set working directory)

yanks head(attend)

You can see here that there are three (3) columns of data in the file: year, win, att. These are straight forward, as 'year' is just the Year of that season, 'win' is the Yankees' win percent in that season, and 'att' is the average per game attendance for the Yankees home games that year.

Now, we could always go back and try some scatter plots with this new data. Say, for instance, we're interested in the relationship between winning and attendance, we could simply plot:

Now this gives us an okay relationship, and it tells us pretty much what we'd expect: winning teams tend to get more fans (see the upward slope of the points from left to right). On the other hand, the relationship is a bit of a mess: attendance was just generally lower early on in Yankee history than it is now just by general increasing demand for baseball. And that's not to mention the fact that the data is capped at a sellout point that may mask some relationship in this simple look.

Also, there is some ambiguity on the direction of causality of winning and attendance. Of course people come to the game to see the team win, but the team also wins through investment and the causality in the long-run can be seen as going the other way around. Each of these are serious econometric issues in sports economics and sport management that I don't plan to get into on this website.

Back to R. Here, I want to get into plotting lines. The best way to start is to use time series data. So instead of plotting these two variables, let's separately plot each one across time. This is pretty simple, and we'll begin again with the scatter plot of each across time:

Now this gives us a nice little picture of how attendance has increased over time. Obviously, average Yankee attendance wasn't the same in 1921as in 2009, despite nearly identical winning percentages. This is why you have to be careful at looking at relationships over time. But we can do better than putting points on the plot like this. Looking at the win percent plot, it's difficult to gauge any patterns or cycles in the Yankees winning. Usually a line plot is a better way to go about this. There are two easy ways to do this, and I'll start by adjusting our "plot" function:

##draw line plots of each over timepng(file="winattbyyearLINE.png", height=800, width=1400)

As you can see in the code above I identified the plot type by 'type="l"' (that's a lower case L) to tell R that I want it to make a line out of the data. In addition, I used "lwd=2" to make the lines a bit thicker (the default is 1). However, R has a nice easy way to plot time series data without having to specify a line plot. We can simply use "plot.ts", and R will understand what we're doing. The code below should generate the same plots, but using this new function. You can see that there is no formula for the plotting here, just the dependent variable (the one you want to plot across time). The disadvantage, however, is that you need to customize the x-axis for it to show the year tickmarks instead of the observation numbers. I prefer to use the simple 'plot' function with type="l" to avoid this, but sometimes you may want to customize that axis anyway. It's really up to you and what you are comfortable with.

###draw line plots of each over time using plot.tspng(file="winattbyyearTS.png", height=800, width=1400)

One thing to remember is that we can also first plot the points, then the lines on the plot using "lines()" after you do your scatter plot. Depending on your objectives, this may be helpful to identify where each year is on the line (remember "cex=" tells R what size to make your points, while "pch=" tells it what types of points you'd like to use).

So that's what I have for the beginners today. I know it's a bit short and not much in addition to last time, but I'm swamped today. Next time, I'm going to get into plotting a regression (and loess) line on these plots as well as cover some more color options like backgrounds and transparent colors. Using a regression line, we can get a better idea of the association on a scatter plot than simply looking at the points, while a loess regression can help to identify patterns in data (and can be very useful in visual inspection of time series data). As usual, I post the pretty R code below for this post:

#########################Line Plots and Time Series Plots######################