sab-R-metrics: Kernel Density Smoothing

Last time I left you, I had gone over some basics of doing loess regression in R. If you remember, loess is a sort of regression that allows wigglyness in your regression of some dependent variable Y on some independent variable X (I will generalize this to more than one dimension later on). However, sometimes we’re not always interested in how X affects Y. Sometimes we may simply be interested in the distribution and frequencies of some value of X. An example that comes to mind is Pitch F/X data, where we want to see the frequency of pitch locations in the strike zone.

For showing pitch location, we’re going to need two dimensions. Today, I’m going to begin with a single dimension of kernel density smoothing. It’s important to understand what this method is doing before jumping into displaying it in these multiple dimensions like the Pitch F/X heat maps you see everywhere nowadays.

For this tutorial, I want you to first go ahead and grab the Jeremy Guthrie pitch data from last time. You can download it directly here. We’ll look to smooth the starting velocity of pitches for Guthrie from the Pitch F/X data.

##set working directory and load dataand take a look at it

setwd(“c:/Users/Millsy/Dropbox/Blog Stuff/sab-R-metrics”)

guth <- read.csv(file=”guthrie.csv”, h=T)

head(guth)

Okay, now that the data is loaded in, let’s start with a histogram. Why? Well, density smoothing and histograms are very much related. Kernel density smoothing is really just a smooth, pretty version of a histogram. We’ll start simple with looking at the distribution of speeds for all pitches, then get a little more advanced and overlay the velocity of each pitch type on top of one another. For this exercise, we’ll have to look back to a previous post and use the “hist()” function in R. This will produce a nice histogram with the defaults. Go ahead and try it yourself. Practice using the RGB scheme and making transparent colors if you want. The histogram function has a bunch of options, but I’ll keep things basic for now. Don’t forget to give your plot a title and label the axes so we know what we’re plotting. Finally, be sure to tell R that you want to plot the density with the command “freq=FALSE“ within the histogram function. We’ll see why a little later. If you get stuck, you can reference the code below:

Okay, this looks pretty much like what we had in the previous tutorial. The range of pitch speeds included is strange on the left of the plot (we’ll fix this later, it’s likely just PFX data issues on 1 or 2 data points). But what does this have to do with kernel density smoothing? Well, density smoothing is another way of representing this distribution. If we don’t really like putting things in “bins” in a histogram, we can use a smoother to look at the distribution. There are some real advantages to this, as we will see in a few paragraphs. I’ll provide some very brief background on what the kernel smoother actually does.

In a histogram, we use bins with a given bandwidth to group together observations and get a rough estimate at the probability density function (PDF…not the Adobe kind) of our data. We can affect the shape by changing the bandwidth and number of bins if we’d like. But the discrete display isn’t always optimal. The kernel smoother allows us to take a weighted average at each observation. The points that are closer to each observation will be weighted more heavily in this average, while those far away are weighted less. This allows us to get a smooth representation of the density (much like the loess does when we have a dependent variable of interest–though loess is a variable bandwidth smoother). There are a number of ways to weight your density, but there are some standard “kernels” that are used in practice. The most popular are the Gaussian and the Epanechnikov kernels. Others include the Triweight, Biweight, Uniform, and so on. Really you can weight it however you want, but the Gaussian and Epanechnikov should do the trick for purposes here.

So, for the Gaussian kernel, picture the normal distribution. In kernel smoothing, we use the height of the standard normal to weight all the observations within the vicinity of each observation in our data set. So, for a 90 mph pitch, those that are 91 mph and 89 mph are likely weighted heavily in deciding the height of our smoothed representation. As you get further from the middle of the normal, you notice that the tails get shorter and shorter. Since the height of the curve is the weight, those observations further and further from our point of interest will be weighted less and less based on these tails. Moving this along each value in our distribution of pitch speeds (i.e. this is done multiple times for each value in the range of the data), we get a weighted, smoothed height of our density curve.

The default in R is the Gaussian kernel, but you can specify what you want by using the “kernel=” option and just typing the name of your desired kernel (i.e. “gaussian” or “epanechnikov”). Let’s apply this using the “density()” function in R and just using the defaults for the kernel.

In first part of the code above, I simply let R calculate the bandwidth using a rule of thumb (called “nrd0″ in R in the first plot…I won’t get too much into optimizing bandwidth, but use the same logic that we did with loess smoothing). Like with the loess regression, you may want to play with different bandwidths and see how this affects the smoothing and look of the distribution. You can adjust this using the option “bw=” within the “density()” function. I’ve plotted multiple lines with different bandwidths to illustrate above in the bottom plot.

You can see that the optimized bandwidth (the default) seems to smooth nicely. On the other hand, the red line (a small bandwidth) is much too wiggly, while the yellow line (very large bandwidth) tells us that Guthrie gets the ball up over 100 mph fairly often. Also, we see there seem to be some outliers (likely Pitch F/X errors). We can fix this up a bit by using some additional options in the “density()” function in R. By telling R to smooth “from=” and “to=“, we ensure that we’re only smoothing over the real range of the data and can ignore those weird few points on the left. In addition to this, we can ensure that there are fewer issues at the edges of the data (i.e. smoothing up past 100 mph–we had similar issues with loess smoothing if you remember the widening of the confidence intervals at the edges of the data range) using the option “cut=“, but I won’t cover this option today. I encourage you to fiddle with it on your own. Let’s try limiting the range of the smoothing for now:

Now this is a little bit easier to look at. Now, to see how this relates to our histogram, let’s overlay the data on the histogram. Call the histogram, and then we’ll add the above density curve using “lines()“. Note that we’ll have to set the y-axis limits to ensure that the entire smooth shows up on our plot (the range of density is larger for the smooth than the default histogram) and reset the x-limits to keep the histogram from including all the useless left-tail pitch speeds.

Notice how the smoother allowed us to see the tri-modal distribution of pitch speeds, while the histogram may not have provided us with enough bins. This seems to allude to the idea that Guthrie is throwing 3 (or 4) different pitch types that have different speeds. Let’s see what these are by estimating the density by pitch type. I’ll group fastball types together, and curveballs, change-ups, and sliders each separately. We can plot them on the same plot and see how they compare.

So, given what we see in the plot above, it looks like Gameday may have mis-classified some Sliders as Curveballs. Guthrie’s Change and Slider seem to be near the same velocity, while–of course–the fastest pitches are the fastball variants classified here. It also looks like there may be one or two Curves and Sliders given the tiny bump on the right for these pitches. Of course, this isn’t cluster analysis (though, this type of comparison is sort of what clustering does, and I’ll go over formal cluster analysis later in the series…promise!). However, we may be able to discern some things by running this same sort of comparison for movement variables included in the Pitch F/X data. I’ll leave that up to the reader to fiddle around with as a learning exercise.

Now that we’ve covered kernel density estimation in a single dimension, we can move on to covering this in two dimensions. Estimating it is quite easy in R, but displaying the data is the difficult part. We’ll need a third dimension to display data. This can be done using color or using a 3-dimensional looking plot. I prefer color, as it creates the really neat heat maps that we’ve seen around the net.