Developing The baseballr Package For R

When this package is finished, it will hopefully be a mighty tool for baseball researchers and analysts.

Introduction

Late in 2015 I wrote a piece here at The Hardball Times that walked through some of my favorite R packages for gathering and analyzing baseball data. Like all things, no single package has everything I need, nor should it. Following that article, I started collecting various functions that I’ve written and routinely use and decided to compile them in a formal package that anyone can easily load and use.

I’ve never written an R package before, so this is partly an excuse for me to learn a new skill. That means the development of the package will be slow, and have its fair share of bumps along the way. I thought I would share some initial views of the kind of functions I plan to include.

Data Acquisition

I featured a number of packages in my previous article that focused on grabbing data, whether it was full season data for individual players and teams, or pitch-by-pitch data. However, right now there isn’t a package that makes it easy to pull real-time data on players during the season. The Lahman package is great, but that database is only updated once a year after the season. As a writer at The Hardball Times I have direct access to our database, but not everyone does. FanGraphs make it very easy to download leaderboards in CSV format that include dozens of statistics for players updated daily, but there isn’t an easy way to grab that data from within R.

Sometimes I like to pull team data such as their schedule and record (which is very helpful for my “team consistency” work). Baseball-Reference is the easiest site to acquire this from, so I created a function that allows you to specify the team and year and get back detailed information about the outcome of each of their games.

Using the team_results_bref() function, here’s what the first 10 games of Houston’s 2015 schedule and results would look like:

Finally, it’s fairly easy to get player performance data for many standard splits, such as by month or by pitcher handedness. But we may want to grab information over a very specific time frame; say, batter performance from August 10, 2015 through the end of the 2015 season. Without access to a game-by-game database this would be impossible, or just incredibly time consuming if you wanted to compile it by hand.

The daily_batter_bref() function makes this very simple. All you need to pass to the function is the first and last date you are interested in. The function will then pull batter performance only over this time frame from Baseball-Reference (the first six records are shown below):

Metric Calculation

FanGraphs and Baseball-Reference do the hard work of calculating some of the most commonly used advanced metrics for visitors. However, there are times when you might want to calculate some of these metrics yourself.

Let’s take our last example, where you have data over a very specific time frame. FanGraphs doesn’t produce wOBA or wRC+ for custom time frames, but there is nothing stopping you calculating statistics like these as long as you have the basic data.

The function below will (eventually) calculate wOBA, wRC, and wRC+ for any player over any timeframe, so long as you feed it the requisite data. For now, the function will only calculate wOBA (hey, I’m working on it).

As an example, let’s say you want to know the wOBA for players from August 10, 2015, through the end of the regular season. It’s a snap as long as you have the data in the right format. We can just feed the woba_plus() function the data we just scraped. Here I am just showing the top-15 players by their wOBA:

I am also planning to include functions that will calculate some of the custom metrics that I have developed and co-developed over the years. Take team consistency, for example. If someone wants to know how consistent each team was in terms of their run scoring and run prevention in 2015 they can easily calculate that with the team_consistency() function:

You can play with the individual functions, or install the development version of the package using devtools. See here for instructions.

Next Steps

All of the development can be tracked on GitHub, including the development version of the package. My plan is to flesh out additional data acquisition functions largely through existing application program interfaces (API’s) or scraping of websites. Additional metrics will be added, specifically the ability to calculate things like wOBA on contact, wOBA per pitch based on PITCHf/x data, calculating Edge% from PITCHf/x data, and individual player consistency/volatility. I am also toying with some visualization functions as well, but more on those later.

Feel free to send suggestions or requests along, especially any feedback on the draft versions of the functions (which will be housed here). I can’t promise I will be able to incorporate all of them (or even most of them), but I will certainly do what I can.

About Bill Petti

Bill leads Predictive Modeling and Data Science consulting at Gallup. In his free time, he writes for The Hardball Times, speaks about baseball research and analytics, has consulted for a Major League Baseball team, and has appeared on MLB Network's Clubhouse Confidential as well as several MLB-produced documentaries. He is also the creator of the baseballr package for the R programming language. Along with Jeff Zimmerman, he won the 2013 SABR Analytics Research Award for Contemporary Analysis. Follow him on Twitter @BillPetti.

The timing is impeccable. I focused this weekend on trying to put my programming chops into actually getting into computational baseball analysis – i.e., learning R, finally, looking at ways at acquiring data from various sources, and just seeing what could be done with a little creativity, the data and the tools to question the data.

I am pretty R comfortable but o/w completely computer illeterate so forgive me if this is a simple question by I have been trying to scrape the Zips/Fans/Steamer projections in to R from fg with no success. Of course getting the data in R once is trivial but it would be great to be able to scrape the ROS projections daily. Friends have pointed me to tutorials/blogposts on APIs and scarping but they are 1) all in python and 2) seem to require knowledge of JAVA or the specific website in question. I think this might be an interesting add. Regardless, heading to devtools to download now…

I LOVE your tools and really appreciate you. The first time I recall seeing your name was at the Tableau Website seeking a little baseball data viz…That was good stuff, and ALL THESE NEW TOOLS are like Christmas Day for me…
Thank you !!!
I am a fan,
Andrea