How to use Classification Trees to Predict the Future

Classification and Regression Tree (CART) modeling sounds very boring, but it is actually one of the most interesting and intuitive modeling methods available. I have covered two types of modeling so far: Multi-Linear Regression and Neural Networks. CART Modeling is by FAR the simplest and most comprehensible. Cart modeling allows us to create a visible tree of decisions that ultimately gives us an answer. For football purposes, the CART method can give us tiers of players based on what information we plug into it.

Let us do a simple example relating to the NFL draft to get an idea of how CART can be used. In the NFL Draft, we have seven rounds, with CART modeling we can try to predict the round (tier) a player will be drafted. In this example, we will use running backs for consistency.

For this example I used the following variables:

*Rank: Mike Mayock’s Draft pre-draft ranking for the RB, I tested a few mainstream analysts, and Mayock was the best for RBs

*Physical Grade: My personal grade on how athletic an RB was at the time of the draft

*Height: Height at the time of the Draft

*40yrd Dash: The RB’s official 40 yard dash time, pre-draft

*CY: College Career Yards

*TDs: College Career Touchdowns

*YPC: College Yards per Carry

*TDpS: College Touchdowns per Season

The Data I used consists of 170 running backs drafted since 2008 and their college statistics. We input the variables above to try and predict the round they were drafted. Below is the very first step to a CART model. The graph below is each running back with respect to the round (what we care about). Notice how the mean in the bottom left is 3.97.

This is because 3.97 is the average round drafted of all of our data. These data are randomly placed on the X-axis. Left-Right has no meaning.

Our first real “statistics” step is that we want to split the data. We want to find where the mean round differs the most for one of our response variables. This will make intuitive sense once you see the next graph. The graph below shows us that the single biggest indicator for what round an RB will be drafted is Mike Mayock’s pre-draft running back rank. This makes sense since Mayock’s ranking is a subjective compilation of many different attributes a running back may or may not possess. Notice how the biggest factor is whether or not they are in the top 10. If they are in Mayock’s top 10, the average round they are drafted is 2.69. If they are outside the top 10, their average drafted round is 5.23 which is a big difference!

The next step is where things get a smidge complicated, but it’s simple if you understood the previous step. We want to try and find the next most important factor. In this case, it happens to again be Mayock’s ranking. For those he ranked in the top 10, the biggest separator is whether they are ranked above or below 4.

Next, we have our first non-Mayock related split in our tree. For those ranked outside the top 4, our biggest split is whether their physical grade was above or below 6.34. Optimally, we would want to split the tree just enough times to get good future predictions, but not overfit our data.

Hopefully, now you have a basic understanding of what a CART model is!

Now for the negatives – CART models are not the best predictors alone. CART models group things into tiers, which, when you want to predict something like yards from scrimmage, is not that helpful. An average CART model would essentially group the top 10 RBs and say that they will have more than 1,000 yards from scrimmage, which is not helpful. I like to use CART models in conjunction with other modeling methods to find “tiers” and let other metrics separate the running backs within each tier. I plan on doing a more complicated CART article in the future, but I wanted to lay down a foundation of understanding first. If you would like even more information on CART, I go even more in-depth on Twitter (@DFF_Koala)! Thank you for reading this, and hopefully, it was helpful.