Data Analysis IV: Multi-Variate Regression

SE350 Team

Introducing Multi-variate Regression

Multi-variate regression works in exactly the same way that simple regression works, with the exception that it is using more than one explanatory variable. This also means that it is not in two dimensional space, and is harder to picture mentally, although all the same ideas apply.

The basic equation for multi-variate regression is

\[ y=\beta_0+\beta_1 x_1+\beta_2 x_2+\dots+\beta_m x_m+e \]

Choosing Variables

Ideally pick values that you can justify based on practical or theoretical grounds

You could also choose variables that generate the largest value of Adjusted \( R^2 \), or

You could choose those with the most significant p-values

Let the computer choose the best variables for you

Remember, when looking at the goodness-of-fit for multi-variate regression models, you must use adjusted \( R^2 \)!

Note that R is often the tool of choice for baseball data:

Multi-variate Regression with Baseball Data

We will import Baseball TEAM data:

team<-read.csv("BaseballTeam2014.csv",as.is=TRUE)

Note: This is a new data set that was not provided in the original data bundle

Now let the computer do it for you

The function we will use to do this is stepAIC. This function uses the Akaike information criterion (AIC) to measure the relative quality of a statistical model. AIC handles the trade-off between the goodness of fit of a model and the complexity of the model. The stepAIC function is found in the MASS package in R. The MASS package ships with R, so you don't have to install it. It does not automatically load when you start R, however, and so you need to call library(MASS) before using the stepAIC function.