Regression Modelling

"Linear Regression is used predict or estimate the value of a response variable by modeling it against one or more explanatory variables. The variables must be pairwise, continuous and are assumed to have a linear relationship between them. This technique is widely popular in predictive analysis."

Assumptions of a Linear Regression

The residuals, calculated as the difference between actuals and predicted values measured along Y-axis, should follow a normal distribution (bell shaped curve).

No heteroscedasticity exists. This means the error terms has the same variance across all observations.

No multicollinearity exists amongst the predictor variables. Multicollinearity can happen when you have two or more highly correlated explanatory variables, typically when they explain each other. This happens when there is redundant information in data.

The relation between response and the predictor variables is linear and additive.

No autocorrelation exists in error terms, in case of a time series regression. This is evident from a lack of trend when the errors are plotted over time.

How To Fit Linear Regression Models ?

Use the lm() function to fit linear models as shown below. No additional packages need to be loaded in R to fit the models but some additional packages may be needed to get diagnostic parameters such as VIF {from car package}. The variable to be predicted is often called as ‘dependent variable’ or ‘response variable’ while the variables that are used to explain or predict the response variable are called either one of the following – ‘independent variable’, ‘predictor variable’, ‘predictors’, ‘explanatory variables’ or ‘covariates’.

In the example code below, the variable to be predicted (dependent variable) is named as “responseVar” and the explanatory variable as “pred1”, “pred2”, etc. You can add as many explanatory variables as you wish. Lets see the code to make models now..

The model object here is ‘lmMod’. In the above model creation code (line 3), the corresponding english version of the formula can be read as follows: responseVar is a linear function of the two predictor variables pred1 and pred2, all of which are present as variables (columns) in inputData.

Modelling Diagnostics

Step 1: The p-Value

The p-Values should be less than the pre-determined significance level (generally 0.05). If the p-value is less than 0.05, you can safely reject the null hypothesis that the value of co-efficient for that variable is zero (or in other words, reject the null hypothesis that the predictor variable has no explanatory relationship with the response variable).

Checking the p-Value of the variables (displayed at the right of each variable, marked by ***) and the model (displayed at the extreme bottom right of summary(lmMod)) should ideally be the first step in your diagnostics. The p-Value is a prime parameter used for validating a model when it is used for estimation purposes (finding value of response for test data).

But, it can be argued that a model with p-Value greater than the significance level may still be useful for forecasting purposes. So, in forecasting models, the predictor variables need not necessarily be discarded purely based on p-Value.

Step 2: Check The Variance Inflation Factor (VIF)

Variance inflation factor is a measure of the extent to which a predictor is explained by other predictors.

The VIF for a predictor variable can be calculated by doing a linear regression of that predictor against all the other predictors to be included in the model, and then obtaining the R2 from that regression. After obtaining the R2 like this, then VIF of that predictor is just..

VIF = 1/(1-R2)

It is desirable to keep the VIF as low as possible, ideally lesser than 2. A high VIF (4 and above), indicates that this predictor variable can be explained by the other existing predictors, so this variable may not be necessary. This condition is referred to as the existence of multi-collinearity.

library(car)
vif(lmMod) # get the variance inflation factor

Step 3: Check for Autocorrelation of residuals

This is a necessary step in case you are modelling time dependent data (i.e. your dependent variable is a time series). The durbin-watson test can be used to test if the error terms are auto-correlated. What this mean is, the error term of a observation in current time period is influenced by the error terms in some of the preceding time periods. This means that the error terms have a pattern to it and are not completely random, which also means, there is some part of the dependent variable that hasn’t been explained by the model thus built (which shows up as auto-correlation in error terms). This is an indication that you should tweak your model, so as to explain the unexplained. The ARIMA model is capable of addressing this.

durbinWatsonTest(lmMod) # get Durbin Watson statistic

A value close to 2 indicates the residuals are not autocorrelated, which is good.

Step 4: Check AIC

The lower the AIC (Akaike Information criterion) value the better the model. It helps to choose models trading off between complexity of model against the goodness of fit.AIC(lmMod)

Step 5: R-squared & adj-R squared

R-squared can be interpreted as the percentage of variance of the response variable explained by the predictors. Therefore, the higher the R-Sq, the better the model. This also means, adding more predictors to the model is always going to increase the R-squared value. Adjusted R-squared penalizes this effect. So it is a good practice to use adj-R Sq to compare the explanatory power of models built from the same dataset.

Also, as a thumb rule, when given a choice of models, choose the simpler one with lower AIC and higher adj R Sq.

A Three-Fold Cross Validation Plot. For a good fit, the lines should be parallel to each other with the corresponding small and large symbols lying close to each other, indicating lower residuals.

The fitted lines are plotted in different colors. For each line, there are points of same color and symbol, but of two different sizes. The small ones plotted on the line are the predicted values of the deleted observations for that fold. The corresponding larger points represent the actual values.

Therefore, the following conditions when satisfied, will indicate a stable model:

The large (actual) and corresponding small (predicted) values are close to each other along the fitted line. Indicates high prediction accuracy.

The fitted lines of different colors are parallel and on-top of each other. Indicating a stable model direction and less influence of outliers.