How to test a regression model for heteroscedasticity and if present, how to correct it?

Absence of heteroscedasticity is one of the assumptions of linear regression, which means that the variance of residuals in the fitted model should not increase as the fitted value increases. We are about to learn how to test for the presence of heteroscedasticity, and if found, what measures can be taken to correct it?

Lets begin by building a simple linear regression model that tries to explain the ‘dist’ from the built-in ‘cars’ dataset. The predictor in this case is the ‘speed’ and the linear model object that is built is tested for heterorscedasticity using ‘bptest()’ function from the ‘lmtest’ package. Along with this we also use the ‘gvlma()’ that tests if the assumptions of linear regression hold your a ‘lm’ object

A p-Value > 0.05 indicates that the null hypothesis(the variance is unchanging in the residual) can be rejected and therefore heterscedasticity exists. This can be confirmed by running a global validation of linear model assumptions (gvlma) on the lm object.

Graphical interpretation

As suspected, the heteroscedasticity condition is NOT satisfied. Now lets plot and see how it looks as well.par(mfrow=c(2,2)) # 4 charts in 1 panel
plot(lmMod)

The graph in top-left of the panel reveals a bit of a ‘pattern’ in the variance and that it is not completely random as suggested from the curvy smoothing line. So definitely, the errors are not so ‘innocent’. They appear know something (if only less) about the dependent variable that the error variable itself can be used as a predictor to model the ‘dist’ variable. In other words, the errors (read residuals) is not completely random and could potentially contain some information (read variance) that could in turn be used to predict the dependent variable. This above method is actually one of the ways that we could use to overcome the problem of heteroscedasticity.

How to correct for presence of heteroscedasticity?

By re-building the model with the created error term (lmMod$residuals) as one of the predictors

By transforming the dependent variable

Using box-cox transformation

For this example we will proceed to use the second method to make amends to the model and predict the ‘dist’. The transformation we will use by default in this case is the box-cox transformation.
Once the optimal ‘lambda’ value is found, we will use it to transform the dependent variable and build a model using the newly transformed variable as dependent.

As seen in the last line, the estimated value of lambda was 0.5. So, how is this lambda helpful? how is the transformation actually applied?. Simple. Each data point in the dependent variable is raised to the power of ‘lambda’, then subtract 1 from that number and finally divide this number by ‘lambda’ itself.

transformed var = (y^lambda – 1)/lambda
But we don’t have to do this manually, it can done by using the ‘BoxCoxTrans’ function from the caret package as you just saw. Now lets append this new transformed variable to cars data and re-build the model.

Both from the above output as well as the bptest(), we can verify that the heteroscedasticity problem is effectively resolved. Lets verify graphically as well.plot(lmMod_bc)

In the top left plot, the residuals appear to be random with a more straighter smoothing line, which implies that the variance of residuals doesn’t change as the fitted value increases, which implies heteroscedasticity is removed. Case resolved!

NOTE: An important point to note is that, the current model you’ve built is on a transformed predictor variable But since this new variable is on a different scale compared to the original ‘dist’, we need to back-transform this variable by doing the opposite of box-cox transformation i.e. “by multiplying it by lambda, subtracting 1 and then raising it to the power of 1/lambda”. I will leave this as a fun exercise to try this out on your own 🙂

There is relatively weak evidence against the assumption of constant variance.
Please describe?

Jai

I think it is typo error, the p-value should be less than 0.5 to reject the null hypothesis.

Rahul Bansal

If p value is less than 0.05 , we accept the null hypothesis. Or we can say we do not have enough evidence to reject null hypothesis.
Remember, p value is probablity of making error while rejecting null hypothesis. So if p value is high , it means probablity of making error by rejecting null hypothesis is high.

Here we have defined alpha as 0.05 , it means we have to be 95% confident to accept null hypothesis.
Depending on the severity of problem, we can choose alpha to some lower value too.

Rahul Bansal

If p value is less than 0.05 , we accept the null hypothesis. Or we can say we do not have enough evidence to reject null hypothesis.
Remember, p value is probablity of making error while rejecting null hypothesis. So if p value is high , it means probablity of making error by rejecting null hypothesis is high.

Here we have defined alpha as 0.05 , it means we have to be 95% confident to accept null hypothesis.
Depending on the severity of problem, we can choose alpha to some lower value too.