Robust Regression

"Robust regression can be used in any situation in which OLS regression can be applied. It yields better accuracies over OLS and is particularly resourceful when there are no compelling reasons to exclude outliers in your data."

Robust regression can be implemented using the rlm()function in MASS package. The outliers can be weighted down differently based on ‘huber’, ‘bi-square’ and ‘hampel’s methods.

Example Problem

Lets see how robust regression fares over an equivalent lm() model, with respect to prediction accuracy. For this analysis, we will utilize the built-in Boston dataset in MASS package. Lets predict the median value of owner occupied homes homes (medv) by training the model on 80% of data, picked randomly. After building a basic model, we will test how well the model predicts on the remaining 20% test data. We will build both regular OLS (lm()) model as well as the corresponding robust regression model (rlm()).

# Calculate Accuracy
lm_actuals_pred <- cbind(lm_Predicted, test_data$medv)
rob_actuals_pred <- cbind(rob_Predicted, test_data$medv)
mean(apply(lm_actuals_pred, 1, min)/ apply(lm_actuals_pred, 1, max)) # 85.48%
mean(apply(rob_actuals_pred, 1, min)/ apply(rob_actuals_pred, 1, max)) # 85.98%
As you can see, the accuracy from a rlm() model has only marginally improved, which could be because Boston is a 'standard dataset'. But in real world scenarios you can expect real time data to have many more outliers, in which case the improvement in prediction accuracy of rlm() can be more pronounced.

Optional diagnostics: Get the outlier observations

The cook's distance is computed for all the data points is computed to reveal the outliers. In this case, those observations that has a cook's distance distance > (4/num_rows) is considered as an outlier.cDist <- cooks.distance (lmMod) # get cooks distance
resids <- stdres (lmMod) #errors
allObs <- cbind (inputData, cDist, resids)
allObs[cDist > 4/num_rows, ] # show outlier observations