Advanced Linear Regression: A Case study

It is possible to build multiple regression models for just one set of response and predictor variables. When you are manually building the models, it can be a herculean task to build even one valid statistically significant regression model especially when you are new to the data/problem. It can be rather frustrating if you later find out there is multi-collinearity or that your model does not perform equally well when cross validated on random samples or does not have good prediction accuracy on test data, or worse.
What if we have the flexibility to build all choicest statistically valid models, see their prediction accuracy, cross-validate on random samples, compare all important diagnostic parameters from one place and finally pick the best one that suits your case?. The details that follow will attempt to solve this.

Since R offers the muscle for both the statistical analysis and algorithmic approach, we combine the best of both worlds to find, build, analyse and choose the best of the best linear regression models. The following framework attempts to strike a balance between a robust algorithm vs investigative approach.

At the end of this, you will have a list of all the best models in a list along with a host of diagnostic parameters, cross validation and accuracy measures all in one final regression output.

Problem description

The ‘Ozone’ data from mlbench package contains the Los Angeles ozone pollution data collected in 1976. It is available as a data frame with 366 rows and 13 variables, that may be helpful to explain the ozone reading in the region. It is now up to us to find out which of the available variables would suit best to predict the ozone reading with maximum accuracy. The objective of this analysis is to accurately predict the ‘daily maximum one-hour average ozone reading’, using linear regression models.
We will ensure this by making a 80:20 split of the data as development and validation samples, and use only the 80% sample for building linear models(training).

The models thus built will be used to predict the ozone reading on the 20% validation sample. With the actual ozone readings from 20% validation sample and model predictions thus computed, we will calculate the prediction accuracy and use it as one of the important parameters to decide the optimal model that suits this problem.

Creating derived variables

Before moving on to exploratory analysis, it is a good idea to derive as many new variables from your predictors, that you think would be relevant for your dependent variable. Why? Because we want to capture all the nuances that might be an explanatory cause for the dependent variable. Why create it so early? Because, you can perform the analyses that follows on these variables as well.

Ideas for creating derived variables

Combination

The new variable is a combination (+/*) of few of the original raw variables

Aggregation

Aggregate the variables to some level. Eg: Sales in last 12 months, Average salary of a certain employment class, etc.

Transformation

Transform the variable into log, cube, etc. More discussion on Box-Cox transformation follows.

Here is a quick challenge: Can you come up with a new derived variable for ‘Ozone’ dataset? If you succeed, attach that variable to your input data and proceed with the analysis. It will be interesting to see if you can build a better predicting model than what we have achieved here, at the end of this analysis.

The above code will create the density plot, scatter plot and box-plot, all within one frame, for all the variables in input data and store it in your working directory. It is preferable for density plot to have a bell shaped curve aligned to the center of chart area. Scatterplot reveals if there is a underlying linear relationship between the predictor and response while the box-plot reveals any outliers in the variable (in which case, dots will be present outside the top and bottom lines (IQR)).

Missing value treatment

Approaches for handling missing values

Eliminating the variable: If a particular variable has more than 30% missing values, it is advisable to consider removing that variable altogether.Eliminating the observations: If there are missing values scattered through out your dataset and you have a good number of observations in your sample, you may choose to eliminate those observations with missing values. After doing this, if you find losing more than 30% of sample size, then you probably need to look for any particular variable that is contributing the most number of missing values.Imputation: Alternatively, you may also impute the missing values with one of these place holders.

Imputation approaches

Mean: Replace with mean

Median: Replace with median

Midrange: Replace with mid-range (max + min)/2

Mode: Most frequent value

Interpolation based: Interpolate based on a function for the variable.

Based on distribution: Identify the distribution of the variable, and estimate using the distribution parameters.

Predict the missing values: Based on treating the missing as a dependent variable in a regression model – See method 2 below.

K Nearest Neighbours: Based on finding the observations that have the closest characteristics of the observation with missing value, through k-Nearest Neighbours algorithm. See method 3 below.

Method 2: Predict the missing values by modelling it as a dependent variable with a regression model based approach
Replace all missing values with the mean, except for response variable. Build model with the variable for which missing value treatment needs to be done as the response variable, while the remaining variables can be used a predictors. This method is discussed in Julian J Faraways’s Book P.158

Correlation analysis and Summary

The correlogram is a graphical representation of correlation amongst all combinations of response and predictor variables. The investigator may choose to pick variables that have high correlation with the response based on a pre-determined cutoff. But we do not do that filtering here, since it is possible that a variable with relatively lower correlation can prove significant and valuable in regression models to capture unexplained variations, even if they are of ‘lower’ importance.par (mfrow=c(1,1))
inputData_response_cont_noNA <- na.omit(cbind(inputData_response, inputData_cont)) # remove rows with missing values
chart.Correlation(inputData_response_cont_noNA, histogram=TRUE, pch=19) # plot correlogram
corrplot(cor(inputData_response_cont_noNA), method="circle", is.corr=TRUE)# Get and store the summary files
sapply(inputData, summary) # get summary statistics for all columns

Box Cox Transformation for Continuous variables

Upon performing outlier treatment, you can expect the skewness of the variable to improve. But not always. Sometimes a box-cox transformation of the response variable is preferable to reduce the skewness and make it closer to a normal distribution. While the variable gets transformed, ranking order of the variable remains the same after the transformation.

Why does a highly skewed non-normal response variable matter?

Because, for the long tailed region of the response, there is a larger variance. This also means, more predictors are expected to have a stronger effect while explaining the larger variance part of the response variable, thereby creating a bias. This can affect all the estimates resulting in spurious models. It can also be argued that a high skewness is a consequence of heteroscedasticity (non-constant variance).

Therefore, if the skewness does not improve after treating the outliers, consider performing a box-cox transformation especially when k-fold cross-validation reveals inconsistent results. Note that, it is not mandatory to apply box-cox transformation on all variables that are skewed. It is up to the best judgement of the investigator to decide if box-cox transformation is indeed needed on a variable.

Nevertheless, just so you know how a box cox transform is done, below is the procedure. As a default, box-cox transformation in the response variable is sufficient, but here, I have chosen to apply it on variables that have a skewness > 1, even after outlier corrections are applied.

Checking for significance: Continuous Variables

For each continuous predictor variable, we build a simple regression model. If the p-Value of the model is less than a predetermined significance level, we choose not to reject the variable and therefore store it in ‘significantPreds_cont’. This step is done purely to qualify meaningful variables that may later help us to build better models.

Checking for significance: Categorical Variables

Since the dependent variable here can be considered as a ordinal count variable, we can use the chi-squared test to determine if a categorical predictor is significant in explaining the dependent variable. Had our dependent variable been a continuous variable, you should use the same method we just saw for continuous predictors (i.e. fitting a simple regression model with the predictor as the categorical variable of interest and checking the p Value).
So, after we perform the chi-squared test (see code below), if the p-Value turns out to be less than the predetermined significance level, we qualify the categorical variable and store it in ‘significantPreds_cat’ for further steps.# Selecting signifiant variables (categorical) - checking for chi-sq test
significantPreds_cat <- character() # initialise output. Significant preds will accrue here
for (predName in names(inputData_cat)) {
pred <- inputData[, predName]
chi_sq <- invisible(suppressWarnings(chisq.test(table(response, pred)))) # perform chi-Squared test
p_value <- chi_sq[3] # capture p-Value
if (p_value < 0.1 & p_value > 0) { # check for significance
significantPreds_cat <- c (significantPreds_cat, predName)
}
}
inputData_cat_signif <- as.data.frame(inputData[, names(inputData) %in% significantPreds_cat])
colnames(inputData_cat_signif) <- significantPreds_cat # assign column names

Part 2: Estimate important variables based on mean drop in accuracy – using cforest

cforest is one of the many techniques we can use to find the most important variables. You might want to try out these other techniques especially when your data size becomes larger. For smaller data, cforest is more convenient especially because you can pass in categorical variables to it as well.# Selecting best variables (variable selection using cforest)
cf1 <- cforest(response ~ . , data= na.omit(inputData_signif[, names(inputData_signif) %in% c(boruta_signif, "response")]), control=cforest_unbiased(mtry=2,ntree=50)) # fit the random forest
impVars <- sort(varimp(cf1), decreasing=T) # get variable importance# Pick top 7. Adjust this as per your needs. More variables will take more computational time.
impVars <- if(length(impVars) > 7){
impVars[1:7]
} else {
impVars
}

Now, the following variables have been filtered as the most suited variables to explain Ozone reading. By adjusting various setting such as VIF level, p-Value significance level, including/dropping certain steps, you may end up with a different set of variables based on the business need. Despite, the variable selection procedures, if the business / real world logic suggest that a variable (that did not get picked up by the algorithm) is influential, then you probably should try and include it while building the models in the next step. So here are are current best ones:

Temperature_Sandburg

pressure_height

Inversion_base_height

Humidity

Month (tentative)

Create all combinations of selected variables that will go into models as predictors

Building all the linear models and diagnostics

Now that we have all the possible combinations for predictors, we can expect to generate several good models from this mix. In the code that follows, you will first build the model and capture the summary of the model in a variable (currSumm). Then, the diagnostic parameters such as R-sq, p-Value, AIC etc are captured.
Post that, the Influential observations and accuracy measures are calculated on the training and test data sets. Finally, we perform the k-Fold cross validation and capture the mean-squared error for the model. We calculate all these metrics for all the model predictor combinations from combos_matrix, which get accrued in Final_Output dataframe, which gets exported in Final_Regression_Output.csv file. Apart from this, the model summary, VIF and predictions from each of the models are also printed out to console.

Final_Regression_Output is the final output file containing diagnostic parameters for all models. In below table, you will find a sample row of ‘Final_Regression_Output for one of the models, that we just built using the above logic.

Sample Row in Final_Regression_Output.csv

Parameter

Value

formula:

response ~ Temperature_Sandburg + Inversion_base_height + Humidity

r.squared:

65.19%

adj.r.squared:

64.83%

AIC:

1728.90

BIC:

1747.23

Model.pValue:

5.27E-65

F.Statistic:

177.95

training.mape:

49.91%

predicted.mape:

45.71%

training.minmax.accuracy:

68.36%

predicted.minmax.accuracy:

73.14%

influential.obs (row nums):

8 20 78 90 104 105 127 166 182 195 221 240 243

k_fold_ms (cross validation mean squared error):

21.37%

Select the best model and make final prediction

Once we have the models, diagnostics and performance accuracies, it is up to the investigator to pick the best model that suits the specific business problem. A general convention to go by when picking good regression model is as follows:

All things being equal, if you have two models with approximately equal predictive accuracy, always favour the simpler one.

Based on the results in Final_Output, we can infer that model in row number 13 has fared well in almost every parameter, especially with respect to AIC, predicted min-max accuracy and mean squared error in k-fold cross validation. So choosing it to be the optimal model and printing the outputs.

How to interpret the cross validation chart?

Since we are performing a 5 fold cross validation (m=5 in CVlm function), you will notice 5 colors and symbols in the chart, one each for one fold. You will also see 5 dashed lines in the chart, which are the best fit line for the respective fold. What happens during cross validation is that, your data is randomly split into n (=5) folds. By holding one of the 5 parts as test, a regression model is built on the remaining 4 parts as training data. Then the model is used to predict on the 5th fold. This is done for each of the 5 parts and then, the actuals (bigger symbols in chart) and predictors(smaller symbols close to the line) are plotted on the same chart along with the lines of best fit.
A really good model, will have all these lines parallel to each other and the bigger symbols closer to the line, indicating a high accuracy of prediction. If either of these conditions are not satisfied, you can imagine what the implications of using that model will be.

Cross validation chart

Now lets pick the best model in row 13 from Final_Regression_Output.csv and generate the predictions and stats.