Categories

Meta

Multiple Linear Regression 1 of 2

In our previous post we had looked at single variable (univariate) linear regression using R. Single variable regression is good but not useful in most practical situations as there would be multiple variables which help predict the value of a dependent variable (DV). In this post we will look at multiple linear regression and also use the model created on a test dataset. We go through the same steps as we did for single variable linear regression.

There are a few things that we need to do before we start building our model, the steps are mentioned here and the details can be found at this site

Loading the Data

Structure of the data (str and summary functions)

After getting to know about our dataset we will start creating the model. The syntax is:

Model 1 = lm(DV ~ IV1 + IV2 + … IVk , data = dataset name)

lm is the R function to create linear regression models

DV is the dependent variable

The ~ (tilde) sign in between the DV and the IV tells R to create a linear model between the DV and IV

IV1 thru IVk are the k Independent variables(IV)

data = dataset name helps R identfiy the dataset that it will be working on to create the model

The syntax above would result in creating a Linear regression of the form:

The syntax mentioned above would create the model but nothing would be displayed in the console window. To see the results of the linear model, we would need to look at the summary of the model using:

summary(Model1)

There are various things that we can see in the summary of the model as shown in the image below. For details about the individual elements please visit this site

For multiple linear regression, one important factor is multi-colinearity. Multi-colinearity is a situation when we have independent variables which are highly correlated. So, what is Correlation?

Correlation measures the linear relation between 2 variables and is a number between -1 and +1. We use the following formula to calculate the correlation between 2 variables

cor(variable1, variable2)

The formula cor(dataset name) helps calculate the pair-wise linear correlation between all the variables in a dataset. A snapshot of it is given below:

The number in the box gives us the correlation between Age and France Population which is 0.994485097.

Now that we know about correlation, our focus would be on looking at the coefficient and the R–squared values of the model. A snapshot of which is added here:

In the above image we see that Age and FrancePop do not have a star (*) or a period (.) at the end of their respective rows. This means that Age and FrancePop do not significantly affect the model that we have created. Let us remove them one by one and not all of them at once. Remember to remove the insignificant variables one by one and not all of them together since there could be a high correlation between 2 models. We can see that the Multiple R-squared and Adjusted R-squared values are 0.8294 and 0.7845. These numbers are pretty good. But since we have 2 insignificant IVs in the model, let’s remove FrancePop from our model and see how the values for R-squared and Adjusted R-squared changes.

In the above image we have removed FrancePop from our model and created a new model with the remaining variables. We see that our Adjusted R-squared has actually increased to 0.7943 from 0.7845. We can also see that Age has now become a very significant IV with 2 stars (**) at the end. The reason for Age to have become significant is due to the removal of FrancePop (remember that we had seen that Age and FrancePop were highly correlated). Since most of the IVs are significant in this model we will stick to this model. So our final equation for predicting the price of wine given the various IVs would be: