Residuals vs Fitted Plot

Any point on fit line obviously has zero residual. Points above have positive residuals and points below have negative residuals.

The red line is the the smoothed high order polynomial curve to give us an idea of pattern of residual movement.

In our case we can see that our residuals have curved pattern. This could mean that we may get a better model is we try a model with a quadratic term included. We will explore this point further by actually trying this to see if it helps.

Normal Q-Q Plot

The Normal Q-Q plot is used to check if our residuals follow Normal distribution or not.

The residuals are normally distributed if the points follow the dotted line closely

In this case residual points follow the dotted line closely except for observation #22

Scale – Location Plot

One of the assumptions for Regression is Homoscedasticity . i.e variance should be reasonably equal across the predictor range.

A horizontal red line is ideal and would indicate that residuals have uniform variance across the range.

As residuals spread wider from each other the red spread line goes up.

In our case till approx 100000 or data is Homoscedastic i.e has uniform variance and later it becomes Heteroscedastic.

Residuals vs Leverage Plot

Before attacking the plot we must know what Influence and what leverage is. Lets understand them first.

Influence : The Influence of an observation can be thought of in terms of how much the predicted scores would change if the observation is excluded. Cook’s Distance is a pretty good measure of influence of an observation.

Leverage : The leverage of an observation is based on how much the observation’s value on the predictor variable differs from the mean of the predictor variable. The more the leverage of an observation , the greater potential that point has in terms of influence.

Now that we are clear on what Leverage is lets analyze our leverage plot draw inferences.

In this plot the dotted red lines are cook’s distance and the areas of interest for us are the ones outside dotted line on top right corner or bottom right corner. If any point falls in that region , we say that the observation has high leverage or potential for influencing our model is higher if we exclude that point.

Its not always the case though that all outliers will have high leverage or vice versa.

In this case observation #22 has high leverage and we have 3 choices

Choice 1 : Justify the inclusion of #22 and keep the model as is

Choice 2 : Include quadratic term as indicated by Residual vs fitted plot and remodel

Choice 3: Exclude observation #22 and remodel.

We will try both Choice #2 and Choice #3 and see what kind of diagnostic plots we get

In this case our diagnostic plots are much better Residuals are almost horizontal and well spread. Spread is almost uniform and no point has excess leverage. Q-Q plot however shows that few points are not along Normal line. But that may be acceptable.

We will check another model without quadratic term and excluding observation #22