Inference statistics for linear regression

We have seen how we can fit a model to existing data using linear regression. Now we want to assess how well the model describes those data points (every Outcome = Model + Error) and will use some statistics for it.

Fit a model

Our first goal should be to determine whether the data provide evidence of an association between price and carats. If the evidence is weak, then one might argue that bigger diamonds are not better!

To evaluate the model we will use a special Python package, statsmodel, which is a package based on the original statistics module of SciPy (Scientific Python) by Jonathan Taylor – later removed – corrected, improved, tested and released as a new package during the Google Summer of Code 2009.

Since statsmodels offers also functions to fit a linear regression model, we do not need to import and use sklearn to fit the model but we can do everything with statsmodels.

We will use its function OLS() that fits a linear regression based on the Ordinary Least Squares algorithm.

The model we want to get is :

y_hat = beta0 + beta1 * X

where y_hat is the estimated Diamond Price (the dependent variable) and x is the diamond Weight (the independent variable).

An intercept is not included by default and should be added by the user:

This answers our first question.
There is a relationship between price and weight of diamond and we can model it.

Which kind of relation is that?

Next question could be to find out if the relation is linear and how it looks like.

Coefficients interpretation

Beta1 is the expected change in response for a 1 unit change in the predictor.
In our case, we expect 3721 Singapore dollars increase in price for every carat increase in mass of diamond (within the restricted range considered).This is within the restricted range considered; extrapolation of the regression line for bigger diamond stones would not be advisable as these stones are rarer and command a different price range.

Beta0 is the expected price when the weight is zero.
This does not always make sense and in our case the negative intercept is even more puzzling because it suggests that a zero-carat diamond ring has a negative economic value!

Getting a more interpretable intercept

The intercept -259.63 is the expected price of a 0 carat diamond.
Which does not make much sense (unless you consider it the cost of bothering the diamond expert when you ask the price of a non-existing diamond 🙂 )

It can be ignored (the model applies only to a restricted range of the data) or can be an indication that a different model could be more appropriate, for example a non-linear regression.

We can also try to find the expected price for a more suitable weight, for example the average diamond weight:

Thus $500 is the expected price for the average sized diamond of the sample (0.2042 carats for our dataset). Makes more sense.

As you can see, the slope is the same as the previous model, only the intercept shifted.
This is always valid when you shift your X values.

You can shift the X input by a certain value but you can also re-scale them.

This can be useful when one unit is quite large and we would prefer a finer scale.
For example, in our case, one carat is worth of almost 4000 SIN$ and could make sense to talk about tenth of carats (= 1/10).

We expect a 372.102 (SIN) dollar change in price for every 1/10th of a carat increase in mass of diamond.
Note that this is equivalent to divide the slope coefficient by 10.

The intercept is the same as in the original model, only the slope coefficient is divided by 10.This is always valid when you re-scale the X values.

Predicting the price of a diamond

Once we have a model of the relation, we can use it for predictions.
The statsmodel package has a method predict() associated to each model, that takes a new set of input and will output the predicted values, according to the model.

Let’s say that I want to buy a 0.2 carats diamond. How much should I expect it to cost?

I can use the beta parameters estimated by the model and just putting them into the linear regression formula:

Result: for 0.16, 0.27, and 0.34 carats, we predict the prices to be respectively 335.74, 745.05, 1005.52 (SIN) dollars

How strong is the relationship?

We know that there is a relationship between diamonds carats and prices, we would like to know the strength of this relationship. In other words, given a certain diamond weight, can we predict the price with a high level of accuracy? This would be a strong relationship. Or is a prediction of prices based on weight only slightly better than a random guess? This would be a weak relationship.

Residuals

As we have seen, the residuals are the difference between the observed (y) and the predicted outcome (y_hat):

85 SIN$ (per defect or excess) is the biggest error done by the model.

Don’t confuse errors and residuals.The error is the deviation of the observed value from the (unobservable) true value of a quantity of interest (for example, a population mean), and the residual is the difference between the observed value and the estimated value of the quantity of interest (for example, a sample mean).

We can learn many things from the residuals.
One is that their distribution and properties can give an indication about the model fit.

The residuals should be distributed uniformly without showing any pattern and having a constant variance.

If we see from the residuals vs. fitted plot that the variance of the residuals increases as the fitted values increase (takes a form of a horizontal cone) this is the sign of heteroscedasticity.

Homoscedasticity describes a situation in which the error term (that is, the “noise” or random disturbance in the relationship between the independent variables and the dependent variable) is the same across all values of the independent variables.

Heteroscedasticity (the violation of homoscedasticity) is present when the size of the error term differs across values of an independent variable.

Examining the scatterplot of the residuals against the predicted values of the dependent variable would show the classic cone-shaped pattern of heteroscedasticity.

Other patterns could be:

curvilinear (indicate is non-linear / missing higher-order term)

a single point is far away from zero (probably an outlier)

a single point is far away from the others in the x-direction (probably an influential point)

This is one of the assumptions for regression analysis: residuals should have a normal (or Gaussian) distribution.

In [25]: plt.hist(residuals)
Out[25]:

It looks normal but we can verify better with a Q-Q plot.

Q-Q plot to verify the residuals distribution

Q-Q plots (stands for a “quantile-quantile plot”) can be used to check whether the data is distributed normally or not.

It is a plot where the axes are purposely transformed in order to make a normal (or Gaussian) distribution appear in a straight line. In other words, a perfectly normal distribution would exactly follow a line with slope = 1 and intercept = 0.

Therefore, if the plot does not appear to be – roughly – a straight line, then the underlying distribution is not normal. If it bends up, then there are more “high flyer” values than expected, for instance.

The theoretical quantiles are placed along the x-axis. That is, the x-axis is not your data, it is simply an expectation of where your data should have been, if it were normal.

The actual data is plotted along the y-axis.

The values are the standard deviations from the mean. So, 0 is the mean of the data, 1 is 1 standard deviation above, etc. This means, for instance, that 68.27% of all your data should be between -1 & 1, if you have a normal distribution.

statsmodels offers a handy qqplot() function:

In [26]: sm.qqplot(residuals, fit=True, line = '45')
Out[26]:

Estimating residual variation

The residual variation measures how well the regression line fits the data points.

It is the variation in the dependent variable (Price) that is not explained by the regression model and is represented by the residuals. We want the residual variation to be as small as possible.

Each residual is distributed normally with mean 0 and variance = sigma_squared.

We have previously seen that the ML Estimate of variance, sigma_squared, is sum(residuals squared) divided by n and we called it the Mean Squared Error (MSE).

Most people use (n-2) instead of n so that the estimator is unbiased (the -2 is accounting for the degrees of freedom for intercept and slope).

The square root of the estimate, sigma, is called the Root Mean Squared Error (RMSE). We want both MSE and RMSE to be as small as possible.

RMSE can be used to calculate the standardised residuals too.They equal the value of a residual divided by an estimate of its standard deviation (so, RMSE).Large standardised residuals are an indication of an outlier.

In [31]: max(simpleModel.resid / RMSE)
Out[31]: 2.4927258932276684

Summarizing the variation: R-squared

The total variation is the residual variation (variation after removing predictors) plus the systematic variation (variation explained by regression model).

R-squared is the percentage of variability explained by the regression model:

0% indicates that the model explains none of the variability of the response data around its mean.

100% indicates that the model explains all the variability of the response data around its mean.

In general, the higher the R-squared, the better the model fits your data.

In [32]: simpleModel.rsquared
Out[32]: 0.97826077798603295

We are quite close to a perfect model.

You can use a fitted line plot to graphically illustrate different R-squared values.The more variation that is explained by the model, the closer the data points fall to the line. Theoretically, if a model could explain 100% of the variation, the fitted values would always equal the observed values and all of the data points would fall on the line.

R-squared can be a misleading summary and needs to be carefully taken (deleting data can inflate R-squared for example).

In conclusion (residuals distribution, variation) the model is pretty good and the relation is very strong.

Because of this, sometimes is preferred to use the adjusted Rsquared, which is Rsquared adjusted for the number of observations.There are several formula that can be used, normally it is the Wherry’s formula:

Confidence

How accurately can we predict the diamond prices?

For any given weight in carats, what is our prediction for the price, and what is the accuracy of this prediction?

In statistics, a sequence of random variables is independent and identically distributed (IID) if each random variable has the same probability distribution as the others and all are mutually independent.

Inference for regression

In the case of regression with IID sampling assumptions and normal distributed residuals, the statistics for our estimated beta coefficients:

will follow a finite sample T-distributions and be normally distributed

can be used to test null hypotesis

can be used to create a confidence interval

In probability and statistics, the t-distribution is any member of a family of continuous probability distributions that arises when estimating the mean of a normally distributed population in situations where the sample size is small and population standard deviation is unknown.

Whereas a normal distribution describes a full population, t-distributions describe samples drawn from a full population.

The t-distribution becomes closer to the normal (Gaussian) distribution as its degrees of freedom (df) increases.

The t-distribution arises in a variety of statistical estimation problems where the goal is to estimate an unknown parameter, such as a mean value, in a setting where the data are observed with additive errors.

If the population standard deviation of these errors is unknown and has to be estimated from the data, the t-distribution is often used to account for the extra uncertainty that results from this estimation.

Confidence intervals and hypothesis tests are two statistical procedures in which the quantiles of the sampling distribution of a particular statistic (e.g. the standard score) are required.

Confidence levels are expressed as a percentage (for example, a 95% confidence level). It means that should you take a sample over and over again, 95 percent of the time your results will match the results you get from the population.

When the population standard deviation sigma is not known, an interval estimate for the population with confidence level (1-alfa) is given by:

Xmean +- t * (estimated standard error of the mean)

where t is a critical value determined from the t-distribution in such a way that there is an are (1-alfa) between t and -t.

Confidence intervals

A 95% confidence interval is defined as a range of values such that with 95% probability, the range will contain the true unknown value of the parameter.

Recall of quantiles and percentiles of a distribution

If you were the 95th percentile on an exam, it means that 95% of people scored worse than you and 5% scored better.

These are sample quantiles.

Now for a population: the i-th quantile of a distribution with distribution function F is simply the point x_i so that :

F(x_i) = i

A percentile is simply a quantile with i expressed as a percent.The 95th percentile of a distribution is the point where the probability that a random variable drawn from the population is less than 95%.

Approximately 68%, 95% and 99% of the normal density lies respectively within 1,2 and 3 standard deviations from the mean.

P-values

P-values are the most common measure of “statistical significance”.

The P-value is the probability under the null hypothesis of obtaining evidence as extreme or more extreme than would be observed by chance alone.If the p-value is small, then either H0 is true and we have observed an extreme rare event or H0 is false.

Let’s say a P-value is 0.1: then the probability of seeing evidence as extreme or more extreme than what actually has been obtained under H0 is 0.1 (10%).

By reporting a P-value, any observer can perform the own hypothesis test at whatever alfa level they choose. If the P-value is less than alfa then you reject the null hypothesis.

Estimating the P-values for hypotesis contrary beta0 is not equal to zero

We can use the T-distribution module from SciPy.stats to calculate the p-values

from scipy.stats import t

The survival function of a random variable is the probability that the random variable is bigger than a value x.
The SciPy.stats function sf() returns this probability:

Summary of statistic values

The statsmodel package offers an overview of the model values, similar to what we calculated above:

In [64]: simpleModel.summary()
Out[64]: OLS Regression Results

Dep. Variable:

price

R-squared:

0.978

Model:

OLS

Adj. R-squared:

0.978

Method:

Least Squares

F-statistic:

2070.

Date:

Sat, 25 Aug 2014

Prob (F-statistic):

6.75e-40

Time:

13:50:30

Log-Likelihood:

-233.20

No. Observations:

48

AIC:

470.4

Df Residuals:

46

BIC:

474.1

Df Model:

1

Covariance Type:

nonrobust

coef

std err

t

P>|t|

[95.0% Conf. Int.]

const

-259.6259

17.319

-14.991

0.000

-294.487 -224.765

carats

3721.0249

81.786

45.497

0.000

3556.398 3885.651

Omnibus:

0.739

Durbin-Watson:

1.994

Prob(Omnibus):

0.691

Jarque-Bera (JB):

0.181

Skew:

0.056

Prob(JB):

0.913

Kurtosis:

3.280

Cond. No.

18.5

Many values can also be accessed directly, for example the standard errors:

In [65]: simpleModel.bse
Out[65]: const 17.318856 carats81.785880

You can see all the values available using dir():

dir(simpleModel)

As an exercise, you can use the more complete diamonds data available on the same JSE website which has additional features and see if adding variables such as the diamond colour help to get a more precise prediction.

Welcome!

This is my personal blog, where I write about what I learned, mostly about software, project management and machine learning.
Why this name? The blog should help me to navigate into the future using (and not forgetting) the past experiences.
From Europe to the world.