Linear Models Don’t have to Fit Exactly for P-Values To Be Accurate, Right, and Useful

There is no need to get confused with multiple linear regression, generalized linear model or general linear methods. The general linear model or multivariate regression model is a statistical linear model and is written as Y = XB + U.

Usually, a linear model includes a number of different statistical models such as ANOVA, ANCOVA, MANOVA, MANCOVA, ordinary linear regression, t-test and F-test. The GLM is a generalization of multiple linear regression models to the case of more than one dependent variable. So if Y, B, and U represent column vectors, the matrix equation above will portray a multiple linear regression.

Which are the key assumptions made in a multiple linear regression analysis?

Independent variables and outcome variables should have a linear relationship among them, and to find out whether there is a linear or curvilinear relationship; scatterplots can be leveraged.

Multivariate Normality: Residuals are normally distributed, as is assumed in multiple regressions.

No Multicollinearity: Independent variables are not correlated among, as is assumed in multiple regressions. To test these assumptions, Variance Inflation Factor – VIF is used.

Homoscedasticity: Error terms are similar across the values for independent variables in the assumptions made. predicted values Vs standardized residuals are used to showcase if points are successfully and equally distributed across independent variables.

Best data analytic solutions are derived to automatically include assumption tests and plots while conducting regression. From nominal, ordinal, or interval/ratio level variables; multiple regression requires at least two independent variables. A rule of thumb for the sample size is that regression analysis requires at least 20 cases per independent variable in the analysis.

Assumptions in your regression or ANOVA model

We know, you know; how important are they because if they’re not met adequately, all the p-values will become inaccurate, wrong, & useless. But linear models don’t have to fit precisely for p-values to be accurate, right and useful; they are robust enough to departures from these assumptions. Statistics classes and several other coaching places have been imparting such knowledge or contradictory statements as you may say; to drive analysts crazy.

It is a debatable topic as to whether statisticians cooked this stuff up to torture researchers, pun intended; or do they do it to satisfy their egos?

Well, the reality is they don’t. Learning how to push robust assumptions is not that hard a task, of course when done with professional training and guidance, backed with some practice. Enlisted are few of the mistakes researchers usually make because of one, or both of the above claims.

1. P-value is the feel-good factor

Avoiding over-testing of assumptions is one of the ways out. Statistical tests can help in determining if assumptions made are met adequately or not. Having a p-value is the feel-good factor, isn’t it? It helps you avoid further complications, and one can do it by leveraging the golden rule of p<.05.

In no case, the tests should ignore the robustness. Assuming that every distribution is non-normal and heteroskedastic; would be a mistake. Tools may prove helpful, but are developed to treat every data set as if it is a nail. The right thing to do is use the hammer when really required, and not hammer everything.

2. GLM is robust but not for all the assumptions

Here, assumptions are made that everything is robust and tests are avoided. It is a normal practice which succeeds most of the times. But there are instances, when it does not work. Yes, the robust GLM, for deviations from some of the assumptions, but are not robust all the way and not for all the assumptions. So check all of them without fail.

3. Test wrong assumptions

Testing wrong assumptions also is one of the mistakes that researchers do. They look at any two regression books and it will give them different sets of assumptions.

Testing the related, but wrong thing

Insights here might seem partial as several of these “assumptions” should be checked, but they are not model assumptions; instead are data challenges. At times the assumptions are taken to their logical conclusions, which adds up to it looking to be partial. The reference guide is trying to make it more logical for you, but in that attempt, it leads you to test the related but the wrong thing. It might work out most of the times, but not always.

You need to be a member of AnalyticBridge to add comments!

"Independent variables and outcome variables should have a linear relationship among them." Wrong. "Linear" in (multiple) linear regression refers to the relationship between the outcome (dependent) variable and the unknown (to be estimated) *parameters* in the model. The model must be *linear* in the unknown parameters. There are many models that are *nonlinear* in the independent variables that can be transformed such that they are linear in the unknown parameters.