Short story: Why is it important that the predictor variables of a linear regression model are independent? If I am not interested in the coefficient but only in the question, which predictor variable is the most significant, am I allowed to use dependent predictor variables?

Long story: We would like to analyze the "quality" of a coating by using a linear regression model. Our parameters are as follows:

$Y$: the responds variable, which measures the "quality" of the layer -- it's a special property of the coating, which we are interested in.

$Q_1, \ldots, Q_m$: some dependent predictor variables, e.g. the density of the coated layer, it's hardness, and the coating rate etc. These predictor variables can't be directly controlled during the coating process. They depend on the variables $P_1,\ldots, P_n$ and maybe on some unknown variables $X_1, \ldots, X_k$. Therefore, they can be expressed as functions $Q_j = Q_j(P_1,\ldots, P_n, X_1, \ldots, X_k)$.

Usually, I would model the responds variable by the independent predictor variables. I would use the linear regression model
$$Y \sim \sum_{i=1}^n P_i$$
where I could include some interaction $P_i \cdot P_j$ terms as well. Then I would use the optimal parameter set for a coating experiment and verification of the quality.

However, according to the literature the most important parameter for the "quality" $Y$ is the dependent predictor $Q_1$. Unfortunately, we are quite sure that we are missing some independent predictor variables $X_1, \ldots, X_k$, because our model $Q_1(P_1,\ldots, P_n)$ deviates from the measured value $Q_1^{(measured)}$. Therefore, I would like to check, that the literature is correct and that $Q_1$ is indeed the most significant predictor variable. In order to do so, I would like to estimate the most significant terms by using the model
$$Y \sim \sum_{i=1}^n P_i + \sum_{i=1}^m Q_i$$
where I treat the dependent variables as if they were independent predictors. Furthermore, I would like to include some interaction terms ($P_i \cdot P_j, P_i \cdot Q_j$, and $Q_i \cdot Q_j$) again.

My colleague says, we should avoid this kind of analysis. His key argument is, that this is an assumption of each linear regression model. I totally agree, but this is not a convincing argument, but merely a statement. What would be a proper argument and what is a proper way to proceed?

Note: We are using a stepwise linear model, where we include predictor by predictor according to their significance. Therefore, ...

we first include only the most significant predictor variable. It must be independent, because it's the only predictor variable in the model. Let's call this predictor $Q_1$ and let's assume that it's a function of three other predictors: $Q_1 = Q_1(P_1, P_2, P_3)$.

if we include the second predictor variable, there could be a dependence between the two predictors. E.g. let's consider that we include $P_2$. Now our linear regression model reads $Y = c_1 \cdot Q_1(P_1, P_2, P_3) + c_2 P_2$ and it could be, that $P_2$ has the second greatest significance level, only because we need to compensate the contribution of $Q_1$.

My intuition tells me, that this is just like in linear algebra: The independent predictor variables represent an orthogonal basis, while a set of dependent predictor variables (e.g. $Q_1$ and $P_2$) represent a non-orthogonal basis. In principle orthogonality is nice, because if I change one coefficient I don't have to compensate by changing an other coefficient as well. However, in principle I am allowed to use a non-orthogonal basis. Is this wrong?

PS: We already checked that the measurements of our parameters are fine. We do not have a measurement problem.

$\begingroup$Highly dependent predictor variables do introduce a multicollinearity problem which affects the stability of the regression coefficients. It doesn't necessarily lead to poor predictions. Also there is no reason that some dependence of the predictor variables (called independent variables - a misnomer) creates any problem at all.$\endgroup$
– Michael ChernickSep 30 '17 at 5:36