Estimating causal relationships from data is one of the fundamental endeavors of researchers. Ideally, we could conduct a controlled experiment to estimate causal relations. However, conducting a controlled experiment may be infeasible. For example, education researchers cannot randomize education attainment and they must learn from observational data.

In the absence of experimental data, we construct models to capture the relevant features of the causal relationship we have an interest in, using observational data. Models are successful if the features we did not include can be ignored without affecting our ability to ascertain the causal relationship we are interested in. Sometimes, however, ignoring some features of reality results in models that yield relationships that cannot be interpreted causally. In a regression framework, depending on our discipline or our research question, we give a different name to this phenomenon: endogeneity, omitted confounders, omitted variable bias, simultaneity bias, selection bias, etc.

Below I show how we can understand many of these problems in a unified regression framework and use simulated data to illustrate how they affect estimation and inference.

In the expression above, \(y\) is the outcome vector of interest, \(X\) is a matrix of covariates, \(\varepsilon\) is a vector of unobservables, and \(g\left(X\right)\) is a vector-valued function. The statement \(E\left(\varepsilon|X\right) = 0\) implies that once we account for all the information in the covariates, what we did not include in our model, \(\varepsilon\), does not give us any information, on average. It also implies that, on average, we can infer the causal relationship of our outcome of interest and our covariates. In other words, it implies that

The expression \(E\left(\varepsilon|X\right) \neq 0\) implies that it does not suffice to control for the covariates \(X\) to obtain a causal relationship because the unobservables are not negligible when we incorporate the information of the covariates in our model.

Below I present three examples that fall into this framework. In the examples below, \(g\left(X\right)\) is linear, but the framework extends beyond linearity.

Example 1 (omitted variable bias and confounders). The true model is given by
\begin{eqnarray*}
y &=& X_1\beta_1 + X_2\beta_2 + \varepsilon \\
E\left(\varepsilon| X_1, X_2\right)&=& 0
\end{eqnarray*}
However, the researcher does not include the covariate matrix \(X_2\) in the model and believes that the relationship between the covariates and the outcome is given by
\begin{eqnarray*}
y &=& X_1\beta_1 + \eta \\
E\left(\eta|X_1\right)&=& 0
\end{eqnarray*}

If \(E\left(\eta|X_1\right)= 0\), the researcher will get correct inference about \(\beta_1\) from linear regression. However, \(E\left(\eta|X_1\right)= 0\) will only happen if \(X_2\) is irrelevant once we incorporate the information of \(X_1\). In other words, this happens if \(E\left(X_2|X_1\right)=0\). To see this, we write

If \(E\left(\eta|X_1\right) \neq 0\), we have omitted variable bias, which in this case comes from the relationship between the included and omitted variable, that is, \(E\left(X_2|X_1\right)\). Depending on your discipline, you would also refer to \(X_2\) as an omitted confounder.

In line 4, I set a parameter that correlates the two regressors in the model. In lines 6-8 I generate correlated regressors. In line 12, I generate the outcome variable. Below I estimate the model excluding one of the regressors.

The estimated coefficient is 0.495, but we know that the true value is 1. Also, our confidence interval suggests that the true value is somewhere between 0.476 and 0.515. Estimation and inference are misleading.

If \(E\left(X_j’\varepsilon \right) \neq 0\), we say that the covariates \(X_j\) are endogenous. The law of iterated expectations states that \(E\left(\varepsilon|X_j\right) = 0\) which yields \(E\left(X_j’\varepsilon \right) = 0\). Thus, if \(E\left(X_j’\varepsilon \right) \neq 0\), we have that \(E\left(\varepsilon|X_j\right) \neq 0\). Say \(X_1\) is endogenous; then, we can write the model under endogeneity within our framework as

In lines 7–10 I generate correlated unobservable variables. In line 14, I generate a covariate that is correlated to one of the unobservables, x2. In line 18, I generate the outcome variable. The covariate x2 is endogenous, and its coefficient should be far away from the true value (in this case, \(-1\)). Below we observe exactly this:

The estimated coefficient is \(-0.498\), and our confidence interval suggests that the true value is somewhere between \(-0.510\) and \(-0.486\). Estimation and inference are misleading.

Example 3 (selection bias). In this case, we only observe our outcome of interest for a subset of the population. The subset of the population we observe depends on a rule. For instance, we observe \(y\) if \(y_2\geq 0\). In this case, the conditional expectation of our outcome of interest is given by

Selection bias arises if \(E\left(\varepsilon|X_1, y_2 \geq 0 \right) \neq 0\). This implies that the selection rule is related to the unobservables in our model. If we define \(X \equiv (X_1, y_2 \geq 0)\), we can rewrite the problem in terms of our general framework:

In lines 7 and 8, I generate correlated unobservable variables. In lines 12–15 I generate the exogenous covariates. In lines 19 and 20, I generate the two outcomes and drop observations according to the selection rule in line 21. If we use linear regression, we obtain

As in the previous cases, the point estimates and confidence intervals lead us to incorrect conclusions.

Concluding remarks

I have presented a general regression framework to understand many of the problems that do not allow us to interpret our results causally. I also illustrated the effects of these problems on our point estimates and confidence intervals using simulated data.

What is a projection model?
(And why is it not an option to sign in using Stata???)

epinzon

Hello Laura,

The projection model is a special case of the regression model. So you can get results from a projection model using -regress-.

The regression model assumes that the unobservables are, on average, unrelated to the regressors or any function of the regressors. For example, X^3, sin(X), cos(X), etc. The projection model concerns itself with linear combinations of the regressors. In that sense it is a special case. Your intuition about linear regression can be formed in terms of the projection model.

The literature of instrumental variables for linear models and, to some extent, GMM write the problem in terms of the projection model, where E(ex)=0 is called a moment condition. This is why I decided to present endogeneity in terms of the projection model.

All that being said, I think the post would have benefited from a conceptual discussion of the projection model. In other word, thanks for the question.