Endogenous Variables and IV Regression in Python

November 17, 2016

Endogeneity occurs when it is impossible to establish a chain of causality among variables. An instance of this might be AIDS funding in Uganda and AIDS occurence in Uganda. The problem here is that the amount of funding is a function of the number of AIDS cases in Uganda, however the number of cases is also affected by funding - what came first, chicken or the egg?

In this notebook I use a fertility data set to explore factors that might affect the age of a woman when she has her first child. The data is from James Heakins, a former undergraduate student of Professor Wooldridge at Michigan State University, for a term project. They come from Botswana’s 1988 Demographic and Health Survey.

Here’s my roadmap for evaluating this data:

I begin by looking at some scatterplots of the variables so that I can begin to get an idea of their relationship to one another.

Estimate a naive equation with a possibly endogenous variable. The dependent variable will always be age at first birth.

Identify the endogenous variable and pick an appropiate instrument for it. Test for the relevancy of this instrument using an f-test.

Use 2-stage least squares regression to estimate a new OLS model with the proper instrument included. I use IV2SLS written by the wonderful people at statsmodels.

As an exercise, replicate part 4 using matrix algebra in Numpy

Test for exogeneity of (supposedly) endogenous variable using the Hausman-Wu test.

Add another instrument to the mix, repeat step 3 for both instruments.

If you squint a bit, there seems to be a small positive relationship between how much education a woman recieves and the age of her first birth. As for the relationship between age of first birth and total number of children, women who have their first child at a young age seem to have more children overall.

Lets compare the first birth age of women who use a method of contraception to those who don’t.

Somewhat interesting… the mean age of 1st birth for women who have ever used a method of birth control is lower than those that have never used one. However, there is slightly more variation in the women who don’t use birth control.

Part 2

estimate naive equation with a possibly endogenous variable

The equation I will estimate is:

We assume that there is no relationship between education and the number of children ever born or education and month of marriage. We also assume that there is no relationship between month of first marriage and number of children born.

Theres a huge problem with missing data in this dataset, roughly a half of agefbrth data a missing from the total amount of observations. I get rid of the data now that has null values for age of first birth, education, month of first marriage, and children ever born.

#gets all columns that aren't nullno_null=fertility[(fertility['agefbrth'].notnull())&(fertility['educ'].notnull())&(fertility.monthfm.notnull())&(fertility['ceb'].notnull())&(fertility['idlnchld'].notnull())]print"lost {} samples of data out of a total {} samples".format(fertility.shape[0]-no_null.shape[0],fertility.shape[0])ind_vars=['monthfm','ceb','educ','idlnchld']dep_var='agefbrth'x=no_null[ind_vars]y=no_null[dep_var]x_const=sm.add_constant(x)first_model_results=sm.OLS(y,x_const,missing='drop').fit()#results = first_model.fit()first_model_results.summary()

There’s definitely some linear structure to the errors here, caused by the discrete nature of the dependent variable. (Maybe build a classification model for this?)

The exogeneity assumptions are not valid here. It’s reasonable to believe that amount of education recieved is correlated with errors in age of first birth. Education and the month of the first marriage are possibly weakly related. I’m not sure how the school years in Botswana are structured, but if a woman is in school for part of a year, she may not want to get married during any of those months, thus affecting the month she is married. We proceed with caution.

Part 3: Pick an instrument and test for relevancy and strength

I hypothesize that the most endogenous variable is education. If a child is born at a young age, there is less time for education, and it is impossible to determine which is the causal variable.

I will use electricity as an instrumental variable. There is no reason to believe that errors in age of birth and electricity are directly related to each other. However, education and electricity are probably related because places that have electricity are probably more developed and thus more likely to have a school. So, electricity is related to age of first birth only via education.

test for the relevancy of electricity as an instrumental variable:

run relevancy equation where exogenous variables and instrument predict the endogenous variable.

test whether the coefficient on the instrument is 0 via an F-test with one degree of freedom

With an F-statistic of 440.417, this is surely a relevant and strong instrument. The F-statistic must be at least 10 in order to be a strong instrument.

Part 4: Instrumenting using two-stage least squares

Some background and information about two-stage least squares regression: It’s called two stage because there are actually two stages of regression done (earth shattering I know).

First stage

In the first stage, the matrix $X$, which contains the endogenous information, is projected on to $Z$. $Z$ is the matrix without endogenous information that includes the variable(s) that are our instruments. Mathematically:
where $V$ is the error, and $\hat\gamma = (Z’ Z)^{-1}Z’ X$.

The projection of X on to Z is then:

Second stage

We repeat the same process as above using

Specifying the two stage least squares model

The documentation for IV2SLS in statsmodels is somewhat confusing and conflicts with some of the terminology that I’ve used in my classes. So, for clarification:

endog is the dependent variable, y

exog is the x matrix that has the endogenous information in it. Include the endogenous variables in it.

instrument is the z matrix. Include all the variables that are not endogenous and replace the endogenous variables from the exog matrix (above) with what ever instruments you choose for them.

On average, every year of education increases age of first birth by .327 years. This speaks to the positive effects of education. Interestingly, it is the only statistically significant variable at the .01 level.

print_resids(no_endog_results.predict(),no_endog_results.resid)

print"the descriptive statistics for the errors and a histogram of them:\n\n",no_endog_results.resid.describe()sns.distplot(no_endog_results.resid);

Part 5: replicate using matrix algebra

first, replicate OLS estimates:

x_mat_ols=np.matrix(x_const)y_mat_ols=np.matrix(y)y_mat_ols=np.reshape(y_mat_ols,(-1,1))#reshape so that its a single column vector, not row vectorb_ols=np.linalg.inv(x_mat_ols.T*x_mat_ols)*x_mat_ols.T*y_mat_olsprintb_ols

Part 6: Hausman-Wu test for endogeneity

Test whether the coefficient on $\hat r$ is significantly different than 0 using an F-test with 1 degree of freedom

# add relevancy equation residuals on to the endogenous matrixx_const['relevancy_resids']=relevancy_results.resid# run endogenous regression now with residuals added inendog_test_results=sm.OLS(y,x_const,missing='drop').fit()endog_test_results.summary()

We reject the null hypothesis that education is exogenous and conclude that education is indeed an endogenous variable.

The thinking behind this test is that the residuals should only include endogenous information of education because we explained all the exogenous information with monthfm and ceb. If we can then use that endogenous information to predict y in a meaningful way (i.e. the coefficient isn’t zero), then that is evidence that education is correlated with age of first birth via the error term.

Part 7: Add another instrument

Now we instrument for education using more than one instrumental variable. Living in an urban area should not be related to differences in the age of first birth, however, it will affect educational attainment. Again, more developed areas should (presumably) have better access to schools and education.

I conclude that these are indeed strong and relevant instrumental variables.

Conclusion

While the predictive power of our model may not be stellar with an $R^2$ of 0.033, we can be sure that our estimates for $\beta$ are unbiased and that there is not a problem with endogeneity. Education, instrumented with access to electricity and urban area, remains the most important factor in predicting the age at which a woman will have her first birth.

Statsmodels does a good job of IV regression, and all results match the output given by Stata. However, some features of Stata are lacking in statsmodels. A robust testing API for hausman-wu and Sargan’s test of over identification would be very nice. In stata, those tests are as simple as typing “estat overid”. Also, the examples on the statsmodels wiki are not stellar and could be expanded upon to include an econometric use case that I’m sure many data scientists and econometricians would find useful.