Using the ACS: Predicting income extremes
Demographics and income

To what extent is an American's income predictable? Here, I look at a subset of this question, using logistic regression to predict whether or not an individual lives below the poverty line. I focus on my home state of Pennsylvania as a smaller test case.

Bottom line: logistic regression gave me insights regarding how demographics and life histories influence whether a person lives below the poverty line. Predictions using this fit were decent, but not great.

Demographics, life history, and poverty

Again, we learn to stay in school

The graph on the left shows the odds ratios and 95% confidence intervals for the logistic regression. An odds ratio greater than one means a person in that category is more likely to live below the poverty line, and similarly an odds ratio less than one means that it is less likely. If the confidence interval bars extending out from each data point cross the line at 1.0, that indicates that that effect is not significant.

Odds ratios are interpreted as the odds that, all other factors being equal, a person lives above the poverty line compared to a base case. I will specify the base cases for each category in the discussion below.

Women are slightly more likely than men to live below the poverty line.

The racial categories here are binary, with the base case being, for example, not African American. We therefore see that people who are African American, Latin@, and Native American are more likely to live below the poverty line. Results for Asian American and Caucasian people are not significant.

The age categories are compared to very young adults, aged 18 - 25. The results suggest that, as Pennsylvanians get older, they are less likely to live below the poverty line. This result was surprising to me.

People without disabilities are less likely to have extremely low incomes than people with disabilities.

The world area of birth category is compared to people born in the United States. People born in Africa, Asia, and Latin America are significantly more likely to have extremely low incomes.

The education categories are compared to people who have not graduated high school. The longer people stay in school, the less likely they are to live below the poverty line. This analysis does not explore how other aspects of a person's life, such as their household income as children, affect their choice to stay in school.

The class of worker categories are compared to people working for for-profit business ventures. Unsurprisingly, people who do not work for pay are much more likely to have extremely low incomes. Self-employed people who run unincorporated businesses are also more likely to live below the poverty line than people working for for-profit companies. More interestingly, people working for the government and for non-profits are less likely to have extremely low incomes.

People who do not live with their spouse (including people with absent spouses and people who are not married) are substantially more likely to live below the poverty line than people who live with their spouse.

The length of home occupation categories are compared to people who have lived at their current address for less than one year. The longer you live in your home, the less likely you are to have extremely low income.

What variables were important?

Education and marriage

The figure to the right shows the relative importance of different variables in the logistic regression results, colored by category as above. For clarity, I show only the top ten predictors.

Living in a household without one's spouse (regardless of whether one has a spouse) was the most important factor in predicting poverty status. This result was surprising, particularly since this category includes unmarried partners living together. Education variables account for half of the top ten most important factors for determining the model coefficients. Length of home occupation, age, and disability status were also important.

How well can we predict poverty?

Kind of ok.

We can assess the predictive capability of the logistic regression model several ways.

The accuracy of its predictions on a separate training data set: 75%. This seems decent, except that if I always guessed a person was living above the poverty line, I would be right about 92% of the time. Doing worse than random guessing is not great.

Its confusion matrix: out of a total of 25% incorrect responses, 2% are false positives and 23% are false negatives. Unsurprisingly, I do a much better job of predicting the majority class of people living above the poverty line.

The area under the ROC curve, pictured to the left. This curve shows the relationship between the model's specificity, or rate of false positives, and sensitivity, or rate of true positives. Ideally, the curve would look like a step function, with an area of 1. For my results, I have an area of 0.83, which is ok.

More details for nerds:

I selected the variables used to perform the logistic regression based on a variety of criteria. I could not simply use all the variables in the PUMS dataset, partially because that was too computationally intensive, and partially because some of the variables overlap and make rank-deficient predictions. In addition, some variables have so many possible categories that results using those variables would be difficult to interpret; for example, one variable describing the industry a person works in has 367 categories. The variables chosen are coarse-grained, non-overlapping, and generate significant contributions to the prediction.

I used the R package caret to perform the logistic regression. I used downsampling to help mitigate class imbalance, since people living below the poverty line account for 8.5% of this data. I used 10-fold cross validation to generate more accurate coefficients. Model fit was measured using the area under the ROC curve.

Confidence intervals are calculated using the profiled log-likelihood function.

This analysis was done without using the weights provided in the ACS that account for the different frequencies with which different kinds of people respond to surveys. Naively assuming that our survey sample is representative has two main effects. First, our data will look more like your typical survey responder than your typical American. Second, we will have smaller standard errors and be more confident in our conclusions than is warranted.

I chose to explore this question because I have strong background knowledge on what my results should be, e.g. it is well known that women have lower incomes than men. This background knowledge helped me learn to implement and interpret logistic regression. In the future, I would like to investigate questions whose answers are less well understood.