Binscatter

This post illustrates use of the Python module
binscatter.
The Stata package that inspired this module
has a far more extensive explanation of what a binned scatter plot is and how to
interpret it.

Let’s say we want a visual representation of the relationship between wages and
time spent on the job (tenure). People with more tenure have more job
experience. We may want to understand how more tenure affects wages either
with or without accounting for experience.

Let’s start by making up some data. Note that if we were to regress wages on
experience and tenure, the coefficient on tenure would be 1, but if we regress
wages on tenure only, the coefficient is greater.

In the first plot, we’ll break observations into twenty bins by their level of tenure. In the second figure, we’ll residualize both wages and tenure using experience, then do a binned scatter plot of residualized wages against residualized tenure.

Why is a residual plot helpful?

Define alpha, beta, and gamma to be best linear predictor coefficients and $E^*$ to be the best linear predictor function.

Let $\tilde{\mathrm{wage}}_i$ and $\tilde{\mathrm{tenure}}_i$ be residuals from regressing wage and tenure on experience. Then

A binned scatter plot of $\tilde{\mathrm{wage}}_i$ against $\tilde{\mathrm{tenure}}_i$ has the slope that we are interested in.