How can I run a piecewise regression in Stata? | Stata FAQ

Say that you want to look at the relationship between how much a child talks
on the phone and the age of the child. You get a random sample of 200 kids and
ask them how old they are and how many minutes they spend talking on the phone.
You start with a scatterplot of the data like below.

Looking at this you are not happy with the nonlinearity that you see in the data, so try to add a quadratic fit.

twoway (scatter talk age) (lfit talk age) (qfit talk age)

Thinking about this more, you decide that you think that the amount of time that kids talk
on the phone changes dramatically at age 14, and that the slope might change at that age as well.
You think that a piecewise regression might make more sense, where before age 14
there is an intercept and linear slope, and after age 14, there is a different
intercept and different linear slope, kind of like pictured below with just
freehand drawing of what the two regression lines might look like.

Try 1: Separate regressions

To investigate this, we can run two separate regressions, one for before age 14, and one for after age 14.
We can compare the results of these two models.

Note how the slopes for the two groups stayed the same, but now the
intercepts (_cons) are the predicted talking time at age 14 for the two
groups. We can see that at age 14 there seems to be not only a change
in the slope (from .682 to 3.62) but also a jump in the intercept
(from 17.6 to 25.8). This suggest that at age 14, there is discontinuous
jump in time talking on the phone as well as a change in the slope as well.
However, this is merely suggestive, we should really test this in a combined
model.

Try 3: Combined model, coding for separate slope and intercept

We now combine the two models into a single model. To do this, we need to
create some new variables.

age1 is the age centered around age 14
but converted to 0s after age 14 (representing the effect of age for before 14
year olds).

age2 is the age centered around age 14
but converted to 0s before 14 (representing the effect of age for after 14 year
olds).

int1 is 1 before age 14 (representing the intercept for before 14
year olds).

int2 is 1 after age 14 (representing the intercept for after 14
year olds).

That might have been confusing, so let us show what these variables look like in a table below.
Note that we have a strange person who is 13.9999 years old (very very close to being 14, but
not quite). This person will be helpful for seeing the effect of the jump from going
from being under 14 to being 14.

Now we are ready to run our combined regression. We use the hascons option
because our model has an implied constant, int1 plus int2
which adds up to 1. By including this option, the overall test of the model is
appropriate and Stata does not try to include its own constant.

Try 4: Alternate coding, coding to compare intercept and slope

This is another way you can code this model. Note that we include age14 and
age2 for the two terms for age, and _cons and int2 to represent the intercept values.
With this coding, age2 and int2 represent the change from being
less than 14 to being 14 and older.

age2 is the change in the slope as a result of becoming age 14 or higher (as compared to being less than 14). Note how this value of 2.94 corresponds to the
lincom command above comparing the slope for after 14 to the slope before 14.

_cons is the predicted mean for someone who is just infinitely close to being 14 years old (but not quite 14).

int2 is the predicted mean for someone who just turned 14 years old minus the predicted mean for someone who is infinitely close to being 14 years old (the jump that occurs at age 14). Note how this corresponds to the result from the
lincom command above that tested the difference in the intercepts.

As you can see, the coefficients for age2 and int2 now focus on the
change that results from becoming 14 years old.

Below we compute the predicted values calling them yhat2. Note how the predicted values are the same for this model and the prior model, because the
models are essentially the same, they are just parameterized differently.

Try 5: Using mkspline and getting separate slope coding

Stata has a very nice convenience
command for these kinds of models called mkspline. Below we use the command to create
the variables xage1 (age before 14)and xage2 (age after 14).
We then show the coding
below.

We then run the regression below. Note that the effect for xage1 is
the slope before age 14, and xage2 is the slope after age 14. The term int2 corresponds
to the jump in the regression lines at age 14. The value for _cons is the predicted amount of
talking for someone who is zero years old.

Note that all of the coefficients are the same as the last model, except for yage2. This
coefficient now is the change in the slope from after age 14 to before age 14 (i.e., 3.62 – .68 = 2.94).
Coded in this fashion, yage2 tests for differences in the slopes.

Summary

This brief FAQ compared different ways of creating piecewise regression models. All of
these models are equivalent in that the overall test of the model is exactly the same
( always F( 3, 196) = 210.66) and that they all generate the exact predicted values.
The differences in parameterization are merely a rescrambling of the intercepts and slopes
for the two segments of the regression model. You can choose the coding strategy that you like best,
but note that you can use lincom to combine or compare coefficients to form comparisons that
were not present in the original model. While the mkspline command is very convenient, some
might prefer the manual coding schemes we illustrated because of the interpretation they provide
with respect to the intercept terms.