Poisson Regression | Stata Data Analysis Examples

Version info: Code for this page was tested in Stata 12.

Poisson regression is used to model count variables.

Please note: The purpose of this page is to show how to use various data
analysis commands. It does not cover all aspects of the research process which
researchers are expected to do. In particular, it does not cover data
cleaning and checking, verification of assumptions, model diagnostics or
potential follow-up analyses.

Examples of Poisson regression

Example 1. The number of persons killed by mule or horse kicks in the
Prussian army per year.
Ladislaus Bortkiewicz collected data from 20 volumes of
Preussischen Statistik. These data were collected on 10 corps of
the Prussian army in the late 1800s over the course of 20 years.

Example 2. The number of people in line in front of you at the grocery store.
Predictors may include the number of items currently offered at a special
discounted price and whether a special event (e.g., a holiday, a big sporting
event) is three or fewer days away.

Example 3. The number of awards earned by students at one high school.
Predictors of the number of awards earned include the type of program in which the
student was enrolled (e.g., vocational, general or academic) and the score on their
final exam in math.

Description of the data

For the purpose of illustration, we have simulated a data set for Example 3 above.
In this example, num_awards is the outcome variable and indicates the
number of awards earned by students at a high school in a year, math is a continuous
predictor variable and represents students’ scores on their math final exam, and prog is a categorical predictor variable with
three levels indicating the type of program in which the students were
enrolled.

Let’s start with loading the data and looking at some descriptive
statistics.

Each variable has 200 valid observations and their distributions seem quite
reasonable. In this particular the unconditional mean and variance of our outcome variable
are not extremely different.

Let’s continue with our description of the variables in this dataset. The
table below shows the average numbers of awards by program type and seems to
suggest that program type is a good candidate for predicting the number of
awards, our outcome variable, because the mean value of the outcome appears to
vary by prog.

Analysis methods you might consider

Below is a list of some analysis methods you may have
encountered. Some of the methods listed are quite reasonable, while others have
either fallen out of favor or have limitations.

Poisson regression – Poisson regression is often used for modeling count
data. Poisson regression has a number of extensions useful for count models.

Negative binomial regression – Negative binomial regression can be used
for over-dispersed count data, that is when the conditional variance exceeds
the conditional mean. It can be considered as a generalization of Poisson
regression since it has the same mean structure as Poisson regression and it
has an extra parameter to model the over-dispersion. If the conditional
distribution of the outcome variable is over-dispersed, the confidence intervals for
Negative binomial
regression are likely to be narrower as compared to those from a Poisson regression.

Zero-inflated regression model – Zero-inflated models attempt to account
for excess zeros. In other words, two kinds of zeros are thought to
exist in the data, "true zeros" and "excess zeros". Zero-inflated
models estimate two equations simultaneously, one for the count model and one for the
excess zeros.

OLS regression – Count outcome variables are sometimes log-transformed
and analyzed using OLS regression. Many issues arise with this
approach, including loss of data due to undefined values generated by taking
the log of zero (which is undefined) and biased estimates.

Poisson regression

Below we use the poisson command to estimate a Poisson regression
model. The i. before prog indicates that it is a factor variable
(i.e., categorical variable), and that it should be included in the model as a
series of indicator variables.

We use the vce(robust) option to obtain robust standard errors for the
parameter estimates as recommended by Cameron and Trivedi (2009) to control for
mild violation of underlying assumptions.

The output begins
with the iteration log, which gives the values of the log of pseudolikelihoods starting
with the null model. The last value in the iteration log is the final value
of the log of pseudolikelihood for the full model and is displayed again.
Because we asked for robust standard errors, the maximized likelihood is
actually a pseudolikelihood.
The estimates of the parameters are maximum likelihood estimates and the
estimation of the variance-covariance matrix of the parameter estimates
leads to the pseudolikelihood. Log pseudolikelihood values
can be used to compare models.

The header information is presented next. On the right-hand side, the number of
observations used in the analysis (200) is given, along with the Wald chi-square
statistic with three degrees of
freedom for the full model, followed by the p-value for the chi-square. This
is a test that all of the estimated coefficients are equal to zero–a
test of the model as a whole. From the p-value, we can see that the model is statistically significant. The header also includes
a pseudo-R2, which is 0.21 in this example.

Below the header you will find the Poisson regression coefficients for
each of the variables along with robust standard errors, z-scores, p-values
and 95% confidence intervals for the coefficients. The coefficient for math
is .07.
This means
that the expected increase in log count for a one-unit increase in math
is
.07. The indicator variable
2.prog is the expected difference in log count between group 2 (prog=2)
and the reference group (prog=1). Compared to level 1 of prog, the
expected log count for level 2 of prog increases by about 1.1. The
indicator variable 3.prog is the expected difference in log count
between group 3 (prog=3) and the reference group (prog=1). Compared to level 1 of prog, the
expected log count for level 3 of prog increases by about .37. To determine if
prog itself, overall,
is statistically significant, we can use the test command to obtain the
two degrees-of-freedom test of this variable.

The two degree-of-freedom chi-square test indicates that prog, taken
together, is a
statistically significant predictor of num_awards.

To help assess the fit of the model, the estat gof command can be used to
obtain the goodness-of-fit chi-squared test. This is not a test of the model
coefficients (which we saw in the header information), but a test of the model form:
Does the poisson model form fit our data?

We conclude that the model fits reasonably well because the goodness-of-fit
chi-squared test is not statistically significant. If the test had been
statistically significant, it would indicate that the data do not fit the model
well. In that situation, we may try to determine if there are omitted
predictor variables, if our linearity assumption holds and/or if there is
an issue of over-dispersion.

Sometimes, we might want to present the regression results as incident rate
ratios, we can use the
irr option. These IRR values are equal to our coefficients from the
output above exponentiated.

The output above indicates that the incident rate for 2.prog is 2.96
times the incident rate for the reference group (1.prog). Likewise,
the incident rate for 3.prog is 1.45 times the incident rate for the
reference group holding the other variables constant. The percent change in the incident rate of num_awards
is an increase of 7% for every unit increase in math.

The coefficients have an additive effect in the log(y) scale and the IRR
have a multiplicative effect in the y scale.

For additional information on the various metrics in which the results can be
presented, and the interpretation of such, please see Regression Models for
Categorical Dependent Variables Using Stata, Second Edition by J. Scott Long
and Jeremy Freese (2006).

To understand the model better, we can use the margins
command. Below we use the
margins command to calculate the predicted counts at each level of
prog, holding all other variables (in this example, math) in the
model at their mean values.

In the output above, we see that the predicted number of events for level 1
of prog is about .21, holding math at its mean. The predicted
number of events for level 2 of prog is higher at .62, and the
predicted number of events for level 3 of prog is about .31. Note that
the predicted count of level 2 of prog is (.6249446/.211411) = 2.96 times
higher than the predicted count for level 1 of prog. This matches what we
saw in the IRR output table.

Below we will obtain the predicted counts for values of math
that range from 35 to 75 in increments of 10.

The table above shows that with prog at its observed values and math
held at 35 for all observations, the average predicted count (or average number of
awards) is about .13; when math = 75, the average predicted count is about 2.17.
If we compare the predicted counts at math = 35 and math = 45, we can see that
the ratio is (.2644714/.1311326) = 2.017. This matches the IRR of 1.0727 for a
10 unit change: 1.0727^10 = 2.017.

You can graph the predicted number of events with the commands below.
The graph indicates that the most awards are predicted for those in the academic
program (prog = 2), especially if the student has a high math score. The
lowest number of predicted awards is for those students in the general program (prog
= 1).

Things to consider

If overdispersion seems to be an issue, we should first check if
our model is appropriately specified, such as omitted variables and
functional forms. For example, if we omitted the predictor variable prog
in the example above, our model would seem to have a problem with
over-dispersion. In other words, a mis-specified model could present a
symptom like an over-dispersion problem.

Assuming that the model is correctly specified, you may want to check for
overdispersion.
There are several ways to do this including the likelihood ratio test of
over-dispersion parameter alpha by running the same regression model using
negative binomial distribution (nbreg).

One common cause of over-dispersion is excess zeros, which in turn are
generated by an additional data generating process. In this situation,
zero-inflated model should be considered.

If the data generating process does not allow for any 0s (such as the
number of days spent in the hospital), then a zero-truncated model may be
more appropriate.

Count data often have an exposure variable, which indicates the number
of times the event could have happened. This variable should be
incorporated into a Poisson model with the use of the exp()
option.

The outcome variable in a Poisson regression cannot have negative numbers, and the exposure
cannot have 0s.

In Stata, a Poisson model can be estimated via glm command with the
log link and the Poisson family.

You will need to use the glm command to obtain the residuals to check other assumptions
of the Poisson model (see Cameron and Trivedi (1998) and Dupont (2002) for
more information).

Many different measures of pseudo-R-squared exist.
They all attempt to provide information similar to that provided by
R-squared in OLS regression, even though none of them can be interpreted
exactly as R-squared in OLS regression is interpreted. For a discussion of
various pseudo-R-squares, see Long and Freese (2006) or our FAQ page
What are pseudo R-squareds?.

Poisson regression is estimated via maximum likelihood estimation. It
usually requires a large sample size.