Negative Binomial Regression | SAS Data Analysis Examples

Please note: The purpose of this page is to show how to use various data
analysis commands. It does not cover all aspects of the research process which
researchers are expected to do. In particular, it does not cover data
cleaning and checking, verification of assumptions, model diagnostics or
potential follow-up analyses.

This page was updated using SAS 9.2.

Examples of negative binomial regression

Example 1. School administrators study the attendance behavior of high
school juniors at two schools. Predictors of the number of days of absence
include the type of program in which the student is enrolled and a standardized
test in math.

Example 2. A health-related researcher is studying the number of hospital
visits in past 12 months by senior citizens in a community based on the
characteristics of the individuals and the types of health plans under which
each one is covered.

Description of the data

Let’s pursue Example 1 from above.

We have attendance data on 314 high school juniors from two urban high
schools in the file https://stats.idre.ucla.edu/wp-content/uploads/2016/02/nb_data.sas7bdat. The
response variable of interest is days absent, daysabs. The variable
math gives the standardized math score for each student. The variable
prog is a three-level nominal variable indicating the type of instructional
program in which the student is enrolled.

Let’s look at the data. It is always a good idea to start with descriptive
statistics and plots.

Each variable has 314 valid observations and their distributions seem quite reasonable. The mean of our outcome variable is much lower than its variance.

Let’s continue with our description of the variables in this dataset. The table below shows the average numbers of days absent by program type and seems to suggest that program type is a good candidate for predicting the number of days absent, our outcome variable, because the mean value of the outcome appears to vary by
prog. The variances within each level of prog are higher than the
means within each level. These are the conditional means and variances. These
differences suggest that over-dispersion is present and that a Negative Binomial
model would be appropriate.

Analysis methods you might consider

Below is a list of some analysis methods you may have
encountered. Some of the methods listed are quite reasonable, while others have
either fallen out of favor or have limitations.

Negative binomial regression – Negative binomial regression can be used
for over-dispersed count data, that is when the conditional variance exceeds
the conditional mean. It can be considered as a generalization of Poisson
regression since it has the same mean structure as Poisson regression and it
has an extra parameter to model the over-dispersion. If the conditional
distribution of the outcome variable is over-dispersed, the confidence
intervals for the Negative binomial regression are likely to be narrower as
compared to those from a Poisson regression model.

Poisson regression – Poisson regression is often used for modeling count
data. Poisson regression has a number of extensions useful for count models.

Zero-inflated regression model – Zero-inflated models attempt to account
for excess zeros. In other words, two kinds of zeros are thought to exist
in the data, "true zeros" and "excess zeros". Zero-inflated models estimate
two equations simultaneously, one for the count model and one for the excess
zeros.

OLS regression – Count outcome variables are sometimes log-transformed
and analyzed using OLS regression. Many issues arise with this approach,
including loss of data due to undefined values generated by taking the log
of zero (which is undefined), as well as the lack of capacity to model the
dispersion.

Negative binomial regression analysis

Negative binomial models can be estimated in SAS using procgenmod. On the class statement we list the variable prog.
After prog, we use two options, which are given in parentheses. The
param=ref option changes the coding of prog from effect coding,
which is the default, to reference coding. The ref=first option
changes the reference group to the first level of prog. We have
used two options on the model statement. The type3 option is
used to get the multi-degree-of-freedom test of the categorical variables listed
on the class statement, and the dist = negbin option is used to
indicate that a negative binomial distribution should be used.

The output
begins the Model Information table and the Criteria for Assessing Goodness of
Fit table. The number of observations read and used
is given. In this example, we have no missing data, so all 314
observations that are read in are used in the analysis. In the Criteria
for Assessing Goodness of Fit table, we see the Pearson Chi-Square of 339.88. This
is not a test of the model coefficients (which we saw in the header
information), but a test of the model form: are the data overdispersed when
modeled with a negative binomial distribution? A low p-value from this test suggests misspecification or other
problems with the model. We can get the p-value of

this test. The non-significant p-value suggests that the negative
binomial model is a good fit for the data.

The Analysis of Maximum Likelihood Parameter Estimates table is
presented next, which gives the regression coefficients, standard errors, the Wald 95% confidence intervals
for the coefficients, chi-square tests and p-values for each of the model
variables. In this example,
the variable math has a coefficient of -0.006, which is statistically
significant. This means that for each one-unit increase in math,
the expected log count of the days absent decreases by .0006. The
indicator for prog=2 is the expected difference in log count between
group 2 and the reference group (prog=1). The expected log count for
level 2 of prog is 0.44 lower than the expected log count for level
1. The indicator variable prog=3 is the expected difference in log
count between group 3 and the reference group. The expected log count for
level 3 of prog is 1.28 lower than the expected log count for level
1. To determine if prog itself, overall, is statistically
significant, we can look at the LR Statistics for Type 3 Analysis table that
includes the two degrees-of-freedom test of this variable. The two
degree-of-freedom chi-square test indicates that prog is a
statistically significant predictor of daysabs. The chi-square value
for this test is 45.05 with a p-value of .0001. This indicates that
the variable prog is a statistically significant predictor of daysabs.

Additionally, there is an estimate of the dispersion coefficient
(often called alpha). A Poisson model is one in which this alpha value is
constrained to zero. In this example, the estimated alpha has a 95%
confidence interval that does not include zero, suggesting that the negative
binomial model form is more appropriate than the Poisson. An estimate
greater than zero suggests over-dispersion (variance greater than mean). An
estimate less than zero suggests under-dispersion, which is very rare.

We can also see the results as incident rate ratios by using estimate statements with the exp option.

The output above indicates that the incident rate for prog=2 is 0.64
times the incident rate for the reference group (prog=1). Likewise, the
incident rate for prog=3 is 0.28 times the incident rate for the
reference group holding the other variables constant. The percent change in the
incident rate of daysabs is a 1% decrease (1 - .99) for every unit
increase in math.

The form of the model equation for negative binomial regression is the same
as that for Poisson regression. The log of the outcome is predicted with a
linear combination of the predictors:

The coefficients have an additive effect in the log(y) scale and the
IRR have a multiplicative effect in the y scale. The dispersion
parameter in negative binomial regression does not effect the expected counts,
but it does effect the estimated variance of the expected counts.

For additional information on the various metrics in which the results can be
presented, and the interpretation of such, please see Regression Models for
Categorical Dependent Variables Using Stata, Second Edition by J. Scott Long
and Jeremy Freese (2006).

Below we use estimate statements to calculate the predicted number of events at each level of
prog, holding all other variables (in this example, math) in the
model at their means.

In the output above, we see that the predicted number of
events for level 1 of prog is about 10.24, holding math at its
mean. The predicted number of events for level 2 of prog is lower at
6.59, and the predicted number of events for level 3 of prog is about
2.85. Note that the predicted count of level 2 of prog is
(6.5879/10.2369) = 0.64 times the predicted count for level 1 of prog.
This matches what we saw in the after in the incident rate ratio output table.

We can similarly obtain the predicted number of events for values of math
while holding prog constant.

The table above shows that when prog held at its reference level and
math at 20, the predicted count (or average number of days absent) is about
12.13; when prog held at its reference level and math at 40, the
predicted count is about 10.76. If we compare the predicted counts at these two
levels of math, we can see that the ratio is (10.7569/12.1267) = 0.887. This
matches the IRR of 0.994 for a 20 unit change: 0.994^20 = 0.887.

You can
graph the predicted number of events using the commands below.
Proc genmod must be run with the output statement to obtain the
predicted values in a dataset we called pred1. We then sorted our
data by the predicted values and created a graph with proc sgplot.

The graph indicates that the most
days absent are predicted for those in program
1. The
lowest number of predicted days absent is for those students in program 3.

Things to consider

It is not recommended that negative binomial models be applied to small
samples.

Negative binomial models assume that only one process
generates the data. If more than one process generates the data, then
it is possible to have more 0s than expected by the negative binomial model;
in this case, a zero-inflated model (either zero-inflated Poisson or
zero-inflated negative binomial) may be more appropriate.

If the data generating process does not allow for any 0s (such as the
number of days spent in the hospital), then a zero-truncated model may be
more appropriate. Such models can be estimated with proc countreg.

Count data often have an exposure variable, which indicates the number
of times the event could have happened. This variable should be
incorporated into your negative binomial model with the use of the offset
option on the model statement.

The outcome variable in a
negative binomial regression cannot have negative numbers.