Monday, October 15, 2012

In practice, we often find that count data is not well modeled by Poisson regression, though Poisson models are often presented as the natural approach for such data. In contrast, the negative binomial regression model is much more flexible and is therefore likely to fit better, if the data are not Poisson.
In example 8.30 we compared the probability mass functions of the two distributions, and found that for a given mean, the negative binomial closely approximates the Poisson, as the scale parameter increases. But how does this affect the choice of regression model? How might another alternative, the overdispersed, or quasi-Poisson model compete with these? Today we generate a rudimentary toolkit for assessing the effects of Poisson, negative binomial, and quasi-Poisson models, assuming data are truly generated by one or the other process.

SAS
We'll begin by simulating Poisson and negative binomial data. Note that we also rely on the poismean_nb function that we created in example 8.30-- this is needed because SAS only accepts the natural parameters of the distribution, while the mean is a (simple) function of the two parameters.
As is typical in such settings, we'll begin by generating data under the null of no association between, in this case, the normal covariate and the count outcome. The proportion of rejections should be no greater than alpha (5%, here). However, we'll include code to easily simulate data under the alternative as well. This will facilitate assessing the relative power of the models, later.

The models will be fit in proc genmod. (See sections 4.1.3, 4.1.5, table 4.1.) It would be good to write a little macro to change the distribution and the output names, but it's not necessary. To save space here, the repetitive lines are omitted. The naming convention is that the true distribution (p or nb) is listed first, followed by the fit model (p, nb, or pod, for overdispersed).

For analysis, we'll bring all the results together using the merge statement (section 1.5.7). Note that the output data sets contain the Wald CI limits as well as the estimates themselves; all have to be renamed in the merge, or they will overwrite each other.

The indicators of CI that exclude the null are calculated with appropriate names using logical tests that are 1 if true (rejections) and 0 if false. (See, e.g., section 1.4.9.) The final results can be obtained from proc means

All of the estimates appear to be unbiased. However, Poisson regression, when applied to the truly negative binomial data, appears to be dramatically anticonservative, rejecting the null (i.e., with CI excluding the null value) 14% of the time. The overdispersed model may be slightly biased as well. The estimated proportion of rejections is 5.55%, or 555 of 10,000 experiments. An exact CI for the proportion excludes 5%, here, although the anticonservative bias appears to be slight. To test other effect sizes, we'd change the mean, set in the first data step and the target in the results data. It would also be valuable to change the scale parameter for the negative binomial.

R
We begin by defining two simple functions: one to extract the standard errors from a model, and the second to assess whether Wald-type CI for parameter estimates exclude some value. It's a bit confusing that a standard error extracting function is not part of R. Or perhaps it is, and someone will point out the obvious function in the comments. It's useful to use the standard errors and construct the Wald CI in the current setting because the obvious alternative for constructing CI, the confint() function, uses profile likelihoods, which would be too time-consuming in a simulation setting. The second function accepts the parameter estimate, its standard error, and a fixed value which we want to know is in or out of the CI. Both functions are actually single expressions, but having them in hand will reduce the typing in the main function.

# this will work for any model object that works with vcov()
# the test for positive variance should be unnecessary but can't hurt
stderrs = function(model) {
ifelse(min(diag(vcov(model))) > 0, sqrt(diag(vcov(model))), NA)
}
# short and sweet: 1 if target is out of Wald CI, 0 if in
ciout = function(est, se, target){
ifelse( (est - 1.96*se > target) | (est + 1.96*se < target), 1,0)
}

With these ingredients prepared, we're ready to write a function to fit the three models to the two sets of observed data. The function will accept a number of observations per data set and a true beta. The Poisson and overdispersed Poisson are fit with the glm() function (section 4.1.3, table 4.1) but the negative binomial uses the glm.nb() function found in the MASS package (section 4.1.5).

Now we can use the convenient replicate() function to call the experiment many times. Since the output of testnb() is a matrix, the result of replicate() is a three-dimensional matrix, R * C * sheet, where sheet here corresponds to each experimental replicate. To summarize the results, we can use the rowMeans() function to get the proportion of rejections or the mean of the estimates.

The results agree completely with the SAS results discussed above.
The naive Poisson regression would appear a bad idea--if the data are negative binomial, tests don't have the nominal size. It would be valuable to replicate the experiment with some other distribution for the real data as well. One approach to modeling count data would be to fit the Poisson and assess the quality of the fit, which can be done in several ways. However, this iterative fitting also jeopardizes the size of the test, in theory. Perhaps we'll explore the practical impact of this in a future entry. Fortunately, at least in this limited example, a nice alternative exists: We can just fit the negative binomial by default. The costs of this in terms of power could be assessed with a thorough simulation study, but are likely to be small, since only one additional parameter is estimated. And the size of the test is hardly affected at all. The quasi-Poisson model could also be recommended, but has the drawback of relying on what is actually not a viable distribution for the data. Some sources suggest that it may be even more flexible than the negative binomial, however.

An unrelated note about aggregators:
We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

I think there are still other ways to skin this particular cat in R. But I can't fathom why there wouldn't be a built-in function to do it.

FTR, I think I used the summary() approach in an earlier entry, but I prefer this one, since it's based on a function that could easily be used for other purposes, and doesn't rely as heavily on R syntax.

Subscribe to SAS and R!

Search the SAS and R Blog

The book (second edition, 2014)

Reviews (from the first edition)

"By placing the R and SAS solutions together and by covering a vast array of tasks in one book, Kleinman and Horton have added surprising value and searchability to the information in their book. … a home run, and it is a book I am grateful to have sitting, dust-free, on my shelf."—Robert Alan Greevy, Jr, Teaching of Statistics in the Health Sciences

"I use SAS and R on a daily basis. Each has strengths and weaknesses, and using both of them gives the advantage of being able to do almost anything when it comes to data manipulation, analysis, and graphics. If you use both SAS and R on a regular basis, get this book. If you know one of the packages and are learning the other, you may need more than this book, but get this book, too. "

Charles Heckler, University of Rochester, Technometrics

"Excellent cross-referencing to other topics and end-of-chapter worked examples on the ‘Health evaluation and linkage to primary care’ data set are given with each topic. … users who are proficient in either of the software packages but with the need to use the other will find this book useful."—Frances Denny, Journal of the Royal Statistical Society, Series A