Math, programming, ecology, and more

Ecologists commonly collect data representing counts of organisms. Generalized linear models (GLMs) provide a powerful tool for analyzing count data. [As mentioned previously, you should generally not transform your data to fit a linear model and, particularly, do not log-transform count data.] The starting point for count data is a GLM with Poisson-distributed errors, but not all count data meet the assumptions of the Poisson distribution. Thus, we need to test if the variance>mean or if the number of zeros is greater than expected. Below, we will walk through the basic steps to determine which GLM to use to analyze your data.

First, we will define a few of the variables that are used repeatedly throughout the subsequent code. [Note: click here to get all code from this post in a single file.] We are using an unrealistic sample size for most ecological studies because we do not want to be misled by a strange draw from any of the distributions.

R

1

2

3

4

n=100# No. of samples per treatment

mean.A=10# Mean count for treatment A

mean.B=5# Mean count for treatment B

nd=data.frame(Trt=c("A","B"))# Used in predict( ) function

Poisson data

We generate random variates from a Poisson distribution with the rpois( ) function. Because mean=variance in a Poisson distribution, we only need to specify the number of random variates and the expected mean of the distribution.

We test for goodness-of-fit of the model with a chi-square test based on the residual deviance and degrees of freedom.

R

1

2

3

1-pchisq(summary(model.pois)$deviance,

summary(model.pois)$df.residual

)

[1] 0.7565471

The GOF test indicates that the Poisson model fits the data (p > 0.05). If this were your actual data, you could breathe a sigh of relief because you could stop here. Well, not quite here. You will still want to use the model to predict mean counts for each treatment and standard errors for each parameter.

R

1

2

3

4

cbind(nd,

Mean=predict(model.pois,newdata=nd,type="response"),

SE=predict(model.pois,newdata=nd,type="response",se.fit=T)$se.fit

)

Trt Mean SE
1 A 10.03 0.3167015
2 B 4.77 0.2184031

Because we used a large sample size, the predicted means are similar to the expected means of 10 and 5.

Negative binomial data

Next we will use the 'MASS' package to generate random deviates from a negative binomial distribution, which involves a parameter, theta, that controls the variance of the distribution.

R

1

2

3

4

library(MASS)

data.nb=data.frame(Trt=c(rep("A",n),rep("B",n)),

Response=c(rnegbin(n,mean.A,5),rnegbin(n,mean.B,5))

)

Poisson model

We first test if a Poisson model fits this data. Remember that we are trying to simulate the steps you would take if you did not know the properties of the distribution (beyond knowing that you have integers bound at zero and infinity).

The model estimates the dispersion parameter as 4.114; we set theta = 5 when generating random variates.

R

1

2

3

1-pchisq(summary(model.nb)$deviance,

summary(model.nb)$df.residual

)

[1] 0.1756968

The GOF test indicates that the negative binomial model fits the data.

R

1

2

3

4

cbind(nd,

Mean=predict(model.nb,newdata=nd,type="response"),

SE=predict(model.nb,newdata=nd,type="response",se.fit=T)$se.fit

)

Trt Mean SE
1 A 10.27 0.5992233
2 B 5.07 0.3364221

Here you see the 'danger' of ignoring overdispersion in the Poisson model. The SE estimates are lower for the Poisson model than for the negative binomial model, which increases the likelihood of incorrectly detecting a significant treatment effect in the Poisson model.

Zero-inflated Poisson data

Lastly, we will add more more layer of complication to the story. If you have lots of zeros in your data, and have determined that Poisson and negative binomial models do not fit your data well, then you should turn to zero-inflated models with either Poisson or negative binomial error distributions. We need the 'VGAM' package to generate random variates from a zero-inflated Poisson distribution using the rzipois( ) function. The 3rd argument to the rzipois( ) function specifies the probability of drawing a zero beyond the expected number of zeros for a Poisson distribution with the specified mean. Here were are introducing a relatively small proportion of extra zeros and the same proportion for each treatment.

Thank you for your response.
However, I’am using this family-negative binomial command in GNM (not glm). If I don’t specify the theta, I get an error. “Error in negative.binomial(link = “log”) : ‘theta’ must be specified”.
Is there another way to get an estimate of theta? As the manual indeed says, if you don’t specify it, it will use an estimate using a Poisson. However, if I run a (regular) Poisson, he just assumes the theta to be 1? Right? How can I get an estimate of theta?

I’ve got some questions about glm.nb.
First, how to interpret the “theta” I took a look to the link above, but it didn’t help me…
I’ve got a theta of 0.2859, Std. Err.: 0.0161 and a 2 x log-likelihood: -5028.2840. Is there any way to say what those parameters mean ?

Secondly, is the pchisq() a goodness of fit test ? How to interpret this. with my glm.nb, pchisq = 0.95, is it enough information to say that my model fit the datas ?

Third, in my model I’ve got two variables which are binary variables, is factor(y) the good way to use binary variable in a model ?

And a last question : I would like to implement a model on the influence of some variable on a response variarable (nothing fancy^^) but the problem is that I’ve got only 16 datas in my response variable, how can I say if it is enough to implement a model ?

Thank you very much in advance for any answer (and thank you very much for the blog, very useful !!)

Can you talk more about how do you judge for overdispersion in summary(model.zip.3) by theta= 28840.7026 ?
If i have tried a model :
>model1summary(model1)
Theta = 0.9882
Number of iterations in BFGS optimization: 45
Log-likelihood: -715.6 on 33 Df
is it overdispersion ?