A count model is a linear model where the dependent variable is a count. For example, the number of times a car breaks down, the number of rats in a litter, the number of times a young student gets out of his seat, etc. Counts are either 0 or a postive whole number, which means we need to use special distributions to generate the data.
The Poisson Distribution
The most common count distribution is the Poisson distribution. It generates whole numbers greater than or equal to 0. It has one parameter, the mean, which is usually symbolized as \(\lambda\) (lambda). The Poisson distribution has the unique property that its mean and variance are equal. We can simulate values from a Poisson model in R using the rpois
function. Use the lambda
argument to set the mean. Below we generate 500 values from a distribution with lambda
= 4:
y < rpois(n = 500, lambda = 4)
table(y)
## y
## 0 1 2 3 4 5 6 7 8 9 10 11 12
## 14 39 69 88 96 82 62 29 15 2 2 1 1
barplot(table(y))
mean(y)
## [1] 3.982
var(y)
## [1] 4.017711
Notice the mean and variance are similar. With larger values of n
we’ll see them grow closer and closer together.
Now let’s say we want to generate a simple model that generates different counts based on whether you’re a male or female. Here’s one way to accomplish that:
set.seed(3)
n < 500
male < sample(c(0,1), size = n, replace = TRUE)
y_sim < rpois(n = n, lambda = exp(2 + 0.5 * (male == 1)))
Notice we used rpois
again but this time the mean is conditional on whether or not you’re a male. If you’re female, lambda is exp(2)
. If your male, lambda is exp(1.5)
. Why use the exp
function? Because that ensures lambda is positive. We could have just used positive numbers to begin with, but as we’ll see, modeling count data with a generalized linear model will default to using log
as the link function which assumes our original model was exponentiated.
Here’s a quick table of the counts we generated.
table(y_sim, male)
## male
## y_sim 0 1
## 0 218 196
## 1 29 44
## 2 2 10
## 3 1 0
We can already see more instances of males having higher counts, as we would expect since we have a postive coefficient for males in the model.
Let’s fit a model to our count data using the glm
function. How close can we get to recovering the “true” values of 2 and 0.5? We have to specify family = poisson
since we’re modeling count data.
m1 < glm(y_sim ~ male, family = poisson)
summary(m1)
##
## Call:
## glm(formula = y_sim ~ male, family = poisson)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## 0.7155 0.7155 0.5367 0.5367 3.5366
##
## Coefficients:
## Estimate Std. Error z value Pr(>z)
## (Intercept) 1.9379 0.1667 11.628 < 2e16 ***
## male 0.5754 0.2083 2.762 0.00575 **
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 361.75 on 499 degrees of freedom
## Residual deviance: 353.80 on 498 degrees of freedom
## AIC: 538.16
##
## Number of Fisher Scoring iterations: 6
Notice the coefficients in the summary are pretty close to what we specified in our model. We generated the data with a coefficient of 0.5 for males. The model estimated a coefficient of 0.57. The basic interpretation is that being male increases the log of expected counts by 0.57. That’s not easy to understand. If we exponentiate we get a multiplicative interpretation.
exp(0.57)
## [1] 1.768267
The interpretation now is that the expected count is about 1.77 times greater for males, or a 77% increase.
What if we wanted to generate data in which the expected count was 2 times greater for males? Using some trial and error we find that exp(0.7)
is about 2.
exp(0.7)
## [1] 2.013753
Therefore we could do the following to simulate such data:
set.seed(4)
n < 500
male < sample(c(0,1), size = n, replace = TRUE)
y_sim < rpois(n = n, lambda = exp(2 + 0.7 * (male == 1)))
And as expected, the expected count is about twice as high for males
m1 < glm(y_sim ~ male, family = poisson)
exp(coef(m1)['male'])
## male
## 2.149378
What kind of counts does this model simulate? First let’s look at our original data:
dat < data.frame(y_sim, male)
library(ggplot2)
ggplot(dat, aes(x = y_sim)) +
geom_bar() +
facet_wrap(~male)
Now let’s generate counts using our model. There are two ways we can go about this. Recall that a count model returns the expected count. That would be the lambda in a Poisson model. Using the expected count for females and males, we can randomly generate counts:
p.out < predict(m1, type = "response")
counts < rpois(n = n, lambda = p.out)
dat2 < data.frame(counts, male)
ggplot(dat2, aes(x = counts)) +
geom_bar() +
facet_wrap(~male)
This looks pretty similar to the original data. This is random, so if you run the code above, you’ll probably get something that looks different.
A second way to use the model to generate counts is to use the expected counts for males and females to generate probabilities for various counts, and then generate the counts by multiplying the probabilities by the original number of males and females. Below we use the dpois
function to calculate the expected probabilities of 0, 1, and 2 counts using the model generated male and female lambda values. We then multiply those probabilities by the number of females and males. That gives us expected number of 0, 1, and 2 counts for each gender. Since the expected counts have decimals, we use the round
function to convert them to whole numbers when we create our data frame.
p.out < predict(m1, type = "response", newdata = data.frame(male = c(0,1)))
female_fit < dpois(0:2,lambda = p.out[1]) * sum(male == 0)
male_fit < dpois(0:2,lambda = p.out[2]) * sum(male == 1)
dat3 < data.frame(male = rep(c(0,1), each = 3),
count = round(c(female_fit, male_fit)),
counts = rep(0:2, 2))
ggplot(dat3, aes(x = counts, y = count)) +
geom_col() +
facet_wrap(~male)
The result is almost indistinguishable from our original data. This is what we expect. We created the model to generate the data, and then fit the exact same model to the data we generated to recover the original parameters. This may seem like a pointless exercise, but it ensures we understand our count model. If know how to generate data from a count model, then we know how to interpret a count model fit to data.
An easier way to check model fit is to create a rootogram. We can do this using the rootogram
function in the countreg package.
# Need to install from RForge instead of CRAN
# install.packages("countreg", repos="http://RForge.Rproject.org")
countreg::rootogram(m1)
The tops of the bars are the expected frequencies of the counts given the model. The counts are plotted on the squareroot scale to help visualize smaller frequencies. The red line shows the fitted frequencies as a smooth curve. The xaxis is actually a horizontal reference line. Bars that hang below the line show underfitting, bars that hang above show overfitting. In this case it’s hard to see any over or underfitting because we fit the right model. In a moment we’ll see some rootograms that clearly identify an illfitting model.
The NegativeBinomial Distribution
The Poisson distribution assumes that the mean and variance are equal. This is a very strong assumption. A count distribution that allows the mean and variance to differ is the Negative Binomial distribution. Learning about the negative binomial distribution allows us to generate and model more general types of counts.
The variance of the negative binomial distribution is a function of its mean and a dispersion parameter, \(k\):
\[var(Y) = \mu + \mu^{2}/k\]
Sometimes \(k\) is referred to as \(\theta\) (theta). As \(k\) grows large, the second part of the equation approaches 0 and converges to a Poisson distribution.
We can generate data from a negative binomial distribution using the rnbinom
function. Instead of lambda
, it has a mu
argument. The dispersion parameter, \(k\), is specified with the size
argument. Below we generate 500 values from a negative binomial distribution with mu
= 4 and \(k\) = 5:
y < rnbinom(n = 500, mu = 4, size = 5)
table(y)
## y
## 0 1 2 3 4 5 6 7 8 9 10 11 12 16
## 26 70 66 75 82 58 43 30 20 15 6 6 2 1
barplot(table(y))
mean(y)
## [1] 3.948
var(y)
## [1] 6.734766
We see the variance is a good deal larger than the mean. Because of this we often say the distribution exhibits overdispersion.
Once again let’s generate a simple model that produces different counts based on whether you’re a male or female. Here’s one way to accomplish that using the same model as before, but this time with a dispersion parameter that we’ve set to 0.5. (Since the dispersion parameter is in the denominator, smaller values actually lead to more dispersion.)
set.seed(5)
n < 500
male < sample(c(0,1), size = n, replace = TRUE)
y_sim < rnbinom(n = n,
mu = exp(2 + 0.7 * (male == 1)),
size = 0.05)
A quick call to table shows us how the counts break down:
table(y_sim, male)
## male
## y_sim 0 1
## 0 236 222
## 1 13 6
## 2 4 4
## 3 2 4
## 4 1 1
## 5 1 2
## 6 0 3
## 11 0 1
We know we generated the data using a negative binomial distribution, but let’s first fit it with a Poisson model and see what we get.
m2 < glm(y_sim ~ male, family = poisson)
summary(m2)
##
## Call:
## glm(formula = y_sim ~ male, family = poisson)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## 0.7536 0.7536 0.5293 0.5293 7.6824
##
## Coefficients:
## Estimate Std. Error z value Pr(>z)
## (Intercept) 1.9656 0.1667 11.793 < 2e16 ***
## male 0.7066 0.2056 3.437 0.000589 ***
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 577.19 on 499 degrees of freedom
## Residual deviance: 564.71 on 498 degrees of freedom
## AIC: 677.76
##
## Number of Fisher Scoring iterations: 7
The model certainly looks “significant”. The estimated coefficients are not too far off from the “true” values of 2 and 0.5. But how does this model fit? Let’s make a rootogram.
countreg::rootogram(m2)
That doesn’t look good. The Poisson model underfits 0 and 3 counts and way overfits 1 counts.
Now let’s fit the appropriate model. For this we need to load the MASS package which has the glm.nb
function.
library(MASS)
m3 < glm.nb(y_sim ~ male)
summary(m3)
##
## Call:
## glm.nb(formula = y_sim ~ male, init.theta = 0.06023302753, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## 0.4582 0.4582 0.3805 0.3805 1.9220
##
## Coefficients:
## Estimate Std. Error z value Pr(>z)
## (Intercept) 1.9656 0.3039 6.467 1e10 ***
## male 0.7066 0.4186 1.688 0.0914 .
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(0.0602) family taken to be 1)
##
## Null deviance: 120.69 on 499 degrees of freedom
## Residual deviance: 117.84 on 498 degrees of freedom
## AIC: 433.23
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 0.0602
## Std. Err.: 0.0137
##
## 2 x loglikelihood: 427.2270
Notice the coefficients are identical to the Poisson model but the standard errors are much larger. Also notice we get an estimate for “Theta”. That’s the model’s estimated dispersion parameter. It’s not too far off from the “true” value of 0.05. How does the rootogram look?
countreg::rootogram(m3)
Much better! Not perfect but definitely better than what we saw with the Poisson model. The AIC for the negative binomial model is also much lower than the Poisson model (433 vs 677). It’s always a good idea to evaluate multiple pieces of information when comparing models.
The ZeroInflated NegativeBinomial Model
The negativebinomial distribution allows us to model counts with overdispersion (ie, variance is greater than the mean). But we often see another phenomenon with counts: excess zeros. This occurs in populations where some subjects will never perform or be observed experiencing the activity of interest. Think of modeling the number of servings of meat people eat in a day. There will likely be excess zeros because many people simply don’t eat meat. We have a mixture of populations: people who never eat meat, and those that do but will sometimes eat no meat in a given day. That leads to an inflation of zeros.
We can simulate such data by mixing distributions. Below we first simulate a series of ones and zeros from a binomial distribution. The probability is set to 0.9, which implies that about 0.1 of the data will be zeros. We then simulate data from a negative binomial distribution based on the binomial distribution. If the original data was 0 from the binomial distribution, it remains a 0. Otherwise we sample from a negative binomial distrbution, which could also be a 0.^{1} Think of this distribution as the meateaters. Some of them will occasionally not eat meat in a given day. Finally we plot the data and note the spike of zeros.
set.seed(6)
n < 1000
male < sample(c(0,1), size = n, replace = TRUE)
z < rbinom(n = n, size = 1, prob = 0.9)
# mean(z == 0)
y_sim < ifelse(z == 0, 0,
rnbinom(n = n,
mu = exp(1.3 + 1.5 * (male == 1)),
size = 2))
# table(y_sim, male)
barplot(table(y_sim))
Going the other direction, if we wanted to model such data (ie, get some estimate of the process that generated the data) we would use a zeroinflated model. The pscl
package provides a function that helps us do this called zeroinfl
. It works like the glm
and glm.nb
functions but the formula accommodates two specifications: one for the process generating counts (including possible zeros) and the other for the process generating just zeros. The latter is specified after the pipe symbol 
. We also have to specify the count distribution we suspect models the data. Below would be the correct model since it matches how we simulated the data. The first half of the formula is the process for the counts. We include male
because mu
was conditional on male
. After the pipe, we just include the number 1, which means fit an intercept only. That’s correct, because the probability of zero was not conditional on anything.
library(pscl)
m4 < zeroinfl(y_sim ~ male  1, dist = "negbin")
summary(m4)
##
## Call:
## zeroinfl(formula = y_sim ~ male  1, dist = "negbin")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## 1.1500 0.7180 0.1301 0.4628 5.8656
##
## Count model coefficients (negbin with log link):
## Estimate Std. Error z value Pr(>z)
## (Intercept) 1.35238 0.04551 29.714 <2e16 ***
## male 1.42124 0.05709 24.894 <2e16 ***
## Log(theta) 0.69064 0.07192 9.602 <2e16 ***
##
## Zeroinflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>z)
## (Intercept) 2.0947 0.1334 15.71 <2e16 ***
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Theta = 1.995
## Number of iterations in BFGS optimization: 9
## Loglikelihood: 3007 on 4 Df
In the summary we see we came close to recovering the true parameters of 1.3 and 1.5 we used in the rnbinom
function. The summary returns a logged version of theta. If we exponentiate we see that we also came close to recovering the “true” dispersion parameter of 2. (Theta is also available at the bottom of the summary.)
exp(0.69065)
## [1] 1.995012
The bottom half of the summary shows the estimated model for the zero generating process. This is a logistic regression model. The intercept is on the logodds scale. To convert that to probability we could take the inverse logit, which we can do with the plogis
function. We see that it comes very close to recovering the “true” probability of a zero:
plogis(2.0947)
## [1] 0.109613
We can also use a rootogram to assess model fit:
countreg::rootogram(m4)
The resulting plot looks pretty good. The red line is our mixture model. We see that our model accommodates the inflated zeros and then tapers down to accommodate the overdispersed count data.
What happens if we fit a zeroinflated model but misspecify the distribution? Let’s find out. Below we use zeroinfl
with a Poisson distribution.
m5 < zeroinfl(y_sim ~ male  1, dist = "poisson")
summary(m5)
##
## Call:
## zeroinfl(formula = y_sim ~ male  1, dist = "poisson")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## 1.9408 1.0843 0.2559 0.9142 10.4785
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>z)
## (Intercept) 1.46640 0.02563 57.21 <2e16 ***
## male 1.31893 0.02814 46.88 <2e16 ***
##
## Zeroinflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>z)
## (Intercept) 1.65062 0.08837 18.68 <2e16 ***
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 6
## Loglikelihood: 4303 on 3 Df
The summary looks good if we go by stars. But there’s not much there to assess model fit. Again the rootogram is an invaluable visual aid.
countreg::rootogram(m5)
Although we have a good model for the inflated zeros, our count model is lacking as indicated by the wavy pattern of alternating instances of over and underfitting.
We could also generate counts where both processes depend on being male. Below we use a logistic regression model to generate probabilities of zero. (See our post Simulating a Logistic Regression Model for more information.) If you’re female the probability of a 0 count is about 0.69. For males, the probability is about 0.27. Then we generate counts using a negativebinomial model as before.
set.seed(7)
n < 1000
male < sample(c(0,1), size = n, replace = TRUE)
z < rbinom(n = n, size = 1,
prob = 1/(1 + exp((0.8 + 1.8 * (male == 1)))))
y_sim < ifelse(z == 0, 0,
rnbinom(n = n,
mu = exp(1.3 + 1.5 * (male == 1)),
size = 2))
The correct model to recover the “true” values needs to include male
in both formulas, as demonstrated below:
m6 < zeroinfl(y_sim ~ male  male, dist = "negbin")
summary(m6)
##
## Call:
## zeroinfl(formula = y_sim ~ male  male, dist = "negbin")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## 0.9378 0.4626 0.4626 0.3721 5.5882
##
## Count model coefficients (negbin with log link):
## Estimate Std. Error z value Pr(>z)
## (Intercept) 1.35609 0.08008 16.934 < 2e16 ***
## male 1.43752 0.08901 16.150 < 2e16 ***
## Log(theta) 0.63396 0.08589 7.381 1.57e13 ***
##
## Zeroinflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>z)
## (Intercept) 0.7761 0.1106 7.018 2.25e12 ***
## male 1.8474 0.1507 12.257 < 2e16 ***
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Theta = 1.8851
## Number of iterations in BFGS optimization: 9
## Loglikelihood: 2310 on 5 Df
Once again we come close to getting back the “true” values we used to simulate the data. The bottom half of the summary says that females have about a 68% percent chance of always being 0.
plogis(0.7761)
## [1] 0.684839
Adding the male coefficient allows us to get the expected probability that a male is always a 0 count, about 26%.
plogis(0.7761 + 1.8474)
## [1] 0.2551559
These are close to the “true” probabilities we assigned in the logistic regression model:
# female
1  1/(1 + exp((0.8)))
## [1] 0.6899745
# male
1  1/(1 + exp((0.8 + 1.8)))
## [1] 0.2689414
Try fitting some “wrong” models to the data and review the rootograms to see the lack of fit.
Hopefully you now have a little better understanding of how to simulate and model count data. For additional reading, see our other blog posts, Getting started with Negative Binomial Regression Modeling and Getting Started with Hurdle Models.
For questions or clarifications regarding this article, contact the UVA Library StatLab: statlab@virginia.edu
View the entire collection of UVA Library StatLab articles.
Clay Ford
Statistical Research Consultant
University of Virginia Library

The
VGAM
package provides a function calledrzinegbin
to generate data from a zeroinflated negativebinomial distribution. To replicate what we did “by hand”:rzinegbin(n = n, munb = exp(1.3 + 1.5 * (male == 1)), size = 2, pstr0 = 0.1)
↩