Getting Started with Rate Models

Let’s say we’re interested in modeling the number of auto accidents that occur at various intersections within a city. Upon collecting data after a certain period of time, perhaps we notice two intersections have the same number of accidents, say 25. Is it correct to conclude these two intersections are similar in their propensity for auto accidents?

What if we learned one intersection had 200,000 automobiles pass through it during the observation period while the other only had 10,000? That changes our interpretation of the counts quite a bit. They no longer seem “equal.” If we divide the counts by the number of accidents, we get \(25/200,000 = 0.000125\) and \(25/10,000 = 0.0025\), respectively. The latter intersection has a higher rate of accidents. Dividing 1 by the rates tells us that the first intersection has about one accident for every \(1/0.000125 = 8000\) cars that come through the intersection, while the second has about one accident for every \(1/0.0025 = 400\) cars. Clearly the second intersection appears more dangerous. Therefore it seems we should model the rate of auto accidents instead of the count.

Let’s learn how to do basic rate modeling using R. This is sometimes referred to as Poisson regression for rates. To begin we will load some data that come from the text Categorical Data Analysis (2nd ed.) by Alan Agresti. The data, concerning patient survival after heart valve replacement, come from a paper by Laird and Olivier (1981). A sample of 109 patients were followed for a period of time after having their heart valve replaced. By the end of the observation period, 21 of the patients died. Of interest is whether age or type of heart valve had any association with survival. Age was dichotomized to “under 55” and “55 and over.” There were two types of heart valves: aortic and mitral. We can enter the data as follows:


heartvalve <- c("aortic", "aortic", "mitral", "mitral")
age <- c("<55", "55+", "<55", "55+")
deaths <- c(4, 7, 1, 9)
d <- data.frame(heartvalve, age, deaths)
d


    heartvalve age deaths
  1     aortic <55      4
  2     aortic 55+      7
  3     mitral <55      1
  4     mitral 55+      9

Like our intersection example above, these counts are not necessarily equal. Not every subject was under observation for the same amount of time. Some may have been observed for 90 months while others were only observed for 3 months. Therefore we should model the rate of death instead of the count. To do this we need exposure data, or time at risk. Let’s add this to our data frame. Below we add a column called risktime that represents months of observation.


d$risktime <- c(1259, 1417, 2082, 1647)
d


    heartvalve age deaths risktime
  1     aortic <55      4     1259
  2     aortic 55+      7     1417
  3     mitral <55      1     2082
  4     mitral 55+      9     1647

We see, for example, 4 deaths occurred in 1259 months of observation. These months include subjects that were observed and never died. (Recall the sample size is 109.) To calculate rate of death, we divide deaths by risktime:


d$rate <- round(d$deaths/d$risktime, 4)
d


    heartvalve age deaths risktime   rate
  1     aortic <55      4     1259 0.0032
  2     aortic 55+      7     1417 0.0049
  3     mitral <55      1     2082 0.0005
  4     mitral 55+      9     1647 0.0055

There appears to be a higher rate of death for the older age group.

Let’s now model the rate of death as a function of age and heart valve. What predictive effects, if any, do these factors have on the rate of death? To do this we use the glm() function with family = poisson as follows:


m <- glm(deaths ~  age + heartvalve + offset(log(risktime)), 
         data = d,  
         family = poisson)

The first part of the formula, deaths ~ heartvalve + age, is pretty straight ahead. It says model deaths as a function of heartvalve and age in an additive fashion. That means the effect of age does not depend on heartvalve and vice versa. But what about the last part: + offset(log(risktime))?

Clearly that’s how we include risktime so we can model the rate. But why do we take the log and wrap it in something called offset()? To understand this, let’s first define our model as a Poisson count model (as opposed to a rate model).

\[\text{deaths} \sim \text{Poisson}(\text{exp}(\beta_0 + \beta_1\text{age} + \beta_2\text{heartvalve}))\]

This says the expected count of "deaths" is a random draw from a Poisson distribution with a mean of \(\text{exp}(\beta_0 + \beta_1\text{age} + \beta_2\text{heartvalve})\). Notice the mean is conditional on “age” and “heartvalve”. The expected death count depends on a weighted sum of "age" and "heartvalve" along with a fixed constant. This is the model we have proposed. (It may not be a good model.) The reason we exponentiate the formula using exp() is to ensure the mean is positive. A Poisson distribution generates counts and requires a positive mean. (Recall that \(e\) raised to any number, positive or negative, is positive.)

We can rewrite our model in the more traditional linear model format like so:

\[\text{deaths} = \text{exp}(\beta_0 + \beta_1\text{age} + \beta_2\text{heartvalve})\] \[\text{log}(\text{deaths}) = \beta_0 + \beta_1\text{age} + \beta_2\text{heartvalve}\]

Notice that taking the log of both sides produces the familiar linear model on the right-hand side. We refer to the log as the link function, because it’s the link to the linear model.

Of course we’re not modeling counts, but rates, so let’s restate our model incorporating "risktime".

\[\text{log}\left( \frac{\text{deaths}}{\text{risktime}} \right) = \beta_0 + \beta_1\text{age} + \beta_2\text{heartvalve}\]

Recalling the rules of logs, we can rewrite the left-hand side.

\[\text{log}(\text{deaths}) - \text{log}(\text{risktime}) = \beta_0 + \beta_1\text{age} + \beta_2\text{heartvalve}\]

And then we can add \(\text{log}(\text{risktime})\) to both sides.

\[\text{log}(\text{deaths}) = \beta_0 + \beta_1\text{age} + \beta_2\text{heartvalve} + \text{log}(\text{risktime})\]

And that is what we modeled above. Notice there is no \(\beta\) coefficient for \(\text{log}(\text{risktime})\). We simply add it to the right-hand side. This is called an offset. We define an offset in glm() using the offset() function. Hence this is why we added + offset(log(risktime)) to our model above.

Let’s take a look at the summary of our model.


summary(m)


  Call:
  glm(formula = deaths ~ age + heartvalve + offset(log(risktime)), 
      family = poisson, data = d)
  
  Deviance Residuals: 
       1       2       3       4  
   1.025  -0.602  -1.197   0.613  
  
  Coefficients:
                   Estimate Std. Error z value Pr(>|z|)    
  (Intercept)       -6.3121     0.5066 -12.460   <2e-16 ***
  age55+             1.2209     0.5138   2.376   0.0175 *  
  heartvalvemitral  -0.3299     0.4382  -0.753   0.4515    
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  
  (Dispersion parameter for poisson family taken to be 1)
  
      Null deviance: 10.8405  on 3  degrees of freedom
  Residual deviance:  3.2225  on 1  degrees of freedom
  AIC: 22.349
  
  Number of Fisher Scoring iterations: 5

In the Estimate column we see the estimated values of our model coefficients (ie, the \(\beta\) values). The coefficient for age55+, 1.221, is the predictive effect of being 55 or older. The output is on the log scale. Exponentiating it returns the multiplicative effect of age.


exp(coef(m)[2])


    age55+ 
  3.390401

The estimated rate of death for subjects 55 or older is about 3.4 times that of subjects 55 or younger. The confidence interval for this coefficient can be calculated as follows:


exp(confint(m))[2,]


  Waiting for profiling to be done...
      2.5 %    97.5 % 
   1.323921 10.389913

Pretty wide, but it doesn’t overlap 1. (Remember we exponentiated. \(e^0 = 1\)) It appears the death rate for subjects 55 and older is at least 30% higher than subjects under 55. (Multiplying by 1.3 implies an increase of 30%. 1.3 is the lower bound of the 95% confidence interval.)

What are the expected rates according to our model? If we use the fitted() function we get expected counts.


cbind(d[c("heartvalve","age")], count_fit = fitted(m))


    heartvalve age count_fit
  1     aortic <55  2.284108
  2     aortic 55+  8.715892
  3     mitral <55  2.715892
  4     mitral 55+  7.284108

To get expected rates, we need to use predict() with type = "response" and a supplied offset value, and then divide the predicted values by the offset. It doesn’t matter which value we pick for the offset. It can be 1000 or 5. The expected counts will change for different offset values, but the rate is always the same.


# 1000
p <- predict(m, type = "response", 
             newdata = data.frame(heartvalve, age, risktime = 1000))
cbind(d[c("heartvalve","age")], rate_fit = round(p/1000,4))


    heartvalve age rate_fit
  1     aortic <55   0.0018
  2     aortic 55+   0.0062
  3     mitral <55   0.0013
  4     mitral 55+   0.0044


# or use 5
p <- predict(m, type = "response", 
             newdata = data.frame(heartvalve, age, risktime = 5))
cbind(d[c("heartvalve","age")], rate_fit = round(p/5,4))


    heartvalve age rate_fit
  1     aortic <55   0.0018
  2     aortic 55+   0.0062
  3     mitral <55   0.0013
  4     mitral 55+   0.0044

An expected rate of 0.0062 for subjects 55 and older with a replaced aortic heart valve corresponds to about 1 death every 161 months (1/0.0062 = 161.3).

We can visualize our model using effect plots. The ggeffects package helps us do this. First we use the ggpredict() function to calculate expected counts at the 4 distinct levels along with confidence intervals. As we did with the predict() function, we need to supply an offset value. We supply this to the condition argument as a vector. The condition argument indicates which predictors should be held constant at specific values when making predictions. Again we choose 1000 for risktime. Even though the output is nicely formatted, the object eff_out is a data frame we can use for plotting.


# install.packages("ggeffects")
library(ggeffects)
eff_out <- ggpredict(m, terms = c("age", "heartvalve"), condition = c(risktime = 1000))
eff_out


  # Predicted values of deaths
  # x = age
  
  # heartvalve = aortic
  
  x   | Predicted |        95% CI
  -------------------------------
  <55 |      1.81 | [0.67,  4.90]
  55+ |      6.15 | [3.29, 11.51]
  
  # heartvalve = mitral
  
  x   | Predicted |       95% CI
  ------------------------------
  <55 |      1.30 | [0.50, 3.41]
  55+ |      4.42 | [2.25, 8.71]


# is it a data frame?
is.data.frame(eff_out)


  [1] TRUE


# the column names of the data frame:
names(eff_out)


  [1] "x"         "predicted" "std.error" "conf.low"  "conf.high" "group"

The x column name refers to age since we listed it first in the terms argument for ggpredict. The group column refers to heartvalve since we listed it second. The predicted column is the expected count. To plot expected rates, we need to divide the predicted, conf.low, and conf.high columns by 1,000.


col <- c("predicted", "conf.low", "conf.high")
eff_out[,col] <- eff_out[,col]/1000

Then we can create our effect plot. For this we use the ggplot2 package. Notice we use the position_dodge() function to define how to “dodge” the plotted points and error bars. Smaller values of width bring the points and bars closer together. If we don’t do this, the points and error bars will overplot on each other.


library(ggplot2)
pd <- position_dodge(width = 0.5)
ggplot(eff_out, aes(x = x, y = predicted, color = group)) +
  geom_point(position = pd) +
  geom_errorbar(aes(ymin = conf.low, ymax = conf.high), 
                width = 0.1, 
                position = pd) +
  labs(x = "Age", y = "Expected rate")

expected death rate by age group and heart valve group

We can see the difference in age groups is much more pronounced than the difference between heart valves. In fact the model summary above returned an inconclusive effect of heartvalve. We might want to drop heartvalve from the model. The update() function makes quick work of this. The formula notation . ~ . -heartvalve says, “Keep everything on the left and right side except heartvalve and refit.” The resulting likelihood ratio test (LRT) tells us the larger model with heartvalve doesn’t appear to be that much better than the model without.


m2 <- update(m, . ~ . -heartvalve, data = d)
anova(m2, m, test = "LRT")


  Analysis of Deviance Table
  
  Model 1: deaths ~ age + offset(log(risktime))
  Model 2: deaths ~ age + heartvalve + offset(log(risktime))
    Resid. Df Resid. Dev Df Deviance Pr(>Chi)
  1         2     3.7897                     
  2         1     3.2225  1  0.56721   0.4514

We can then create a new effect plot with our simplified model. Notice we no longer need to worry about dodging points.


eff_out2 <- ggpredict(m2, terms = "age", offset = log(1000))
eff_out2[,col] <- eff_out2[,col]/1000
ggplot(eff_out2, aes(x = x, y = predicted)) +
  geom_point() +
  geom_errorbar(aes(ymin = conf.low, ymax = conf.high), 
                width = 0.1) +
  labs(x = "Age", y = "Expected rate")

expected death rate by age group

Hopefully you have a better understanding of how rate models work and when to use them. In addition to Agresti (2002), Gelman and Hill (2007) provide a good overview of rate models.


References

  • Agresti, A. (2002). Categorical data analysis (2nd ed.). Wiley. Chapter 9.
  • Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical models. Cambridge University Press. Chapter 6.
  • Laird, N. M., & Olivier, D. (1981). Covariance analysis of censored survival data using log-linear analysis techniques. Journal of the American Statistical Association, 76(374), 231--240.
  • Lüdecke, D. (2018). ggeffects: Tidy data frames of marginal effects from regression models. Journal of Open Source Software, 3(26), 772. https://doi.org/10.21105/joss.00772
  • R Core Team. (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing.

For questions or clarifications regarding this article, contact the UVA Library StatLab: statlab@virginia.edu View the entire collection of UVA Library StatLab articles.

Clay Ford
Statistical Research Consultant
University of Virginia Library
May 30, 2020


For questions or clarifications regarding this article, contact statlab@virginia.edu.

View the entire collection of UVA Library StatLab articles, or learn how to cite.