# Understanding Deviance Residuals

If you have ever performed binary logistic regression in R using the glm() function, you may have noticed a summary of “Deviance Residuals” at the top of the summary output. In this article we talk about how these residuals are calculated and what we can use them for. We also talk about other types of residuals available for binary logistic regression.

To begin let’s fit an example binary logistic regression model using the ICU (Intensive Care Unit) data set that’s included with the vcdExtra package. Below we model the probability of dying once admitted to an Adult ICU as a function of subject’s age and whether or not they were unconscious upon admission. We also reset the row numbering of the ICU data frame to make it easier to extract rows of interest later in this article.

library(vcdExtra)
data(ICU)
# reset the row numbering
rownames(ICU) <- NULL
m <- glm(died ~ age + uncons, data = ICU, family = binomial)
summary(m)

Call:
glm(formula = died ~ age + uncons, family = binomial, data = ICU)

Deviance Residuals:
Min       1Q   Median       3Q      Max
-2.3128  -0.6231  -0.5140  -0.3226   2.4421

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.45771    0.80964  -4.271 1.95e-05 ***
age          0.02778    0.01215   2.287   0.0222 *
unconsYes    3.58783    0.79350   4.522 6.14e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 200.16  on 199  degrees of freedom
Residual deviance: 159.48  on 197  degrees of freedom
AIC: 165.48

Number of Fisher Scoring iterations: 5



At the top of the output we see a section titled “Deviance Residuals” which displays a five-number summary of the residuals. We can reproduce this summary by extracting the residuals and using the quantile() function.

quantile(residuals(m))

0%        25%        50%        75%       100%
-2.3127570 -0.6231176 -0.5140270 -0.3225848  2.4420598



In general, the reason we might be interested in this summary is to see how well our model is fitting the data. Residuals are the difference between what we observe and what our model predicts. It would be nice if our residuals were evenly distributed. We would like the 1Q/3Q values and Min/Max values to be about the same in absolute value, and for the Median to be close to 0. In addition we would like to see the Min/Max values less than about 3 in absolute value. This is because Deviance Residuals can be roughly approximated with a standard Normal distribution when the model holds (Agresti, 2002). Residuals greater than the absolute value of 3 are in the tails of a standard Normal distribution and usually indicate strain in the model.

But recall that our response variable, died, takes two values: 0 and 1; and that our model is predicting a probability that ranges between 0 and 1. So how are we getting residuals such as -2.31 or 2.44?

It turns out there are a several residuals we can calculate from a binary logistic regression model. The most basic is the Raw Residual, which we’ll call $$e_i$$. This is simply the difference between what we observe ($$y_i$$) and the predicted probability ($$\hat{p_i}$$). It ranges between -1 and 1.

$e_i = y_i – \hat{p_i}$

We can calculate these by hand. Notice we need to first convert our response to a 0 and 1. (It is stored as a factor in the ICU data frame.) When finished, we show the observed value (y), predicted probability from model (p_hat), and raw residual (e) for the first few observations.

p_hat <- predict(m, type = "response")
y <- ifelse(ICU$died == "Yes", 1, 0) e <- y - p_hat cbind(y, p_hat, e) |> head() y p_hat e 1 0 0.06253121 -0.06253121 2 0 0.13962618 -0.13962618 3 0 0.21110650 -0.21110650 4 0 0.12375704 -0.12375704 5 0 0.26106920 -0.26106920 6 0 0.17645557 -0.17645557  We can also get raw residuals using residuals() with type = "response" # raw residuals e <- residuals(m, type = "response")  Another type of residual is the Pearson Residual. It is the raw residual divided the estimated standard deviation of a binomial distribution with number of trials equal to 1 and p equal to $$\hat{p}$$. The Pearson residual is basically a rescaled version of the raw residual. We’ll call it $$r_i$$. $r_i = \frac{e_i}{\sqrt{\hat{p_i}(1 – \hat{p_i})}}$ We can also calculate this by hand or use residuals() with type = "pearson". r <- e / sqrt(p_hat * (1 - p_hat)) # or r <- residuals(m, type = "pearson") cbind(y, p_hat, r) |> head() y p_hat r 1 0 0.06253121 -0.2582677 2 0 0.13962618 -0.4028467 3 0 0.21110650 -0.5172990 4 0 0.12375704 -0.3758138 5 0 0.26106920 -0.5943961 6 0 0.17645557 -0.4628861  Yet another residual is the Standardized Pearson Residual. This is the Pearson Residual adjusted for the leverage of predictors using what are called hatvalues. Hatvalues measure the distance of individual predictors from the mean of the predictors. High hatvalues indicate a subject or row could have outlying predictor values. This in turn could mean that a subject or row has substantial leverage in determining the predicted response. This adjustment then increases the absolute value of certain residuals based on the leverage of the associated predictors. We’ll call this residual $$rs_i$$ $rs_i = \frac{r_i}{\sqrt{1 – \hat{h_i}}}$ We can calculate by hand using the hatvalues() function, or use the rstandard() function with type = "pearson". rs <- r / sqrt(1 - hatvalues(m)) # or rs <- rstandard(m, type = "pearson") cbind(y, p_hat, rs) |> head() y p_hat rs 1 0 0.06253121 -0.2601841 2 0 0.13962618 -0.4040353 3 0 0.21110650 -0.5202722 4 0 0.12375704 -0.3770517 5 0 0.26106920 -0.6014345 6 0 0.17645557 -0.4645071  An interesting property of Standardized Pearson Residuals is that they have an approximate Standard Normal distribution if the model fits (Agresti, 2002). This can be useful for assessing model fit as we demonstrate later in the article. Finally we come to Deviance Residuals, which we’ll call $$d_i$$. This is a weird looking formula derived from the likelihood ratio test for comparing logistic regression models, where we compare our current model to a saturated model. A saturated model is a model with as many coefficients as there are observations. The formula for calculating this test statistic for a single observation produces the Deviance Residual. Notice the sign of $$d_i$$ is the same as that of $$e_i$$. $d_i = \text{sign}(e_i) [-2(y_i \text{log}\hat{p_i} + (1 – y_i)\text{log}(1 – \hat{p_i}))]^{1/2}$ We can calculate this by hand or just use the residuals() function. We use the sign() function to get the sign of $$e_i$$. d <- sign(e)*sqrt(-2*(y*log(p_hat) + (1 - y)*log(1 - p_hat))) # or d <- residuals(m) cbind(y, p_hat, d) |> head() y p_hat d 1 0 0.06253121 -0.3593656 2 0 0.13962618 -0.5484311 3 0 0.21110650 -0.6886566 4 0 0.12375704 -0.5140270 5 0 0.26106920 -0.7778830 6 0 0.17645557 -0.6231176  Again, this is the Deviance Residuals we see summarized in the model summary output. 1 If we square the Deviance Residuals and add them all up, we get the Residual Deviance statistic we see printed at the bottom of the summary output: sum(d^2) [1] 159.4783  We can also extract this from the model object with the deviance() function. deviance(m) [1] 159.4783  This test statistic can help us assess if our model is different from the saturated model. The null of the test is no difference between the models. The LRstats() function from the vcdExtra package runs this test for us. We fail to reject the null hypothesis in this case. vcdExtra::LRstats(m) Likelihood summary table: AIC BIC LR Chisq Df Pr(>Chisq) m 165.48 175.37 159.48 197 0.9768  Deviance Residuals measure how much probabilities estimated from our model differ from the observed proportions of successes. Bigger values indicate a bigger difference. Smaller values mean a smaller difference. The smallest Deviance Residual is in row 23. The observed response is 0 (did not die) and the model predicts a small probability of dying. Hence we get a small Deviance Residual (ie, close to 0). i <- which.min(abs(residuals(m))) # 23 ICU$died[i]

[1] No
Levels: No Yes


p_hat[i]

23
0.04683521


residuals(m)[i]

23
-0.3097337



The largest Deviance Residual is in row 165. The observed response is a 1 (did die), but the model predicts a small probability of dying. Thus we have a large Deviance Residual.

i <- which.max(abs(residuals(m))) # 165
ICU$died[i] [1] Yes Levels: No Yes  p_hat[i] 165 0.05070007  residuals(m)[i] 165 2.44206  We mentioned the formula for Deviance Residuals is a bit weird. It uses log transformations and addition. Why is that? To better understand this we need to talk about Maximum Likelihood. Consider the following toy example of flipping coins. Let’s say we have a coin that lands heads 40% of the time (p = 0.4) and flip it 5 times (1 = heads, 0 = tails). We observe 1, 0, 1, 1, 0. The joint probability of getting this result is calculated as follows: 2 p <- 0.4 p * (1 - p) * p * p * (1 - p) [1] 0.02304  Or written more compactly using exponents: (p^3)*(1-p)^2 [1] 0.02304  What if we didn’t know p? Our formula would look as follows: $\text{joint probability} = p^3 (1 – p)^2$ One approach to finding p would be to try different values and select the one that maximizes the joint probability. In other words, what value of p would make the data we observed most likely? Let’s try 0.1 and 0.5. p <- c(0.1, 0.5) (p^3)*(1-p)^2 [1] 0.00081 0.03125  It looks like 0.5 is more likely than 0.1 because the joint probability is higher. Fortunately we don’t need to use trial and error. It can be shown that the value that maximizes the joint probability is simply the number of 1s divided by the number of trials: 3/5 or 0.6. So the most likely value of p for the binomial data we observed is 0.6. Sometimes this concept is presented visually by plotting the joint probabilities for various values of p versus p. Below we see the peak of the curve is at p = 0.6 p <- seq(0.01, 0.99, 0.01) y <- (p^3)*(1-p)^2 plot(p, y, type = "l") abline(v = 0.6, col = "red")  The value of the joint probability on the y-axis doesn’t really matter. We just want to know what p on the x-axis corresponds with the maximum value. In the general case, we refer to joint probability as likelihood (L), with number of successes (s) and total number of trials (n). $L = p^s(1 – p)^{n-s}$ If we log transform both sides of this equation we get log likelihood. $\log L = s\log(p) + (n – s)\log(1 – p)$ We can also write this in terms of individual observations where $$y_i$$ takes a value of 0 or 1. Notice this is similar to the formula we presented earlier for Deviance Residuals with $$n = 1$$ and s equal to $$y_i$$. $\log L = \sum_{i = 1}^{n} y_i\log(p) + (1 – y_i)\log(1 – p)$ This is easier to compute since we’re adding instead of taking products. To use with our data, we use the sum() function to compute the log likelihood. Let’s try 0.1 and 0.5 again. obs <- c(1, 0, 1, 1, 0) # Log Likelihood of 0.1 sum(obs*log(0.1) + (1 - obs)*log(1 - 0.1)) [1] -7.118476  # Log Likelihood of 0.5 sum(obs*log(0.5) + (1 - obs)*log(1 - 0.5)) [1] -3.465736  The log likelihood for 0.5 is higher than the log likelihood for 0.1. And of course the log likelihood for 0.6 is highest since it’s the maximum likelihood estimate. sum(obs*log(0.6) + (1 - obs)*log(1 - 0.6)) [1] -3.365058  Again, the result is not important. We just want the value of p that maximizes the result. It turns out that multiplying this by -2 has desirable statistical qualities. The resulting value has a chi-square distribution as our sample size gets bigger (Wilks, 1938). When we multiply by -2, we get the following value: -2*sum(obs*log(0.6) + (1 - obs)*log(1 - 0.6)) [1] 6.730117  This is the Residual Deviance. If we analyze this data using glm(), we’ll get the same result. To do this we fit an intercept-only model using the syntax obs ~ 1. m0 <- glm(obs ~ 1, family= binomial) deviance(m0) [1] 6.730117  Incidentally if we take the inverse logit of the Intercept coefficient, using the plogis() function, we get the MLE of 0.6: plogis(coef(m0)) (Intercept) 0.6  Now let’s step back and look at what we’re summing using the MLE of 0.6: -2 * (obs*log(0.6) + (1 - obs)*log(1 - 0.6)) [1] 1.021651 1.832581 1.021651 1.021651 1.832581  We have 5 elements corresponding to each observation. If we take the square root of these we get the absolute values of the Deviance Residuals. sqrt(-2 * (obs*log(0.6) + (1 - obs)*log(1 - 0.6))) [1] 1.010768 1.353729 1.010768 1.010768 1.353729  Now compare these to the model-based Deviance Residuals. They’re the same in absolute value. residuals(m0) 1 2 3 4 5 1.010768 -1.353729 1.010768 1.010768 -1.353729  To get the model-based Deviance Residuals, we multiply by the sign of the raw residual. As we mentioned before, the raw residual is the observed response (1/0) minus the predicted probability (in this case 0.6). The sign() function extracts 1 or -1 depending on the sign. sign(obs - fitted(m0)) * sqrt(-2 * (obs*log(0.6) + (1 - obs)*log(1 - 0.6))) 1 2 3 4 5 1.010768 -1.353729 1.010768 1.010768 -1.353729  So we see that Deviance Residuals for binomial logistic regression are a scaled version of the components of the binomial log likelihood. In addition, since they sum to a statistic that has an approximate chi-squared distribution, the components themselves can be approximated with a standard normal distribution. (Recall a sum of n squared standard normal values has a chi-square distribution with n degrees of freedom.) Now it’s important to note that we should use Deviance Residuals (and Pearson Residuals) with caution when we fit subject-level data with continuous (numeric) predictors. For example, the ICU data set we used earlier is subject-level data with one row per subject. It also has a highly variable numeric predictor, age. This means we have combinations of predictors with only one observation. One such combination is age = 59 and uncons = “Yes”. We have only one subject with that combination of predictors. Therefore we should be wary of reading too much into the residual for this subject. The model is estimating the probability of dying when age = 59 and uncons = “Yes” based on just one observation. Another approach to binary logistic regression is to use group-level data instead of subject-level data. Group-level data contains responses that are counts of “success” and “failures” at unique patterns of predictor variables. These are sometimes referred to as explanatory variable patterns (Bilder and Loughin, 2015). Deviance and Pearson residuals are more useful when modeling group-level data. Let’s group the ICU data by unique combinations of predictor variables, refit the model, and compare the residuals to the subject-level model. First we convert the died variable to a numeric variable that takes values of 0 or 1. Then we sum the number of people who died grouped by age and uncons using the aggregate() function. Next we count the number of trials grouped by age and uncons. Finally we combine the results into a new data frame called ICU_grp and add a column with the proportion of “died” for each unique combination of age and uncons. In this new data frame, “died” is the number of people who died at each combination of age and uncons. Notice there are 6 subjects with age = 75 and uncons = “No”, 3 of which died after admission to the ICU. ICU$died <- ifelse(ICU$died == "Yes", 1, 0) x <- aggregate(died ~ age + uncons, data = ICU, sum) n <- aggregate(died ~ age + uncons, data = ICU, length) ICU_grp <- data.frame(age = x$age,
uncons = x$uncons, died = x$died,
trials = n$died, proportion = x$died/n$died) # look at rows 48 - 52 ICU_grp[48:52,] age uncons died trials proportion 48 73 No 0 4 0.00 49 74 No 0 2 0.00 50 75 No 3 6 0.50 51 76 No 1 4 0.25 52 77 No 0 6 0.00  Now we refit the model using glm(), but this time we use different syntax. Instead of “died”, we use “proportion” as the response and add the argument weights = trials to provide the sample size for each proportion. m2 <- glm(proportion ~ age + uncons, data = ICU_grp, family = binomial, weights = trials)  Or equivalently, we could use died/trials as the response along with the argument weights = trials. m2 <- glm(died/trials ~ age + uncons, data = ICU_grp, family = binomial, weights = trials) summary(m2) Call: glm(formula = died/trials ~ age + uncons, family = binomial, data = ICU_grp, weights = trials) Deviance Residuals: Min 1Q Median 3Q Max -2.3128 -0.7111 -0.3618 0.5449 1.6639 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.45771 0.80964 -4.271 1.95e-05 *** age 0.02778 0.01215 2.287 0.0222 * unconsYes 3.58783 0.79350 4.522 6.14e-06 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 97.266 on 75 degrees of freedom Residual deviance: 56.584 on 73 degrees of freedom AIC: 100.73 Number of Fisher Scoring iterations: 5  The estimated coefficients and associated hypothesis tests are the same, but the Residual Deviance is now 56.584 on 73 degrees of freedom (versus 159.48 on 197 degrees of freedom in the original model).3 This is because our model was fit to 76 “observations” instead of 200. At first glance this may seem like a disaster. Why would we throw out over half our data? But that’s not what we did. When it comes to the age and uncons variables, we have 76 unique patterns. We didn’t drop any data, we simply aggregated the number of died and “trials” for each unique variable combination. This results in more informative residuals. We can demonstrate this with the Residuals vs Fitted plot. First let’s look at this plot for the original model fit to the subject-level data. We can do this by calling plot() on our model object and setting which = 1. Standardized Pearson Residuals are plotted on the y-axis versus Predicted log-odds on the x-axis. This produces strange uninformative somewhat parallel lines. There’s not much we can infer from this plot. plot(m, which = 1)  Now create the same plot using the model fit to the group-level data. This plot is more informative and looks similar to the type of Residuals vs Fitted plot we see evaluating linear models. plot(m2, which = 1)  As usual we want to see that we’re not systematically over- or under-predicting. It appears at the lower predicted values we’re under-fitting a bit. The predicted proportions are larger than the observed proportions. At the higher predicted values were over-fitting for a couple of predictor patterns. The predicted values are much larger than the predicted proportions. The labeled points refer to rows in our data. We can use those to investigate the residuals. For example, looking at rows 68 and 76 we see that our model predicts high probability of dying, but these subjects lived. Thus we have large residuals. But also notice each prediction is based on one observation. We should be cautious reading too much into these residuals. Even though this model is based on group-level data we still have instances of groups with just one observation. ICU_grp$pred <- predict(m2, type = "response")
ICU_grp\$deviance <- residuals(m2)
ICU_grp[c(68, 76),]

age uncons died trials proportion      pred  deviance
68  59    Yes    0      1          0 0.8543873 -1.963061
76  89    Yes    0      1          0 0.9310534 -2.312757



Earlier we mentioned that Standardized Pearson Residuals have an approximate Standard Normal distribution if the model fits. This implies looking at a QQ Plot of residuals can provide some assessment of model fit. We can produce this plot using plot() with which = 2. Again the plot created with the group-level model is more informative than the plot created with the subject-level model.

plot(m2, which = 2)



This suggests our model is holding OK in the middle range of predicted values but is maybe suspect in the extremities. It’s worth mentioning that binomial logistic regression models have no error term or normality assumptions. In a standard linear model this plot assesses normality of residuals, which is one of the assumptions of a linear model. But in a binary logistic regression model, normality of residuals is simply evidence of a decent fitting model. There is no assumption that residuals are random draws from a normal distribution.

We hope you now have a better understanding of Deviance and Pearson Residuals, how to interpret them, and when they’re most useful. Ideally, we like to work with them when a model is based on group-level data, but with many continuous predictors, group-level data may not be possible or result in too many unique predictor combinations.

References

• Agresti A (2002). Categorical Data Analysis, 2nd Edition. Wiley
• Bilder C Loughin, T (2015). Analysis of Categorical Data with R. CRC Press.
• Friendly M, Meyer D (2016). Discrete Data Analysis with R: Visualization and Modeling Techniques for Categorical and Count Data. CRC Press.
• Friendly M (2022). vcdExtra: ‘vcd’ Extensions and Additions. R package version 0.8-0, https://CRAN.R-project.org/package=vcdExtra.
• Harrell F (2015). Regression Modeling Strategies, 2nd ed. Springer. http://hbiostat.org/rmsc/
• R Core Team (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
• Wilks S (1938). The Large-Sample Distribution of the Likelihood Ratio for Testing Composite Hypotheses. Ann. Math. Statist. 9(1): 60-62 (March, 1938). DOI: 10.1214/aoms/1177732360

For questions or clarifications regarding this article, contact the UVA Library StatLab: statlab@virginia.edu

View the entire collection of UVA Library StatLab articles.

Clay Ford
Statistical Research Consultant
University of Virginia Library
September 28, 2022

1. We can also calculate Standardized Deviance Residuals using rstandard() and type = "deviance". It, too, uses hatvalues to adjust the Deviance Residuals.↩︎

2. Technically, we should multiply by the number of combinations that could result in 3 heads and 2 tails (ie, the binomial coefficient) to get the true joint probability, but it doesn’t matter in the context of maximum likelihood.↩︎

3. Degrees of freedom is number of observations minus number of model coefficients.↩︎