If you have ever performed binary logistic regression in R using the glm()
function, you may have noticed a summary of “Deviance Residuals” at the top of the summary output. In this article we talk about how these residuals are calculated and what we can use them for. We also talk about other types of residuals available for binary logistic regression.
To begin let’s fit an example binary logistic regression model using the ICU (Intensive Care Unit) data set that’s included with the vcdExtra package. Below we model the probability of dying once admitted to an Adult ICU as a function of subject’s age and whether or not they were unconscious upon admission. We also reset the row numbering of the ICU data frame to make it easier to extract rows of interest later in this article.
library(vcdExtra) data(ICU) # reset the row numbering rownames(ICU) < NULL m < glm(died ~ age + uncons, data = ICU, family = binomial) summary(m) Call: glm(formula = died ~ age + uncons, family = binomial, data = ICU) Deviance Residuals: Min 1Q Median 3Q Max 2.3128 0.6231 0.5140 0.3226 2.4421 Coefficients: Estimate Std. Error z value Pr(>z) (Intercept) 3.45771 0.80964 4.271 1.95e05 *** age 0.02778 0.01215 2.287 0.0222 * unconsYes 3.58783 0.79350 4.522 6.14e06 ***  Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 200.16 on 199 degrees of freedom Residual deviance: 159.48 on 197 degrees of freedom AIC: 165.48 Number of Fisher Scoring iterations: 5
At the top of the output we see a section titled “Deviance Residuals” which displays a fivenumber summary of the residuals. We can reproduce this summary by extracting the residuals and using the quantile()
function.
quantile(residuals(m)) 0% 25% 50% 75% 100% 2.3127570 0.6231176 0.5140270 0.3225848 2.4420598
In general, the reason we might be interested in this summary is to see how well our model is fitting the data. Residuals are the difference between what we observe and what our model predicts. It would be nice if our residuals were evenly distributed. We would like the 1Q/3Q values and Min/Max values to be about the same in absolute value, and for the Median to be close to 0. In addition we would like to see the Min/Max values less than about 3 in absolute value. This is because Deviance Residuals can be roughly approximated with a standard Normal distribution when the model holds (Agresti, 2002). Residuals greater than the absolute value of 3 are in the tails of a standard Normal distribution and usually indicate strain in the model.
But recall that our response variable, died, takes two values: 0 and 1; and that our model is predicting a probability that ranges between 0 and 1. So how are we getting residuals such as 2.31 or 2.44?
It turns out there are a several residuals we can calculate from a binary logistic regression model. The most basic is the Raw Residual, which we’ll call \(e_i\). This is simply the difference between what we observe (\(y_i\)) and the predicted probability (\(\hat{p_i}\)). It ranges between 1 and 1.
\[e_i = y_i – \hat{p_i}\]
We can calculate these by hand. Notice we need to first convert our response to a 0 and 1. (It is stored as a factor in the ICU data frame.) When finished, we show the observed value (y), predicted probability from model (p_hat), and raw residual (e) for the first few observations.
p_hat < predict(m, type = "response") y < ifelse(ICU$died == "Yes", 1, 0) e < y  p_hat cbind(y, p_hat, e) > head() y p_hat e 1 0 0.06253121 0.06253121 2 0 0.13962618 0.13962618 3 0 0.21110650 0.21110650 4 0 0.12375704 0.12375704 5 0 0.26106920 0.26106920 6 0 0.17645557 0.17645557
We can also get raw residuals using residuals()
with type = "response"
# raw residuals e < residuals(m, type = "response")
Another type of residual is the Pearson Residual. It is the raw residual divided the estimated standard deviation of a binomial distribution with number of trials equal to 1 and p equal to \(\hat{p}\). The Pearson residual is basically a rescaled version of the raw residual. We’ll call it \(r_i\).
\[r_i = \frac{e_i}{\sqrt{\hat{p_i}(1 – \hat{p_i})}} \]
We can also calculate this by hand or use residuals()
with type = "pearson"
.
r < e / sqrt(p_hat * (1  p_hat)) # or r < residuals(m, type = "pearson") cbind(y, p_hat, r) > head() y p_hat r 1 0 0.06253121 0.2582677 2 0 0.13962618 0.4028467 3 0 0.21110650 0.5172990 4 0 0.12375704 0.3758138 5 0 0.26106920 0.5943961 6 0 0.17645557 0.4628861
Yet another residual is the Standardized Pearson Residual. This is the Pearson Residual adjusted for the leverage of predictors using what are called hatvalues. Hatvalues measure the distance of individual predictors from the mean of the predictors. High hatvalues indicate a subject or row could have outlying predictor values. This in turn could mean that a subject or row has substantial leverage in determining the predicted response. This adjustment then increases the absolute value of certain residuals based on the leverage of the associated predictors. We’ll call this residual \(rs_i\)
\[rs_i = \frac{r_i}{\sqrt{1 – \hat{h_i}}} \]
We can calculate by hand using the hatvalues()
function, or use the rstandard()
function with type = "pearson"
.
rs < r / sqrt(1  hatvalues(m)) # or rs < rstandard(m, type = "pearson") cbind(y, p_hat, rs) > head() y p_hat rs 1 0 0.06253121 0.2601841 2 0 0.13962618 0.4040353 3 0 0.21110650 0.5202722 4 0 0.12375704 0.3770517 5 0 0.26106920 0.6014345 6 0 0.17645557 0.4645071
An interesting property of Standardized Pearson Residuals is that they have an approximate Standard Normal distribution if the model fits (Agresti, 2002). This can be useful for assessing model fit as we demonstrate later in the article.
Finally we come to Deviance Residuals, which we’ll call \(d_i\). This is a weird looking formula derived from the likelihood ratio test for comparing logistic regression models, where we compare our current model to a saturated model. A saturated model is a model with as many coefficients as there are observations. The formula for calculating this test statistic for a single observation produces the Deviance Residual. Notice the sign of \(d_i\) is the same as that of \(e_i\).
\[d_i = \text{sign}(e_i) [2(y_i \text{log}\hat{p_i} + (1 – y_i)\text{log}(1 – \hat{p_i}))]^{1/2}\] We can calculate this by hand or just use the residuals()
function. We use the sign()
function to get the sign of \(e_i\).
d < sign(e)*sqrt(2*(y*log(p_hat) + (1  y)*log(1  p_hat))) # or d < residuals(m) cbind(y, p_hat, d) > head() y p_hat d 1 0 0.06253121 0.3593656 2 0 0.13962618 0.5484311 3 0 0.21110650 0.6886566 4 0 0.12375704 0.5140270 5 0 0.26106920 0.7778830 6 0 0.17645557 0.6231176
Again, this is the Deviance Residuals we see summarized in the model summary output. ^{1}
If we square the Deviance Residuals and add them all up, we get the Residual Deviance statistic we see printed at the bottom of the summary output:
sum(d^2) [1] 159.4783
We can also extract this from the model object with the deviance()
function.
deviance(m) [1] 159.4783
This test statistic can help us assess if our model is different from the saturated model. The null of the test is no difference between the models. The LRstats()
function from the vcdExtra package runs this test for us. We fail to reject the null hypothesis in this case.
vcdExtra::LRstats(m) Likelihood summary table: AIC BIC LR Chisq Df Pr(>Chisq) m 165.48 175.37 159.48 197 0.9768
Deviance Residuals measure how much probabilities estimated from our model differ from the observed proportions of successes. Bigger values indicate a bigger difference. Smaller values mean a smaller difference.
The smallest Deviance Residual is in row 23. The observed response is 0 (did not die) and the model predicts a small probability of dying. Hence we get a small Deviance Residual (ie, close to 0).
i < which.min(abs(residuals(m))) # 23 ICU$died[i] [1] No Levels: No Yes
p_hat[i] 23 0.04683521
residuals(m)[i] 23 0.3097337
The largest Deviance Residual is in row 165. The observed response is a 1 (did die), but the model predicts a small probability of dying. Thus we have a large Deviance Residual.
i < which.max(abs(residuals(m))) # 165 ICU$died[i] [1] Yes Levels: No Yes
p_hat[i] 165 0.05070007
residuals(m)[i] 165 2.44206
We mentioned the formula for Deviance Residuals is a bit weird. It uses log transformations and addition. Why is that? To better understand this we need to talk about Maximum Likelihood. Consider the following toy example of flipping coins.
Let’s say we have a coin that lands heads 40% of the time (p = 0.4) and flip it 5 times (1 = heads, 0 = tails). We observe 1, 0, 1, 1, 0. The joint probability of getting this result is calculated as follows: ^{2}
p < 0.4 p * (1  p) * p * p * (1  p) [1] 0.02304
Or written more compactly using exponents:
(p^3)*(1p)^2 [1] 0.02304
What if we didn’t know p? Our formula would look as follows:
\[\text{joint probability} = p^3 (1 – p)^2\]
One approach to finding p would be to try different values and select the one that maximizes the joint probability. In other words, what value of p would make the data we observed most likely? Let’s try 0.1 and 0.5.
p < c(0.1, 0.5) (p^3)*(1p)^2 [1] 0.00081 0.03125
It looks like 0.5 is more likely than 0.1 because the joint probability is higher. Fortunately we don’t need to use trial and error. It can be shown that the value that maximizes the joint probability is simply the number of 1s divided by the number of trials: 3/5 or 0.6. So the most likely value of p for the binomial data we observed is 0.6. Sometimes this concept is presented visually by plotting the joint probabilities for various values of p versus p. Below we see the peak of the curve is at p = 0.6
p < seq(0.01, 0.99, 0.01) y < (p^3)*(1p)^2 plot(p, y, type = "l") abline(v = 0.6, col = "red")
The value of the joint probability on the yaxis doesn’t really matter. We just want to know what p on the xaxis corresponds with the maximum value.
In the general case, we refer to joint probability as likelihood (L), with number of successes (s) and total number of trials (n).
\[L = p^s(1 – p)^{ns}\]
If we log transform both sides of this equation we get log likelihood.
\[\log L = s\log(p) + (n – s)\log(1 – p)\]
We can also write this in terms of individual observations where \(y_i\) takes a value of 0 or 1. Notice this is similar to the formula we presented earlier for Deviance Residuals with \(n = 1\) and s equal to \(y_i\).
\[\log L = \sum_{i = 1}^{n} y_i\log(p) + (1 – y_i)\log(1 – p)\]
This is easier to compute since we’re adding instead of taking products. To use with our data, we use the sum()
function to compute the log likelihood. Let’s try 0.1 and 0.5 again.
obs < c(1, 0, 1, 1, 0) # Log Likelihood of 0.1 sum(obs*log(0.1) + (1  obs)*log(1  0.1)) [1] 7.118476
# Log Likelihood of 0.5 sum(obs*log(0.5) + (1  obs)*log(1  0.5)) [1] 3.465736
The log likelihood for 0.5 is higher than the log likelihood for 0.1. And of course the log likelihood for 0.6 is highest since it’s the maximum likelihood estimate.
sum(obs*log(0.6) + (1  obs)*log(1  0.6)) [1] 3.365058
Again, the result is not important. We just want the value of p that maximizes the result.
It turns out that multiplying this by 2 has desirable statistical qualities. The resulting value has a chisquare distribution as our sample size gets bigger (Wilks, 1938). When we multiply by 2, we get the following value:
2*sum(obs*log(0.6) + (1  obs)*log(1  0.6)) [1] 6.730117
This is the Residual Deviance. If we analyze this data using glm()
, we’ll get the same result. To do this we fit an interceptonly model using the syntax obs ~ 1
.
m0 < glm(obs ~ 1, family= binomial) deviance(m0) [1] 6.730117
Incidentally if we take the inverse logit of the Intercept coefficient, using the plogis()
function, we get the MLE of 0.6:
plogis(coef(m0)) (Intercept) 0.6
Now let’s step back and look at what we’re summing using the MLE of 0.6:
2 * (obs*log(0.6) + (1  obs)*log(1  0.6)) [1] 1.021651 1.832581 1.021651 1.021651 1.832581
We have 5 elements corresponding to each observation. If we take the square root of these we get the absolute values of the Deviance Residuals.
sqrt(2 * (obs*log(0.6) + (1  obs)*log(1  0.6))) [1] 1.010768 1.353729 1.010768 1.010768 1.353729
Now compare these to the modelbased Deviance Residuals. They’re the same in absolute value.
residuals(m0) 1 2 3 4 5 1.010768 1.353729 1.010768 1.010768 1.353729
To get the modelbased Deviance Residuals, we multiply by the sign of the raw residual. As we mentioned before, the raw residual is the observed response (1/0) minus the predicted probability (in this case 0.6). The sign()
function extracts 1 or 1 depending on the sign.
sign(obs  fitted(m0)) * sqrt(2 * (obs*log(0.6) + (1  obs)*log(1  0.6))) 1 2 3 4 5 1.010768 1.353729 1.010768 1.010768 1.353729
So we see that Deviance Residuals for binomial logistic regression are a scaled version of the components of the binomial log likelihood. In addition, since they sum to a statistic that has an approximate chisquared distribution, the components themselves can be approximated with a standard normal distribution. (Recall a sum of n squared standard normal values has a chisquare distribution with n degrees of freedom.)
Now it’s important to note that we should use Deviance Residuals (and Pearson Residuals) with caution when we fit subjectlevel data with continuous (numeric) predictors. For example, the ICU data set we used earlier is subjectlevel data with one row per subject. It also has a highly variable numeric predictor, age. This means we have combinations of predictors with only one observation. One such combination is age = 59 and uncons = “Yes”. We have only one subject with that combination of predictors. Therefore we should be wary of reading too much into the residual for this subject. The model is estimating the probability of dying when age = 59 and uncons = “Yes” based on just one observation.
Another approach to binary logistic regression is to use grouplevel data instead of subjectlevel data. Grouplevel data contains responses that are counts of “success” and “failures” at unique patterns of predictor variables. These are sometimes referred to as explanatory variable patterns (Bilder and Loughin, 2015). Deviance and Pearson residuals are more useful when modeling grouplevel data.
Let’s group the ICU data by unique combinations of predictor variables, refit the model, and compare the residuals to the subjectlevel model.
First we convert the died variable to a numeric variable that takes values of 0 or 1. Then we sum the number of people who died grouped by age and uncons using the aggregate()
function. Next we count the number of trials grouped by age and uncons. Finally we combine the results into a new data frame called ICU_grp and add a column with the proportion of “died” for each unique combination of age and uncons. In this new data frame, “died” is the number of people who died at each combination of age and uncons. Notice there are 6 subjects with age = 75 and uncons = “No”, 3 of which died after admission to the ICU.
ICU$died < ifelse(ICU$died == "Yes", 1, 0) x < aggregate(died ~ age + uncons, data = ICU, sum) n < aggregate(died ~ age + uncons, data = ICU, length) ICU_grp < data.frame(age = x$age, uncons = x$uncons, died = x$died, trials = n$died, proportion = x$died/n$died) # look at rows 48  52 ICU_grp[48:52,] age uncons died trials proportion 48 73 No 0 4 0.00 49 74 No 0 2 0.00 50 75 No 3 6 0.50 51 76 No 1 4 0.25 52 77 No 0 6 0.00
Now we refit the model using glm()
, but this time we use different syntax. Instead of “died”, we use “proportion” as the response and add the argument weights = trials
to provide the sample size for each proportion.
m2 < glm(proportion ~ age + uncons, data = ICU_grp, family = binomial, weights = trials)
Or equivalently, we could use died/trials
as the response along with the argument weights = trials
.
m2 < glm(died/trials ~ age + uncons, data = ICU_grp, family = binomial, weights = trials) summary(m2) Call: glm(formula = died/trials ~ age + uncons, family = binomial, data = ICU_grp, weights = trials) Deviance Residuals: Min 1Q Median 3Q Max 2.3128 0.7111 0.3618 0.5449 1.6639 Coefficients: Estimate Std. Error z value Pr(>z) (Intercept) 3.45771 0.80964 4.271 1.95e05 *** age 0.02778 0.01215 2.287 0.0222 * unconsYes 3.58783 0.79350 4.522 6.14e06 ***  Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 97.266 on 75 degrees of freedom Residual deviance: 56.584 on 73 degrees of freedom AIC: 100.73 Number of Fisher Scoring iterations: 5
The estimated coefficients and associated hypothesis tests are the same, but the Residual Deviance is now 56.584 on 73 degrees of freedom (versus 159.48 on 197 degrees of freedom in the original model).^{3} This is because our model was fit to 76 “observations” instead of 200. At first glance this may seem like a disaster. Why would we throw out over half our data? But that’s not what we did. When it comes to the age and uncons variables, we have 76 unique patterns. We didn’t drop any data, we simply aggregated the number of died and “trials” for each unique variable combination. This results in more informative residuals. We can demonstrate this with the Residuals vs Fitted plot.
First let’s look at this plot for the original model fit to the subjectlevel data. We can do this by calling plot()
on our model object and setting which = 1
. Standardized Pearson Residuals are plotted on the yaxis versus Predicted logodds on the xaxis. This produces strange uninformative somewhat parallel lines. There’s not much we can infer from this plot.
plot(m, which = 1)
Now create the same plot using the model fit to the grouplevel data. This plot is more informative and looks similar to the type of Residuals vs Fitted plot we see evaluating linear models.
plot(m2, which = 1)
As usual we want to see that we’re not systematically over or underpredicting. It appears at the lower predicted values we’re underfitting a bit. The predicted proportions are larger than the observed proportions. At the higher predicted values were overfitting for a couple of predictor patterns. The predicted values are much larger than the predicted proportions.
The labeled points refer to rows in our data. We can use those to investigate the residuals. For example, looking at rows 68 and 76 we see that our model predicts high probability of dying, but these subjects lived. Thus we have large residuals. But also notice each prediction is based on one observation. We should be cautious reading too much into these residuals. Even though this model is based on grouplevel data we still have instances of groups with just one observation.
ICU_grp$pred < predict(m2, type = "response") ICU_grp$deviance < residuals(m2) ICU_grp[c(68, 76),] age uncons died trials proportion pred deviance 68 59 Yes 0 1 0 0.8543873 1.963061 76 89 Yes 0 1 0 0.9310534 2.312757
Earlier we mentioned that Standardized Pearson Residuals have an approximate Standard Normal distribution if the model fits. This implies looking at a QQ Plot of residuals can provide some assessment of model fit. We can produce this plot using plot()
with which = 2
. Again the plot created with the grouplevel model is more informative than the plot created with the subjectlevel model.
plot(m2, which = 2)
This suggests our model is holding OK in the middle range of predicted values but is maybe suspect in the extremities. It’s worth mentioning that binomial logistic regression models have no error term or normality assumptions. In a standard linear model this plot assesses normality of residuals, which is one of the assumptions of a linear model. But in a binary logistic regression model, normality of residuals is simply evidence of a decent fitting model. There is no assumption that residuals are random draws from a normal distribution.
We hope you now have a better understanding of Deviance and Pearson Residuals, how to interpret them, and when they’re most useful. Ideally, we like to work with them when a model is based on grouplevel data, but with many continuous predictors, grouplevel data may not be possible or result in too many unique predictor combinations.
References
 Agresti A (2002). Categorical Data Analysis, 2nd Edition. Wiley
 Bilder C Loughin, T (2015). Analysis of Categorical Data with R. CRC Press.
 Friendly M, Meyer D (2016). Discrete Data Analysis with R: Visualization and Modeling Techniques for Categorical and Count Data. CRC Press.
 Friendly M (2022). vcdExtra: ‘vcd’ Extensions and Additions. R package version 0.80, https://CRAN.Rproject.org/package=vcdExtra.
 Harrell F (2015). Regression Modeling Strategies, 2nd ed. Springer. http://hbiostat.org/rmsc/
 R Core Team (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.Rproject.org/.
 Wilks S (1938). The LargeSample Distribution of the Likelihood Ratio for Testing Composite Hypotheses. Ann. Math. Statist. 9(1): 6062 (March, 1938). DOI: 10.1214/aoms/1177732360
For questions or clarifications regarding this article, contact the UVA Library StatLab: statlab@virginia.edu
View the entire collection of UVA Library StatLab articles.
Clay Ford
Statistical Research Consultant
University of Virginia Library
September 28, 2022

We can also calculate Standardized Deviance Residuals using
rstandard()
andtype = "deviance"
. It, too, uses hatvalues to adjust the Deviance Residuals.↩︎ 
Technically, we should multiply by the number of combinations that could result in 3 heads and 2 tails (ie, the binomial coefficient) to get the true joint probability, but it doesn’t matter in the context of maximum likelihood.↩︎

Degrees of freedom is number of observations minus number of model coefficients.↩︎