Understanding Ordered Factors in a Linear Model

Consider the following data from the text Design and Analysis of Experiments, 7 ed (Montgomery, Table 3.1). It has two variables: power and rate. Power is a discrete setting on a tool used to etch circuits into a silicon wafer. There are four levels to choose from. Rate is the distance etched measured in Angstroms per minute. (An Angstrom is one ten-billionth of a meter.) Of interest is how (or if) the power setting affects the etch rate. Below we enter the small data set by hand into R and create a quick plot. The plot suggests higher power leads to higher etch rates.

# Table 3.1
power <- rep(c(160, 180, 200, 220), 5)
rate <- c(575, 565, 600, 725,
          542, 593, 651, 700, 
          530, 590, 610, 715,
          539, 579, 637, 685,
          570, 610, 629, 710)
etch <- data.frame(rate, power)
stripchart(rate ~ power, data = etch, pch = 1,
           vertical = TRUE, xlab = "power")
		   

A formal statistical analysis of this data requires a linear model. To perform the analysis in R we need to define the power variable as a factor. This tells R that power is a categorical variable and to treat it as such in a modeling procedure. But in addition, we may want to further specify that the levels of power are ordered. Power level 160 is less than 180, and 180 is less than 200, and so on. This is additional information we may want to account for. We can define power as an ordered factor in R using the ordered function. We do that below and save the ordered factor version as “powerF”. Notice that calling head to view the first 6 values of powerF shows us the ordering of the levels: 160 < 180 < 200 < 220

etch$powerF <- ordered(etch$power)
head(etch$powerF)
## [1] 160 180 200 220 160 180
## Levels: 160 < 180 < 200 < 220

It’s important to note that R is storing powerF internally as integers, which we can see by calling unclass or as.integer on the variable. (We’ll make use of this fact in the next plot we create.)

unclass(etch$powerF)
##  [1] 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
## attr(,"levels")
## [1] "160" "180" "200" "220"

Now let’s formally analyze the data using the lm function in R. This is equivalent to performing a one-way ANOVA. Below we model rate as a function of the ordered factor “powerF” and pipe the result in the summary function using base R’s native forward pipe operator, |>, introduced in version 4.1.0.

lm(rate ~ powerF, data = etch) |> summary()
## 
## Call:
## lm(formula = rate ~ powerF, data = etch)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -25.4  -13.0    2.8   13.2   25.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  617.750      4.085 151.234  < 2e-16 ***
## powerF.L     113.011      8.169  13.833 2.56e-10 ***
## powerF.Q      22.700      8.169   2.779   0.0134 *  
## powerF.C       9.347      8.169   1.144   0.2694    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.27 on 16 degrees of freedom
## Multiple R-squared:  0.9261, Adjusted R-squared:  0.9122 
## F-statistic:  66.8 on 3 and 16 DF,  p-value: 2.883e-09

At the bottom of the output is the familiar ANOVA test result. The F-statistic is large and the p-value is tiny which provides evidence that the mean rates between the four power levels are different. (We have statistically sanctified what we saw in the plot.) In addition the adjusted R-squared suggests power explains about 91% of the variability we’re seeing in the etch rates. But what about the rest? What are these mysterious “power.L”, “power.Q”, and “power.C” labels? What do their estimates mean? What are the hypothesis tests associated with them?

The quick answer to these questions is that the L, Q and C stand for Linear, Quadratic, and Cubic; that the coefficients don’t have a ready interpretation; and that each hypothesis is a test for trend.

The “power.L” entry is a test for linear trend. The null is no linear trend (ie, a flat line). The large t value provides evidence against this. Again, this is not a surprise. Looking at the plot above we can visualize a straight line over the points.

The “power.Q” entry is a test for quadratic trend. The null is no quadratic trend (ie, a straight line). The t value of 2.8 provides some evidence against this. This is less obvious in the plot, though we can perhaps imagine a slight curve superimposed over the data.

The “power.C” entry is a test for cubic trend. The null is no cubic trend (ie, a straight or quadratic line). The t value of 1.1 is small and the p-value is quite large. There doesn’t appear to be any cubic trend in the data.

We can visualize these trends using ggplot. Below we call geom_smooth three times to plot linear, quadratic and cubic lines using lm. Notice we have to unclass the powerF variable to make this work. Defining the colors as “1”, “2” and “3” is an easy way to force ggplot to create a legend and place the labels in that order. We notice there isn’t much difference between the fitted quadratic and cubic lines, visually demonstrating why the cubic trend test was not “significant”.

library(ggplot2)
ggplot(etch) +
  aes(x = powerF, y = rate) +
  geom_point(shape = 1) +
  geom_smooth(aes(x = unclass(powerF), color = "1"), 
              formula = y ~ x, 
              method = lm, se = FALSE) +
  geom_smooth(aes(x = unclass(powerF), color = "2"), 
              formula = y ~ poly(x, 2), 
              method = lm, se = FALSE) +
  geom_smooth(aes(x = unclass(powerF), color = "3"), 
              formula = y ~ poly(x, 3), 
              method = lm, se = FALSE) +
  scale_color_discrete("Trend", labels = c("linear", "quadratic", "cubic")) +
  theme_classic()
  

If you’re thinking these tests for trend look a lot like a polynomial model, you’re right. That’s basically what it is. In fact we get identical test results (ie, same test statistics and p-values) if we simply fit a 3-degree polynomial model to the original numeric power variable.

lm(rate ~ poly(power, 3), data = etch) |> summary()
## 
## Call:
## lm(formula = rate ~ poly(power, 3), data = etch)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -25.4  -13.0    2.8   13.2   25.6 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      617.750      4.085 151.234  < 2e-16 ***
## poly(power, 3)1  252.700     18.267  13.833 2.56e-10 ***
## poly(power, 3)2   50.759     18.267   2.779   0.0134 *  
## poly(power, 3)3   20.900     18.267   1.144   0.2694    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.27 on 16 degrees of freedom
## Multiple R-squared:  0.9261, Adjusted R-squared:  0.9122 
## F-statistic:  66.8 on 3 and 16 DF,  p-value: 2.883e-09

We also get the same test results if we unclass the ordered factor version of power and fit a 3-degree polynomial model.

lm(rate ~ poly(unclass(powerF), 3), data = etch) |> summary()
## 
## Call:
## lm(formula = rate ~ poly(unclass(powerF), 3), data = etch)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -25.4  -13.0    2.8   13.2   25.6 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                617.750      4.085 151.234  < 2e-16 ***
## poly(unclass(powerF), 3)1  252.700     18.267  13.833 2.56e-10 ***
## poly(unclass(powerF), 3)2   50.759     18.267   2.779   0.0134 *  
## poly(unclass(powerF), 3)3   20.900     18.267   1.144   0.2694    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.27 on 16 degrees of freedom
## Multiple R-squared:  0.9261, Adjusted R-squared:  0.9122 
## F-statistic:  66.8 on 3 and 16 DF,  p-value: 2.883e-09

Did you notice the coefficients and standard errors of these two models differ from the first model with the ordered factor? That’s because the ordered factor model uses a contrast. A contrast is a matrix that transforms a series of 0/1 dummy variables into columns that can be estimated in a modeling routine. The default contrast for ordered factors in R is the polynomial contrast. We can see the contrast R uses by calling the contr.poly function. Simply tell it how many levels your categorical variable has. In our case we have 4.

contr.poly(4)
##              .L   .Q         .C
## [1,] -0.6708204  0.5 -0.2236068
## [2,] -0.2236068 -0.5  0.6708204
## [3,]  0.2236068 -0.5 -0.6708204
## [4,]  0.6708204  0.5  0.2236068

That’s the contrast used to convert dummy variables of an ordered factor into a model matrix. Let’s demonstrate. Below we generate a 4-column matrix of dummy variables for the power variable. There is one row per observation and one column per level of power. A 1 in a row identifies membership in one of the four power categories.

Xpower <- sapply(c(160, 180, 200, 220), 
               function(x)ifelse(etch$power==x, 1, 0))
colnames(Xpower) <- c("160", "180", "200", "220")
head(Xpower)
##      160 180 200 220
## [1,]   1   0   0   0
## [2,]   0   1   0   0
## [3,]   0   0   1   0
## [4,]   0   0   0   1
## [5,]   1   0   0   0
## [6,]   0   1   0   0

Now we multiply the dummy variable matrix by the polynomial contrast to obtain a model matrix that is estimable.

Xpower %*% contr.poly(4)
##               .L   .Q         .C
##  [1,] -0.6708204  0.5 -0.2236068
##  [2,] -0.2236068 -0.5  0.6708204
##  [3,]  0.2236068 -0.5 -0.6708204
##  [4,]  0.6708204  0.5  0.2236068
##  [5,] -0.6708204  0.5 -0.2236068
##  [6,] -0.2236068 -0.5  0.6708204
##  [7,]  0.2236068 -0.5 -0.6708204
##  [8,]  0.6708204  0.5  0.2236068
##  [9,] -0.6708204  0.5 -0.2236068
## [10,] -0.2236068 -0.5  0.6708204
## [11,]  0.2236068 -0.5 -0.6708204
## [12,]  0.6708204  0.5  0.2236068
## [13,] -0.6708204  0.5 -0.2236068
## [14,] -0.2236068 -0.5  0.6708204
## [15,]  0.2236068 -0.5 -0.6708204
## [16,]  0.6708204  0.5  0.2236068
## [17,] -0.6708204  0.5 -0.2236068
## [18,] -0.2236068 -0.5  0.6708204
## [19,]  0.2236068 -0.5 -0.6708204
## [20,]  0.6708204  0.5  0.2236068

Notice this is the same model.matrix that is used when we model the ordered factor with lm. (The column of ones is for the intercept.)

lm(rate ~ powerF, data = etch) |> model.matrix()
##    (Intercept)   powerF.L powerF.Q   powerF.C
## 1            1 -0.6708204      0.5 -0.2236068
## 2            1 -0.2236068     -0.5  0.6708204
## 3            1  0.2236068     -0.5 -0.6708204
## 4            1  0.6708204      0.5  0.2236068
## 5            1 -0.6708204      0.5 -0.2236068
## 6            1 -0.2236068     -0.5  0.6708204
## 7            1  0.2236068     -0.5 -0.6708204
## 8            1  0.6708204      0.5  0.2236068
## 9            1 -0.6708204      0.5 -0.2236068
## 10           1 -0.2236068     -0.5  0.6708204
## 11           1  0.2236068     -0.5 -0.6708204
## 12           1  0.6708204      0.5  0.2236068
## 13           1 -0.6708204      0.5 -0.2236068
## 14           1 -0.2236068     -0.5  0.6708204
## 15           1  0.2236068     -0.5 -0.6708204
## 16           1  0.6708204      0.5  0.2236068
## 17           1 -0.6708204      0.5 -0.2236068
## 18           1 -0.2236068     -0.5  0.6708204
## 19           1  0.2236068     -0.5 -0.6708204
## 20           1  0.6708204      0.5  0.2236068
## attr(,"assign")
## [1] 0 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$powerF
## [1] "contr.poly"

The latter two models that explicitly modeled a 3-degree polynomial did not use a contrast. Instead the actual power values were squared and cubed.

Is there an advantage to using one over the other? Maybe we should have just used polynomial regression instead of creating an ordered factor? Fox and Weisberg (2019) tell us contr.poly is appropriate when the levels of your ordered factor are equally spaced and balanced. That is the case for the power variable. Each level is separated by 20 units and we have five observations for each power level. If we have an ordered factor that does not have equal spacing and/or is not balanced, we’re better off performing polynomial regression via the poly function.

It’s worth noting that even though we have power saved as an ordered factor and modeled with a polynomial contrast, we can still perform follow-up analyses usually associated with standard unordered categorical variables. For example, using the emmeans package we can make pairwise comparisons.

library(emmeans)
lm(rate ~ powerF, data = etch) |> 
  emmeans(specs = "powerF") |>
  pairs()
##  contrast  estimate   SE df t.ratio p.value
##  160 - 180    -36.2 11.6 16  -3.133 0.0294 
##  160 - 200    -74.2 11.6 16  -6.422 <.0001 
##  160 - 220   -155.8 11.6 16 -13.485 <.0001 
##  180 - 200    -38.0 11.6 16  -3.289 0.0216 
##  180 - 220   -119.6 11.6 16 -10.352 <.0001 
##  200 - 220    -81.6 11.6 16  -7.063 <.0001 
## 
## P value adjustment: tukey method for comparing a family of 4 estimates

Hopefully you now have a better understanding of what’s happening in R when you set a factor as ordered and use it in a linear model.

References

For questions or clarifications regarding this article, contact the UVA Library StatLab: statlab@virginia.edu

View the entire collection of UVA Library StatLab articles.

Clay Ford
Statistical Research Consultant
University of Virginia Library
July 1, 2021