Consider the following data from the text *Design and Analysis of Experiments, 7 ed* (Montgomery, Table 3.1). It has two variables: power and rate. Power is a discrete setting on a tool used to etch circuits into a silicon wafer. There are four levels to choose from. Rate is the distance etched measured in Angstroms per minute. (An Angstrom is one ten-billionth of a meter.) Of interest is how (or if) the power setting affects the etch rate. Below we enter the small data set by hand into R and create a quick plot. The plot suggests higher power leads to higher etch rates.

# Table 3.1 power <- rep(c(160, 180, 200, 220), 5) rate <- c(575, 565, 600, 725, 542, 593, 651, 700, 530, 590, 610, 715, 539, 579, 637, 685, 570, 610, 629, 710) etch <- data.frame(rate, power) stripchart(rate ~ power, data = etch, pch = 1, vertical = TRUE, xlab = "power")

A formal statistical analysis of this data requires a linear model. To perform the analysis in R we need to define the power variable as a *factor*. This tells R that power is a categorical variable and to treat it as such in a modeling procedure. But in addition, we may want to further specify that the levels of power are *ordered*. Power level 160 is less than 180, and 180 is less than 200, and so on. This is additional information we may want to account for. We can define power as an *ordered factor* in R using the `ordered`

function. We do that below and save the ordered factor version as “powerF”. Notice that calling `head`

to view the first 6 values of powerF shows us the ordering of the levels: `160 < 180 < 200 < 220`

etch$powerF <- ordered(etch$power) head(etch$powerF) ## [1] 160 180 200 220 160 180 ## Levels: 160 < 180 < 200 < 220

It’s important to note that R is storing powerF internally as integers, which we can see by calling `unclass`

or `as.integer`

on the variable. (We’ll make use of this fact in the next plot we create.)

unclass(etch$powerF) ## [1] 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 ## attr(,"levels") ## [1] "160" "180" "200" "220"

Now let’s formally analyze the data using the `lm`

function in R. This is equivalent to performing a one-way ANOVA. Below we model rate as a function of the ordered factor “powerF” and pipe the result in the `summary`

function using base R’s native forward pipe operator, `|>`

, introduced in version 4.1.0.

lm(rate ~ powerF, data = etch) |> summary() ## ## Call: ## lm(formula = rate ~ powerF, data = etch) ## ## Residuals: ## Min 1Q Median 3Q Max ## -25.4 -13.0 2.8 13.2 25.6 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 617.750 4.085 151.234 < 2e-16 *** ## powerF.L 113.011 8.169 13.833 2.56e-10 *** ## powerF.Q 22.700 8.169 2.779 0.0134 * ## powerF.C 9.347 8.169 1.144 0.2694 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 18.27 on 16 degrees of freedom ## Multiple R-squared: 0.9261, Adjusted R-squared: 0.9122 ## F-statistic: 66.8 on 3 and 16 DF, p-value: 2.883e-09

At the bottom of the output is the familiar ANOVA test result. The F-statistic is large and the p-value is tiny which provides evidence that the mean rates between the four power levels are different. (We have statistically sanctified what we saw in the plot.) In addition the adjusted R-squared suggests power explains about 91% of the variability we’re seeing in the etch rates. But what about the rest? What are these mysterious “power.L”, “power.Q”, and “power.C” labels? What do their estimates mean? What are the hypothesis tests associated with them?

The quick answer to these questions is that the L, Q and C stand for Linear, Quadratic, and Cubic; that the coefficients don’t have a ready interpretation; and that each hypothesis is a test for *trend*.

The “power.L” entry is a test for *linear trend*. The null is no linear trend (ie, a flat line). The large t value provides evidence against this. Again, this is not a surprise. Looking at the plot above we can visualize a straight line over the points.

The “power.Q” entry is a test for *quadratic trend*. The null is no quadratic trend (ie, a straight line). The t value of 2.8 provides some evidence against this. This is less obvious in the plot, though we can perhaps imagine a slight curve superimposed over the data.

The “power.C” entry is a test for *cubic trend*. The null is no cubic trend (ie, a straight or quadratic line). The t value of 1.1 is small and the p-value is quite large. There doesn’t appear to be any cubic trend in the data.

We can visualize these trends using ggplot. Below we call `geom_smooth`

three times to plot linear, quadratic and cubic lines using `lm`

. Notice we have to `unclass`

the powerF variable to make this work. Defining the colors as “1”, “2” and “3” is an easy way to force ggplot to create a legend and place the labels in that order. We notice there isn’t much difference between the fitted quadratic and cubic lines, visually demonstrating why the cubic trend test was not “significant”.

library(ggplot2) ggplot(etch) + aes(x = powerF, y = rate) + geom_point(shape = 1) + geom_smooth(aes(x = unclass(powerF), color = "1"), formula = y ~ x, method = lm, se = FALSE) + geom_smooth(aes(x = unclass(powerF), color = "2"), formula = y ~ poly(x, 2), method = lm, se = FALSE) + geom_smooth(aes(x = unclass(powerF), color = "3"), formula = y ~ poly(x, 3), method = lm, se = FALSE) + scale_color_discrete("Trend", labels = c("linear", "quadratic", "cubic")) + theme_classic()

If you’re thinking these tests for trend look a lot like a polynomial model, you’re right. That’s basically what it is. In fact we get identical test results (ie, same test statistics and p-values) if we simply fit a 3-degree polynomial model to the original numeric power variable.

lm(rate ~ poly(power, 3), data = etch) |> summary() ## ## Call: ## lm(formula = rate ~ poly(power, 3), data = etch) ## ## Residuals: ## Min 1Q Median 3Q Max ## -25.4 -13.0 2.8 13.2 25.6 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 617.750 4.085 151.234 < 2e-16 *** ## poly(power, 3)1 252.700 18.267 13.833 2.56e-10 *** ## poly(power, 3)2 50.759 18.267 2.779 0.0134 * ## poly(power, 3)3 20.900 18.267 1.144 0.2694 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 18.27 on 16 degrees of freedom ## Multiple R-squared: 0.9261, Adjusted R-squared: 0.9122 ## F-statistic: 66.8 on 3 and 16 DF, p-value: 2.883e-09

We also get the same test results if we `unclass`

the ordered factor version of power and fit a 3-degree polynomial model.

lm(rate ~ poly(unclass(powerF), 3), data = etch) |> summary() ## ## Call: ## lm(formula = rate ~ poly(unclass(powerF), 3), data = etch) ## ## Residuals: ## Min 1Q Median 3Q Max ## -25.4 -13.0 2.8 13.2 25.6 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 617.750 4.085 151.234 < 2e-16 *** ## poly(unclass(powerF), 3)1 252.700 18.267 13.833 2.56e-10 *** ## poly(unclass(powerF), 3)2 50.759 18.267 2.779 0.0134 * ## poly(unclass(powerF), 3)3 20.900 18.267 1.144 0.2694 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 18.27 on 16 degrees of freedom ## Multiple R-squared: 0.9261, Adjusted R-squared: 0.9122 ## F-statistic: 66.8 on 3 and 16 DF, p-value: 2.883e-09

Did you notice the coefficients and standard errors of these two models differ from the first model with the ordered factor? That’s because the ordered factor model uses a *contrast*. A contrast is a matrix that transforms a series of 0/1 dummy variables into columns that can be estimated in a modeling routine. The default contrast for ordered factors in R is the *polynomial contrast*. We can see the contrast R uses by calling the `contr.poly`

function. Simply tell it how many levels your categorical variable has. In our case we have 4.

contr.poly(4) ## .L .Q .C ## [1,] -0.6708204 0.5 -0.2236068 ## [2,] -0.2236068 -0.5 0.6708204 ## [3,] 0.2236068 -0.5 -0.6708204 ## [4,] 0.6708204 0.5 0.2236068

That’s the contrast used to convert dummy variables of an ordered factor into a model matrix. Let’s demonstrate. Below we generate a 4-column matrix of dummy variables for the power variable. There is one row per observation and one column per level of power. A 1 in a row identifies membership in one of the four power categories.

Xpower <- sapply(c(160, 180, 200, 220), function(x)ifelse(etch$power==x, 1, 0)) colnames(Xpower) <- c("160", "180", "200", "220") head(Xpower) ## 160 180 200 220 ## [1,] 1 0 0 0 ## [2,] 0 1 0 0 ## [3,] 0 0 1 0 ## [4,] 0 0 0 1 ## [5,] 1 0 0 0 ## [6,] 0 1 0 0

Now we multiply the dummy variable matrix by the polynomial contrast to obtain a model matrix that is estimable.

Xpower %*% contr.poly(4) ## .L .Q .C ## [1,] -0.6708204 0.5 -0.2236068 ## [2,] -0.2236068 -0.5 0.6708204 ## [3,] 0.2236068 -0.5 -0.6708204 ## [4,] 0.6708204 0.5 0.2236068 ## [5,] -0.6708204 0.5 -0.2236068 ## [6,] -0.2236068 -0.5 0.6708204 ## [7,] 0.2236068 -0.5 -0.6708204 ## [8,] 0.6708204 0.5 0.2236068 ## [9,] -0.6708204 0.5 -0.2236068 ## [10,] -0.2236068 -0.5 0.6708204 ## [11,] 0.2236068 -0.5 -0.6708204 ## [12,] 0.6708204 0.5 0.2236068 ## [13,] -0.6708204 0.5 -0.2236068 ## [14,] -0.2236068 -0.5 0.6708204 ## [15,] 0.2236068 -0.5 -0.6708204 ## [16,] 0.6708204 0.5 0.2236068 ## [17,] -0.6708204 0.5 -0.2236068 ## [18,] -0.2236068 -0.5 0.6708204 ## [19,] 0.2236068 -0.5 -0.6708204 ## [20,] 0.6708204 0.5 0.2236068

Notice this is the same model.matrix that is used when we model the ordered factor with `lm`

. (The column of ones is for the intercept.)

lm(rate ~ powerF, data = etch) |> model.matrix() ## (Intercept) powerF.L powerF.Q powerF.C ## 1 1 -0.6708204 0.5 -0.2236068 ## 2 1 -0.2236068 -0.5 0.6708204 ## 3 1 0.2236068 -0.5 -0.6708204 ## 4 1 0.6708204 0.5 0.2236068 ## 5 1 -0.6708204 0.5 -0.2236068 ## 6 1 -0.2236068 -0.5 0.6708204 ## 7 1 0.2236068 -0.5 -0.6708204 ## 8 1 0.6708204 0.5 0.2236068 ## 9 1 -0.6708204 0.5 -0.2236068 ## 10 1 -0.2236068 -0.5 0.6708204 ## 11 1 0.2236068 -0.5 -0.6708204 ## 12 1 0.6708204 0.5 0.2236068 ## 13 1 -0.6708204 0.5 -0.2236068 ## 14 1 -0.2236068 -0.5 0.6708204 ## 15 1 0.2236068 -0.5 -0.6708204 ## 16 1 0.6708204 0.5 0.2236068 ## 17 1 -0.6708204 0.5 -0.2236068 ## 18 1 -0.2236068 -0.5 0.6708204 ## 19 1 0.2236068 -0.5 -0.6708204 ## 20 1 0.6708204 0.5 0.2236068 ## attr(,"assign") ## [1] 0 1 1 1 ## attr(,"contrasts") ## attr(,"contrasts")$powerF ## [1] "contr.poly"

The latter two models that explicitly modeled a 3-degree polynomial did not use a contrast. Instead the actual power values were squared and cubed.

Is there an advantage to using one over the other? Maybe we should have just used polynomial regression instead of creating an ordered factor? Fox and Weisberg (2019) tell us `contr.poly`

is appropriate when the levels of your ordered factor are equally spaced and balanced. That is the case for the power variable. Each level is separated by 20 units and we have five observations for each power level. If we have an ordered factor that does not have equal spacing and/or is not balanced, we’re better off performing polynomial regression via the `poly`

function.

It’s worth noting that even though we have power saved as an ordered factor and modeled with a polynomial contrast, we can still perform follow-up analyses usually associated with standard unordered categorical variables. For example, using the emmeans package we can make pairwise comparisons.

library(emmeans) lm(rate ~ powerF, data = etch) |> emmeans(specs = "powerF") |> pairs() ## contrast estimate SE df t.ratio p.value ## 160 - 180 -36.2 11.6 16 -3.133 0.0294 ## 160 - 200 -74.2 11.6 16 -6.422 <.0001 ## 160 - 220 -155.8 11.6 16 -13.485 <.0001 ## 180 - 200 -38.0 11.6 16 -3.289 0.0216 ## 180 - 220 -119.6 11.6 16 -10.352 <.0001 ## 200 - 220 -81.6 11.6 16 -7.063 <.0001 ## ## P value adjustment: tukey method for comparing a family of 4 estimates

Hopefully you now have a better understanding of what’s happening in R when you set a factor as ordered and use it in a linear model.

**References**

- John Fox and Sanford Weisberg (2019).
*An {R} Companion to Applied Regression, Third Edition*. Thousand Oaks CA: Sage. URL:https://socialsciences.mcmaster.ca/jfox/Books/Companion/ - Douglas Montgomery (2009).
*Design and Analysis of Experiments, Seventh Edition*. Wiley.

For questions or clarifications regarding this article, contact the UVA Library StatLab: statlab@virginia.edu

View the entire collection of UVA Library StatLab articles.

Clay FordStatistical Research Consultant

University of Virginia Library

July 1, 2021