Loglinear models model cell counts in contingency tables. They’re a little different from other modeling methods in that they don’t distinguish between response and explanatory variables. All variables in a loglinear model are essentially “responses”.

To learn more about loglinear models, we’ll explore the following data from Agresti (1996, Table 6.3). It summarizes responses from a survey that asked high school seniors in a particular city whether they had ever used alcohol, cigarettes, or marijuana.

We’re interested in how the cell counts in this table depend on the levels of the categorical variables. If we were interested in modeling, say, marijuana use as a function of alcohol and cigarette use, then we might perform logistic regression. But if we’re interested in understanding the relationship between all three variables without one necessarily being the “response”, then we might want to try a loglinear model.

First, let’s get the data into R. We usually import data from an external file, but since the data is relatively small, we can type it in.

> seniors <- array(data = c(911, 44, 538, 456, 3, 2, 43, 279), dim = c(2,2,2), dimnames = list("cigarette" = c("yes","no"), "marijuana" = c("yes","no"), "alcohol" = c("yes","no")))

We use the `array`

function when we want to create a table with more than two dimensions. The `dim`

argument says we want to create a table with 2 rows, 2 columns, and 2 layers. In other words, we want to create two 2 x 2 tables: cigarette versus marijuana use for each level of alcohol use. The `dimnames`

argument provides names for the dimensions. It requires a list object, so we wrap the argument’s input in the `list`

function. The order is rows, columns, then layers. The `data`

argument provides the cell counts. Notice the order of entry. We fill the columns starting at the top left for each layer and work our way down. Finally we assign it to an object called “seniors”.

Notice when we print the object, it produces two 2 x 2 tables:

> seniors , , alcohol = yes marijuana cigarette yes no yes 911 538 no 44 456 , , alcohol = no marijuana cigarette yes no yes 3 43 no 2 279

If we want to print the table to look like our data source, we need to flatten it. We can do with the `ftable`

function. The `row.vars`

argument specifies which variables we want in the rows, and the order we want them in. Notice we use the names we specified in the `dimnames`

argument when we created the table.

> ftable(seniors, row.vars = c("alcohol","cigarette")) marijuana yes no alcohol cigarette yes yes 911 538 no 44 456 no yes 3 43 no 2 279

Before proceeding to modeling, we may want to explore the data. The `addmargins`

function provides marginal totals. We see that 2276 students participated in the survey and most of them tried alcohol.

> addmargins(seniors) , , alcohol = yes marijuana cigarette yes no Sum yes 911 538 1449 no 44 456 500 Sum 955 994 1949 , , alcohol = no marijuana cigarette yes no Sum yes 3 43 46 no 2 279 281 Sum 5 322 327 , , alcohol = Sum marijuana cigarette yes no Sum yes 914 581 1495 no 46 735 781 Sum 960 1316 2276

The `prop.table`

function allows us to calculate cell proportions. It has a `margin`

argument that allows us to specify which margin we want to work across. 1 means rows, 2 means columns, and 3 and above refer to the layers. Below we calculate proportions across the columns along the rows for each layer. Hence the argument `margin = c(1,3)`

:

> prop.table(seniors, margin = c(1,3)) , , alcohol = yes marijuana cigarette yes no yes 0.6287095 0.3712905 no 0.0880000 0.9120000 , , alcohol = no marijuana cigarette yes no yes 0.065217391 0.9347826 no 0.007117438 0.9928826

We see that of those students who tried cigarettes and alcohol, 62% also tried marijuana. Likewise, of those students who did not try cigarettes or alcohol, 99% also did not try marijuana. There definitely appears to be some sort of relationship here.

Now we’re ready to try some loglinear modeling. For this article we’ll use the `glm`

function, which means we need to transform our data into a different format. While there is a function, `loglin`

, that works directly with contingency tables, we find it easier to work with the formula-based `glm`

function.

We can easily transform our data by converting it to a data frame. Calling `as.data.frame`

on a table object in R returns a data frame with a column for cell frequencies where each row represents a unique combination of variables. But we also need to convert our array into a table. We can do that with `as.table`

. Below we do it with one line. After that we set the reference level for our three variables to “no”. This means the model coefficients will be expressed in terms of “yes” answers, which would seem to be more interesting given the age of our subjects.

> seniors.df <- as.data.frame(as.table(seniors)) > seniors.df[,-4] <- lapply(seniors.df[,-4], relevel, ref = "no") > seniors.df cigarette marijuana alcohol Freq 1 yes yes yes 911 2 no yes yes 44 3 yes no yes 538 4 no no yes 456 5 yes yes no 3 6 no yes no 2 7 yes no no 43 8 no no no 279

Now our data is ready for loglinear modeling using the `glm`

function. Before we get started, let’s think about why we call it “loglinear”. Let’s say I flip a quarter 40 times, you flip a nickel 40 times, and we record the results in a 2 x 2 contingency table. Here’s a quick simulation:

> set.seed(1) > qtr <- sample(c("H","T"),40,replace = T) > nik <- sample(c("H","T"),40,replace = T) > table(qtr,nik) nik qtr H T H 11 10 T 9 10

Since the coins are fair and independent of one another, we’d guess that the cell counts would be roughly equal (about 10 in each cell). In fact, when two categorical variables are independent, we can use their marginal probabilities to calculate their joint probabilities. In our example, the marginal probabilities are 0.5 for the heads and tails of each coin. The joint probabilities are the probabilities of a particular result in the table. For example, the probability of both coins landing heads is 0.5 * 0.5 = 0.25. To get an expected cell count, we multiply that probability by the number of trials. In our case that’s 40 * 0.25 = 10. In our simulation, we got 11.

We can generalize this with symbols: \(\mu_{ij} = n\pi_{i}\pi_{j}\). This says the cell count in row i and column j is equal to the product of the marginal probabilities and the number of observations. What we have here is a nice little model that describes how a cell count depends on row and column variables, provided the row and column variables are independent. If we take the log of each side it becomes additive (ie, linear):

$$\log \mu_{ij} = \log n + \log \pi_{i} + \log \pi_{j}$$

Thus we have a “loglinear” model. This particular model is called the loglinear model of independence for two-way contingency tables. If our two variables are not independent, this model does not work well. We would need an additional parameter in our model to allow the two variables to interact. We often call this an interaction term or higher-order term. In loglinear models, those are usually what we’re interested in. They allow us to determine if two variables are associated and in what way.

Let’s work with our survey data, which is a three-way contingency table. To start, we model Freq as a function of the three variables using the `glm`

function. Notice we set the `family`

argument to poisson since we’re modeling counts. This is known as the independence model. It assumes the three variables are independent of one another, which seems very unlikely.

> mod0 <- glm(Freq ~ cigarette + marijuana + alcohol, data = seniors.df, family = poisson) > summary(mod0) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 4.17254 0.06496 64.234 < 2e-16 *** cigaretteyes 0.64931 0.04415 14.707 < 2e-16 *** marijuanayes -0.31542 0.04244 -7.431 1.08e-13 *** alcoholyes 1.78511 0.05976 29.872 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 2851.5 on 7 degrees of freedom Residual deviance: 1286.0 on 4 degrees of freedom

Looking at the summary it appears this is a great model. We see highly significant coefficients and p-values near 0. But for loglinear models we want to check the residual deviance. As a rule of thumb, we’d like it to be close in value to the degrees of freedom. Here we have 1286 on 4 degrees of freedom. This indicates a poor fit. We can calculate a p-value if we like. The deviance statistic has an approximate chi-squared distribution, so we use the `pchisq`

function. The `df.residual`

function extracts the degrees of freedom. Setting `lower.tail = F`

means we get the probability of seeing a deviance as large or larger than what we observed.

> pchisq(deviance(mod0), df = df.residual(mod0), lower.tail = F) [1] 3.574437e-277

The null of this test is that the expected frequencies satisfy the given loglinear model. Clearly they do not!

A more intuitive way to investigate fit is to compare the fitted values to the observed values. One way we can do that is to combine the original data with the fitted values, like so:

> cbind(mod0$data, fitted(mod0)) cigarette marijuana alcohol Freq fitted(mod0) 1 yes yes yes 911 539.98258 2 no yes yes 44 282.09123 3 yes no yes 538 740.22612 4 no no yes 456 386.70007 5 yes yes no 3 90.59739 6 no yes no 2 47.32880 7 yes no no 43 124.19392 8 no no no 279 64.87990

The fitted model is way off from the observed data. For example, we observed 3 students who tried cigarettes and marijuana but not alcohol. Our model predicts about 90.

The coefficients in this model can be interpreted as odds if we exponentiate them. For example, if we exponentiate the coefficient for marijuana, we get the following:

> exp(coef(mod0)[3]) marijuanayes 0.7294833

That’s the odds of using marijuana because “no” is the baseline. We can check this manually by calculating the odds directly:

> margin.table(seniors, margin = 2)/sum(margin.table(seniors, margin = 2)) marijuana yes no 0.4217926 0.5782074 > 0.4217926 / 0.5782074 [1] 0.7294832

According to this model, the odds of using marijuana are about 0.73 to 1, regardless of whether you tried alcohol or cigarettes. However we know that’s probably not a good estimate. We can just look at the raw data and see there were many more people who tried marijuana when they also tried cigarettes and alcohol.

Let’s a fit more complex model that allows variables to be associated with one another, but maintains the same association regardless of the level of the third variable. We call this homogeneous association. This says that, for example, alcohol and marijuana use have some sort of relationship, but that relationship is the same regardless of whether or not they tried cigarettes. Fitting this model means we add interactions for each pairwise combination of variables. R makes this easy with its formula notation. Simply put parentheses around your variables and add an exponent of 2. This translates to “all variables and all pairwise interactions”:

> mod1 <- glm(Freq ~ (cigarette + marijuana + alcohol)^2, data = seniors.df, family = poisson) > summary(mod1) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 5.63342 0.05970 94.361 < 2e-16 *** cigaretteyes -1.88667 0.16270 -11.596 < 2e-16 *** marijuanayes -5.30904 0.47520 -11.172 < 2e-16 *** alcoholyes 0.48772 0.07577 6.437 1.22e-10 *** cigaretteyes:marijuanayes 2.84789 0.16384 17.382 < 2e-16 *** cigaretteyes:alcoholyes 2.05453 0.17406 11.803 < 2e-16 *** marijuanayes:alcoholyes 2.98601 0.46468 6.426 1.31e-10 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 2851.46098 on 7 degrees of freedom Residual deviance: 0.37399 on 1 degrees of freedom

This model fits much better. Notice the residual deviance (0.37399) compared to the degrees of freedom (1). We can calculate a p-value if we like:

> pchisq(deviance(mod1), df = df.residual(mod1), lower.tail = F) [1] 0.5408396

The high p-value says we have insufficient evidence to reject the null hypothesis that the expected frequencies satisfy our model. Once again we can compare the fitted and observed values and see how well they match up:

> cbind(mod1$data, fitted(mod1)) cigarette marijuana alcohol Freq fitted(mod1) 1 yes yes yes 911 910.38317 2 no yes yes 44 44.61683 3 yes no yes 538 538.61683 4 no no yes 456 455.38317 5 yes yes no 3 3.61683 6 no yes no 2 1.38317 7 yes no no 43 42.38317 8 no no no 279 279.61683

So the model seems to fit (very well), but how do we describe the association between the variables? What effect do the variables have on one another? To answer this we look at the coefficients of the interactions. By exponentiating the coefficients, we get odds ratios. For example, let’s look at the coefficient for “marijuanayes:alcoholyes”.

> exp(coef(mod1)["marijuanayes:alcoholyes"]) marijuanayes:alcoholyes 19.80658

Students who tried marijuana have estimated odds of having tried alcohol that are 19 times the estimated odds for students who did not try marijuana. Because we fit a homogeneous association model, the odds ratio is the same regardless of whether they tried cigarettes.

It’s a good idea to calculate a confidence interval for these odds ratio estimates. We can do that with `confint`

function:

> exp(confint(mod1, parm = c("cigaretteyes:marijuanayes", "cigaretteyes:alcoholyes", "marijuanayes:alcoholyes"))) 2.5 % 97.5 % cigaretteyes:marijuanayes 12.645760 24.06925 cigaretteyes:alcoholyes 5.601452 11.09715 marijuanayes:alcoholyes 8.814046 56.64360

The associations are all pretty strong. We see that the odds of trying marijuana if you tried cigarettes is at least 12 times higher than the odds of trying marijuana if you hadn’t tried cigarettes, and vice versa.

What happens if we fit a model with a three-way interaction? This allows the pairwise associations to change according the level of the third variable. This is the saturated model since it has as many coefficients as cells in our table: 8. It perfectly fits the data. The formula “cigarette * marijuana * alcohol” means fit all interactions.

> mod2 <- glm(Freq ~ cigarette * marijuana * alcohol, data = seniors.df, family = poisson)

The deviance of this model is basically 0 on 0 degrees of freedom. The fitted counts match the observed counts:

> deviance(mod2) [1] -2.609024e-13 > cbind(mod2$data, fitted(mod2)) cigarette marijuana alcohol Freq fitted(mod2) 1 yes yes yes 911 911 2 no yes yes 44 44 3 yes no yes 538 538 4 no no yes 456 456 5 yes yes no 3 3 6 no yes no 2 2 7 yes no no 43 43 8 no no no 279 279

All things being equal, we prefer a simpler model. We usually don’t want to finish with a saturated model that perfectly fits our data. However it’s useful to fit a saturated model to verify the higher-order interaction is statistically not significant. We can verify that the homogeneous association model fits just as well as the saturated model by performing a likelihood ratio test. One way to do this is with the `anova`

function:

> anova(mod1, mod2) Analysis of Deviance Table Model 1: Freq ~ (cigarette + marijuana + alcohol)^2 Model 2: Freq ~ cigarette * marijuana * alcohol Resid. Df Resid. Dev Df Deviance 1 1 0.37399 2 0 0.00000 1 0.37399 > pchisq(0.373, df = 1, lower.tail = F) [1] 0.5413735

This says we fail to reject the null hypothesis that mod1 fits just as well as mod2.

We could try models with only certain interactions. For example, look at the following model:

> mod3 <- glm(Freq ~ (cigarette * marijuana) + (alcohol * marijuana), data = seniors.df, family = poisson)

This fits interactions for cigarette and marijuana, and alcohol and marijuana, but not cigarette and alcohol. The implication is that alcohol and cigarette use is independent of one another, controlling for marijuana use. Is this a suitable model? Once again we can perform a likelihood ratio test:

> anova(mod3, mod1) Analysis of Deviance Table Model 1: Freq ~ (cigarette * marijuana) + (alcohol * marijuana) Model 2: Freq ~ (cigarette + marijuana + alcohol)^2 Resid. Df Resid. Dev Df Deviance 1 2 187.754 2 1 0.374 1 187.38 > pchisq(187.38, df = 1, lower.tail = F) [1] 1.186427e-42

The p-value is tiny. The probability of seeing such a big change in deviance (187.38) if the models really were no different is remote. There appears to be good evidence that the homogeneous association model provides a much better fit than the model that assumes conditional independence between alcohol and cigarette use.

Loglinear models work for larger tables that extend into 4 or more dimensions. Obviously the interpretation of interactions becomes much more complicated when they involve 3 or more variables.

To learn more about loglinear models, see the references below. The example for this blog post comes from Chapter 6 of An Introduction to Categorical Data Analysis.

References:

– Agresti, A. *An Introduction to Categorical Data Analysis*, 1st Ed. 1996. Ch. 6.

– Faraway, J. *Extending the Linear Model with R*. 2006. Ch. 4.

– Venables, W.N and Ripley, B.D. *Modern Applied Statistics with S*, 4th Ed. 2002. Ch. 7.

For questions or clarifications regarding this article, contact the UVa Library StatLab: statlab@virginia.edu

Clay FordStatistical Research Consultant

University of Virginia Library

July 12, 2016