Simulating Multinomial Logistic Regression Data

In this article we demonstrate how to simulate data suitable for a multinomial logistic regression model using R. One reason to do this is to gain a better understanding of how multinomial logistic regression models work. Another is to simulate data for the purposes of estimating power and sample size for a planned experiment that will involve a multinomial logistic regression analysis.

Before we begin, let’s review why we might want to undertake a multinomial logistic regression analysis. The word “multinomial” in this case means “multiple unordered categories”. This could be three brands of cereal, five models of cars, three types of skin cancer, and so on. These unordered categories might be our dependent variable that we’re interested in understanding. For example, we could have data on cereal brand preference for 300 adults. Perhaps 30% prefer brand A, 25% prefer brand B, and 45% prefer brand C. Why is there variability in this brand preference? Perhaps sex, income level, race, occupation, age, and other independent variables could explain some of this variability. To assess this, we could model brand preference as a function of these variables using multinomial logistic regression.

The statistical model for multinomial logistic regression with p predictors and a dependent variable that has j categories is as follows:

\[\text{log}(\pi_j/\pi_1) = \beta_{j0} + \beta_{j1}x_1 + \dots + \beta_{jp}x_p\]

The term on the left hand side is called the log-odds, or the logit. The ratio of probabilities, or odds, is the probability of being in category j divided by the probability of being in category 1, or the baseline category. Whenever we do multinomial logistic regression, one of the categories of the dependent variable needs to serve as the baseline. For this reason multinomial logistic regression is sometimes referred to as baseline-category logit regression.

If we exponentiate both sides of the model we can solve for the probability of category j.

\[\pi_j = \pi_1\text{exp}(\beta_{j0} + \beta_{j1}x_1 + \dots + \beta_{jp}x_p)\] Since all the non-baseline category probabilities can be expressed this way and all the category probabilities sum to 1 (ie, \(\pi_1 + \pi_2 + \dots + \pi_j = 1\)), we can obtain the following expression:

\[\pi_1 + \pi_1\text{exp}(\beta_{20} + \beta_{21}x_1 + \dots + \beta_{2p}x_p) + \dots + \pi_1\text{exp}(\beta_{j0} + \beta_{j1}x_1 + \dots + \beta_{jp}x_p) = 1\] Solving for \(\pi_1\) produces

\[\pi_1 = \frac{1}{1 + \sum_{j=2}^J \text{exp}(\beta_{j0} + \beta_{j1}x_1 + \dots + \beta_{jp}x_p)} \]

Substituting this expression in the formula for \(\pi_j\) above produces the following general expression for \(\pi_j\) in terms of the predictors:

\[\pi_j = \frac{\sum_{j=2}^J \text{exp}(\beta_{j0} + \beta_{j1}x_1 + \dots + \beta_{jp}x_p)}{1 + \sum_{j=2}^J \text{exp}(\beta_{j0} + \beta_{j1}x_1 + \dots + \beta_{jp}x_p)} \]

We can use these two expressions to simulate probabilities for category membership, which can then be used to randomly assign membership to a category.

To demonstrate, we’ll assume we have two predictors:

  1. a numeric variable, x, that ranges uniformly from 0.5 to 3
  2. a categorical variable, g, consisting of two levels: “a” and “b”

There is nothing special about these predictors. This is just what we happened to choose. Let’s go ahead and create them. The set.seed(1) line ensures you will generate the same data if you wish to follow along in R.


set.seed(1)
# randomly sample 300 values from uniform dist'n ranging from 0.5 - 3
x <- runif(n = 300, min = 0.5, max = 3)
# randomly sample "a" and "b" 300 times with replacement
g <- sample(c("a", "b"), size = 300, replace = TRUE)

Now let’s use our predictors to generate an unordered categorical variable with three categories. Maybe it’s three brands of cereals, or three types of careers, or three kinds of cars. To do this, we need to create linear predictors which will generate outcomes on the log odds scale. The expression inside the \(\text{exp()}\) above is the linear predictor: \(\beta_{j0} + \beta_{j1}x_1 + \dots + \beta_{jp}x_p\). Since the probabilities of category membership sum to one, we only need to generate two linear predictors. Below we create two vectors of log odds: one for category 2 (lp2) and one for category 3 (lp3). For example, lp2 has an intercept (\(\beta_{20}\)) of 3, an x variable coefficient (\(\beta_{21}\)) of -2, and a g variable coefficient (\(\beta_{22}\)) of -0.7.


lp2 <- 3 + -2*x + -0.7*(g == "b")
lp3 <- -2 + 1.2*x + -0.3*(g == "b")

Next we use our expressions above for \(\pi_1\) and \(\pi_j\) to generate probabilities of category membership. Notice both expressions have the same denominator, so we only need to calculate it once. We then use the denominator and linear predictor results to generate the probabilities. Finally we combine into a matrix called P.


# denominator, ensures probabilities sum to 1
den <- (1 + exp(lp2) + exp(lp3))
p1 <- 1/den
p2 <- exp(lp2)/den
p3 <- exp(lp3)/den
P <- cbind(p1, p2, p3)
head(P)


            p1         p2        p3
[1,] 0.2852292 0.55877658 0.1559942
[2,] 0.3445178 0.39604434 0.2594379
[3,] 0.3575890 0.15067383 0.4917372
[4,] 0.2066306 0.01627929 0.7770901
[5,] 0.2411312 0.64997348 0.1088953
[6,] 0.2112217 0.01747830 0.7713000

Each row of P are probabilities for each observation being in a particular category. For example, the first observation has the highest probability of being in category 2. All rows sum to 1, which we can verify by summing each row and checking if all sums equal 1:


# use round() to suppress precision errors 
all(round(apply(P, 1, sum)) == 1)


[1] TRUE

Now we’re ready to simulate group membership using the probabilities. To do this we use the sample() function, which allows you to sample a category based on given probabilities. For example, let’s say we have three categories, labeled 1 - 3, with probabilities 0.1, 0.3, and 0.6. We can randomly pick a category as follows:


set.seed(2)
sample(1:3, size = 1, prob = c(0.1, 0.3, 0.6))


[1] 3

Not surprisingly we sampled 3 since it has the highest probability of being selected.

We now want to apply the sample() function to each row of P and use the probabilities to randomly sample a category. We can do that using the apply() function. The second argument to apply(), MARGIN = 1, says apply the function to the rows (ie, the first dimension). We apply an anonymous function using sample() and pass each row to the prob argument. We save the result as “y” and table up the values.


set.seed(3)
y <- apply(P, MARGIN = 1, function(x)sample(x = 1:3, size = 1, prob = x))
table(y)


y
  1   2   3 
 99  76 125 

Most observations are in category 3, followed by category 1, and then category 2.

Finally we combine our response, “y”, with “x” and “g” into a data frame. Notice we made “y” a factor. It’s not required, but it’s a good idea since it’s supposed to be a categorical variable. And we’re done! We have simulated data suitable for multinomial logistic regression.


d <- data.frame(y = factor(y), x, g)
str(d)


'data.frame':   300 obs. of  3 variables:
 $ y: Factor w/ 3 levels "1","2","3": 2 3 3 3 2 3 3 3 1 2 ...
 $ x: num  1.16 1.43 1.93 2.77 1 ...
 $ g: chr  "a" "a" "a" "a" ...

Let’s go ahead and fit a model to this data and see how close it comes to the “true” values we used to simulate the data. We’ll use the multinom() function from the nnet package that comes with R. The model, y ~ x + g, is the correct model. That’s precisely how we generated “y”. (The trace = FALSE argument suppresses output that we don’t need for this article.)


library(nnet)
m <- multinom(y ~ x + g, data = d, trace = FALSE)

Now compare the model coefficients to the linear predictors we used to generate the data.


coef(m)


  (Intercept)         x           gb
2    3.726047 -2.838422 -0.527639019
3   -2.293166  1.304444  0.008536652

And here are the “true” coefficients we used:


lp2 <- 3 + -2*x + -0.7*(g == "b")
lp3 <- -2 + 1.2*x + -0.3*(g == "b")

Other than the g coefficient in the model for category 3, the rest of the estimated coefficients are relatively close to the true values.

If we wanted to assess the collective significance of the model predictors we might use the Anova() function from the car package.


car::Anova(m)


Analysis of Deviance Table (Type II tests)

Response: y
  LR Chisq Df Pr(>Chisq)    
x   157.70  2     <2e-16 ***
g     2.35  2     0.3088    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Even though we simulated the data such that “g” had an effect, it appears 300 observations may not be enough to reliably detect the effect. To be sure, we should generate many data sets of size 300, fit a model to each data set, and see how often we detect an effect for “g” (ie, get a p-value below a certain threshold, say 0.05). The replicate() function makes this relatively easy. We define an expression between the curly braces where we generate a new y vector, create a data frame, fit the model, and check whether the Chi-square test for “g” is less than 0.05. We do this 500 times and save to an object called “rout”.


rout <- replicate(n = 500, expr = {
  y <- apply(P, MARGIN = 1, function(x)sample(x = 1:3, size = 1, prob = x))
  d <- data.frame(y = factor(y), x, g)
  m <- multinom(y ~ x + g, data = d, trace = FALSE)
  aod <- car::Anova(m)
  aod["g", "Pr(>Chisq)"] < 0.05
})

The “rout” vector consists of TRUE/FALSE values. Taking the mean of this vector returns the proportion of TRUEs.


mean(rout)


[1] 0.452

It appears we have probability, or power, of about 0.45 to detect the effect of “g” with a sample size of 300. That’s not very high. What sample size would we need to reliably detect the effect of “g”? To figure this out it helps to create a function where we can modify the sample size. We create one such function below that we name “sim_mod”. The function takes one argument, n, which determines how many rows of data to generate. It then generates the data as before, fits a model 500 times, and returns the proportion of times the effect for “g” is detected. You’ll notice the function consists entirely of code we had already written above. We simply copied the code and pasted it between the curly brackets following function(n) and replaced “300” with “n”.


sim_mod <- function(n){
  # generate predictors
  x <- runif(n = n, min = 0.5, max = 3)
  g <- sample(c("a", "b"), size = n, replace = TRUE)
  # linear predictors
  lp2 <- 3 + -2*x + -0.7*(g == "b")
  lp3 <- -2 + 1.2*x + -0.3*(g == "b")
  # probabilities
  den <- (1 + exp(lp2) + exp(lp3))
  p1 <- 1/den
  p2 <- exp(lp2)/den
  p3 <- exp(lp3)/den
  P <- cbind(p1, p2, p3)
  rout <- replicate(n = 500, expr = {
    y <- apply(P, MARGIN = 1, function(x)sample(x = 1:3, size = 1, prob = x))
    d <- data.frame(y = factor(y), x, g)
    m <- multinom(y ~ x + g, data = d, trace = FALSE)
    aod <- car::Anova(m)
    aod["g", "Pr(>Chisq)"] < 0.05
    })
  mean(rout)
}

We can now apply our sim_mod() function to several candidate sample sizes and get an estimated power for each.


sizes <- sapply(c(400, 500, 600, 700), sim_mod)
sizes


[1] 0.614 0.700 0.764 0.832

It looks like we’re going to need about 700 observations to achieve power greater than 0.80 to reliably detect the hypothesized effect sizes for the predictor “g”.

Let’s return to our linear predictor formulas and look at the coefficients:


lp2 <- 3 + -2*x + -0.7*(g == "b")
lp3 <- -2 + 1.2*x + -0.3*(g == "b")

What does the -0.7 coefficient for “g” mean in linear predictor 2 (lp2)? Likewise, what does the 1.2 coefficient for “x” mean in linear predictor 3 (lp3)? Technically, these are weights on the log-odds scale. So in linear predictor 2, the -0.7 coefficient on “g” says that the log-odds decrease by -0.7 when an observation is in group b versus group a. In linear predictor 3, the log-odds increase by 1.2 for every one unit increase in “x”. But what does that mean? The log-odds scale is not easy to interpret.

We can make interpretation a bit easier by exponentiating the coefficients. This returns an odds ratio. When we exponentiate -0.7, we get about 0.50.


exp(-0.7)


[1] 0.4965853

This says the odds of being in category 2 versus category 1 (the baseline) decrease by a factor of 0.50 when an observation is in group b versus group a. We can verify this manually by computing a linear predictor when group = b, another when group = a, converting to odds, and then taking the ratio. Notice we use plogis() to convert log-odds to probability when we calculate the odds. Recall odds = p/(1 - p). The resulting odds ratio matches what we got by exponentiating the coefficient.


# group = b with x = 0.5
lp2_b <- 3 + 1.2*0.5 + -0.7*1
# group = a with x = 0.5
lp2_a <- 3 + 1.2*0.5 + -0.7*0
# odds of being in category 2 when group = b
odds_b <- plogis(lp2_b)/(1 - plogis(lp2_b))
# odds of being in category 2 when group = a
odds_a <- plogis(lp2_a)/(1 - plogis(lp2_a))
# odds ratio
odds_b/odds_a


[1] 0.4965853

Likewise if we exponentiate 1.2 in linear predictor 3 we get about about 3.32.


exp(1.2)


[1] 3.320117

This says for every one unit increase in “x”, the odds of being in category 3 versus category 1 (the baseline) increase by a factor of 3.32. We can also verify this manually by calculating two linear predictors, one at x = 1 and another x = 2 (ie, one unit increase in “x”) and calculating the odds ratio.


# x = 1 with group = a
lp3_x1 <- -2 + 1.2*1 + -0.3*0
# x = 2 with group = a
lp3_x2 <- -2 + 1.2*2 + -0.3*0
# odds of being in category 3 when x = 1
odds_x1 <- plogis(lp3_x1)/(1 - plogis(lp3_x1))
# odds of being in category 3 when x = 2
odds_x2 <- plogis(lp3_x2)/(1 - plogis(lp3_x2))
# odds ratio
odds_x2/odds_x1


[1] 3.320117

These kinds of calculations can help you be more specific about how you generate your data. For example, let’s say you hypothesize the probabilities of category membership when x = 1 to be

  • P(category 1) = 0.3
  • P(category 2) = 0.4
  • P(category 3) = 0.3

But when x = 2, you hypothesize the probabilities change to

  • P(category 1) = 0.1
  • P(category 2) = 0.3
  • P(category 3) = 0.6

If we assume category 1 is the baseline we have two sets of odds:

  • odds when x = 1
    • P(category 2)/P(category 1) = (0.4/0.3) = 1.33
    • P(category 3)/P(category 1) = (0.3/0.3) = 1
  • odds when x = 2
    • P(category 2)/P(category 1) = (0.3/0.1) = 3
    • P(category 3)/P(category 1) = (0.6/0.1) = 6

This leads to two odds ratios:

  • odds ratio (category 2/category 1) = 3/1.33 = 2.26
  • odds ratio (category 3/category 1) = 6/1 = 6

Which we could then use in linear predictor formulas. For example:


linpred1 <- 0.2 + log(2.26)*x
linpred2 <- 0.1 + log(6)*x

Since exponentiating returns an odds ratio, we need to take the log of the odds ratios to calculate the coefficients for the linear predictor.

We just randomly selected 0.2 and 0.1 to be the intercepts, but they could also be determined in a similar manner. They represent the log-odds when x = 0. If x = 0 is not an interesting or plausible value, you may want to center your data so that x = 0 represents the average. Let’s say we do plan to center our data. Then exponentiating the intercepts returns the following odds ratios:


exp(0.2)


[1] 1.221403


exp(0.1)


[1] 1.105171

The first says the odds of being in category 2 are higher by a factor of about 1.22 than the odds of being in category 1, when x is at the average value. The second says the odds of being in category 3 are higher by a factor of about 1.11 than the odds of being in category 1, when x is at the average value. As the researcher simulating the data, you’ll have to determine what intercept values makes sense for you.


References


Clay Ford
Statistical Research Consultant
University of Virginia Library
February 24, 2023


For questions or clarifications regarding this article, contact statlab@virginia.edu.

View the entire collection of UVA Library StatLab articles, or learn how to cite.