Power and Sample Size Analysis using Simulation

The power of a test is the probability of correctly rejecting a null hypothesis. For example, let’s say we suspect a coin is not fair and lands heads 65% of the time.

  • The null hypothesis is the coin is not biased to land heads.
  • The alternative hypothesis is the coin is biased to land heads.

We plan to flip the coin 50 times and calculate the proportion of heads. We’ll analyze the data using a one-sample proportion test. We’ll reject the null hypothesis if our p-value is below 0.05. Assuming our hunch is correct, that the coin lands heads 65% of the time, what is the probability of correctly rejecting the null hypothesis? That’s power. Ideally we want power to be high, say greater than 0.90.

In simple experiments such as this it is relatively easy to calculate power. In R we can use the pwr.p.test() function in the pwr package. The h argument is for effect size. We use the ES.h() function to calculate effect size of 0.65 versus 0.50. We also specify our sample size, our significance level, and the direction of our alternative hypothesis.

# install.packages("pwr")
library(pwr)
pwr.p.test(h = ES.h(p1 = 0.65, p2 = 0.50), 
           n = 50, sig.level = 0.05, 
           alternative = "greater")
## 
##      proportion power calculation for binomial distribution (arcsine transformation) 
## 
##               h = 0.3046927
##               n = 50
##       sig.level = 0.05
##           power = 0.6948512
##     alternative = greater

The power is computed to be about 0.70. There’s a decent, but not great, chance we’ll reject the null hypothesis (assuming the coin really is biased to land heads 65% of the time). How many times should we flip the coin to have a power of 0.90? We can rerun the same code, but this time specify power and leave out n.

pwr.p.test(h = ES.h(p1 = 0.65, p2 = 0.50), 
           power = 0.90, sig.level = 0.05, 
           alternative = "greater")
## 
##      proportion power calculation for binomial distribution (arcsine transformation) 
## 
##               h = 0.3046927
##               n = 92.24545
##       sig.level = 0.05
##           power = 0.9
##     alternative = greater

It looks we need to flip our coin at least 93 times to have a 90% chance of rejecting the null hypothesis. Once again this assumes the coin really is loaded to land heads 65% of the time.

We could also calculate power and sample size for this experiment using simulation. The basic idea is to simulate running the experiment many times and calculate the proportion of times we reject the null hypothesis. This proportion provides an estimate of power. Let’s try that.

Below we simulate flipping a biased coin 50 times using the sample() function. Notice we use the prob argument to specify that “H” is sampled 65% of the time. Next we run a one-sample proportion test to see if we have evidence of a coin biased to land heads. If the p-value is below 0.05, then we reject the null hypothesis.

d <- sample(c("H","T"), size = 50, replace = TRUE, prob = c(0.65, 0.35))
test <- prop.test(x = sum(d == "H"), n = 50, alternative = "greater")
test$p.value < 0.05
## [1] TRUE

We can replicate this code many times using the replicate() function. Simply wrap the code in curly brackets and specify how many times you want to replicate the code. Below we run our code 1000 times and save the result as “sim”. The result is a vector of TRUE/FALSE values. Taking the mean of TRUE/FALSE values returns the proportion of TRUES. In this case, the proportion of TRUES will be an estimate of power.

sim <- replicate(n = 1000, expr = {
  d <- sample(c("H","T"), size = 50, 
              replace = TRUE, prob = c(0.65, 0.35))
  test <- prop.test(x = sum(d == "H"), n = 50, 
                    alternative = "greater")
  test$p.value < 0.05
})
mean(sim)
## [1] 0.625

The estimate is a little over 0.60. The reason that’s lower than what we calculated before is because the prop.test() function uses something called Yates’ continuity correction. If we turn that off by setting correct = FALSE we get a similar result as before.

sim <- replicate(n = 1000, expr = {
  d <- sample(c("H","T"), size = 50, 
              replace = TRUE, prob = c(0.65, 0.35))
  test <- prop.test(x = sum(d == "H"), n = 50, 
                    alternative = "greater", 
                    correct = FALSE)
  test$p.value < 0.05
})
mean(sim)
## [1] 0.717

To try different sample sizes, we can create a function that allows us to set the sample size, and then run the function with various sample sizes to see how power changes. Below we create a function called sim_power() that is basically a copy of our previous code. We simply replaced “50” with n.

sim_power <- function(n){
  sim <- replicate(n = 1000, expr = {
  d <- sample(c("H","T"), size = n, 
              replace = TRUE, prob = c(0.65, 0.35))
  test <- prop.test(x = sum(d == "H"), n = n, 
                    alternative = "greater", 
                    correct = FALSE)
  test$p.value < 0.05
  })
mean(sim)
}

Now we can “apply” this function to various sample sizes and see how power changes. The sapply function simplifies the output to a vector.

sample_size <- c(50, 75, 100, 125)
power <- sapply(sample_size, sim_power)
data.frame(sample_size, power)
##   sample_size power
## 1          50 0.746
## 2          75 0.842
## 3         100 0.914
## 4         125 0.963

It looks like we would want to do around 100 flips for about 90% power, which agrees with our previous result using the pwr.p.test() function. If you run this code you’ll get slightly different answers because we’re using simulated data.

For a simple experiment such as coin flipping, simulation isn’t necessary to determine power and sample size. But simulation can be very helpful when your experiment is more complicated.

Let’s say we want to investigate different weights of paper and different designs of paper airplanes on the distance flown in inches. Perhaps we have two designs (a and b) and two weights of paper (20 lb and 28 lb). That’s 2 x 2 = 4 combinations of paper airplanes. What power do we have to detect a difference of 12 inches for design “b” with a significance level of 0.01, assuming a sample size of 10 per group and a common standard deviation of 10 inches between the four groups?

To begin, let’s just simulate one data set and one analysis.

n <- 10 # per group
design <- rep(c("a","b"), each = n*2)
weight <- rep(c("20 lb","28 lb"), times = n*2)
table(design, weight)
##       weight
## design 20 lb 28 lb
##      a    10    10
##      b    10    10

Next we simulate our response according to the following assumptions about expected flight distance. Perhaps these are based on a small pilot study.

  • design a, 20 lb paper: 72 inches of flight
  • design b, 20 lb paper: 72 + 12
  • design a, 28 lb paper: 72 + -4
  • design b, 28 lb paper: 72 + 12 + -4

These are known as additive effects. The effect of design b is to add 12 inches to expected flight distance. The effect of 28 lb paper is to add -4 inches to expected flight distance.

The rnorm function at the end adds “noise” to the distance and is analogous to setting the standard deviation for each design/weight combination to 10.

dist <- 72 +              
  (design == "b")*12 +    
  (weight == "28 lb")*-4 +
  rnorm(n*4, mean = 0, sd = 10)
  

And finally we run the analysis using the lm function.

m <- lm(dist ~ design + weight)

We can extract the hypothesis test for designb and check to see if it’s less than 0.01 as follows:

coef(summary(m))['designb','Pr(>|t|)'] < 0.01
## [1] TRUE

Now repeat the simulation and analysis 1000 times.

results <- replicate(1000, expr = {
  dist <- 72 + (design == "b")*12 + 
    (weight == "28 lb")*-4 + 
    rnorm(n*4, mean = 0, sd = 10)
  m <- lm(dist ~ design + weight)
  coef(summary(m))['designb','Pr(>|t|)'] < 0.01
})
mean(results)
## [1] 0.859

That’s fairly high power. If our assumptions are correct, 10 observations of each of the four different airplane designs provide a high probability of correctly rejecting the null of no design effect at a significance level of 0.01.

What if we wanted to detect an interaction between design and weight? Interactions imply the effects are NOT additive. The effect of design depends on weight and vice versa.

Let’s say it would be of interest if the interaction was 6 inches. For example, the effect of design b is 12 inches for 20 lb paper, but 12 + 6 = 18 inches for 28 lb paper.

We might simulate our response according to the following:

  • design a, 20 lb paper: 72
  • design b, 20 lb paper: 72 + 12
  • design a, 28 lb paper: 72 + -4
  • design b, 28 lb paper: 72 + 12 + -4 + 6

To specify the interaction in lm we use the : operator.

dist <- 72 + 
  (design == "b")*12 + 
  (weight == "28 lb")*-4 + 
  (design == "b")*(weight == "28 lb")*6 +
  rnorm(n*4, mean = 0, sd = 10)
m <- lm(dist ~ design + weight + design:weight)
coef(summary(m))['designb:weight28 lb','Pr(>|t|)'] < 0.01
## [1] FALSE

Let’s run it 1000 times and see what proportion of times we reject the null hypothesis of no interaction:

results <- replicate(1000, expr = {
  dist <- 72 + (design == "b")*12 + (weight == "28 lb")*-4 +
    (design == "b")*(weight == "28 lb")*6 +
    rnorm(n*4, mean = 0, sd = 10)
  m <- lm(dist ~ design + weight + design:weight)
  coef(summary(m))['designb:weight28 lb','Pr(>|t|)'] < 0.01
})
mean(results)
## [1] 0.045

The estimated power is very low. We need a much bigger sample size to detect this interaction.

Let’s turn this into a function so we can easily try different sample sizes.

sim_model <- function(n){
  results <- replicate(n = 1000, expr = {
    design <- rep(c("a","b"), each = n*2)
    weight <- rep(c("20 lb","28 lb"), times = n*2)
    dist <- 72 + (design == "b")*12 + (weight == "28 lb")*-4 +
      (design == "b")*(weight == "28 lb")*6 +
      rnorm(n*4, mean = 0, sd = 10)
    m <- lm(dist ~ design + weight + design:weight)
    coef(summary(m))['designb:weight28 lb','Pr(>|t|)'] < 0.01})
  mean(results)
}

# try samples sizes of 50, 100, 150
sample_size <- c(50,100,150)
power <- sapply(sample_size, sim_model)
data.frame(sample_size, power)
##   sample_size power
## 1          50 0.279
## 2         100 0.671
## 3         150 0.867

It appears we need about 15 times the sample size (150 per group versus 10 per group) to detect the hypothesized interaction with decent power! (See this blog post for more insight on sample sizes for interactions.)

Again, this is all one big thought experiment based on our assumptions. It doesn’t guarantee we’ll find large effects and get “significant” results. It simply gives us an approximate sample size to aim for given what we believe might be true about our paper airplane designs.

If we wanted we could build in more assumptions, such as unequal variances between groups, unequal sample sizes per group, block effects to account for different people throwing the airplanes, and so forth. This would make our simulation and subsequent analysis more complicated, but it may be necessary to get accurate estimates of power and sample size. Simulating data forces us to think about how we’re going to analyze our data, which in turn makes us confront our assumptions about how the data are generated.

References

  • R Core Team (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
  • Stephane Champely (2020). pwr: Basic Functions for Power Analysis. R package version 1.3-0. https://CRAN.R-project.org/package=pwr
  • Andrew Gelman and Jennifer Hill (2007). Data Analysis Using Regression and Multilevel/hierarchical Models. Cambridge: Cambridge University Press. Chapter 20.

For questions or clarifications regarding this article, contact the UVA Library StatLab: statlab@virginia.edu

View the entire collection of UVA Library StatLab articles.

Clay Ford
Statistical Research Consultant
University of Virginia Library
October 1, 2021