Modeling Non-Constant Variance

One of the basic assumptions of linear modeling is constant, or homogeneous, variance. What does that mean exactly? Let’s simulate some data that satisfies this condition to illustrate the concept.

Below we create a sorted vector of numbers ranging from 1 to 10 called x, and then create a vector of numbers called y that is a function of x. When we plot x vs y, we get a straight line with an intercept of 1.2 and a slope of 2.1.


x <- seq(1,10, length.out = 100)
y <- 1.2 + 2.1 * x
plot(x, y)

y as a function of x

Now let’s add some “noise” to our data so that y is not completely determined by x. We can do that by randomly drawing values from a theoretical normal distribution with mean 0 and some set variance, and then adding them to the formula that generates y. The rnorm() function in R allows us to easily do this. Below we draw 100 random values from a normal distribution with mean 0 and standard deviation 2 and save as a vector called noise. (Recall that standard deviation is simply the square root of variance.) Then we generate y with the noise added. The function set.seed(222) allows you to get the same “random” data in case you want to follow along. Finally we re-plot the data.


set.seed(222)
noise <- rnorm(n = 100, mean = 0, sd = 2)
y <- 1.2 + 2.1 * x + noise
plot(x, y)

y as a function of x plus constant error variance

Now we have data that are a combination of a linear, deterministic component (\(y = 1.2 + 2.1x\)) and random noise drawn from an \(N(0, 2)\) distribution. These are the basic assumptions we make about our data when we fit a traditional linear model. Below we use the lm() function to “recover” the “true” values we used to generate the data, of which there are three:

  • The intercept: 1.2
  • The slope: 2.1
  • The standard deviation: 2

m <- lm(y ~ x)
summary(m)


  Call:
  lm(formula = y ~ x)
  
  Residuals:
      Min      1Q  Median      3Q     Max 
  -5.5831 -1.2165  0.3288  1.3022  4.3714 
  
  Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
  (Intercept)  1.27426    0.44720   2.849  0.00534 ** 
  x            2.09449    0.07338  28.541  < 2e-16 ***
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  
  Residual standard error: 1.926 on 98 degrees of freedom
  Multiple R-squared:  0.8926, Adjusted R-squared:  0.8915 
  F-statistic: 814.6 on 1 and 98 DF,  p-value: < 2.2e-16

The intercept and slope estimates in the "Coefficients" section, 1.27 and 2.09, are quite close to 1.2 and 2.1, respectively. The residual standard error of 1.926 is also quite close to the constant value of 2. We produced a “good” model because we knew how y was generated and gave the lm() function the “correct” model to fit. While we can’t do this in real life, it’s a useful exercise to help us understand the assumptions of linear modeling.

Now what if the variance was not constant? What if we multiplied the standard deviation of 2 by the square root of x? As we see in the plot below, the vertical scatter of the points increases as x increases.


set.seed(222)
noise <- rnorm(n = 100, mean = 0, sd = 2 * sqrt(x))
y <- 1.2 + 2.1 * x + noise
plot(x, y)

y as a function of x plus non-constant error variance

We multiplied 2 by sqrt(x) because we’re specifying standard deviation. If we could specify variance, we would have multiplied 4 by just x.

If we fit the same model using lm(), we get a residual standard error of 4.488.


m2 <- lm(y ~ x)
summary(m2)


  Call:
  lm(formula = y ~ x)
  
  Residuals:
       Min       1Q   Median       3Q      Max 
  -15.0460  -2.4013   0.5638   2.8734  10.2996 
  
  Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
  (Intercept)    1.313      1.042    1.26    0.211    
  x              2.096      0.171   12.26   <2e-16 ***
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  
  Residual standard error: 4.488 on 98 degrees of freedom
  Multiple R-squared:  0.6051, Adjusted R-squared:  0.6011 
  F-statistic: 150.2 on 1 and 98 DF,  p-value: < 2.2e-16

We know that isn’t right, because we simulated the data. There was no constant standard deviation when we created the “noise.” Each random value was drawn from a different normal distribution, each with mean 0 but a standard deviation that varied according to x. This means our assumption of constant variance is violated. How would we detect this in real life?

The most common way is plotting residuals versus fitted values. This is easy to do in R. Just call plot() on the model object. This generates four different plots to assess the traditional modeling assumptions. See this blog post for more information. The plots we’re interested in are the 1st and 3rd plots, which we can specify with the which argument.


plot(m2, which = 1)

residuals against fitted values, non-constant variance model predicting y from x


plot(m2, which = 3)

standardized residuals against fitted values, non-constant variance model predicting y from x

In the first plot we see the variability around 0 increase as we move further to the right with bigger fitted values. In the third plot we also see increasing variability as we move to the right, though this time the residuals have been standardized and transformed to the square root scale. The positive trajectory of the smooth red line indicates an increase in variance.

So now that we’ve confirmed that our assumption of non-constant variance is invalid, what can we do? One approach is to log transform the data. This sometimes works if your response variable is positive and highly skewed. But that’s not really the case here. y is only slightly skewed. (Call hist() on y to verify.) Plus, we know our data are not on a log scale.

Another approach is to model the non-constant variance. That’s the topic of this blog post.

To do this, we will use functions from the nlme package that is included with the base installation of R. The workhorse function is gls(), which stands for “generalized least squares.” We use it just as we would the lm() function, except we also use the weights argument along with a handful of variance functions. Let’s demonstrate with an example.

Below we fit the “correct” model to our data that exhibited non-constant variance. We load the nlme package so we can use the gls() function.1 We specify the model syntax as before, y ~ x. Then we use the weights argument to specify the variance function, in this case varFixed(), also part of the nlme package. This says our variance function has no parameters and a single covariate, x, which is precisely how we generated the data. The syntax, ~x, is a one-sided formula that can be read as “model variance as a function of x.”


library(nlme)
vm1 <- gls(y ~ x, weights = varFixed(~x))
summary(vm1)


  Generalized least squares fit by REML
    Model: y ~ x 
    Data: NULL 
         AIC      BIC    logLik
    576.2928 584.0477 -285.1464
  
  Variance function:
   Structure: fixed weights
   Formula: ~x 
  
  Coefficients:
                 Value Std.Error   t-value p-value
  (Intercept) 1.369583 0.6936599  1.974431  0.0511
  x           2.085425 0.1504863 13.857900  0.0000
  
   Correlation: 
    (Intr)
  x -0.838
  
  Standardized residuals:
         Min         Q1        Med         Q3        Max 
  -2.8942967 -0.6293867  0.1551594  0.6758773  2.2722755 
  
  Residual standard error: 1.925342 
  Degrees of freedom: 100 total; 98 residual

The result produces a residual standard error of 1.925 that is very close to 2, the “true” value we used to generate the data. The intercept and slope values, 1.37 and 2.09, are also very close to 1.2 and 2.1.

Once again we can call plot() on the model object. In this case only one plot is generated: standardized residuals versus fitted values:


plot(vm1)

standardized residuals against fitted values, model fit with glm to accomodate non-constant variance

This plot looks good. As long as we model our variance as a function of x, the fitted model neither overfits nor underfits in any systematic sort of way (unlike when we used lm() to fit the model and assumed constant variance.)

The varFixed() function creates weights for each observation, symbolized as \(w_i\). The idea is that the higher the weight a given observation has, the smaller the variance of the normal distribution from which its noise component was drawn. In our example, as x increases the variance increases. Therefore higher weights should be associated with smaller x values. This can be accomplished by taking the reciprocal of x, that is \(w_i = 1/x\). So when \(x = 2\), \(w = 1/2\). When \(x = 10\), \(w = 1/10\). Larger x get smaller weights.

Finally, to ensure larger weights are associated with smaller variances, we divide the constant variance by the weight. Stated mathematically,

\[\epsilon_i \sim N(0, \sigma^2/w_i)\]

Or taking the square root to express standard deviation,

\[\epsilon_i \sim N(0, \sigma/ \sqrt{w_i})\]

So the larger the denominator (i.e., the larger the weight and, hence, the smaller the x), the smaller the variance and more precise the observed y.

Incidentally, we can use lm() to weight observations as well. It too has a weights argument like the gls() function. The only difference is we have to be more explicit about how we express the weights. In this case, we have to specify the reciprocal of x. Notice the result is almost identical to what we get using gls() and varFixed().


m3 <- lm(y ~ x, weights = 1/x)
summary(m3)


  Call:
  lm(formula = y ~ x, weights = 1/x)
  
  Weighted Residuals:
      Min      1Q  Median      3Q     Max 
  -5.5725 -1.2118  0.2987  1.3013  4.3749 
  
  Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
  (Intercept)   1.3696     0.6937   1.974   0.0511 .  
  x             2.0854     0.1505  13.858   <2e-16 ***
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  
  Residual standard error: 1.925 on 98 degrees of freedom
  Multiple R-squared:  0.6621, Adjusted R-squared:  0.6587 
  F-statistic:   192 on 1 and 98 DF,  p-value: < 2.2e-16

The power of the nlme package is that it allows a variety of variance functions. The varFixed() function we just illustrated is the simplest and something that can be done with lm(). The other variance functions include:

  • varIdent()
  • varPower()
  • varExp()
  • varConstPower()
  • varComb()

Let’s explore each of these using simulated data.

varIdent()

The varIdent() function allows us to model different variances per stratum. To see how it works, we’ll first simulate data with this property. Below we use set.seed(11) in case someone wants to simulate the same random data. Then we set n equal to 400, the number of observations we will simulate. x is generated the same as before. In this example, we include an additional predictor called g which can be thought of as gender. We randomly sample 400 values of “m” and “f”. Next we simulate a vector of two standard deviations and save it as msd. If g == "m", then the standard deviation is \(2 \times 2.5\). Otherwise it’s \(2\) for g == "f". We use this vector in the next line to generate y. Notice where msd is plugged into the rnorm function. Also notice how we generate y. We include an interaction between x and y. When g == "f", the intercept and slope is 1.2 and 2.1. When g == "m", the intercept and slope are (1.2 + 2.8) and (2.1 + 2.8), respectively. Finally we place our simulated data into a data frame and plot with ggplot2.


set.seed(11)
n <- 400
x <- seq(1,10, length.out = n)
g <- sample(c("m","f"), size = n, replace = TRUE)
msd <- ifelse(g=="m", 2*2.5, 2)
y <- 1.2 + 2.1*x - 1.5*(g == "m") + 2.8*x*(g == "m") + rnorm(n, sd = msd)
d <- data.frame(y, x, g)

library(ggplot2)
ggplot(d, aes(x, y, color = g)) + 
  geom_point()


y against x, split up by m and f groups, non-constant variance between groups

Notice the different variances for each level of g. The variance for “m” is much greater than the variance for “f”. It has a lot more scatter than “f”. We simulated the data that way. We set the variance for “m” to be 2.5 times that of “f”.

Now let’s use the gls() function with varIdent() in attempt to recover these true values. We follow the same approach as before, defining our variance function in the weights argument. Below we specify in the form argument that the formula for modeling variance is conditional on g. The expression ~ 1|g is a one-sided formula that says the variance differs between the levels of g. The 1 just means we have no additional covariates in our model for variance. It is possible to include a formula such as ~ x|g, but that would be incorrect in this case since we did not use x in generating the variance. Looking at the plot also shows that while the variability in y is different between the groups, the variability does not increase as x increases.


vm2 <- gls(y ~ x*g, data = d, weights = varIdent(form = ~ 1|g))
summary(vm2)


  Generalized least squares fit by REML
    Model: y ~ x * g 
    Data: d 
         AIC     BIC    logLik
    2081.752 2105.64 -1034.876
  
  Variance function:
   Structure: Different standard deviations per stratum
   Formula: ~1 | g 
   Parameter estimates:
        f       m 
  1.00000 2.58127 
  
  Coefficients:
                   Value Std.Error  t-value p-value
  (Intercept)  0.9686349 0.3237724  2.99172  0.0029
  x            2.1222707 0.0525024 40.42239  0.0000
  gm          -1.9765090 0.9352152 -2.11343  0.0352
  x:gm         2.9957974 0.1553551 19.28355  0.0000
  
   Correlation: 
       (Intr) x      gm    
  x    -0.901              
  gm   -0.346  0.312       
  x:gm  0.304 -0.338 -0.907
  
  Standardized residuals:
          Min          Q1         Med          Q3         Max 
  -2.74115039 -0.67013954  0.01619031  0.69793776  2.72985748 
  
  Residual standard error: 2.005397 
  Degrees of freedom: 400 total; 396 residual

First, notice at the bottom of the output that the estimated residual standard error is 2.005397, very close to the “true” value of 2. Also notice in the “Variance function” section we get an estimated value of 2.58127 for group “m”, which is very close to the “true” value of 2.5 we used to generate the different variance for group “m”. In general, when you use the varIdent() function to estimate different variances between levels of strata, one of the levels will be set to baseline, and the others will be estimated as multiples of the residual standard error. In this case since “f” comes before “m” alphabetically, “f” was set to baseline, or 1. The estimated residual standard error for group “f” is \(2.005397 \times 1\). The estimated residual standard error for group “m” is \(2.005397 \times 2.58127\).

It’s important to note these are just estimates. To get a feel for the uncertainty of these estimates, we can use the intervals() function from the nlme package to get 95% confidence intervals. To reduce the output, we save the result and view selected elements of the object.


int <- intervals(vm2)

The varStruct element contains the 95% confidence interval for the parameter estimated in the variance function. The parameter in this case is the residual standard error multiplier for the male group.


int$varStruct


       lower    est.    upper
  m 2.245615 2.58127 2.967095
  attr(,"label")
  [1] "Variance function:"

The sigma element contains the 95% confidence interval for the residual standard error.


int$sigma


     lower     est.    upper 
  1.818639 2.005397 2.211335 
  attr(,"label")
  [1] "Residual standard error:"

Both intervals are quite small and contain the “true” value we used to generate the data.

varPower()

The varPower() function allows us to model variance as a power of the absolute value of a covariate. Once again, to help explain this we will simulate data with this property. Below, the main thing to notice is the sd argument of the rnorm() function. That’s where we take the standard deviation of 2 and then multiply it by the absolute value of x raised to the power of 1.5. This is similar to how we simulated data to demonstrate the varFixed() function above. In that case we simply assumed the exponent was 0.5. (Recall taking the square root of a number is equivalent to raising it to the power of 0.5.) Here we arbitrarily picked a power of 1.5. When we use gls() with varPower() we will attempt to recover the “true” value of 1.5 in addition to 2.


set.seed(4)
n <- 1000
x <- seq(1,10,length.out = n)
y <- 1.2 + 2.1*x + rnorm(n, sd = 2*abs(x)^1.5)
d <- data.frame(y, x)
ggplot(d, aes(x, y)) + geom_point()


y against x, variance of y is a power of x

We see the data exhibiting the classic form of variance increasing as the predictor increases. To correctly model this data using gls(), we supply it with the formula y ~ x and use the weights argument with varPower(). Notice we specify the one-sided formula just as we did with the varFixed() function. In this model, however, we’ll get an estimate for the power.


vm3 <- gls(y ~ x, data = d, weights = varPower(form = ~x))
summary(vm3)


  Generalized least squares fit by REML
    Model: y ~ x 
    Data: d 
         AIC      BIC    logLik
    8840.188 8859.811 -4416.094
  
  Variance function:
   Structure: Power of variance covariate
   Formula: ~x 
   Parameter estimates:
     power 
  1.520915 
  
  Coefficients:
                 Value Std.Error  t-value p-value
  (Intercept) 2.321502 0.4750149 4.887218       0
  x           1.582272 0.2241367 7.059404       0
  
   Correlation: 
    (Intr)
  x -0.845
  
  Standardized residuals:
           Min           Q1          Med           Q3          Max 
  -2.892319514 -0.655237190  0.001653178  0.691241690  3.346273816 
  
  Residual standard error: 1.872944 
  Degrees of freedom: 1000 total; 998 residual


int <- intervals(vm3)
int$varStruct


           lower     est.    upper
  power 1.445064 1.520915 1.596765
  attr(,"label")
  [1] "Variance function:"


int$sigma


     lower     est.    upper 
  1.650987 1.872944 2.124740 
  attr(,"label")
  [1] "Residual standard error:"

The power is estimated to be 1.52 which is very close to the “true” value of 1.5. The estimated residual standard error of 1.872944 is also quite close to 2. Both intervals capture the values we used to simulate the data. The coefficients in the model, on the other hand, are somewhat poorly estimated. This is not surprising given how much variability is in y, especially for \(x > 2\).

varExp()

The varExp() function allows us to model variance as an exponential function of a covariate. Yet again we’ll explain this variance function using simulated data. The only change is in the sd argument of the rnorm() function. We have a fixed value of 2 that we multiply by x which is itself multiplied by 0.5 and exponentiated. Notice how rapidly the variance increases as we increase x. This is due to the exponential growth of the variance.


set.seed(55)
n <- 400
x <- seq(1, 10, length.out = n)
y <- 1.2 + 2.1*x + rnorm(n, sd = 2*exp(0.5*x))
d <- data.frame(y, x)
ggplot(d, aes(x, y)) + geom_point()

y against x, variance of y is an exponential function of x

To work backwards and recover these values, we use the varExp() function in the weights argument of gls(). The one-sided formula does not change in this case. It says model the variance as a function of x. The varExp() function says that x has been multiplied by some value and exponentiated, so gls() will try to estimate that value.


vm4 <- gls(y ~ x, data = d, weights = varExp(form = ~x))
summary(vm4)


  Generalized least squares fit by REML
    Model: y ~ x 
    Data: d 
         AIC      BIC    logLik
    3873.153 3889.098 -1932.576
  
  Variance function:
   Structure: Exponential of variance covariate
   Formula: ~x 
   Parameter estimates:
      expon 
  0.4845623 
  
  Coefficients:
                 Value Std.Error  t-value p-value
  (Intercept) 1.198525 1.1172841 1.072713   0.284
  x           2.073297 0.4933502 4.202486   0.000
  
   Correlation: 
    (Intr)
  x -0.892
  
  Standardized residuals:
          Min          Q1         Med          Q3         Max 
  -3.16976627 -0.69525696  0.00614222  0.63718022  2.90423217 
  
  Residual standard error: 2.119192 
  Degrees of freedom: 400 total; 398 residual


int <- intervals(vm4)
int$varStruct


            lower      est.     upper
  expon 0.4562871 0.4845623 0.5128374
  attr(,"label")
  [1] "Variance function:"


int$sigma


     lower     est.    upper 
  1.786737 2.119192 2.513505 
  attr(,"label")
  [1] "Residual standard error:"

The “expon” estimate of 0.4845623 in the “Variance function” section is very close to our specified value of 0.5. Likewise, the estimated residual standard error of 2.119192 is close to the “true” value of 2. The model coefficient estimates are also close to the values we used to generate the data, but notice the uncertainty in the intercept. The hypothesis test in the summary can’t rule out a negative intercept. This again is not surprising since y has so much non-constant variance and since we have no observations of x equal to 0. Since the intercept is the value y takes when x equals 0, our estimated intercept is an extrapolation to an event we did not observe.

varConstPower()

The varConstPower() function allows us to model variance as a positive constant plus a power of the absolute value of a covariate. That’s a mouthful, but this is basically the same as the varPower() function except now we add a positive constant to it. The following code simulates data for which the varConstPower() function would be suitable to use. Notice it is identical to the code we used to simulate data for the varPower() section, except we add 0.7 to abs(x)^1.5. Why 0.7? No special reason. It’s just a positive constant we picked.


set.seed(4)
n <- 1000
x <- seq(1,10,length.out = n)
y <- 1.2 + 2.1*x + rnorm(n, sd = 2*(0.7 + abs(x)^1.5))
d <- data.frame(y, x)
ggplot(d, aes(x, y)) + geom_point()


y against x, variance of y is a power of x plus a constant

The correct way to model this data is to use gls() with the varConstPower() function. The one-sided formula is the same as before. The summary returns three estimates for the variance model: the constant, the power and the residual standard error. Recall the “true” values are 0.7, 1.5 and 2, respectively.


vm5 <- gls(y ~ x, data = d, weights = varConstPower(form = ~x))
summary(vm5)


Generalized least squares fit by REML
  Model: y ~ x 
  Data: d 
       AIC      BIC    logLik
  9033.253 9057.782 -4511.627

Variance function:
 Structure: Constant plus power of variance covariate
 Formula: ~x 
 Parameter estimates:
    const     power 
0.3388157 1.4463162 

Coefficients:
               Value Std.Error  t-value p-value
(Intercept) 2.550278 0.6140015 4.153538       0
x           1.527665 0.2517976 6.067037       0

 Correlation: 
  (Intr)
x -0.831

Standardized residuals:
         Min           Q1          Med           Q3          Max 
-2.843996437 -0.660034231  0.000752882  0.681577120  3.299594323 

Residual standard error: 2.199816 
Degrees of freedom: 1000 total; 998 residual


int <- intervals(vm5)
int$varStruct


          lower      est.    upper
const 0.6010367 0.8170943 1.110819
power 1.4849924 1.5343144 1.583636
attr(,"label")
[1] "Variance function:"


int$sigma


   lower     est.    upper 
1.698709 1.877371 2.074822 
attr(,"label")
[1] "Residual standard error:"

The intervals are fairly tight and contain the “true” values we used to generate the data.

varComb()

Finally, the varComb() function allows us to model combinations of two or more variance models by multiplying together the corresponding variance functions. Obviously this can accommodate some very complex variance models. We’ll simulate some basic data that would be appropriate to use with the varComb() function.

What we do below is combine two different variance processes:

  • One that allows standard deviation to differ between gender (“f” = \(2\), “m” = \(2 \times 2.5\))
  • One that allows standard deviation to increase as x increases, where x is multiplied by 1.5 and exponentiated

To help with visualizing the data we have limited x to range from 1 to 3.


set.seed(77)
n <- 400
x <- seq(1, 3, length.out = n)
g <- sample(c("m","f"), size = n, replace = TRUE)
msd <- ifelse(g=="m", 2*2.5, 2) * exp(1.5*x)
y <- 1.2 + 2.1*x - 1.5*(g == "m") + 2.8*x*(g == "m") + rnorm(n, sd = msd)
d <- data.frame(y, x, g)
ggplot(d, aes(x, y, color = g)) + 
  geom_point()


y against x split up by m and f groups, variance of y differs by group and as an exponential function of x

The plot shows increasing variance as x increases, but also differences in variance between genders. The variance of y for group “m” is much greater than the variance of y in group “f”, especially when x is greater than 1.5.

To correctly model the data generating process we specified above and attempt to recover the true values, we use the varComb() function as a wrapper around two more variance functions: varIdent() and varExp(). Why these two? Because we have different variances between genders, and we have variance increasing exponentially as a function of x.


vm6 <- gls(y ~ x*g, data = d, 
           weights = varComb(varIdent(form = ~ 1|g),
                             varExp(form = ~x)))
summary(vm6)


  Generalized least squares fit by REML
    Model: y ~ x * g 
    Data: d 
        AIC     BIC    logLik
    4431.45 4459.32 -2208.725
  
  Combination of variance functions: 
   Structure: Different standard deviations per stratum
   Formula: ~1 | g 
   Parameter estimates:
         f        m 
  1.000000 2.437046 
   Structure: Exponential of variance covariate
   Formula: ~x 
   Parameter estimates:
     expon 
  1.540608 
  
  Coefficients:
                   Value Std.Error    t-value p-value
  (Intercept)  -5.996337  6.539552 -0.9169339  0.3597
  x             8.829971  4.849272  1.8208860  0.0694
  gm          -17.572238 16.548540 -1.0618603  0.2889
  x:gm         16.932938 12.192793  1.3887661  0.1657
  
   Correlation: 
       (Intr) x      gm    
  x    -0.972              
  gm   -0.395  0.384       
  x:gm  0.387 -0.398 -0.974
  
  Standardized residuals:
          Min          Q1         Med          Q3         Max 
  -2.74441479 -0.67478610 -0.00262221  0.66079254  3.14975288 
  
  Residual standard error: 1.794185 
  Degrees of freedom: 400 total; 396 residual

The summary section contains two sections for modeling variance. The first estimates the multiplier for the “m” group to be 2.437, which is very close to the true value of 2.5. The exponential parameter is estimated to be 1.54, extremely close to the true value of 1.5 we used when generating the data. Finally the residual standard error is estimated to be about 1.79, close to the true value of 2. Calling intervals(vm6) shows very tight confidence intervals. The A and B prefixes in the varStruct element are just labels for the two different variance models.


int <- intervals(vm6)
int$varStruct


             lower     est.    upper
  A.m     2.119874 2.437046 2.801673
  B.expon 1.419366 1.540608 1.661850
  attr(,"label")
  [1] "Variance function:"


int$sigma


     lower     est.    upper 
  1.380008 1.794185 2.332669 
  attr(,"label")
  [1] "Residual standard error:"

Unfortunately, due to the large exponential variability, the estimates of the model coefficients are woefully bad.

What’s the point?

So why did we simulate all this fake data and then attempt to recover “true” values? What good is that? Fair questions. The answer is it helps us understand what assumptions we’re making when we specify and fit a model. When we specify and fit the following model…


m <- lm(y ~ x1 + x2)

…we’re saying we think y is approximately equal to a weighted sum of x1 and x2 (plus an intercept), with errors being random draws from a normal distribution with mean of 0 and some fixed standard deviation. Likewise, if we specify and fit the following model…


m <- gls(y ~ x1 + x2, data = d, weights = varPower(form = ~x1))

…we’re saying we think y is approximately equal to a weighted sum of x1 and x2 (plus an intercept), with errors being random draws from a normal distribution with mean of 0 and a standard deviation that grows as a multiple of x1 raised by some power.

If we can simulate data suitable for those models, then we have a better understanding of the assumptions we’re making when we use those models. Hopefully you now have a better understanding of what you can do to model variance using the nlme package.


References

  • Pinheiro, J., & Bates, D. (2000). Mixed-effects models in S and S-PLUS. Springer. https://doi.org/10.1007/b98882
  • Pinheiro, J., Bates, D., & R Core Team. (2020). nlme: Linear and nonlinear mixed effects models (Version 3.1-162). https://cran.r-project.org/package=nlme
  • R Core Team. (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.r-project.org/
  • Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer-Verlag.

Clay Ford
Statistical Research Consultant
University of Virginia Library
April 07, 2020


  1. The nlme package is perhaps better known for its lme() function, which is used to fit mixed-effect models (i.e., models with both fixed and random effects). This blog post demonstrates variance functions using gls(), which does not fit random effects. However, everything we present in this blog post can be used with lme().

For questions or clarifications regarding this article, contact statlab@virginia.edu.

View the entire collection of UVA Library StatLab articles, or learn how to cite.