One of the basic assumptions of linear modeling is constant, or homogeneous, variance. What does that mean exactly? Let’s simulate some data that satisfies this condition to illustrate the concept.
Below we create a sorted vector of numbers ranging from 1 to 10 called x
, and then create a vector of numbers called y
that is a function of x
. When we plot x
vs y
, we get a straight line with an intercept of 1.2 and a slope of 2.1.
x <- seq(1,10, length.out = 100)
y <- 1.2 + 2.1 * x
plot(x, y)
Now let’s add some “noise” to our data so that y
is not completely determined by x
. We can do that by randomly drawing values from a theoretical Normal distribution with mean 0 and some set variance, and then adding them to the formula that generates y
. The rnorm
function in R allows us to easily do this. Below we draw 100 random values from a Normal distribution with mean 0 and standard deviation 2 and save as a vector called noise
. (Recall that standard deviation is simply the square root of variance.) Then we generate y
with the noise added. The function set.seed(222)
allows you to get the same “random” data in case you want to follow along. Finally we re-plot the data.
set.seed(222)
noise <- rnorm(n = 100, mean = 0, sd = 2)
y <- 1.2 + 2.1 * x + noise
plot(x, y)
Now we have data that is a combination of a linear, deterministic component (\(y = 1.2 + 2.1x\)) and random noise drawn from a \(N(0, 2)\) distribution. These are the basic assumptions we make about our data when we fit a traditional linear model. Below we use the lm
function to “recover” the “true” values we used to generate the data, of which there are three:
- The intercept: 1.2
- The slope: 2.1
- The standard deviation: 2
m <- lm(y ~ x)
summary(m)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.5831 -1.2165 0.3288 1.3022 4.3714
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.27426 0.44720 2.849 0.00534 **
## x 2.09449 0.07338 28.541 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.926 on 98 degrees of freedom
## Multiple R-squared: 0.8926, Adjusted R-squared: 0.8915
## F-statistic: 814.6 on 1 and 98 DF, p-value: < 2.2e-16
The (Intercept)
and x
(ie, slope) estimates in the Coefficients section, 1.27 and 2.09, are quite close to 1.2 and 2.1, respectively. The Residual standard error of 1.926 is also quite close to the constant value of 2. We produced a “good” model because we knew how y
was generated and gave the lm
function the “correct” model to fit. While we can’t do this in real life, it’s a useful exercise to help us understand the assumptions of linear modeling.
Now what if the variance was not constant? What if we multiplied the standard deviation of 2 by the square root of x
? As we see in the plot below, the vertical scatter of the points increases as x
increases.
set.seed(222)
noise <- rnorm(n = 100, mean = 0, sd = 2 * sqrt(x))
y <- 1.2 + 2.1 * x + noise
plot(x, y)
We multiplied 2 by sqrt(x)
because we’re specifying standard deviation. If we could specify variance, we would have multiplied 4 by just x
.
If we fit the same model using lm
, we get a Residual Standard error of 4.488.
m2 <- lm(y ~ x)
summary(m2)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.0460 -2.4013 0.5638 2.8734 10.2996
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.313 1.042 1.26 0.211
## x 2.096 0.171 12.26 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.488 on 98 degrees of freedom
## Multiple R-squared: 0.6051, Adjusted R-squared: 0.6011
## F-statistic: 150.2 on 1 and 98 DF, p-value: < 2.2e-16
We know that isn’t right, because we simulated the data. There was no constant standard deviation when we created the “noise”. Each random value was drawn from a different Normal distribution, each with mean 0 but a standard deviation that varied according to x
. This means our assumption of constant variance is violated. How would we detect this in real life?
The most common way is plotting residuals versus fitted values. This is easy to do in R. Just call plot
on the model object. This generates four different plots to assess the traditional modeling assumptions. See this blog post for more information. The plots we’re interested in are the 1st and 3rd plots, which we can specify with the which
argument.
plot(m2, which = 1)
plot(m2, which = 3)
In the first plot we see the variability around 0 increase as we move further to the right with bigger fitted values. In the third plot we also see increasing variability as we move to the right, though this time the residuals have been standardized and transformed to the square root scale. The positive trajectory of the smooth red line indicates an increase in variance.
So now that we’ve confirmed that our assumption of non-constant variance is invalid, what can we do? One approach is to log transform the data. This sometimes works if your response variable is positive and highly skewed. But that’s not really the case here. y
is only slightly skewed. (Call hist()
on y
to verify.) Plus we know our data is not on a log scale.
Another approach is to model the non-constant variance. That’s the topic of this blog post.
To do this, we will use functions from the nlme
package that is included with the base installation of R. The workhorse function is gls
, which stands for “generalized least squares”. We use it just as we would the lm
function, except we also use the weights
argument along with a handful of variance functions. Let’s demonstrate with an example.
Below we fit the “correct” model to our data that exhibited non-constant variance. We load the nlme
package so we can use the gls
function1. We specify the model syntax as before, y ~ x
. Then we use the weights
argument to specify the variance function, in this case varFixed
, also part of the nlme
package. This says our variance function has no parameters and a single covariate, x
, which is precisely how we generated the data. The syntax, ~x
, is a one-sided formula that can be read as “model variance as a function of x
.”
library(nlme)
vm1 <- gls(y ~ x, weights = varFixed(~x))
summary(vm1)
## Generalized least squares fit by REML
## Model: y ~ x
## Data: NULL
## AIC BIC logLik
## 576.2928 584.0477 -285.1464
##
## Variance function:
## Structure: fixed weights
## Formula: ~x
##
## Coefficients:
## Value Std.Error t-value p-value
## (Intercept) 1.369583 0.6936599 1.974431 0.0511
## x 2.085425 0.1504863 13.857900 0.0000
##
## Correlation:
## (Intr)
## x -0.838
##
## Standardized residuals:
## Min Q1 Med Q3 Max
## -2.8942967 -0.6293867 0.1551594 0.6758773 2.2722755
##
## Residual standard error: 1.925342
## Degrees of freedom: 100 total; 98 residual
The result produces a Residual Standard error of 1.925 that is very close to 2, the “true” value we used to generate the data. The (Intercept) and slope values, 1.37 and 2.09, are also very close to 1.2 and 2.1.
Once again we can call plot
on the model object. In this case only one plot is generated: standardized residuals versus fitted values:
plot(vm1)
This plot looks good. As long as we model our variance as a function of x
, the fitted model neither overfits nor underfits in any systematic sort of way (unlike when we used lm
to fit the model and assumed constant variance.)
The varFixed
function creates weights for each observation, symbolized as \(w_i\). The idea is that the higher the weight a given observation has, the smaller the variance of the Normal distribution from which its noise component was drawn. In our example, as x
increases the variance increases. Therefore higher weights should be associated with smaller x
values. This can be accomplished by taking the reciprocal of x
, that is \(w_i = 1/x\). So when \(x = 2\), \(w = 1/2\). When \(x = 10\), \(w = 1/10\). Larger x
get smaller weights.
Finally, to ensure larger weights are associated with smaller variances, we divide the constant variance by the weight. Stated mathematically,
\[\epsilon_i \sim N(0, \sigma^2/w_i)\]
Or taking the square root to express standard deviation,
\[\epsilon_i \sim N(0, \sigma/ \sqrt{w_i})\]
So the larger the denominator (ie, the larger the weight and hence, smaller the x
), the smaller the variance and more precise the observed y
.
Incidentally we can use lm
to weight observations as well. It too has a weights
argument like the gls
function. The only difference is we have to be more explicit about how we express the weights. In this case, we have to specify the reciprocal of x
. Notice the result is almost identical to what we get using gls
and varFixed
.
m3 <- lm(y ~ x, weights = 1/x)
summary(m3)
##
## Call:
## lm(formula = y ~ x, weights = 1/x)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -5.5725 -1.2118 0.2987 1.3013 4.3749
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.3696 0.6937 1.974 0.0511 .
## x 2.0854 0.1505 13.858 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.925 on 98 degrees of freedom
## Multiple R-squared: 0.6621, Adjusted R-squared: 0.6587
## F-statistic: 192 on 1 and 98 DF, p-value: < 2.2e-16
The power of the nlme
package is that it allows a variety of variance functions. The varFixed
function we just illustrated is the simplest and something that can be done with lm
. The other variance functions include:
varIdent
varPower
varExp
varConstPower
varComb
Let’s explore each of these using simulated data.
varIdent
The varIdent
function allows us to model different variances per stratum. To see how it works, we’ll first simulate data with this property. Below we use set.seed(11)
in case someone wants to simulate the same random data. Then we set n
equal to 400, the number of observations we will simulate. x
is generated the same as before. In this example we include an additional predictor called g
which can be thought of as gender. We randomly sample 400 values of “m” and “f”. Next we simulate a vector of two standard deviations and save as msd
. If g == "m"
, then the standard deviation is \(2 \times 2.5\). Otherwise it’s \(2\) for g == "f"
. We use this vector in the next line to generate y
. Notice where msd
is plugged into the rnorm
function. Also notice how we generate y
. We include an interaction between x
and y
. When g == "f"
, the intercept and slope is 1.2 and 2.1. When g == "m"
, the intercept and slope are (1.2 + 2.8) and (2.1 + 2.8), respectively. Finally we place our simulated data into a data frame and plot with ggplot2.
set.seed(11)
n <- 400
x <- seq(1,10, length.out = n)
g <- sample(c("m","f"), size = n, replace = TRUE)
msd <- ifelse(g=="m", 2*2.5, 2)
y <- 1.2 + 2.1*x - 1.5*(g == "m") + 2.8*x*(g == "m") + rnorm(n, sd = msd)
d <- data.frame(y, x, g)
library(ggplot2)
ggplot(d, aes(x, y, color = g)) +
geom_point()
Notice the different variances for each level of g
. The variance for “m” is much greater than the variance for “f”. It has a lot more scatter than “f”. We simulated the data that way. We set the variance for “m” to be 2.5 times that of “f”.
Now let’s use the gls
function with varIdent
in attempt to recover these true values. We use the same way as before: define our variance function in the weights
argument. Below we specify in the form
argument that the formula for modeling variance is conditional on g
. The expression ~ 1|g
is a one-sided formula that says the variance differs between the levels of g
. The 1
just means we have no additional covariates in our model for variance. It is possible to include a formula such as ~ x|g
, but that would be incorrect in this case since we did not use x
in generating the variance. Looking at the plot also shows that while the variability in y
is different between the groups, the variability does not increase as x
increases.
vm2 <- gls(y ~ x*g, data = d, weights = varIdent(form = ~ 1|g))
summary(vm2)
## Generalized least squares fit by REML
## Model: y ~ x * g
## Data: d
## AIC BIC logLik
## 2081.752 2105.64 -1034.876
##
## Variance function:
## Structure: Different standard deviations per stratum
## Formula: ~1 | g
## Parameter estimates:
## f m
## 1.00000 2.58127
##
## Coefficients:
## Value Std.Error t-value p-value
## (Intercept) 0.9686349 0.3237724 2.99172 0.0029
## x 2.1222707 0.0525024 40.42239 0.0000
## gm -1.9765090 0.9352152 -2.11343 0.0352
## x:gm 2.9957974 0.1553551 19.28355 0.0000
##
## Correlation:
## (Intr) x gm
## x -0.901
## gm -0.346 0.312
## x:gm 0.304 -0.338 -0.907
##
## Standardized residuals:
## Min Q1 Med Q3 Max
## -2.74115039 -0.67013954 0.01619031 0.69793776 2.72985748
##
## Residual standard error: 2.005397
## Degrees of freedom: 400 total; 396 residual
First notice at the bottom of the output that the estimated residual standard error is 2.0053974, very close to the “true” value of 2. Also notice in the “Variance function” section we get an estimated value of 2.58127 for group “m”, which is very close to the “true” value of 2.5 we used to generate the different variance for group “m”. In general, when you use the varIdent
function to estimate different variances between levels of strata, one of the levels will be set to baseline, and the others will be estimated as multiples of the residual standard error. In this case since “f” comes before “m” alphabetically, “f” was set to baseline, or 1. The estimated residual standard error for group “f” is \(2.005397 \times 1\). The estimated residual standard error for group “m” is \(2.005397 \times 2.58127\).
It’s important to note these are just estimates. To get a feel for the uncertainty of these estimates, we can use the intervals
function from the nlme
package to get 95% confidence intervals. To reduce the output, we save the result and view selected elements of the object.
int <- intervals(vm2)
The varStruct
element contains the 95% confidence interval for the parameter estimated in the variance function. The parameter in this case is the residual standard error multiplier for the male group.
int$varStruct
## lower est. upper
## m 2.245615 2.58127 2.967095
## attr(,"label")
## [1] "Variance function:"
The sigma
element contains the 95% confidence interval for the residual standard error.
int$sigma
## lower est. upper
## 1.818639 2.005397 2.211335
## attr(,"label")
## [1] "Residual standard error:"
Both intervals are quite small and contain the “true” value we used to generate the data.
varPower
The varPower
function allows us to model variance as a power of the absolute value of a covariate. Once again, to help explain this we will simulate data with this property. Below the main thing to notice is the sd
argument of the rnorm
function. That’s where we take the standard deviation of 2 and then multiply it by the absolute value of x
raised to the power of 1.5. This is similar to how we simulated data to demonstrate the varFixed
function above. In that case we simply assumed the exponent was 0.5. (Recall taking the square root of a number is equivalent to raising it to the power of 0.5.) Here we arbitrarily picked a power of 1.5. When we use gls
with varPower
we will attempt to recover the “true” value of 1.5 in addition to 2.
set.seed(4)
n <- 1000
x <- seq(1,10,length.out = n)
y <- 1.2 + 2.1*x + rnorm(n, sd = 2*abs(x)^1.5)
d <- data.frame(y, x)
ggplot(d, aes(x, y)) + geom_point()
We see the data exhibiting the classic form of variance increasing as the predictor increases. To correctly model this data using gls
, we supply it with the formula y ~x
and use the weights
argument with varPower
. Notice we specify the one-sided formula just as we did with the varFixed
function. In this model, however, we’ll get an estimate for the power.
vm3 <- gls(y ~ x, data = d, weights = varPower(form = ~x))
summary(vm3)
## Generalized least squares fit by REML
## Model: y ~ x
## Data: d
## AIC BIC logLik
## 8840.188 8859.811 -4416.094
##
## Variance function:
## Structure: Power of variance covariate
## Formula: ~x
## Parameter estimates:
## power
## 1.520915
##
## Coefficients:
## Value Std.Error t-value p-value
## (Intercept) 2.321502 0.4750149 4.887218 0
## x 1.582272 0.2241367 7.059404 0
##
## Correlation:
## (Intr)
## x -0.845
##
## Standardized residuals:
## Min Q1 Med Q3 Max
## -2.892319514 -0.655237190 0.001653178 0.691241690 3.346273816
##
## Residual standard error: 1.872944
## Degrees of freedom: 1000 total; 998 residual
int <- intervals(vm3)
int$varStruct
## lower est. upper
## power 1.445064 1.520915 1.596765
## attr(,"label")
## [1] "Variance function:"
int$sigma
## lower est. upper
## 1.650987 1.872944 2.124740
## attr(,"label")
## [1] "Residual standard error:"
The power is estimated to be 1.52 which is very close to the “true” value of 1.5. The estimated residual standard error of 1.8729439 is also quite close to 2. Both intervals capture the values we used to simulate the data. The coefficients in the model, on the other hand, are somewhat poorly estimated. This is not surprising given how much variability is in y
, especially for \(x > 2\).
varExp
The varExp
function allows us to model variance as an exponential function of a covariate. Yet again we’ll explain this variance function using simulated data. The only change is in the sd
argument of the rnorm
function. We have a fixed value of 2 that we multiply by x
which is itself multiplied by 0.5 and exponentiated. Notice how rapidly the variance increases as we increase x
. This is due to the exponential growth of the variance.
set.seed(55)
n <- 400
x <- seq(1, 10, length.out = n)
y <- 1.2 + 2.1*x + rnorm(n, sd = 2*exp(0.5*x))
d <- data.frame(y, x)
ggplot(d, aes(x, y)) + geom_point()
To work backwards and recover these values we use the varExp
function in the weights
argument of gls
. The one-sided formula does not change in this case. It says model the variance as a function of x
. The varExp
function says that x
has been multiplied by some value and exponentiated, so gls
will try to estimate that value.
vm4 <- gls(y ~ x, data = d, weights = varExp(form = ~x))
summary(vm4)
## Generalized least squares fit by REML
## Model: y ~ x
## Data: d
## AIC BIC logLik
## 3873.153 3889.098 -1932.576
##
## Variance function:
## Structure: Exponential of variance covariate
## Formula: ~x
## Parameter estimates:
## expon
## 0.4845623
##
## Coefficients:
## Value Std.Error t-value p-value
## (Intercept) 1.198525 1.1172841 1.072713 0.284
## x 2.073297 0.4933502 4.202486 0.000
##
## Correlation:
## (Intr)
## x -0.892
##
## Standardized residuals:
## Min Q1 Med Q3 Max
## -3.16976627 -0.69525696 0.00614222 0.63718022 2.90423217
##
## Residual standard error: 2.119192
## Degrees of freedom: 400 total; 398 residual
int <- intervals(vm4)
int$varStruct
## lower est. upper
## expon 0.4562871 0.4845623 0.5128374
## attr(,"label")
## [1] "Variance function:"
int$sigma
## lower est. upper
## 1.786737 2.119192 2.513505
## attr(,"label")
## [1] "Residual standard error:"
The “expon” estimate of 0.4845623 in the “variance function” section is very close to our specified value of 0.5. Likewise the estimated residual standard error of 2.1191918 is close to the “true” value of 2. The model coefficient estimates are also close to the values we used to generate the data, but notice the uncertainty in the (Intercept). The hypothesis test in the summary can’t rule out a negative intercept. This again is not surprising since y
has so much non-constant variance as well as the fact that we have no observations of x
equal to 0. Since the intercept is the value y
takes when x
equals 0, our estimated intercept is an extrapolation to an event we did not observe.
varConstPower
The varConstPower
function allows us to model variance as a positive constant plus a power of the absolute value of a covariate. That’s a mouthful, but this is basically the same as the varPower
function except now we add a positive constant to it. The following code simulates data for which the varConstPower
function would be suitable to use. Notice it is identical to the code we used to simulate data for the varPower
section, except we add 0.7 to x
in the abs
function. Why 0.7? No special reason. It’s just a positive constant we picked.
set.seed(4)
n <- 1000
x <- seq(1,10,length.out = n)
y <- 1.2 + 2.1*x + rnorm(n, sd = 2*abs(0.7 + x)^1.5)
d <- data.frame(y, x)
ggplot(d, aes(x, y)) + geom_point()
The correct way to model this data is to use gls
with the varConstPower
function. The one-sided formula is the same as before. The summary returns three estimates for the variance model: the constant, the power and the residual standard error. Recall the “true” values are 0.7, 1.5 and 2, respectively.
vm5 <- gls(y ~ x, data = d, weights = varConstPower(form = ~x))
summary(vm5)
## Generalized least squares fit by REML
## Model: y ~ x
## Data: d
## AIC BIC logLik
## 9319.794 9344.323 -4654.897
##
## Variance function:
## Structure: Constant plus power of variance covariate
## Formula: ~x
## Parameter estimates:
## const power
## 0.3554921 1.3593016
##
## Coefficients:
## Value Std.Error t-value p-value
## (Intercept) 2.763788 0.7607449 3.633003 3e-04
## x 1.456677 0.2915621 4.996112 0e+00
##
## Correlation:
## (Intr)
## x -0.824
##
## Standardized residuals:
## Min Q1 Med Q3 Max
## -2.846999539 -0.664551194 0.006709051 0.679895874 3.287366683
##
## Residual standard error: 2.887761
## Degrees of freedom: 1000 total; 998 residual
int <- intervals(vm5)
int$varStruct
## lower est. upper
## const 0.02658804 0.3554921 4.753063
## power 1.14258677 1.3593016 1.576016
## attr(,"label")
## [1] "Variance function:"
int$sigma
## lower est. upper
## 1.814757 2.887761 4.595195
## attr(,"label")
## [1] "Residual standard error:"
The intervals capture the “true” values, though the interval for the constant is quite wide. If we didn’t know the true value, it would seem the constant could plausibly be 2 or even 4.
varComb
Finally, the varComb
function allows us to model combinations of two or more variance models by multiplying together the corresponding variance functions. Obviously this can accommodate some very complex variance models. We’ll simulate some basic data that would be appropriate to use with the varComb
function.
What we do below is combine two different variance processes:
- One that allows standard deviation to differ between gender (“f” = \(2\), “m” = \(2 \times 2.5\))
- One that allows standard deviation to increase as
x
increases, wherex
is multiplied by 1.5 and exponentiated
To help with visualizing the data we have limited x
to range from 1 to 3.
set.seed(77)
n <- 400
x <- seq(1, 3, length.out = n)
g <- sample(c("m","f"), size = n, replace = TRUE)
msd <- ifelse(g=="m", 2*2.5, 2) * exp(1.5*x)
y <- 1.2 + 2.1*x - 1.5*(g == "m") + 2.8*x*(g == "m") + rnorm(n, sd = msd)
d <- data.frame(y, x, g)
ggplot(d, aes(x, y, color = g)) +
geom_point()
The plot shows increasing variance as x
increases, but also differences in variance between genders. The variance of y
for group “m” is much greater than the variance of y
in group “f”, especially when x
is greater than 1.5.
To correctly model the data generating process we specified above and attempt to recover the true values, we use the varComb
function as a wrapper around two more variance functions: varIdent
and varExp
. Why these two? Because we have different variances between genders, and we have variance increasing exponentially as a function of x
.
vm6 <- gls(y ~ x*g, data = d,
weights = varComb(varIdent(form = ~ 1|g),
varExp(form = ~x)))
summary(vm6)
## Generalized least squares fit by REML
## Model: y ~ x * g
## Data: d
## AIC BIC logLik
## 4431.45 4459.32 -2208.725
##
## Combination of variance functions:
## Structure: Different standard deviations per stratum
## Formula: ~1 | g
## Parameter estimates:
## f m
## 1.000000 2.437046
## Structure: Exponential of variance covariate
## Formula: ~x
## Parameter estimates:
## expon
## 1.540608
##
## Coefficients:
## Value Std.Error t-value p-value
## (Intercept) -5.996337 6.539552 -0.9169339 0.3597
## x 8.829971 4.849272 1.8208860 0.0694
## gm -17.572238 16.548540 -1.0618603 0.2889
## x:gm 16.932938 12.192793 1.3887661 0.1657
##
## Correlation:
## (Intr) x gm
## x -0.972
## gm -0.395 0.384
## x:gm 0.387 -0.398 -0.974
##
## Standardized residuals:
## Min Q1 Med Q3 Max
## -2.74441479 -0.67478610 -0.00262221 0.66079254 3.14975288
##
## Residual standard error: 1.794185
## Degrees of freedom: 400 total; 396 residual
The summary section contains two sections for modeling variance. The first estimates the multiplier for the “m” group to be 2.437, which is very close to the true value of 2.5. The exponential parameter is estimated to be 1.54, extremely close to the true value of 1.5 we used when generating the data. Finally the residual standard error is estimated to be about 1.79, close to the true value of 2. Calling intervals(vm6)
shows very tight confidence intervals. The A
and B
prefixes in the varStruct
element are just labels for the two different variance models.
int <- intervals(vm6)
int$varStruct
## lower est. upper
## A.m 2.119874 2.437046 2.801673
## B.expon 1.419366 1.540608 1.661850
## attr(,"label")
## [1] "Variance function:"
int$sigma
## lower est. upper
## 1.380008 1.794185 2.332669
## attr(,"label")
## [1] "Residual standard error:"
Unfortunately due to the large exponential variability, the estimates of the model coefficients are woefully bad.
What’s the point?
So why did we simulate all this fake data and then attempt to recover “true” values? What good is that? Fair questions. The answer is it helps us understand what assumptions we’re making when we specify and fit a model. When we specify and fit the following model…
m <- lm(y ~ x1 + x2)
…we’re saying we think y
is approximately equal to a weighted sum of x1
and x2
(plus an intercept) with errors being random draws from a Normal distribution with mean of 0 and some fixed standard deviation. Likewise, if we specify and fit the following model…
m <- gls(y ~ x1 + x2, data = d, weights = varPower(form = ~x1))
…we’re saying we think y
is approximately equal to a weighted sum of x1
and x2
(plus an intercept) with errors being random draws from a Normal distribution with mean of 0 and a standard deviation that grows as a multiple of x1
raised by some power.
If we can simulate data suitable for those models, then we have a better understanding of the assumptions we’re making when we use those models. Hopefully you now have a better understanding of what you can do to model variance using the nlme
package.
For questions or clarifications regarding this article, contact the UVA Library StatLab: statlab@virginia.edu
View the entire collection of UVA Library StatLab articles.
Clay FordStatistical Research Consultant
University of Virginia Library
April 7, 2020
-
1. The
nlme
package is perhaps better known for itslme
function, which is used to fit mixed-effect models (ie, models with both fixed and random effects). This blog post demonstrates variance functions usinggls
, which does not fit random effects. However everything we present in this blog post can be used withlme
.↩