# Latest Posts

## Getting started with Multivariate Multiple Regression

Multivariate Multiple Regression is the method of modeling multiple responses, or dependent variables, with a single set of predictor variables. For example, we might want to model both math and reading SAT scores as a function of gender, race, parent income, and so forth. This allows us to evaluate the relationship of, say, gender with each score. You may be thinking, “why not just run separate regressions for each dependent variable?” That’s actually a good idea! And in fact that’s pretty much what multivariate multiple regression does. It regresses each dependent variable separately on the predictors. However, because we have multiple responses, we have to modify our hypothesis tests for regression parameters and our confidence intervals for predictions.

To get started, let’s read in some data from the book Applied Multivariate Statistical Analysis (6th ed.) by Richard Johnson and Dean Wichern. This data come from exercise 7.25 and involve 17 overdoses of the drug amitriptyline (Rudorfer, 1982). There are two responses we want to model: TOT and AMI. TOT is total TCAD plasma level and AMI is the amount of amitriptyline present in the TCAD plasma level. The predictors are as follows:

GEN, gender (male = 0, female = 1)
AMT, amount of drug taken at time of overdose
PR, PR wave measurement
DIAP, diastolic blood pressure
QRS, QRS wave measurement

We’ll use the R statistical computing environment to demonstrate multivariate multiple regression. The following code reads the data into R and names the columns.

ami_data <- read.table("http://static.lib.virginia.edu/statlab/materials/data/ami_data.DAT")
names(ami_data) <- c("TOT","AMI","GEN","AMT","PR","DIAP","QRS")



Before going further you may wish to explore the data using the summary and pairs functions.

summary(ami_data)
pairs(ami_data)



Performing multivariate multiple regression in R requires wrapping the multiple responses in the cbind() function. cbind() takes two vectors, or columns, and “binds” them together into two columns of data. We insert that on the left side of the formula operator: ~. On the other side we add our predictors. The + signs do not mean addition per se but rather inclusion. Taken together the formula “cbind(TOT, AMI) ~ GEN + AMT + PR + DIAP + QRS” translates to “model TOT and AMI as a function of GEN, AMT, PR, DIAP and QRS.” To fit this model we use the workhorse lm() function and save it to an object we named “mlm1”. Finally we view the results with summary().

mlm1 <- lm(cbind(TOT, AMI) ~ GEN + AMT + PR + DIAP + QRS, data = ami_data)
summary(mlm1)

Response TOT :

Call:
lm(formula = TOT ~ GEN + AMT + PR + DIAP + QRS, data = ami_data)

Residuals:
Min     1Q Median     3Q    Max
-399.2 -180.1    4.5  164.1  366.8

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.879e+03  8.933e+02  -3.224 0.008108 **
GEN          6.757e+02  1.621e+02   4.169 0.001565 **
AMT          2.848e-01  6.091e-02   4.677 0.000675 ***
PR           1.027e+01  4.255e+00   2.414 0.034358 *
DIAP         7.251e+00  3.225e+00   2.248 0.046026 *
QRS          7.598e+00  3.849e+00   1.974 0.074006 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 281.2 on 11 degrees of freedom
Multiple R-squared:  0.8871,	Adjusted R-squared:  0.8358
F-statistic: 17.29 on 5 and 11 DF,  p-value: 6.983e-05

Response AMI :

Call:
lm(formula = AMI ~ GEN + AMT + PR + DIAP + QRS, data = ami_data)

Residuals:
Min      1Q  Median      3Q     Max
-373.85 -247.29  -83.74  217.13  462.72

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.729e+03  9.288e+02  -2.938 0.013502 *
GEN          7.630e+02  1.685e+02   4.528 0.000861 ***
AMT          3.064e-01  6.334e-02   4.837 0.000521 ***
PR           8.896e+00  4.424e+00   2.011 0.069515 .
DIAP         7.206e+00  3.354e+00   2.149 0.054782 .
QRS          4.987e+00  4.002e+00   1.246 0.238622
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 292.4 on 11 degrees of freedom
Multiple R-squared:  0.8764,	Adjusted R-squared:  0.8202
F-statistic:  15.6 on 5 and 11 DF,  p-value: 0.0001132



Notice the summary shows the results of two regressions: one for TOT and one for AMI. These are exactly the same results we would get if modeled each separately. You can verify this for yourself by running the following code and comparing the summaries to what we got above. They’re identical.

m1 <- lm(TOT ~ GEN + AMT + PR + DIAP + QRS, data = ami_data)
summary(m1)
m2 <- lm(AMI ~ GEN + AMT + PR + DIAP + QRS, data = ami_data)
summary(m2)



The same diagnostics we check for models with one predictor should be checked for these as well. For a review of some basic but essential diagnostics see our post Understanding Diagnostic Plots for Linear Regression Analysis.

We can use R’s extractor functions with our mlm1 object, except we’ll get double the output. For example, instead of one set of residuals, we get two:

head(resid(mlm1))
TOT        AMI
1  132.82172  161.52769
2  -72.00392 -264.35329
3 -399.24769 -373.85244
4 -382.84730 -247.29456
5 -152.39129   15.78777
6  366.78644  217.13206



Instead of one set of fitted values, we get two:

head(fitted(mlm1))
TOT       AMI
1 3256.1783 2987.4723
2 1173.0039  917.3533
3 1530.2477 1183.8524
4  978.8473  695.2946
5 1048.3913  828.2122
6 1400.2136 1232.8679



Instead of one set of coefficients, we get two:

coef(mlm1)
TOT           AMI
(Intercept) -2879.4782461 -2728.7085444
GEN           675.6507805   763.0297617
AMT             0.2848511     0.3063734
PR             10.2721328     8.8961977
DIAP            7.2511714     7.2055597
QRS             7.5982397     4.9870508



Instead of one residual standard error, we get two:

sigma(mlm1)
TOT      AMI
281.2324 292.4363



Again these are all identical to what we get by running separate models for each response. The similarity ends, however, with the variance-covariance matrix of the model coefficients. We don’t reproduce the output here because of the size, but we encourage you to view it for yourself:

vcov(mlm1)



The main takeaway is that the coefficients from both models covary. That covariance needs to be taken into account when determining if a predictor is jointly contributing to both models. For example, the effects of PR and DIAP seem borderline. They appear significant for TOT but less so for AMI. But it’s not enough to eyeball the results from the two separate regressions! We need to formally test for their inclusion. And that test involves the covariances between the coefficients in both models.

Determining whether or not to include predictors in a multivariate multiple regression requires the use of multivariate test statistics. These are often taught in the context of MANOVA, or multivariate analysis of variance. Again the term “multivariate” here refers to multiple responses or dependent variables. This means we use modified hypothesis tests to determine whether a predictor contributes to a model.

The easiest way to do this is to use the Anova() or Manova() functions in the car package (Fox and Weisberg, 2011), like so:

library(car)
Anova(mlm1)

Type II MANOVA Tests: Pillai test statistic
Df test stat approx F num Df den Df   Pr(>F)
GEN   1   0.65521   9.5015      2     10 0.004873 **
AMT   1   0.69097  11.1795      2     10 0.002819 **
PR    1   0.34649   2.6509      2     10 0.119200
DIAP  1   0.32381   2.3944      2     10 0.141361
QRS   1   0.29184   2.0606      2     10 0.178092
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1



The results are titled “Type II MANOVA Tests”. The Anova() function automatically detects that mlm1 is a multivariate multiple regression object. “Type II” refers to the type of sum-of-squares. This basically says that predictors are tested assuming all other predictors are already in the model. This is usually what we want. Notice that PR and DIAP appear to be jointly insignificant for the two models despite what we were led to believe by examining each model separately.

Based on these results we may want to see if a model with just GEN and AMT fits as well as a model with all five predictors. One way we can do this is to fit a smaller model and then compare the smaller model to the larger model using the anova() function, (notice the little “a”; this is different from the Anova() function in the car package). For example, below we create a new model using the update() function that only includes GEN and AMT. The expression “. ~ . – PR – DIAP – QRS” says “keep the same responses and predictors except PR, DIAP and QRS.”

mlm2 <- update(mlm1, . ~ . - PR - DIAP - QRS)
anova(mlm1, mlm2)

Analysis of Variance Table

Model 1: cbind(TOT, AMI) ~ GEN + AMT + PR + DIAP + QRS
Model 2: cbind(TOT, AMI) ~ GEN + AMT
Res.Df Df Gen.var.  Pillai approx F num Df den Df Pr(>F)
1     11       43803
2     14  3    51856 0.60386   1.5859      6     22 0.1983



The large p-value provides good evidence that the model with two predictors fits as well as the model with five predictors. Notice the test statistic is “Pillai”, which is one of the four common multivariate test statistics.

The car package provides another way to conduct the same test using the linearHypothesis() function. The beauty of this function is that it allows us to run the test without fitting a separate model. It also returns all four multivariate test statistics. The first argument to the function is our model. The second argument is our null hypothesis. The linearHypothesis() function conveniently allows us to enter this hypothesis as character phrases. The null entered below is that the coefficients for PR, DIAP and QRS are all 0.

lh.out <- linearHypothesis(mlm1, hypothesis.matrix = c("PR = 0", "DIAP = 0", "QRS = 0"))
lh.out

Sum of squares and products for the hypothesis:
TOT      AMI
TOT 930348.1 780517.7
AMI 780517.7 679948.4

Sum of squares and products for error:
TOT      AMI
TOT 870008.3 765676.5
AMI 765676.5 940708.9

Multivariate Tests:
Df test stat approx F num Df den Df  Pr(>F)
Pillai            3 0.6038599 1.585910      6     22 0.19830
Wilks             3 0.4405021 1.688991      6     20 0.17553
Hotelling-Lawley  3 1.1694286 1.754143      6     18 0.16574
Roy               3 1.0758181 3.944666      3     11 0.03906 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1



The Pillai result is the same as we got using the anova() function above. The Wilks, Hotelling-Lawley, and Roy results are different versions of the same test. The consensus is that the coefficients for PR, DIAP and QRS do not seem to be statistically different from 0. There is some discrepancy in the test results. The Roy test in particular is significant, but this is likely due to the small sample size (n = 17).

Also included in the output are two sum of squares and products matrices, one for the hypothesis and the other for the error. These matrices are used to calculate the four test statistics. These matrices are stored in the lh.out object as SSPH (hypothesis) and SSPE (error). We can use these to manually calculate the test statistics. For example, let SSPH = H and SSPE = E. The formula for the Wilks test statistic is

$$\frac{\begin{vmatrix}\bf{E}\end{vmatrix}}{\begin{vmatrix}\bf{E} + \bf{H}\end{vmatrix}}$$

In R we can calculate that as follows:

E <- lh.out$SSPE H <- lh.out$SSPH
det(E)/det(E + H)
[1] 0.4405021



Likewise the formula for Pillai is

$$tr[\bf{H}(\bf{H} + \bf{E})^{-1}]$$

tr means trace. That’s the sum of the diagonal elements of a matrix. In R we can calculate as follows:

sum(diag(H %*% solve(E + H)))
[1] 0.6038599



The formula for Hotelling-Lawley is

$$tr[\bf{H}\bf{E}^{-1}]$$

In R:

sum(diag(H %*% solve(E)))
[1] 1.169429



And finally the Roy statistics is the largest eigenvalue of $$\bf{H}\bf{E}^{-1}$$.

In R code:

e.out <- eigen(H %*% solve(E))
max(e.out$values) [1] 1.075818  Given these test results, we may decide to drop PR, DIAP and QRS from our model. In fact this is model mlm2 that we fit above. Here is the summary: summary(mlm2) Response TOT : Call: lm(formula = TOT ~ GEN + AMT, data = ami_data) Residuals: Min 1Q Median 3Q Max -756.05 -190.68 -59.83 203.32 560.84 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 56.72005 206.70337 0.274 0.7878 GEN 507.07308 193.79082 2.617 0.0203 * AMT 0.32896 0.04978 6.609 1.17e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 358.6 on 14 degrees of freedom Multiple R-squared: 0.7664, Adjusted R-squared: 0.733 F-statistic: 22.96 on 2 and 14 DF, p-value: 3.8e-05 Response AMI : Call: lm(formula = AMI ~ GEN + AMT, data = ami_data) Residuals: Min 1Q Median 3Q Max -716.80 -135.83 -23.16 182.27 695.97 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -241.34791 196.11640 -1.231 0.23874 GEN 606.30967 183.86521 3.298 0.00529 ** AMT 0.32425 0.04723 6.866 7.73e-06 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 340.2 on 14 degrees of freedom Multiple R-squared: 0.787, Adjusted R-squared: 0.7566 F-statistic: 25.87 on 2 and 14 DF, p-value: 1.986e-05  Now let’s say we wanted to use this model to predict TOT and AMI for GEN = 1 (female) and AMT = 1200. We can use the predict() function for this. First we need put our new data into a data frame with column names that match our original data. nd <- data.frame(GEN = 1, AMT = 1200) p <- predict(mlm2, nd) p TOT AMI 1 958.5473 754.0677  This predicts two values, one for each response. Now this is just a prediction and has uncertainty. We usually quantify uncertainty with confidence intervals to give us some idea of a lower and upper bound on our estimate. But in this case we have two predictions from a multivariate model with two sets of coefficients that covary! This means calculating a confidence interval is more difficult. In fact we don’t calculate an interval but rather an ellipse to capture the uncertainty in two dimensions. Unfortunately at the time of this writing there doesn’t appear to be a function in R for creating uncertainty ellipses for multivariate multiple regression models with two responses. However we have written one below you can use called “predictionEllipse”. The details of the function go beyond a “getting started” blog post but it should be easy enough to use. Simply submit the code in the console to create the function. Then use the function with any multivariate multiple regression model object that has two responses. The newdata argument works the same as the newdata argument for predict. Use the level argument to specify a confidence level between 0 and 1. The default is 0.95. Set ggplot to FALSE to create the plot using base R graphics. predictionEllipse <- function(mod, newdata, level = 0.95, ggplot = TRUE){ # labels lev_lbl <- paste0(level * 100, "%") resps <- colnames(mod$coefficients)
title <- paste(lev_lbl, "confidence ellipse for", resps[1], "and", resps[2])

# prediction
p <- predict(mod, newdata)

# center of ellipse
cent <- c(p[1,1],p[1,2])

# shape of ellipse
Z <- model.matrix(mod)
Y <- mod$model[[1]] n <- nrow(Y) m <- ncol(Y) r <- ncol(Z) - 1 S <- crossprod(resid(mod))/(n-r-1) # radius of circle generating the ellipse z0 <- c(1, as.matrix(newdata)) rad <- sqrt((m*(n-r-1)/(n-r-m))*qf(level,m,n-r-m)*t(z0)%*%solve(t(Z)%*%Z) %*% z0) # generate ellipse using ellipse function in car package ell_points <- car::ellipse(center = c(cent), shape = S, radius = c(rad), draw = FALSE) # ggplot2 plot if(ggplot){ require(ggplot2, quietly = TRUE) ell_points_df <- as.data.frame(ell_points) ggplot(ell_points_df, aes(x, y)) + geom_path() + geom_point(aes(x = TOT, y = AMI), data = data.frame(p)) + labs(x = resps[1], y = resps[2], title = title) } else { # base R plot plot(ell_points, type = "l", xlab = resps[1], ylab = resps[2], main = title) points(x = cent[1], y = cent[2]) } }  Here’s a demonstration of the function. predictionEllipse(mod = mlm2, newdata = nd)  The dot in the center is our predicted values for TOT and AMI. The ellipse represents the uncertainty in this prediction. We’re 95% confident the true values of TOT and AMI when GEN = 1 and AMT = 1200 are within the area of the ellipse. Notice also that TOT and AMI seem to be positively correlated. Predicting higher values of TOT means predicting higher values of AMI, and vice versa. References Fox, J and Weisberg, S (2011). An {R} Companion to Applied Regression, Second Edition. Thousand Oaks CA: Sage. URL: http://socserv.socsci.mcmaster.ca/jfox/Books/Companion Johnson, R and Wichern, D (2007). Applied Multivariate Statistical Analysis, Sixth Edition. Prentice-Hall. Rudorfer, MV “Cardiovascular Changes and Plasma Drug Levels after Amitriptyline Overdose.” Journal of Toxicology-Clinical Toxicology, 19 (1982), 67-71. For questions or clarifications regarding this article, contact the UVa Library StatLab: statlab@virginia.edu Clay Ford Statistical Research Consultant University of Virginia Library October 27, 2017 ## Spring 2018 Data Science Short Courses The Library’s Research Data Services is offering a 1-credit data science course in Spring 2018 through the Data Science Institute: Data Wrangling in Python and the Public Interest Data Lab. Open to all, these courses are designed to provide short, guided training and practice on relevant data science skills. DS 6501 Data Wrangling in Python (1 credit, meets the first five weeks of the semester) Pete Alonzi T,R 2:00-3:15 from 1/18/2018-2/20/2018 Dell 2 100 The goal of this class is to make the skills of data wrangling in Python familiar and easy to use. We’ll cover data cleaning and data manipulation in Python, including reading and writing data, the Pandas library for cleaning, transforming, merging, reshaping, and data aggregation, and the matplotlib library for plotting. DS 5559 Public Interest Data Lab (1 credit, meets the first ten weeks of the semester) Michele Claibourn F 12:00-2:00 from 1/19/2018-3/30/2018 Brown Library The lab course will provide experience working collaboratively, openly, and reproducibly on data science projects organized by the lab director — for example, working with local agencies to understand their data and improve processes. The goal is to provide students with an opportunity to enhance their data skills and to gain experience working as a team on a joint project while promoting social good. See the SIS entry for information about the Spring 2018 project. To register search for Subject “DS” and the Course Number in SIS. ## Announcing the Data Purchase Program Did you know that UVA Library will consider purchasing datasets for you? It’s true! We are launching a new Data Purchase Program for UVA-affiliated researchers. Awards are directed at smaller research needs (around$1,000 to $5,000). However, the total amount available will depend on the number of applications and the potential overall impact on UVA’s research mission. The deadline for submission for the inaugural round of the Data Purchase Program is October 15, 2017. We will have another round in the Spring. Please visit the main Data Purchase Program page to learn more. Examples of Recent Data Purchases: Questions? Contact data@virginia.edu. ## Visualizing the Effects of Proportional-Odds Logistic Regression Proportional-odds logistic regression is often used to model an ordered categorical response. By “ordered”, we mean categories that have a natural ordering, such as “Disagree”, “Neutral”, “Agree”, or “Everyday”, “Some days”, “Rarely”, “Never”. For a primer on proportional-odds logistic regression, see our post, Fitting and Interpreting a Proportional Odds Model. In this post we demonstrate how to visualize a proportional-odds model in R. To begin, we load the effects package. The effects package provides functions for visualizing regression models. This post is essentially a tutorial for using the effects package with proportional-odds models. We also load the car, MASS and splines packages for particular functions, which we’ll explain as we encounter them. If you don’t have the effects or car packages, uncomment the lines below and run them in R. MASS and splines are recommended packages that come with R. > # install.packages("effects") > # install.packages("car") > library(effects) > library(car) > library(MASS) > library(splines)  To demonstrate how to visualize a proportional-odds model we’ll use data from the World Values Surveys (1995-1997) for Australia, Norway, Sweden, and the United States. This dataset, WVS, comes with the effects package. Once we load the effects package, the data is ready to access. > head(WVS) poverty religion degree country age gender 1 Too Little yes no USA 44 male 2 About Right yes no USA 40 female 3 Too Little yes no USA 36 female 4 Too Much yes yes USA 25 female 5 Too Little yes yes USA 39 male 6 About Right yes no USA 80 female  The response variable of interest is poverty, which is a 3-level ordered categorical variable. It contains the answer to the question “Do you think that what the government is doing for people in poverty in this country is about the right amount, too much, or too little?” The answers are an ordered categorical variable with levels “Too Little”, “About Right”, and “Too Much”. The other variables will serve as our predictors. These include country, gender, religion (belong to a religion?), education (hold a university degree?), and age in years. The data contains 5381 records. Before we get started we need to note this example comes from an article on the effects package in the Journal of Statistical Software by John Fox and Jangman Hong, the authors of the effects package. You should definitely take the time to read through that article and cite it if you plan to use the effects package for your own research. What we seek to do in this blog post is elaborate on the example and provide some additional details. Before we can visualize a proportional odds model we need to fit it. For this we use the polr function from the MASS package. The first model we fit models poverty as a function of country interacted with gender, religion, degree and age. The interaction allows the effects of the predictors to vary with each country. > wvs.1 <- polr(poverty ~ country*(gender + religion + degree + age), data = WVS) > summary(wvs.1) Call: polr(formula = poverty ~ country * (gender + religion + degree + age), data = WVS) Coefficients: Value Std. Error t value countryNorway 0.5308176 0.286989 1.84961 countrySweden 0.5446552 0.546029 0.99748 countryUSA -0.0347317 0.248059 -0.14001 gendermale 0.0696120 0.090212 0.77165 religionyes 0.0094685 0.112476 0.08418 degreeyes -0.1242920 0.167603 -0.74158 age 0.0155849 0.002597 6.00185 countryNorway:gendermale 0.1873611 0.144503 1.29659 countrySweden:gendermale 0.0563508 0.154414 0.36493 countryUSA:gendermale 0.2119735 0.139513 1.51938 countryNorway:religionyes -0.2186724 0.216256 -1.01118 countrySweden:religionyes -0.8789724 0.513263 -1.71252 countryUSA:religionyes 0.6002277 0.174433 3.44101 countryNorway:degreeyes 0.0558595 0.208202 0.26829 countrySweden:degreeyes 0.6281743 0.214295 2.93136 countryUSA:degreeyes 0.3030866 0.206394 1.46848 countryNorway:age -0.0157142 0.004367 -3.59846 countrySweden:age -0.0092122 0.004657 -1.97826 countryUSA:age 0.0005419 0.003975 0.13635 Intercepts: Value Std. Error t value Too Little|About Right 0.7161 0.1535 4.6644 About Right|Too Much 2.5355 0.1578 16.0666 Residual Deviance: 10347.07 AIC: 10389.07  The summary output is imposing. In addition to 19 coefficients we have 2 intercepts. Larger coefficients with large t-values are indicative of important predictors, but with so many interactions it’s hard to to see what’s happening or what the model “says”. To evaluate whether the interactions are significant, we use the Anova function from the car package. By default the Anova function returns Type II tests, which tests each term after all others, save interactions. (The base R anova function performs Type I tests which tests each term sequentially.) > Anova(wvs.1) Analysis of Deviance Table (Type II tests) Response: poverty LR Chisq Df Pr(>Chisq) country 250.881 3 < 2.2e-16 *** gender 10.749 1 0.0010435 ** religion 4.132 1 0.0420698 * degree 4.284 1 0.0384725 * age 49.950 1 1.577e-12 *** country:gender 3.049 3 0.3841657 country:religion 21.143 3 9.833e-05 *** country:degree 12.861 3 0.0049476 ** country:age 17.529 3 0.0005501 ***  The Anova result shows all interactions except country:gender are significant. But what do the interactions mean? How do, say, country and age interact? This is where the effects package enters. The effects package allows us to easily create effect displays. What are effect displays? The documentation for the effects package explains it this way: “To create an effect display, predictors in a term are allowed to range over their combinations of values, while other predictors in the model are held to typical values.” In other words, we take our model and use it to calculate predicted values for various combinations of certain “focal” predictors while holding other predictors at fixed values. Then we plot our predicted values versus the “focal” predictors to see how the response changes. Let’s demonstrate. The two primary functions are Effect and plot. Effect generates the predictions and plot creates the display. Let’s say we’re interested in the age and country interaction. We want to visualize how age affects views on poverty for each country. Since our model includes an interaction, which was significant, we expect to see different trajectories for each country. The following code generates the predicted values. The first argument, focal.predictors, is where we list the predictors we’re interested in. Notice it requires a vector, which is why we use the c() function. The second argument is the fitted model. > Effect(focal.predictors = c("age","country"), wvs.1) age*country effect (probability) for Too Little country age Australia Norway Sweden USA 20 0.5958889 0.5632140 0.6496015 0.4330782 30 0.5578683 0.5635318 0.6349615 0.3939898 40 0.5191570 0.5638496 0.6200674 0.3562127 50 0.4802144 0.5641674 0.6049438 0.3201442 60 0.4415107 0.5644851 0.5896166 0.2861049 70 0.4035049 0.5648027 0.5741134 0.2543308 80 0.3666230 0.5651203 0.5584631 0.2249734 90 0.3312402 0.5654379 0.5426958 0.1981045 age*country effect (probability) for About Right country age Australia Norway Sweden USA 20 0.3050532 0.3250955 0.2699789 0.3918455 30 0.3282699 0.3249058 0.2797785 0.4064109 40 0.3502854 0.3247160 0.2895692 0.4171736 50 0.3704966 0.3245261 0.2993161 0.4237414 60 0.3883075 0.3243362 0.3089823 0.4258700 70 0.4031610 0.3241462 0.3185295 0.4234793 80 0.4145714 0.3239561 0.3279182 0.4166592 90 0.4221528 0.3237659 0.3371079 0.4056632 age*country effect (probability) for Too Much country age Australia Norway Sweden USA 20 0.09905788 0.1116905 0.08041952 0.1750762 30 0.11386182 0.1115624 0.08526005 0.1995993 40 0.13055755 0.1114343 0.09036331 0.2266137 50 0.14928897 0.1113065 0.09574006 0.2561143 60 0.17018180 0.1111787 0.10140107 0.2880252 70 0.19333407 0.1110511 0.10735706 0.3221899 80 0.21880555 0.1109236 0.11361867 0.3583674 90 0.24660694 0.1107962 0.12019630 0.3962323  Notice the output lists three sections of probabilities corresponding to each level of the response. The first section lists predicted probabilities for answering “Too Little” for each country for ages ranging from 20 to 90 in increments of 10. The predicted probability a 20-year-old from the USA answers “Too Little” is about 0.43. The second section is for “About right”. The predicted probability a 20-year-old from the USA answers “About Right” is about 0.39. Finally, the third section is for “Too Much”. The predicted probability a 20-year-old from the USA answers “Too Much” is about 0.18. This information is much easier to digest as an effect display. Simply insert the original Effect function call into the plot function to create the effect display. > plot(Effect(focal.predictors = c("age","country"), wvs.1), rug = FALSE)  Now we see how the model “works”. For example, in the upper right plot, we see that in the USA, the probability of answering “Too Much” increases rather dramatically with age, while the probabilities for answering the same in Norway and Sweden stay low and constant. Likewise we see that the probability of USA respondents answering “Too Little” decreases with age while the probabilities for Norway and Sweden stay rather high and constant. The effect display shows us where the interactions are happening and to what degree. (Note: setting rug = FALSE turns off the rug plot. When set to TRUE, the default, the marginal distribution of the predictor is displayed on the x axis. In this case we don’t find it very helpful since we have so much data.) Recall that we need to use all predictors to generate these plots. Country and age are the focal predictors, so they are varied. This means gender, religion and degree are held fixed. We can find out the values they’re fixed at by saving the result of the Effect function and viewing the model matrix. > e.out <- Effect(focal.predictors = c("age","country"), wvs.1) > e.out$model.matrix[1,c("gendermale","religionyes","degreeyes")]
gendermale religionyes   degreeyes
0.4935886   0.8539305   0.2124140



This says gender was set to 0.4935886, religion to 0.8539305, and degree to 0.2124140. In the original data these are indicator variables that take values of 0 or 1 corresponding to No and Yes. Does it make sense to plug in decimals instead of 0s or 1s? It does if you think of modeling a population that is about 49% men, 85% religious, and 21% with a college degree. In fact this describes our sample. By default, the effects package will take the mean of numeric variables that are held fixed. We can verify that’s what it’s doing:

> mean(WVS$gender=="male") [1] 0.4935886 > mean(WVS$religion=="yes")
[1] 0.8539305
> mean(WVS$degree=="yes") [1] 0.212414  If we want to change these values, we can use the given.values argument. It needs to be a named vector that uses the terms as listed in the output summary. For example, to create an effect plot for religious men without a college degree: > e.out <- Effect(focal.predictors = c("country","age"), mod = wvs.1, given.values = c(gendermale = 1, religionyes = 1, degreeyes = 0)) > plot(e.out, rug = FALSE)  We see that the overall “story” of the display does not change; the same changes are happening in the same plots. But the overall probabilities have increased by evidence of the y axis now topping out at 0.7. We can also change the range and number of focal predictors using the xlevels argument. For example, we can set age to range from 20 to 80 in steps of 10. Notice it needs to be a named list. > e.out <- Effect(focal.predictors = c("country","age"), mod = wvs.1, xlevels = list(age = seq(20,80,10))) > plot(e.out, rug = FALSE)  We can investigate other interactions using the same syntax. Below is the effect display for the religion and country interaction. In this case, age is no longer a focal predictor and is held fixed at its mean (45.04). > plot(Effect(focal.predictors = c("religion","country"), wvs.1), rug = FALSE)  We notice, for example, that religious people in the USA have a higher probability of answering “Too Much” compared to their counterparts in the other countries surveyed. We also notice there is much more uncertainty about estimates in Sweden for people who answered “No” to the religion question. This is due to the small numbers of respondents in those categories, as we can see with the xtabs function. > xtabs(~ religion + country + poverty, data = WVS, country == "Sweden", drop.unused.levels = TRUE) , , poverty = Too Little country religion Sweden no 7 yes 597 , , poverty = About Right country religion Sweden no 5 yes 369 , , poverty = Too Much country religion Sweden no 3 yes 22  Following the example in Fox’s article, let’s fit another model that relaxes the linearity assumption for age. We can do this by generating what’s called a basis matrix for natural cubic splines. Instead of fitting a regular polynomial such as age + age^2, we fit piecewise cubic polynomials over the range of age separated by a certain number of intervals, or knots. The ns function in the splines package makes this easy to do. Below we use it in the model formula and specify 4 knots. Harrell (2001) suggests 3-5 knots is usually a good choice (p. 23), so 4 seems wise in this case. > wvs.2 <- polr(poverty ~ country*(gender + religion + degree + ns(age, 4)),data = WVS) > summary(wvs.2) > Anova(wvs.2)  Due to space considerations we don’t print the output of the summary and Anova functions. The model summary shows information for 31 coefficients and is very difficult to interpret. The Anova result is similar in substance to the first model, showing all interactions except country:gender significant. It’s worth noting the AIC of the second model is considerably lower than the first (10373.85 vs 10389.07), suggesting a better fit. Let’s look at the country and age interaction while allowing age to range from 20 – 80: > plot(Effect(focal.predictors = c("country","age"), mod = wvs.2, xlevels = list(age = 20:80)), rug = FALSE)  We see that using a natural spline allows a nonlinear effect of age. For example we see the probability of answering “Too Little” in the USA decreases sharply from 20 to 30, increases from about age 30 to 45, and then decreases and levels out through age 80. The effects package also allows us to create “stacked” effect displays for proportional-odds models. We do this by setting style="stacked" in the plot function. > plot(Effect(focal.predictors = c("country","age"), mod = wvs.2, xlevels = list(age = 20:80)), rug = FALSE, style="stacked")  This plot is useful for allowing us to compare probabilities across the response categories. For example, in Norway and Sweden, people are most likely to answer “Too Little” regardless of age. The blue shaded regions dominate their graphs. We can also create a “latent” version of the effect display. In this plot, the y axis is on the logit scale, which we interpret to be a latent, or hidden, scale from which the ordered categories are derived. We create it by setting latent = TRUE in the Effect function. > plot(Effect(focal.predictors = c("country","age"), mod = wvs.2, xlevels = list(age = 20:80), latent = TRUE), rug = FALSE, ylim = c(0,3.5))  This plot is useful when we’re more interested in classification than probability. The horizontal lines in the plots correspond to the intercepts in the summary output. We can think of these lines as threshholds that define where we crossover from one category to the next on the latent scale. The TL-AR line indicates the boundary between the “Too Little” and “About Right” categories. The AR-TM line indicates the boundary between the “About Right” and “Too Much” categories. Like the “stacked” effect display, we see that someone from Norway and Sweden would be expected to answer “Too Little” regardless of age, though the confidence ribbon indicates this expectation is far from certain, especially for older and younger respondents. On the other hand, most USA respondents are expected to answer “About Right”. Let’s see how the “latent” plot changes when we set the non-focal predictors to college-educated, non-religious female. > plot(Effect(focal.predictors = c("country","age"), mod = wvs.2, xlevels = list(age = 20:80), given.values = c(gendermale = 0, religionyes = 0, degreeyes = 1), latent = TRUE), rug = FALSE, ylim = c(0,3.5))  Notice now that predicted classification for Sweden is “About Right” over the age range but with increased uncertainty. We also see increased chances of answering “Too Little” for certain age ranges in the USA. We can add gender as a focal predictor to compare plots for males versus females: > plot(Effect(focal.predictors = c("country","age","gender"), mod = wvs.2, xlevels = list(age = 20:80), latent = TRUE), rug = FALSE, ylim = c(0,3.5))  Since we didn’t fit a 3-way interaction between country, gender and age, the trajectories do not change between genders. They simply shift horizontally between the two levels of gender. References Fox, J. and J. Hong (2009). Effect displays in R for multinomial and proportional-odds logit models: Extensions to the effects package. Journal of Statistical Software 32:1, 1–24, <http://www.jstatsoft.org/v32/i01/>. Harrell, Frank. Regression Modeling Strategies. Springer, 2001. R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/. For questions or clarifications regarding this article, contact the UVa Library StatLab: statlab@virginia.edu Clay Ford Statistical Research Consultant University of Virginia Library May 10, 2017 ## Getting started with the purrr package in R If you’re wondering what exactly the purrr package does, then this blog post is for you. Before we get started, we should mention the Iteration chapter in R for Data Science by Garrett Grolemund and Hadley Wickham. We think this is the most thorough and extensive introduction to the purrr package currently available (at least at the time of this writing.) Wickham is one of the authors of the purrr package and he spends a good deal of the chapter clearly explaining how it works. Good stuff, recommended reading. The purpose of this article is to provide a short introduction to purrr, focusing on just a handful of functions. We use some real world data and replicate what purrr does in base R so we have a better understanding of what’s going on. We visited Yahoo Finance on 13 April 2017 and downloaded about three weeks of historical data for three companies: Boeing, Johnson & Johnson and IBM. The following R code will download and unzip the data in your current working directory if you wish to follow along. URL <- "http://static.lib.virginia.edu/statlab/materials/data/stocks.zip" download.file(url = URL, destfile = basename(URL)) unzip(basename(URL))  We have three CSV files. In the spirit of being efficient we would like to import these files into R using as little code as possible (as opposed to calling read.csv three different times.) Using base R functions, we could put all the file names into a vector and then apply the read.csv function to each file. This results in a list of three data frames. When done we could name each list element using the names function and our vector of file names. # get all files ending in csv files <- list.files(pattern = "csv$")
names(dat) <- gsub("\\.csv", "", files) # remove file extension



Here is how we do the same using the map function from the purrr package.

install.packages("purrr") # if package not already installed
library(purrr)
names(dat2) <- gsub("\\.csv", "", files)



So we see that map is like lapply. It takes a vector as input and applies a function to each element of the vector. map is one of the star functions in the purrr package.

Let’s say we want to find the mean Open price for each stock. Here is a base R way using lapply and an anonymous function:

lapply(dat, function(x)mean(x$Open))$BA
[1] 177.8287

$IBM [1] 174.3617$JNJ
[1] 125.8409



We can do the same with map.

map(dat, function(x)mean(x$Open))$BA
[1] 177.8287

$IBM [1] 174.3617$JNJ
[1] 125.8409



But map allows us to bypass the function function. Using a tilda (~) in place of function and a dot (.) in place of x, we can do this:

map(dat, ~mean(.$Open))  Furthermore, purrr provides several versions of map that allow you to specify the structure of your output. For example, if we want a vector instead of a list we can use the map_dbl function. The “_dbl” indicates that it returns a vector of type double (ie, numbers with decimals). map_dbl(dat, ~mean(.$Open))
BA      IBM      JNJ
177.8287 174.3617 125.8409



Now let’s say that we want to extract each stock’s Open price data. In other words, we want to go into each data frame in our list and pull out the Open column. We can do that with lapply as follows:

lapply(dat, function(x)x$Open)$BA
[1] 178.25 177.50 179.00 178.39 177.56 179.00 176.88 177.08 178.02 177.25 177.40 176.29 174.37 176.85 177.34 175.96 179.99
[18] 180.10 178.31 179.82 179.00 178.54 177.16

$IBM [1] 171.04 170.65 172.53 172.08 173.47 174.70 173.52 173.82 173.98 173.86 174.30 173.94 172.69 175.12 174.43 174.04 176.01 [18] 175.65 176.29 178.46 175.71 176.18 177.85$JNJ
[1] 124.54 124.26 124.87 125.12 124.85 124.72 124.51 124.73 124.11 124.74 125.05 125.62 125.16 125.86 126.10 127.05 128.38
[18] 128.04 128.45 128.44 127.05 126.86 125.83



Using map is a little easier. We just provide the name of the column we want to extract.

map(dat, "Open")
$BA [1] 178.25 177.50 179.00 178.39 177.56 179.00 176.88 177.08 178.02 177.25 177.40 176.29 174.37 176.85 177.34 175.96 179.99 [18] 180.10 178.31 179.82 179.00 178.54 177.16$IBM
[1] 171.04 170.65 172.53 172.08 173.47 174.70 173.52 173.82 173.98 173.86 174.30 173.94 172.69 175.12 174.43 174.04 176.01
[18] 175.65 176.29 178.46 175.71 176.18 177.85

$JNJ [1] 124.54 124.26 124.87 125.12 124.85 124.72 124.51 124.73 124.11 124.74 125.05 125.62 125.16 125.86 126.10 127.05 128.38 [18] 128.04 128.45 128.44 127.05 126.86 125.83  We often want to plot financial data. In this case we may want to plot Closing price for each stock and look for trends. We can do this with the base R function mapply. First we create a vector of stock names for plot labeling. Next we set up one row of three plotting regions. Then we use mapply to create the plot. The “m” in mapply means “multiple arguments”. In this case we have two arguments: the Closing price and the stock name. Notice that mapply requires the function come first and then the arguments. stocks <- sub("\\.csv","", files) par(mfrow=c(1,3)) mapply(function(x,y)plot(x$Close, type = "l", main = y), x = dat, y = stocks)



The purrr equivalent is map2. Again we can substitute a tilda (~) for function, but now we need to use .x and .y to identify the arguments. However the ordering is the same as map: data come first and then the function.

map2(dat, stocks, ~plot(.x$Close, type="l", main = .y))  Each time we run mapply or map2 above, the following is printed to the console: $BA
NULL

$IBM NULL$JNJ
NULL



This is because both functions return a value. Since plot returns no value, NULL is printed. The purrr package provides walk for dealing with functions like plot. Here is the same task with walk2 instead of map2. It produces the plots and prints nothing to the console.

walk2(dat, stocks, ~plot(.x$Close, type="l", main = .y))  At some point we may want to collapse our list of three data frames into a single data frame. This means we’ll want to add a column to indicate which record belongs to which stock. Using base R this is a two step process. We do.call the rbind function to the elements of our list. Then we add a column called Stock by taking advantage of the fact that the row names of our data frame contain the name of the original list element, in this case the stock name. datDF <- do.call(rbind, dat) # add stock names to data frame datDF$Stock <- gsub("\\.[0-9]*", "", rownames(datDF)) # remove period and numbers
Date   Open   High    Low  Close  Volume Adj.Close Stock
BA.1 2017-04-12 178.25 178.25 175.94 176.05 2920000    176.05    BA
BA.2 2017-04-11 177.50 178.60 176.96 178.57 2259700    178.57    BA
BA.3 2017-04-10 179.00 179.97 177.48 177.56 2259500    177.56    BA
BA.4 2017-04-07 178.39 179.09 177.26 178.85 2704700    178.85    BA
BA.5 2017-04-06 177.56 178.22 177.12 177.37 2343600    177.37    BA
BA.6 2017-04-05 179.00 180.18 176.89 177.08 2387100    177.08    BA



Using purrr, we could have used map_df instead of map with the read.csv function, but we would have lost the source file information.

dat2DF <- map_df(files, read.csv) # works, but which record goes with which stock?



We could also use purrr’s reduce function. That will collapse the list into a single data frame. But again we have no way of labeling which row came from which stock.

dat2DF <- reduce(dat, rbind) # works, but which record goes with which stock?



To accomplish this with purrr, we need to use the stocks vector we created earlier along with the map2_df function. This function applies a function to two arguments and returns a data frame. The function we want to apply is update_list, another purrr function. The update_list function allows you to add things to a list element, such as a new column to a data frame. Below we use the formula notation again and .x and .y to indicate the arguments. The result is a single data frame with a new Stock column.

dat2DF <- map2_df(dat2, stocks, ~update_list(.x, stock = .y))
Date   Open   High    Low  Close  Volume Adj.Close stock
1 2017-04-12 178.25 178.25 175.94 176.05 2920000    176.05    BA
2 2017-04-11 177.50 178.60 176.96 178.57 2259700    178.57    BA
3 2017-04-10 179.00 179.97 177.48 177.56 2259500    177.56    BA
4 2017-04-07 178.39 179.09 177.26 178.85 2704700    178.85    BA
5 2017-04-06 177.56 178.22 177.12 177.37 2343600    177.37    BA
6 2017-04-05 179.00 180.18 176.89 177.08 2387100    177.08    BA



Finally, we should consider reformatting the Date column as a Date instead of a Factor. The easiest way to deal with this would have been to use the read_csv function from the readr package instead of read.csv. But in the interest of demonstrating some more purrr functionality, let’s pretend we can’t do that. Further, let’s pretend we don’t know which columns are Factor, but we would like to convert them to Date if they are Factor. This time we give a purrr solution first.

To do this we nest one map function in another. The first one is dmap_if. dmap is just like map, except dmap returns a data frame. dmap_if allows us to define a condition to dictate whether or not we apply the function. In this case the condition is determined by is.factor. If is.factor returns TRUE, then we apply the ymd function from the lubridate package. Now dmap_if takes a data frame not a list, so we have to use map to apply dmap_if to each data frame in our list. The final code is as follows:

dat2 <- map(dat2, ~dmap_if(., is.factor, lubridate::ymd))



Doing this in base R is possible but far more difficult. We nest one lapply function inside another, but since lapply returns a list, we need to wrap the first lapply with as.data.frame. And within the first lapply we have to use the assignment operator as a function, which works but looks cryptic!

dat <- lapply(dat,
function(x)as.data.frame(
lapply(x,
function(y)
if(is.factor(y))
<-(y, lubridate::ymd(y))
else y)))



This article provides just a taste of purrr. We hope it gets you started learning more about the package. Be sure to read the documentation as well. Each help page contains illustrative examples. Note that purrr is a very young package. At the time of this writing it is at version 0.2.2. There are sure to be improvements and changes in the coming months and years.

For questions or clarifications regarding this article, contact the UVa Library StatLab: statlab@virginia.edu

Clay Ford
Statistical Research Consultant
University of Virginia Library
April 14, 2017

## Endangered Data Week (April 17-21)

The UVa Library is hosting a number of events as part of Endangered Data Week (April 17-21), a new, collaborative effort, coordinated across campuses, nonprofits, libraries, citizen science initiatives, and cultural heritage institutions, to shed light on public datasets that are in danger of being deleted, repressed, mishandled, or lost.

• Introduction to Libra Data (Dataverse at UVa) – Monday, April 17th 11am-noon, in Brown Library, Room 133.
• Introduction to Git/Github – Tuesday, April 18th, noon-1:30pm, in Brown Library, Room 133.
• Introduction to DocNow – Tuesday, April 18th, 2pm-4pm, in Alderman Library, Room 421.
• Web Scraping with R – Wednesday, April 19th, 10:30am-noon, in Brown Library, Room 133.
• Preserving Artifact and Architecture with Cultural Heritage Informatics – Friday, April 21st, 10:30am-11:30am, Location TBD.
• Endangered Data Week webinar – Friday, April 21st, 1pm-2:30pm, in Brown Library, Room 133.

Within the UVa Library, these events are being hosted by Research Data Services and the Scholars’ Lab.

For more information about other Endangered Data Week events around the country and around the world, please visit the Endangered Data Week website.

## Fall 2017 Data Science Short Courses

The Library’s Research Data Services is offering a 1-credit data science course in Fall 2017 through the Data Science Institute: Data Wrangling and Exploration in R.

DS 6559-001 Data Wrangling and Exploration in R (1 credit, meets the first five weeks of the semester)
Clay Ford
T,R 12:30-1:45 from 8/22/2017-9/21/2017
New Cabell Hall 489

This course covers data exploration, cleaning, and manipulation in R. Topics include reading in/writing out data in various formats, R data structures, working with date/time data, character manipulation, using regular expressions in R, reshaping data, data transformations, data aggregation and basic data visualization to aid in data cleaning.

Next spring, we anticipate offering Data Wrangling in Python and a new hands-on Public Interest Data Lab.

To register search for Subject “DS” and Course Number “6559” or “5559” in SIS. Full-time employees of UVA can use their Education Benefit and register through the Community Scholar Program.

## 2017-18 StatLab Fellows

The UVA Library’s Research Data Services is seeking StatLab Fellows for the 2017-2018 academic year. Responding to growing interest in applied data science experience along with a developing movement to use data and data science for the public interest, the program provides up to four UVA graduate students experience working collaboratively, openly, and reproducibly on data science projects — for example, working with local agencies to understand their data and improve processes, or working with news data to help citizens better navigate a complex media environment. The goal is to provide graduate students with an opportunity to enhance their data skills and to gain experience working as a team on a joint project while promoting social good.

Program Description
The StatLab provides consultation, training, and support around data analysis and statistical methods, and data wrangling and visualization. Research across the social, natural, engineering, and data sciences increasingly draws upon sophisticated research designs, complex statistical methods, and computational power to draw inferences about the world around us. The StatLab sees its mission as both contributing to and assisting with a wide portfolio of quantitative research and Fellows will have the opportunity to contribute to academic consulting, as well, providing exposure to a variety of designs, challenges, and methods across disciplines.

Fellows will have the opportunity to deepen their knowledge of methods and data analysis tools by: working collaboratively on a project, writing and blogging about their work, consulting and working with researchers across disciplinary boundaries, and presenting their own work to a diverse audience.

Fellowship Details
StatLab Fellows are expected to work between 5-10 hours per week in the fall and spring semesters and will be paid $20/hour (to provide some flexibility and accommodate variable schedules across the semester). Fellows should be available for a Friday noon meeting most weeks, and will: • contribute intellectually and methodologically to a collaborative project in public and reproducible ways; • expand or deepen their methodological knowledge and skills; • write blog posts about the project and their work; • contribute to consultations and collaborations with researchers seeking help from the StatLab on ocassion; • and advance and present their own research projects. Applicants should have completed at least two methods courses before applying and have experience in a statistical software environment such as R or Stata or in a computing environment such as Python. We encourage applications from women, people of color, LGBT students, first-generation college students and other under-represented groups. How to Apply Send a CV and cover letter by April 17 outlining: • your research interests; • your experience with data analysis, statistical methods, data wrangling, and visualization; • the skills you expect to bring to a collaborative project; • and a summary of what you hope to gain as a StatLab fellow. Email a complete application package to Michele Claibourn, mclaibourn@virginia.edu. Questions about the StatLab Fellowship and the application process should also be directed to Michele Claibourn. You can learn more about the StatLab at our website. ## Working with dates and time in R using the lubridate package Sometimes we have data with dates and/or times that we want to manipulate or summarize. A common example in the health sciences is time-in-study. A subject may enter a study on Feb 12, 2008 and exit on November 4, 2009. How many days was the person in the study? (Don’t forget 2008 was a leap year; February had 29 days.) What was the median time-in-study for all subjects? Another example are experiments that time participants performing an activity, applies a treatment to certain members, and then re-times the activity. What was the difference in times between subjects that received the treatment and those that did not? If our data is stored and read in as something like “01:23:03”, then we’ll need to convert to seconds. The lubridate package for the R statistical computing environment was designed to help us deal with these kinds of data. The out-of-the-box base R installation also provides functions for working with dates and times, but the functions in the lubridate package are a little easier to use and remember. Formatting dates When we import data into R, dates and times are usually stored as character or factor by default due to symbols such as “-”, “:” and “/”. (Though see the readr package for functions that attempt to parse date and times automatically.) Using the str or class functions will tell you how they’re stored. If dates or times are stored as character or factor that means we can’t calculate or summarize elapsed times. To format dates, lubridate provides a series of functions that are a permutation of the letters “m”, “d” and “y” to represent the ordering of month, day and year. For example, if our data has a column of dates such as May 11, 1996, our dates are ordered month-day-year. Therefore we would use the mdy function to transform the column to a date object. If our dates were in the order of, say, year-month-day, we would use the ymd function. lubridate provides functions for every permutation of “m”, “d”, “y”. Let’s demonstrate. Below we generate two character vectors of dates, inspect their class, reformat them using the mdy function and then inspect their class again. library(lubridate) begin <- c("May 11, 1996", "September 12, 2001", "July 1, 1988") end <- c("7/8/97","10/23/02","1/4/91") class(begin) ## [1] "character" class(end) ## [1] "character" (begin <- mdy(begin)) ## [1] "1996-05-11" "2001-09-12" "1988-07-01" (end <- mdy(end)) ## [1] "1997-07-08" "2002-10-23" "1991-01-04" class(begin) ## [1] "Date" class(end) ## [1] "Date" The dates now have class “Date” and are printed in year-month-day format. They may appear to still be character data when printed, but they are in fact numbers. The “Date” class means dates are stored as the number of days since January 1, 1970, with negative values for earlier dates. We can use the as.numeric function to view the raw values. as.numeric(begin) ## [1] 9627 11577 6756 as.numeric(end) ## [1] 10050 11983 7673 With dates stored in this fashion we can do things like subtraction to calculate number of days between two dates. We can also format dates that contain time information by appending _h, _hm, or _hms to any of the aforementioned functions. “h”, “m”, and “s” stand for hour, minute, and second, respectively. Below we add some time data to our dates and demonstrate how to use mdy_hms. begin <- c("May 11, 1996 12:05", "September 12, 2001 1:00", "July 1, 1988 3:32") end <- c("7/8/97 8:00","10/23/02: 12:00","1/4/91 2:05") (begin <- mdy_hm(begin)) ## [1] "1996-05-11 12:05:00 UTC" "2001-09-12 01:00:00 UTC" ## [3] "1988-07-01 03:32:00 UTC" (end <- mdy_hm(end)) ## [1] "1997-07-08 08:00:00 UTC" "2002-10-23 12:00:00 UTC" ## [3] "1991-01-04 02:05:00 UTC" class(begin) ## [1] "POSIXct" "POSIXt" class(end) ## [1] "POSIXct" "POSIXt" Notice the class is now “POSIXct”. “POSIXct” represents the number of seconds since the beginning of 1970. If a date is before 1970, the number of seconds is negative. Notice also the the letters “UTC” have been appended to the date-times. UTC is short for Universal Coordinated Time. You can read more about UTC here, but it’s basically the time standard by which the world regulates clocks. If we prefer we can specify a time zone when formatting dates by using the tz argument. Here’s how we can specify the Eastern Time Zone in the United States when formatting our dates. begin <- c("May 11, 1996 12:05", "September 12, 2001 1:00", "July 1, 1988 3:32") end <- c("7/8/97 8:00","10/23/02: 12:00","1/4/91 2:05") (begin <- mdy_hm(begin, tz = "US/Eastern")) ## [1] "1996-05-11 12:05:00 EDT" "2001-09-12 01:00:00 EDT" ## [3] "1988-07-01 03:32:00 EDT" (end <- mdy_hm(end, tz = "US/Eastern")) ## [1] "1997-07-08 08:00:00 EDT" "2002-10-23 12:00:00 EDT" ## [3] "1991-01-04 02:05:00 EST" Notice the last date is EST instead of EDT. EST means “Eastern Standard Time”. EDT means “Eastern Daylight Time”. Any day and time that falls during Daylight Savings is EDT. Otherwise it’s EST. How do we know the appropriate time zone phrase to use in the tz argument? We can use the OlsonNames function to see a character vector of all time zone names. Just enter OlsonNames() in the R console and hit Enter. We can also read in times without dates using the functions ms, hm, or hms, where again “h”, “m”, and “s” stand for “hours”, “minutes”, and “seconds”. Here are a few examples. time1 <- c("1:13", "0:58", "1:01") time2 <- c("12:23:11", "09:45:31", "12:05:22") time3 <- c("2:14", "2:16", "3:35") (time1 <- ms(time1)) ## [1] "1M 13S" "58S" "1M 1S" (time2 <- hms(time2)) ## [1] "12H 23M 11S" "9H 45M 31S" "12H 5M 22S" (time3 <- hm(time3)) ## [1] "2H 14M 0S" "2H 16M 0S" "3H 35M 0S" Once again, don’t be fooled by the print out. These times are actually stored as seconds. Use as.numeric to verify. as.numeric(time1) ## [1] 73 58 61 as.numeric(time2) ## [1] 44591 35131 43522 as.numeric(time3) ## [1] 8040 8160 12900 The class of these new time objects is neither “Date” nor “POSIX” but rather “Period”. class(time1) ## [1] "Period" ## attr(,"package") ## [1] "lubridate" Period is one of three classes lubridate provides for time spans. Let’s learn more about these classes. Durations, Intervals and Periods lubridate provides three classes, or three different ways, to distinguish between different types of time spans. 1. Duration 2. Interval 3. Period Understanding these classes will help you get the most out of lubridate. The most simple is Duration. This is simply a span of time measured in seconds. There is no start date. An Interval is also measured in seconds but has an associated start date. An Interval measures elapsed seconds between two specific points in time. A Period records a time span in units larger than seconds, such as years or months. Unlike seconds, years and months differ in time. June has 30 days while July has 31 days. February has 28 days except for leap years when it has 29 days. With the Period class, we can add 1 month to February 1 and get March 1. It allows us to perform calculations in calendar or clock time as opposed to absolute number of seconds. Let’s see these three classes in action. Below we define two dates in the US Eastern time zone. The start day is March 11, 2017 at 5:21 AM. The end day is March 12, 2017 at the same time. Note that Daylight Savings begins (or began, depending on when you’re reading this) on March 12 at 2:00 AM. start <- mdy_hm("3-11-2017 5:21", tz = "US/Eastern") end <- mdy_hm("3-12-2017 5:21", tz = "US/Eastern") Since we’re dealing with elapsed time between two dates, let’s start with Intervals. We can define an Interval using the %--% operator. time.interval <- start %--% end time.interval ## [1] 2017-03-11 05:21:00 EST--2017-03-12 05:21:00 EDT Notice how Intervals print. They show the beginng date and end date. And also notice how the time zone changes from EST to EDT indicating that Daylight Savings has started. If we look at the structure of an Interval object we see it contains elapsed time in seconds, 82800, and the start date. str(time.interval) ## Formal class 'Interval' [package "lubridate"] with 3 slots ## ..@ .Data: num 82800 ## ..@ start: POSIXct[1:1], format: "2017-03-11 05:21:00" ## ..@ tzone: chr "US/Eastern" To create a Duration between these two dates, we can use the as.duration function. time.duration <- as.duration(time.interval) time.duration ## [1] "82800s (~23 hours)" Notice a Duration object prints the elapsed time in seconds as well as something a little friendlier to read, in this case hours. Because Daylight Savings went into effect at 2:00 AM during the interval, an hour was skipped. Thus the duration between these two time points is only 23 hours. If we look at the structure of a Duration object we see it just contains elapsed time in seconds. str(time.duration) ## Formal class 'Duration' [package "lubridate"] with 1 slot ## ..@ .Data: num 82800 We can create a Period from an Interval using the as.period function. time.period <- as.period(time.interval) time.period ## [1] "1d 0H 0M 0S" A Period prints elapsed time as integers in the form of years, months, weeks, days and so on. Notice this Period is 1 day long. While only 23 hours have technically elapsed since the start date, according to our clock one day has elapsed. If we look at the structure we see a Period contains several slots for “clock time” values and, like the Duration object, no associated date. str(time.period) ## Formal class 'Period' [package "lubridate"] with 6 slots ## ..@ .Data : num 0 ## ..@ year : int 0 ## ..@ month : int 0 ## ..@ day : int 1 ## ..@ hour : int 0 ## ..@ minute: int 0 To recap: • An Interval is elapsed time in seconds between two specific dates. (If no time is provided, the time for each date is assumed to be 00:00:00, or midnight.) • A Duration is elapsed time in seconds independent of a start date. • A Period is elapsed time in “calendar” or “clock” time (4 weeks, 2 months, etc) independent of a start date. Calculations and conversions Once we format dates and define our time span we often want to do some calculations and conversions. For example, we may want to calculate the mean elapsed time in weeks for different groups. Let’s create some data and demonstrate. First we enter arbitrary start and end dates and define an Interval start <- c("2012-08-21", "2012-09-01", "2012-08-15", "2012-09-18") end <- c("2012-09-16", "2012-09-06", "2012-08-22", "2012-10-11") elapsed.time <- start %--% end If we view the elapsed.time object we’ll just see date ranges. We can use as.duration or even as.numeric to view the elapsed time in seconds but that’s not very useful in this case. It would be better if we converted seconds to another unit of time such as weeks or days. Fortunately lubridate makes this easy. The trick is to convert intervals to durations and then divide the duration by a duration object in the units we desire. That’s a mouthful but easy to demonstrate. Below we demonstrate how to convert to weeks. First we convert our interval to a duration, and then we divide by dweeks(1). The function call dweeks(1) generates a duration of one week in seconds, which is 604800. Dividing that into our duration returns number of weeks. as.duration(elapsed.time) / dweeks(1) ## [1] 3.7142857 0.7142857 1.0000000 3.2857143 We can do the same with hours, days, minutes and years. as.duration(elapsed.time) / dhours(1) ## [1] 624 120 168 552 as.duration(elapsed.time) / ddays(1) ## [1] 26 5 7 23 as.duration(elapsed.time) / dminutes(1) ## [1] 37440 7200 10080 33120 as.duration(elapsed.time) / dyears(1) ## [1] 0.07123288 0.01369863 0.01917808 0.06301370 Once we have the durations in the units we want, we can then do things like find the mean. mean(as.duration(elapsed.time) / dweeks(1)) ## [1] 2.178571 Of course this was just for demonstration. With only 4 values, the mean is not a very useful summary. As another example, consider the following vector of character data summarizing a duration of time. “12w” means 12 weeks and “4d” means 4 days. StudyTime <- c("12w 4d", "11w", "10w 5d", NA, "12w 6d") What if we wanted to convert that to numeric weeks? First we’ll give the R code and them explain how it works. as.duration(period(StudyTime, units = c("week","day"))) / dweeks(1) ## [1] 12.57143 11.00000 10.71429 NA 12.85714 First we use the period function to define a Period using our data. The units argument says the first part of our data represents weeks and the second part represents days. That is then converted to a Duration object that stores time in seconds. Finally we divide by dweeks(1) to convert seconds to weeks. Notice how the NA remains NA and that “11w” converts to 11 just fine even though it had no days appended to it. There is much more to the lubridate package. Read the vignette and check out the examples on each function’s help page. But hopefully the material in this post gets you started with reading in dates, creating time-spans, and making conversions and calculations. For questions or clarifications regarding this article, contact the UVa Library StatLab: statlab@virginia.edu Clay Ford Statistical Research Consultant University of Virginia Library January 11, 2017 ## The Wilcoxon Rank Sum Test The Wilcoxon Rank Sum Test is often described as the non-parametric version of the two-sample t-test. You sometimes see it in analysis flowcharts after a question such as “is your data normal?” A “no” branch off this question will recommend a Wilcoxon test if you’re comparing two groups of continuous measures. So what is this Wilcoxon test? What makes it non-parametric? What does that even mean? And how do we implement it and interpret it? Those are some of the questions we aim to address in this post. First, let’s recall the assumptions of the two-sample t test for comparing two population means: 1. The two samples are independent of one another 2. The two populations have equal variance or spread 3. The two populations are normally distributed There’s no getting around #1. That assumption must be satisfied for a two-sample t-test. When assumptions #2 and #3 (equal variance and normality) are not satisfied but the samples are large (say, greater than 30), the results are approximately correct. But when our samples are small and our data skew or non-normal, we probably shouldn’t place much faith in the two-sample t-test. This is where the Wilcoxon Rank Sum Test comes in. It only makes the first two assumptions of independence and equal variance. It does not assume our data have have a known distribution. Known distributions are described with math formulas. These formulas have parameters that dictate the shape and/or location of the distribution. For example, variance and mean are the two parameters of the Normal distribution that dictate its shape and location, respectively. Since the Wilcoxon Rank Sum Test does not assume known distributions, it does not deal with parameters, and therefore we call it a non-parametric test. Whereas the null hypothesis of the two-sample t test is equal means, the null hypothesis of the Wilcoxon test is usually taken as equal medians. Another way to think of the null is that the two populations have the same distribution with the same median. If we reject the null, that means we have evidence that one distribution is shifted to the left or right of the other. Since we’re assuming our distributions are equal, rejecting the null means we have evidence that the medians of the two populations differ. The R statistical programming environment, which we use to implement the Wilcoxon rank sum test below, refers to this a “location shift”. Let’s work a quick example in R. The data below come from Hogg & Tanis, example 8.4-6. It involves the weights of packaging from two companies selling the same product. We have 8 observations from each company, A and B. We would like to know if the distribution of weights is the same at each company. A quick boxplot reveals the data have similar spread but may be skew and non-normal. With such a small sample it might be dangerous to assume normality. A <- c(117.1, 121.3, 127.8, 121.9, 117.4, 124.5, 119.5, 115.1) B <- c(123.5, 125.3, 126.5, 127.9, 122.1, 125.6, 129.8, 117.2) dat <- data.frame(weight = c(A,B), company = rep(c("A","B"), each=8)) boxplot(weight ~ company, data = dat)  Now we run the Wilcoxon Rank Sum Test using the wilcox.test function. Again, the null is that the distributions are the same, and hence have the same median. The alternative is two-sided. We have no idea if one distribution is shifted to the left or right of the other. wilcox.test(weight ~ company, data = dat) Wilcoxon rank sum test data: weight by company W = 13, p-value = 0.04988 alternative hypothesis: true location shift is not equal to 0  First we notice the p-value is a little less than 0.05. Based on this result we may conclude the medians of these two distributions differ. The alternative hypothesis is stated as the “true location shift is not equal to 0”. That’s another way of saying “the distribution of one population is shifted to the left or right of the other,” which implies different medians. The Wilcoxon statistic is returned as W = 13. This is NOT an estimate of the difference in medians. This is actually the number of times that a package weight from company B is less than a package weight from company A. We can calculate it by hand using nested for loops as follows (though we should note that this is not how the wilcox.test function calculates W): W <- 0 for(i in 1:length(B)){ for(j in 1:length(A)){ if(B[j] < A[i]) W <- W + 1 } } W [1] 13  Another way to do this is to use the outer function, which can take two vectors and perform an operation on all pairs. The result is an 8 x 8 matrix consisting of TRUE/FALSE values. Using sum on the matrix counts all instances of TRUE. sum(outer(B, A, "<")) [1] 13  Of course we could also go the other way and count the number of times that a package weight from company A is less than a package weight from company B. This gives us 51. sum(outer(A, B, "<")) [1] 51  If we relevel our company variable in data.frame dat to have “B” as the reference level, we get the same result in the wilcox.test output. dat$company <- relevel(dat$company, ref = "B") wilcox.test(weight ~ company, data = dat) Wilcoxon rank sum test data: weight by company W = 51, p-value = 0.04988 alternative hypothesis: true location shift is not equal to 0  So why are we counting pairs? Recall this is a non-parametric test. We’re not estimating parameters such as a mean. We’re simply trying to find evidence that one distribution is shifted to the left or right of the other. In our boxplot above, it looks like the distributions from both companies are reasonably similar but with B shifted to the right, or higher, than A. One way to think about testing if the distributions are the same is to consider the probability of a randomly selected observation from company A being less than a randomly selected observation from company B: P(A < B). We could estimate this probability as the number of pairs with A less than B divided by the total number of pairs. In our case that comes to $$51/(8\times8)$$ or $$51/64$$. Likewise we could estimate the probability of B being less than A. In our case that's $$13/64$$. So we see that the statistic W is the numerator in this estimated probability. The exact p-value is determined from the distribution of the Wilcoxon Rank Sum Statistic. We say "exact" because the distribution of the Wilcoxon Rank Sum Statistic is discrete. It is parametrized by the two sample sizes we're comparing. "But wait, I thought the Wilcoxon test was non-parametric?" It is! But the test statistic W has a distribution which does not depend on the distribution of the data. We can calculate the exact two-sided p-values explicitly using the pwilcox function (they’re two-sided, so we multiply by 2): For W = 13, $$P(W \leq 13)$$: pwilcox(q = 13, m = 8, n = 8) * 2 [1] 0.04988345  For W = 51, $$P(W \geq 51)$$, we have to get $$P(W \leq 50)$$ and then subtract from 1 to get $$P(W \geq 51)$$: (1 - pwilcox(q = 51 - 1, m = 8, n = 8)) * 2 [1] 0.04988345  By default the wilcox.test function will calculate exact p-values if the samples contains less than 50 finite values and there are no ties in the values. (More on “ties” in a moment.) Otherwise a normal approximation is used. To force the normal approximation, set exact = FALSE. dat$company <- relevel(dat$company, ref = "A") wilcox.test(weight ~ company, data = dat, exact = FALSE) Wilcoxon rank sum test with continuity correction data: weight by company W = 13, p-value = 0.05203 alternative hypothesis: true location shift is not equal to 0  When we use the normal approximation the phrase “with continuity correction” is added to the name of the test. A continuity correction is an adjustment that is made when a discrete distribution is approximated by a continuous distribution. The normal approximation is very good and computationally faster for samples larger than 50. Let’s return to “ties”. What does that mean and why does that matter? To answer those questions first consider the name “Wilcoxon Rank Sum test”. The name is due to the fact that the test statistic can be calculated as the sum of the ranks of the values. In other words, take all the values from both groups, rank them from lowest to highest according to their value, and then sum the ranks from one of the groups. Here’s how we can do it in R with our data: sum(rank(tins$weight)[tins$company=="A"]) [1] 49  Above we rank all the weights using the rank function, select only those ranks for company A, and then sum them. This is the classic way to calculate the Wilcoxon Rank Sum test statistic. Notice it doesn’t match the test statistic provided by wilcox.test, which was 13. That’s because R is using a different calculation due to Mann and Whitney. Their test statistic, sometimes called U, is a linear function of the original rank sum statistic, usually called W: $U = W – \frac{n_2(n_2 + 1)}{2}$ where $$n_2$$ is the number of observations in the other group whose ranks were not summed. We can verify this relationship for our data sum(rank(tins$weight)[tins$company=="A"]) - (8*9/2) [1] 13  This is in fact how the wilcox.test function calculates the test statistic, though it labels it W instead of U. The rankings of values have to be modified in the event of ties. For example, in the data below 7 occurs twice. One of the 7’s could be ranked 3 and the other 4. But then one would be ranked higher than the other and that’s not correct. We could rank them both 3 or both 4, but that wouldn’t be right either. What we do then is take the average of their ranks. Below this is $$(3 + 4)/2 = 3.5$$. R does this by default when ranking values. vals <- c(2, 4, 7, 7, 12) rank(vals) [1] 1.0 2.0 3.5 3.5 5.0  The impact of ties means the Wilcoxon rank sum distribution cannot be used to calculate exact p-values. If ties occur in our data and we have fewer than 50 observations, the wilcox.test function returns a normal approximated p-value along with a warning message that says “cannot compute exact p-value with ties”. Whether exact or approximate, p-values do not tell us anything about how different these distributions are. For the Wilcoxon test, a p-value is the probability of getting a test statistic as large or larger assuming both distributions are the same. In addition to a p-value we would like some estimated measure of how these distributions differ. The wilcox.test function provides this information when we set conf.int = TRUE. wilcox.test(weight ~ company, data = dat, conf.int = TRUE) Wilcoxon rank sum test data: weight by company W = 13, p-value = 0.04988 alternative hypothesis: true location shift is not equal to 0 95 percent confidence interval: -8.5 -0.1 sample estimates: difference in location -4.65  This returns a “difference in location” measure of -4.65. The documentation for the wilcox.test function states this “does not estimate the difference in medians (a common misconception) but rather the median of the difference between a sample from x and a sample from y.” Again we can use the outer function to verify this calculation. First we calculate the difference between all pairs and then find the median of those differences. median(outer(A,B,"-")) [1] -4.65  The confidence interval is fairly wide due to the small sample size, but it appears we can safely say the median weight of company A’s packaging is at least -0.1 less than the median weight of company B’s packaging. If we’re explicitly interested in the difference in medians between the two populations, we could try a bootstrap approach using the boot package. The idea is to resample the data (with replacement) many times, say 1000 times, each time taking a difference in medians. We then take the median of those 1000 differences to estimate the difference in medians. We can then find a confidence interval based on our 1000 differences. An easy way is to use the 2.5th and 97.5th percentiles as the upper and lower bounds of a 95% confidence interval. Here is one way to carry this out in R. First we load the boot package, which comes with R, and create a function called med.diff to calculate the difference in medians. In order to work with the boot package’s boot function, our function needs two arguments: one for the data and one to index the data. We have arbitrarily named these arguments d and i. The boot function will take our data, d, and resample it according to randomly selected row numbers, i. It will then return the difference in medians for the resampled data. library(boot) med.diff <- function(d, i) { tmp <- d[i,] median(tmp$weight[tmp$company=="A"]) - median(tmp$weight[tmp$company=="B"]) }  Now we use the boot function to resample our data 1000 times, taking a difference in medians each time, and saving the results into an object called boot.out. boot.out <- boot(data = dat, statistic = med.diff, R = 1000)  The boot.out object is a list object. The element named “t” contains the 1000 differences in medians. Taking the median of those values gives us a point estimate of the estimated difference in medians. Below we get -5.05, but you will likely get something different. median(boot.out$t)
[1] -5.05



Next we use the boot.ci function to calculate confidence intervals. We specify type = "perc" to obtain the bootstrap percentile interval.

boot.ci(boot.out, type = c("perc","bca"))
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL :
boot.ci(boot.out = boot.out, type = "perc")

Intervals :
Level     Percentile
95%   (-9.399, -0.100 )
Calculations and Intervals on Original Scale



We notice the interval is not too different from what the wilcox.test function returned, but certainly bigger on the lower bound. Like the Wilcoxon rank sum test, bootstrapping is a non-parametric approach that can be useful for small and/or non-normal data.

References

Hogg, R.V. and Tanis, E.A., Probability and Statistical Inference, 7th Ed, Prentice Hall, 2006.
R Core Team (2016). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

For questions or clarifications regarding this article, contact the UVa Library StatLab: statlab@virginia.edu

Clay Ford
Statistical Research Consultant
University of Virginia Library
Jan 5, 2017