Latest Posts

Getting Started with Matching Methods

A frequent research question is whether or not some “treatment” causes an effect. For example, does taking aspirin daily reduce the chance of a heart attack? Does more sleep lead to better academic performance for teenagers? Does smoking increase the risk of chronic obstructive pulmonary disease (COPD)?

To truly answer such questions, we need a time machine and a lack of ethics. We would need to be able to take a subject and, say, make him or her smoke for several years and observe whether or not they get COPD. Then travel back in time and observe the same subject NOT smoking to see if they get COPD. Only then could we really see the effect of smoking on that person’s chances of contracting COPD. To estimate the average effect of smoking on COPD, we would need to do the same with many subjects. Clearly the laws of physics and all IRBs forbid this.

To work around these issues researchers often employ what are called “matching methods”. This involves taking observational data, such as data from surveys, and matching people who have similar characteristics but different treatments. I can’t travel through time and make a person smoke and not smoke. But I can identify two people who are similar in almost every way except their smoking status. Here’s a 200-pound, middle class, college educated, Caucasian man who smokes, and a 200-pound, middle class, college educated, Caucasian man who does NOT smoke. They’re not the same people but they’re similar. It stands to reason the effects of smoking on the smoker can approximate the effect of smoking on the non-smoker, and vice versa. This is the idea of matching methods: create two such groups that are similar in every respect but their treatment.

For a fantastic and easy-to-read overview of matching methods, see Elizabeth Stuart’s 2010 paper “Matching Methods for Causal Inference: A Review and a Look Forward“. See also Michele Claibourn’s excellent 2014 workshop on matching methods. In this brief article, we demonstrate how to get started with matching methods using R and the MatchIt package.

To begin we need to load some data. For demonstration purposes we’ll use a sample of the 2015 BRFSS survey (pronounced bur-fiss). BRFSS stands for Behavioral Risk Factor Surveillance System. BRFSS is a CDC telephone survey that collects state data about U.S. residents regarding their health-related risk behaviors and chronic health conditions. The entire survey has over 400,000 records and over 300 variables. The sample we’ll work with has 5000 records and 7 variables.

brfss <- read.csv("http://static.lib.virginia.edu/statlab/materials/data/brfss_2015_sample.csv")
summary(brfss)
  COPD          SMOKE              RACE         AGE           SEX           WTLBS        AVEDRNK2     
 No :4707   Min.   :0.0000   Black   : 308   18-24: 258   Female:2521   Min.   : 85   Min.   : 1.000  
 Yes: 293   1st Qu.:0.0000   Hispanic: 339   25-34: 599   Male  :2479   1st Qu.:148   1st Qu.: 1.000  
            Median :0.0000   Other   : 254   35-44: 667                 Median :172   Median : 2.000  
            Mean   :0.1524   White   :4099   45-54: 914                 Mean   :178   Mean   : 2.122  
            3rd Qu.:0.0000                   55-64:1122                 3rd Qu.:200   3rd Qu.: 2.000  
            Max.   :1.0000                   65+  :1440                 Max.   :527   Max.   :30.000  

The summary provides details on the seven variables:

COPD: Ever told you have chronic obstructive pulmonary disease (COPD)?
SMOKE: Adults who are current smokers (0 = no, 1 = yes)
RACE: Race group
AGE: age group
SEX: gender
WTLBS: weight in lbs
AVEDRNK2: During the past 30 days, when you drank, how many drinks did you drink on average?

We wish to investigate the effect of smoking on COPD. Does smoking increase the chance of contracting COPD? Obviously this is not something we could carry out as a randomized experiment. But perhaps using this observational data and matching methods we can simulate what our data might look like had we carried out a randomized experiment. We can create two groups, smokers and non-smokers, who are roughly equivalent in age, race, sex, weight and alcohol consumption. (Please note this is only for educational purposes. While the data is real and the methods are sound, we would certainly include several other variables as well as enlist the help of doctors and epidemiologists to investigate a research question like this.)

Before we start matching, let’s check the balance and overlap of our data. By balance we mean that the distributions of our data in the smoking and non-smoking groups are similar. By overlap, we mean that the continuous variables in the smoking and non-smoking groups span the same range of values. We can informally check both with some plots. Notice below we use the factor() function to ensure ggplot recognizes smoking as a categorical variable. We have smoking stored in our data as a numeric column of zeroes and ones because that’s how the MatchIt package requires treatment variables to be stored.

library(ggplot2)
p <- ggplot(brfss, aes(fill = factor(SMOKE))) + 
  geom_bar(position = "dodge") + 
  scale_fill_discrete("Smoke")
p + aes(x = AGE)
p + aes(x = RACE)
p + aes(x = SEX)

Above we notice a lack of balance. For example, the number of non-smokers increase with age while the number of smokers stays fairly constant.

To check the balance and overlap of the continuous weight and drink variables we plot histograms.

ggplot(brfss, aes(x = WTLBS, fill = factor(SMOKE))) + 
  geom_histogram(position = "identity") +
  scale_fill_discrete("Smoke")
 
ggplot(brfss, aes(x = AVEDRNK2, fill = factor(SMOKE))) + 
  geom_histogram(position = "identity", binwidth = 1) +
  scale_fill_discrete("Smoke")

The balance maybe looks OK but there appears to be a slight lack of overlap at the higher values. We can investigate more closely by looking at the quantiles within each group.

tapply(brfss$WTLBS, brfss$SMOKE, quantile, probs = seq(0.2, 1, 0.1))
$`0` 	
 20%  30%  40%  50%  60%  70%  80%  90% 100% 
 140  150  163  175  185  200  212  240  522 

$`1`
  20%   30%   40%   50%   60%   70%   80%   90%  100% 
140.0 150.0 160.0 172.5 180.0 195.0 209.8 235.0 506.0 


tapply(brfss$AVEDRNK2, brfss$SMOKE, quantile, probs = seq(0.2, 1, 0.1))
$`0`
 20%  30%  40%  50%  60%  70%  80%  90% 100% 
   1    1    1    1    2    2    2    3   30 

$`1`
 20%  30%  40%  50%  60%  70%  80%  90% 100% 
   1    2    2    2    3    3    4    6   30 

We see that once we get past the 30th quantile the non-smokers begin to outweigh the smokers, and the smokers begin to outdrink the non-smokers, producing a lack of balance. On the other hand, it seems the lack of overlap isn’t as bad as we thought.

At this point we’re ready to proceed to matching.

The idea of matching isn’t complicated. We want to simply find subjects with matching covariates among the smokers and non-smokers. However it’s often difficult to find exact matches, so instead we define a “closeness” or “distance” metric and use that to generate matches. In this tutorial we’ll use Nearest Neighbor Matching which is the default method in the MatchIt package. Nearest Neighbor Matching selects the best control (non-smoker) for each treated subject (smoker) using a distance measure called the propensity score. The end result is two groups of equal size and (hopefully) similar distributions of covariates.

The propensity score is the estimated probability of receiving treatment (ie, being a smoker), conditional on the covariates. If two subjects, one who is a smoker and the other who is not, have similar propensity scores, then we think of them as potential matches.

To carry out Nearest Neighbor Matching we load the MatchIt package and use the matchit() function. Notice we need to assign the result of the matchit() function to an object. Below we name it “m.out”, but you can name it something else if you like. matchit() uses R’s formula syntax to define what we want to match on. On the left side of the tilde (~) is SMOKE. That’s our treatment variable. On the right side of the tilde are the variables we want to match on. We have kept it simple but you can match on interactions and higher powers. Also notice we have not included our dependent variable, COPD! That is by design. At this step we are trying to create balance across covariates among smokers and non-smokers. The analysis of COPD comes after the matching procedure.

install.packages("MatchIt")
library(MatchIt)
m.out <- matchit(SMOKE ~ RACE + AGE + SEX + WTLBS + AVEDRNK2, data = brfss)

That should run fairly quickly on just about any computer. The next step is to investigate the results of the matching procedure. Did we improve balance? Are the smokers and non-smokers similar across age, race, weight, number of drinks and sex? The quickest way to check this is to plot a summary of the results, like so:

s.out <- summary(m.out, standardize = TRUE)
plot(s.out)

First we call summary() on m.out and save to a new object called “s.out”. For the plot to work we need to also set the standardize argument to TRUE. Then we call plot() on the s.out object. R will allow you to click on points to label them. When finished, click Esc. We labeled 3 points. The result is below.

On the left side of the plot is the standardized difference in means of covariates between smokers and non-smokers for all the data. We see that distance (propensity score) is the largest, followed by number of drinks (AVEDRNK2) and the AGE65+ group. This tells us that prior to matching we had, for example, large differences in the mean values of AVEDRNK2 between smokers and non-smokers. The right side shows the standardized difference in means after matching. We hope to see all the points near 0. Specifically, we want to see large differences in means become smaller. The three largest differences prior to matching are now tiny. The dark lines show mean differences that increased after balance. This is actually common when you match on covariates that were already pretty well balanced to begin with. Since the standardized mean difference is still less than 0.1, we’re not too concerned about it.

We can also compare the distribution of propensity scores (distance) before and after matching.

plot(m.out,  type = "jitter", interactive = FALSE)
plot(m.out,  type = "hist")

What we want to see is that the Matched Treated (smokers) and Matched Control (non-smokers) distributions are roughly similar. We like what we see here.

We can also evaluate the matching results using tabular output. Just call summary() on the “m.out” object. Below is the results of our matching procedure.

summary(m.out)

Call:
matchit(formula = SMOKE ~ RACE + AGE + SEX + WTLBS + AVEDRNK2, 
    data = brfss)

Summary of balance for all data:
             Means Treated Means Control SD Control Mean Diff eQQ Med eQQ Mean eQQ Max
distance            0.1991        0.1440     0.0799    0.0551  0.0448   0.0547  0.1816
RACEBlack           0.1194        0.0512     0.2204    0.0682  0.0000   0.0682  1.0000
RACEHispanic        0.0709        0.0672     0.2505    0.0036  0.0000   0.0026  1.0000
RACEOther           0.0682        0.0477     0.2131    0.0206  0.0000   0.0197  1.0000
RACEWhite           0.7415        0.8339     0.3722   -0.0924  0.0000   0.0919  1.0000
AGE25-34            0.1824        0.1085     0.3111    0.0739  0.0000   0.0735  1.0000
AGE35-44            0.1811        0.1248     0.3306    0.0563  0.0000   0.0564  1.0000
AGE45-54            0.1837        0.1826     0.3864    0.0011  0.0000   0.0013  1.0000
AGE55-64            0.2192        0.2253     0.4179   -0.0062  0.0000   0.0066  1.0000
AGE65+              0.1640        0.3103     0.4627   -0.1462  0.0000   0.1470  1.0000
SEXMale             0.5459        0.4868     0.4999    0.0591  0.0000   0.0591  1.0000
WTLBS             177.7561      177.9875    43.8187   -0.2314  0.0000   2.1432 50.0000
AVEDRNK2            2.9843        1.9672     1.8102    1.0171  1.0000   1.0000  4.0000


Summary of balance for matched data:
             Means Treated Means Control SD Control Mean Diff eQQ Med eQQ Mean eQQ Max
distance            0.1991        0.1986     0.1086    0.0004       0   0.0006  0.0283
RACEBlack           0.1194        0.1155     0.3198    0.0039       0   0.0039  1.0000
RACEHispanic        0.0709        0.0669     0.2501    0.0039       0   0.0039  1.0000
RACEOther           0.0682        0.0761     0.2654   -0.0079       0   0.0079  1.0000
RACEWhite           0.7415        0.7415     0.4381    0.0000       0   0.0000  0.0000
AGE25-34            0.1824        0.1916     0.3938   -0.0092       0   0.0092  1.0000
AGE35-44            0.1811        0.1903     0.3928   -0.0092       0   0.0092  1.0000
AGE45-54            0.1837        0.1850     0.3886   -0.0013       0   0.0013  1.0000
AGE55-64            0.2192        0.1890     0.3917    0.0302       0   0.0302  1.0000
AGE65+              0.1640        0.1588     0.3657    0.0052       0   0.0052  1.0000
SEXMale             0.5459        0.5276     0.4996    0.0184       0   0.0184  1.0000
WTLBS             177.7561      177.8207    42.4245   -0.0646       1   2.7788 60.0000
AVEDRNK2            2.9843        2.9423     2.7596    0.0420       0   0.2756  3.0000

Percent Balance Improvement:
             Mean Diff.  eQQ Med  eQQ Mean  eQQ Max
distance        99.1869  99.9585   98.9125  84.3921
RACEBlack       94.2289   0.0000   94.2308   0.0000
RACEHispanic    -8.8341   0.0000  -50.0000   0.0000
RACEOther       61.7348   0.0000   60.0000   0.0000
RACEWhite      100.0000   0.0000  100.0000 100.0000
AGE25-34        87.5647   0.0000   87.5000   0.0000
AGE35-44        83.6772   0.0000   83.7209   0.0000
AGE45-54       -19.9887   0.0000    0.0000   0.0000
AGE55-64      -388.2488   0.0000 -360.0000   0.0000
AGE65+          96.4106   0.0000   96.4286   0.0000
SEXMale         68.9365   0.0000   68.8889   0.0000
WTLBS           72.1013     -Inf  -29.6583 -20.0000
AVEDRNK2        95.8709 100.0000   72.4409  25.0000

Sample sizes:
          Control Treated
All          4238     762
Matched       762     762
Unmatched    3476       0
Discarded       0       0

The first two sections look at all the data and the matched data, respectively. The first, second and fourth columns are probably the easiest to examine. The first and second columns show the means for treated (smokers) and control (non-smokers) groups. Prior to matching, for example, we have 16% of smokers over age 65 versus 31% who are not smokers. That’s an absolute difference of about 15%. After matching we have roughly an equal proportion of subjects over age 65 in both groups with a negligible mean difference. The third section, Percent Balance Improvement, shows how much we have increased or decreased balance. Negative numbers in the Mean Diff column indicate covariates where balance actually got worse. The AGE55-64 value of -388 seems pretty terrible. What happened was the balance went from 21.9% vs 22.5% pre-match to 21.9% vs 18.9% post-match. Ideally we don’t want balance getting worse, but in this case the balance is still quite good. The last section tells us we now have equal numbers (762) in both the treated and control groups.

The columns labeled eQQ refer to empirical quantile-quantile plots. They provide the median, mean, and maximum distance between the empirical quantile functions of the treated and control groups. Values greater than 0 indicate deviations between the groups in some part of the empirical distributions. These are most useful for continuous variables. We can view these plots by calling plot() on the matching object. The which.xs argument allows us to specify which variables to plot. Below we plot weight and number of drinks.

plot(m.out, which.xs = c("WTLBS", "AVEDRNK2"))

We hope to see the points in the right column (Matched) falling between the dotted lines. The idea is that if both groups, smokers and non-smokers, have good balance then the quantiles should lie on a 45-degree line. Once again we like what we see.

When we’re satisfied with the results of the matching procedure, the next step is to create a new data frame that contains only the matched data. For this we use the match.data() function on the matching object. Below we create a new data frame called brfss.match.

brfss.match <- match.data(m.out)

And now we’re ready to perform our statistical analysis as if our data had been generated through randomization. For instance, we might perform logistic regression to analyze how smoking affects the probability of contracting COPD.

brfss.match$SMOKE <- factor(brfss.match$SMOKE, labels = c("No", "Yes"))
mod <- glm(COPD ~ SMOKE + AGE + RACE + SEX + WTLBS + AVEDRNK2, 
           data = brfss.match, family = "binomial")
exp(confint(mod, parm = "SMOKEYes"))

   2.5 %   97.5 % 
2.868077 7.091090 

It appears that smokers are at least 2.9 times more likely to report COPD than non-smokers, controlling for age, sex, race, weight and drinking.

There is much more to learn about matching methods. We demonstrated Nearest Neighbor Matching, but there are several other methods. The MatchIt User’s Guide provides a nice overview of how to implement various matching methods: https://r.iq.harvard.edu/docs/matchit/2.4-20/User_s_Guide_to.html

It’s worth pointing out that matching does not necessarily guarantee that each treated subject will have a matching control that has identical covariate values. The only guarantee is that groups of individuals with similar distance measures (eg, propensity scores) will have similar covariate distributions. Having said that, it is possible to see which control subject was matched with which treated subject. The matching object has a match.matrix component, where the row names correspond to the names of the treated, and the cells correspond to the names of the matched controls. Here are the first few matches in our m.out match.matrix:

head(m.out$match.matrix)
   1     
29 "3128"
44 "3957"
62 "4669"
81 "1970"
82 "2849"
87 "4754"

So row 29 of brfss was a treated subject (a smoker) matched to the control subject in row 3128 (a non-smoker). If we compare them we notice they don’t exactly match up:

brfss[29,]
   COPD SMOKE     RACE   AGE  SEX WTLBS AVEDRNK2
29   No     1 Hispanic 25-34 Male   150       10
brfss[3128,]
     COPD SMOKE  RACE   AGE  SEX WTLBS AVEDRNK2
3128   No     0 Other 25-34 Male   164        8

The most glaring difference is RACE. But recall, they weren’t matched on the specific covariates but rather their propensity score. We can compare their propensity scores as follows (taking advantage of the fact that the row names in the matched data frame are the same as those in the original data frame):

i <- rownames(brfss.match) %in% c(29,3128)
brfss.match[i,]
     COPD SMOKE     RACE   AGE  SEX WTLBS AVEDRNK2  distance weights
29     No   Yes Hispanic 25-34 Male   150       10 0.4776235       1
3128   No    No    Other 25-34 Male   164        8 0.4634368       1

The propensity score is in the distance column. According to the propensity score, these subjects are similar. Matching on this distance metric helps ensure the smoking and non-smoking groups have similar covariate distributions. So even those these two specific subjects do not match on RACE, overall the smoking and non-smoking groups are balanced on RACE.

We mentioned earlier that the propensity score is the probability of receiving treatment (ie, being a smoker), conditional on the covariates. The matchit() function calculates this for us but we can do it as well by performing a logistic regression of SMOKE on the covariates that we want to balance on. Notice the formula is identical to the one we used with matchit().

pscore <- glm(SMOKE ~ RACE + AGE + SEX + WTLBS + AVEDRNK2, 
              data = brfss, family = binomial)

Here are the first 6 propensity scores in the brfss.match data frame (again, the row numbers correspond to the rows in our original brfss data frame):

head(brfss.match[, "distance", drop = F])
     distance
1  0.19127152
3  0.08129118
13 0.10868555
14 0.41062392
21 0.07483546
22 0.32779268

We get the same results by using our model to make predictions for the same subjects:

predict(pscore, type = "response")[c(1,3,13,14,21,22)]
         1          3         13         14         21         22 
0.19127152 0.08129118 0.10868555 0.41062392 0.07483546 0.32779268 

Another type of distance measure is the linear propensity score. It’s just the log-odds version of the propensity score. Whereas the probability-based propensity score is bounded from 0 to 1, the linear propensity score has no such bounds. This means we can make better matches in the lower and upper extremes of the scores since the values are not being compressed near 0 or 1. In section 6.2 of her paper, Stuart actually recommends using the linear propensity score. Fortunately MatchIt makes it easy to do. Simply set the distance argument to "linear.logit":

m.out.lp <- matchit(SMOKE ~ RACE + AGE + SEX + WTLBS + AVEDRNK2, data = brfss, 
                 distance = "linear.logit")

The resulting propensity scores are on the log-odds scale, which we can verify as we did before:

brfss.match2 <- match.data(m.out.lp)
head(brfss.match2[, "distance", drop = F])
     distance
1  -1.4417693
2  -2.0709984
14 -0.3613867
21 -2.5146797
22 -0.7181855
25 -1.9113250

predict(pscore)[c(1,2,14,21,22, 25)] # default prediction is on log-odds scale
         1          2         14         21         22         25 
-1.4417693 -2.0709984 -0.3613867 -2.5146797 -0.7181855 -1.9113250 

Notice that matching on the linear propensity score has resulted in different subjects being selected. We leave it as an exercise for the interested reader to verify that the improvement in balance is essentially the same as the matching performed on the probability-based propensity score.

To learn more about matching methods and causal inference in general, please see the following references:

  • Claibourn, M. (2014). Matching Methods for Causal Inference. Workshop presented at the UVa Library. http://static.lib.virginia.edu/statlab/materials/workshops/MatchingMethods.zip
  • Gelman, A. and Hill, J. (2007). Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge. Chapters 9 and 10.
  • Ho, D., Imai K., King, G., Stuart, E. (2011). MatchIt: Nonparametric Preprocessing for Parametric Causal Inference. Journal of Statistical Software, Vol. 42, No. 8, pp. 1-28. URL http://www.jstatsoft.org/v42/i08/
  • Stuart, E. (2010). Matching Methods for Causal Inference: A Review and a Look Forward. Statistical Science. Vol. 25, No. 1, 1–21. DOI: 10.1214/09-STS313

For questions or clarifications regarding this article, contact the UVa Library StatLab: statlab@virginia.edu

Clay Ford
Statistical Research Consultant
University of Virginia Library
April 24, 2018

2018-2019 StatLab Fellows

Graduate StatLab Fellows
The UVA Library’s Research Data Services is seeking multiple StatLab graduate fellows for the 2018-2019 academic year for one of two possible roles. Apply through UVA Handshake.

  1. Data Analytics Fellow: Fellows will deepen their knowledge of quantitative methods, data analysis and data science, and statistical computation by: working collaboratively with statistical and data science experts, supporting data science research on grounds through consultations with researchers across disciplinary boundaries, writing online instructional materials and articles, and developing and teaching an instructional workshop.
    • Consultations and collaborations: Consultations involve helping researchers from a variety of domains and with a wide range of experience with developing research designs, identifying appropriate analyses and models, explaining statistical methods and interpretation of results, data wrangling and statistical computation, and implementing models and algorithms in supported software environments. Fellows will join in on and eventually lead consultations.
    • Data science blog posts and articles: The Research Data Services webpage hosts a growing collection of highly-viewed StatLab posts and articles explaining common data analytic challenges, demonstrating the application of complex methods, introducing and using new computational libraries, generating effective visualizations, and more. Fellows will propose and write on multiple data science topics for the webpage.
    • Workshop instruction: The Research Data Services team in the Library supports a robust set of hands-on workshops each semester, from introductions to statistical and computing environments to instruction on advanced methods and machine learning. Workshops are free and open to all learners, and materials are archived on our webpages. Fellows will prepare and lead a data science workshop in the Spring.

    In addition, Fellows will contribute to overall technical and administrative support of the StatLab services in theLibrary.

  1. Data for Democracy Fellow: Fellows will have the opportunity to deepen their knowledge of quantitative methods, data analysis and data science, and statistical computation by: working collaboratively on a public interest data science project, contributing intellectually and methodologically; practicing open and reproducible data science workflows using tools like R, Python, GitHub, and ShareLaTeX; communicating about the work through blog posts, research articles, and presentations.
    • Current project: The Public Presidency project works with political news data using computational text analysis to imagine and develop new ways for the public to engage political information in an increasingly complex and polarized information environment.
    • Research elements include: acquiring the text of presidential news from sources representing a range of perceived ideological perspectives; extracting features of presidential news, attention, and activity; developing a dynamic visual representation of presidentially-relevant news that citizens and researchers could use to monitor information and compare features across information streams; testing the value of such information to citizen users via experimentation; and analyzing derived data to test theories in political communication and related disciplines.
    • Project goals include: in addition to the above research goals, the project seeks to develop a set of tools and methods that could be applied to other mediated topics, and to examine how machine learning can be used to promote collective civic intelligence.

    Fellows will contribute to the project, expanding on current ideas and methods to advance the research and project goals.

Fellowship Details
Fellows in both roles have the opportunity to build their data science and quantitative knowledge, experience, and portfolios. Fellows will be paid hourly at $20/hour for up to 10 hours a week for 15 weeks each semester with the possibility of continued work over the summer.

Applicants should have completed at least a year of statistical, computational, or data science coursework before applying and be proficient in at least one statistical software environment such as R, Python, Stata, or SPSS. We particularly encourage applications from women, people of color, LGBT students, first-generation college students and other under-represented groups in the data science field.

To apply, submit an application through UVA’s Handshake system by April 30 for full consideration (applications will be considered until the positions are filled), including a CV and cover letter highlighting:

  • The StatLab fellow role that interests you, or if you’re interested in both;
  • Your experience with data analysis, statistical methods, data wrangling, and visualization on research projects or in classwork;
  • Your experience, if any, with open and reproducible tools and workflows;
  • Your research interests and what you hope to gain as a StatLab fellow.

Desired Qualifications:

  • Graduate student standing (required);
  • Experience performing statistical analyses and interpreting the results;
  • Proficiency in at least one statistical software environment;
  • Familiarity with a range of statistical methods and data analytic techniques;
  • Willingness to support academic research across a variety of domains;
  • Strong oral and written communication skills.

About StatLab and Research Data Services at UVA Library
The Library’s Research Data Services (RDS) provides expertise around data discovery and software, data management and sharing, and data analytics and use. Within RDS, the StatLab offers consultation, collaboration, training and support around data science, applied statistics, and scientific computing. We support research needs around data wrangling and cleaning, analysis and visualization, statistical inference and computational methods, reproducibility and open science. RDS/ StatLab staff collectively have expertise in open source programming languages like R and Python, in statistical environments like Stata and SPSS, and in computing technologies like Unix, GitHub, and ShareLaTeX.

StatLab, as a service, assists with a wide portfolio of quantitative research and contributes to data science education; StatLab, as an initiative, engages in collaborative knowledge production through projects of the Data for Democracy Lab. You can learn more about the StatLab at our website.

Getting Started with Moderated Mediation

In a previous post we demonstrated how to perform a basic mediation analysis. In this post we look at performing a moderated mediation analysis.

The basic idea is that a mediator may depend on another variable called a “moderator”. For example, in our mediation analysis post we hypothesized that self-esteem was a mediator of student grades on the effect of student happiness. We illustrate this below with a path diagram. We see a direct effect of grades on happiness, but also an indirect effect of grades on happiness through self-esteem. A mediation analysis helps us investigate and quantify that indirect effect.

But what if we suspect that, say, gender moderates that indirect effect? In other words, what if we think that the mediation effect of self-esteem might differ between females and males? To analyze that question we use moderated mediation.

The difference between mediation and moderated mediation is that we include an interaction for the moderator in our models. Let’s demonstrate using R.

First we read in the data from our mediation analysis post, but this time with a gender variable added. (Note: this data and example are made up purely for illustration purposes.)

myData <- read.csv('http://static.lib.virginia.edu/statlab/materials/data/mediationData2.csv')

Next we load the mediation package. If you don’t already have the mediation package, run the install.packages function below. Otherwise you can skip it.

install.packages("mediation")
library("mediation")

Now we define our mediator and outcome models with an interaction term for gender. The interaction needs to happen with both “treatment” and mediating variables. In this case, grades is our “treatment” and self-esteem is the mediator.

model.M <- lm(self.esteem ~ grades*gender, myData)
model.Y <- lm(happiness ~ grades*gender + self.esteem*gender, myData)

Notice this is just like the code in the mediation analysis post except we’ve added an interaction for gender in both models. The formula notation grades*gender is a short cut for writing grades + gender + grades:gender, where “:” is an interaction operator in R’s formula syntax. An interaction allows the effect of grades and self-esteem to vary according to gender.

Now we run our mediation as before, but this time we only need to use a couple of simulation draws.

results <- mediate(model.M, model.Y, treat='grades', mediator='self.esteem', sims=2)

Finally we perform the moderated mediation using the test.modmed function. This is where we perform the simulation draws to calculate uncertainty. The first argument is the output of the mediation analysis. The second and third arguments are the different levels of the moderators. Notice they each need to be a list object. The last argument specifies the number of simulations. We use 500, but you may want to do as many as 1000.

test.modmed(results, covariates.1 = list(gender = "M"), covariates.2 = list(gender = "F"), sims = 500)


	Test of ACME(covariates.1) - ACME(covariates.2) = 0

data:  estimates from results
ACME(covariates.1) - ACME(covariates.2) = 0.22679, p-value = 0.38
alternative hypothesis: true ACME(covariates.1) - ACME(covariates.2) is not equal to 0
95 percent confidence interval:
 -0.2163987  0.8049392


	Test of ADE(covariates.1) - ADE(covariates.2) = 0

data:  estimates from results
ADE(covariates.1) - ADE(covariates.2) = -0.046475, p-value = 0.84
alternative hypothesis: true ADE(covariates.1) - ADE(covariates.2) is not equal to 0
95 percent confidence interval:
 -0.5832841  0.5655469

Since we’re using simulation to estimate uncertainty, your answer will differ slightly from the output above.

The first section is a test of difference between the average causal mediation effects (ACME), i.e., the indirect effect of grades through self-esteem on happiness. The estimated difference is about 0.22, but the 95% confidence interval spans from -0.21 to 0.80. Using traditional hypothesis testing we might conclude we cannot rule out 0 as the true difference between the mediation effects. Another conclusion might be that the true difference appears to be small, but we don’t have enough evidence to determine if that difference is positive or negative.

The second section is a test of difference between the average direct effects (ADE), i.e., the direct effect of grades on happiness. As with the indirect effect, we don’t have enough evidence to conclude if the difference in direct effects between genders is positive or negative.

In this case our moderator was a categorical variable but a moderator can also be continuous. We just have to specify different values of the moderator in the covariates arguments of test.modmed. See the documentation of test.modmed for an example. Just enter ?test.modmed in your R console.

References

  • Tingley, D., Yamamoto, T., Hirose, K., Keele, L., & Imai, K. (2014). Mediation: R package for causal mediation analysis. https://www.jstatsoft.org/article/view/v059i05
  • MacKinnon, D. (2008). Introduction to Statistical Mediation Analysis. Lawrence Erlbaum.

For questions or clarifications regarding this article, contact the UVa Library StatLab: statlab@virginia.edu

Clay Ford
Statistical Research Consultant
University of Virginia Library
March 2, 2018

Data Purchase Program – Spring 2018

We are pleased to announce the Spring round of the pilot year of the Data Purchase Program for UVA-affiliated researchers.  Awards are directed at smaller research needs (around $1,000 to $5,000).  However, the total amount available will depend on the number of applications and the potential overall impact on UVA’s research mission.  The deadline for submission for the Data Purchase Program is March 5, 2018.  Please visit the main Data Purchase Program page to learn more.

Examples of Recent Data Purchases:

Questions?  Contact data@virginia.edu.

Getting started with Multivariate Multiple Regression

Multivariate Multiple Regression is the method of modeling multiple responses, or dependent variables, with a single set of predictor variables. For example, we might want to model both math and reading SAT scores as a function of gender, race, parent income, and so forth. This allows us to evaluate the relationship of, say, gender with each score. You may be thinking, “why not just run separate regressions for each dependent variable?” That’s actually a good idea! And in fact that’s pretty much what multivariate multiple regression does. It regresses each dependent variable separately on the predictors. However, because we have multiple responses, we have to modify our hypothesis tests for regression parameters and our confidence intervals for predictions.

To get started, let’s read in some data from the book Applied Multivariate Statistical Analysis (6th ed.) by Richard Johnson and Dean Wichern. This data come from exercise 7.25 and involve 17 overdoses of the drug amitriptyline (Rudorfer, 1982). There are two responses we want to model: TOT and AMI. TOT is total TCAD plasma level and AMI is the amount of amitriptyline present in the TCAD plasma level. The predictors are as follows:

GEN, gender (male = 0, female = 1)
AMT, amount of drug taken at time of overdose
PR, PR wave measurement
DIAP, diastolic blood pressure
QRS, QRS wave measurement

We’ll use the R statistical computing environment to demonstrate multivariate multiple regression. The following code reads the data into R and names the columns.

ami_data <- read.table("http://static.lib.virginia.edu/statlab/materials/data/ami_data.DAT")
names(ami_data) <- c("TOT","AMI","GEN","AMT","PR","DIAP","QRS")

Before going further you may wish to explore the data using the summary and pairs functions.

summary(ami_data)
pairs(ami_data)

Performing multivariate multiple regression in R requires wrapping the multiple responses in the cbind() function. cbind() takes two vectors, or columns, and “binds” them together into two columns of data. We insert that on the left side of the formula operator: ~. On the other side we add our predictors. The + signs do not mean addition per se but rather inclusion. Taken together the formula “cbind(TOT, AMI) ~ GEN + AMT + PR + DIAP + QRS” translates to “model TOT and AMI as a function of GEN, AMT, PR, DIAP and QRS.” To fit this model we use the workhorse lm() function and save it to an object we named “mlm1”. Finally we view the results with summary().

mlm1 <- lm(cbind(TOT, AMI) ~ GEN + AMT + PR + DIAP + QRS, data = ami_data)
summary(mlm1)

Response TOT :

Call:
lm(formula = TOT ~ GEN + AMT + PR + DIAP + QRS, data = ami_data)

Residuals:
   Min     1Q Median     3Q    Max 
-399.2 -180.1    4.5  164.1  366.8 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.879e+03  8.933e+02  -3.224 0.008108 ** 
GEN          6.757e+02  1.621e+02   4.169 0.001565 ** 
AMT          2.848e-01  6.091e-02   4.677 0.000675 ***
PR           1.027e+01  4.255e+00   2.414 0.034358 *  
DIAP         7.251e+00  3.225e+00   2.248 0.046026 *  
QRS          7.598e+00  3.849e+00   1.974 0.074006 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 281.2 on 11 degrees of freedom
Multiple R-squared:  0.8871,	Adjusted R-squared:  0.8358 
F-statistic: 17.29 on 5 and 11 DF,  p-value: 6.983e-05


Response AMI :

Call:
lm(formula = AMI ~ GEN + AMT + PR + DIAP + QRS, data = ami_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-373.85 -247.29  -83.74  217.13  462.72 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.729e+03  9.288e+02  -2.938 0.013502 *  
GEN          7.630e+02  1.685e+02   4.528 0.000861 ***
AMT          3.064e-01  6.334e-02   4.837 0.000521 ***
PR           8.896e+00  4.424e+00   2.011 0.069515 .  
DIAP         7.206e+00  3.354e+00   2.149 0.054782 .  
QRS          4.987e+00  4.002e+00   1.246 0.238622    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 292.4 on 11 degrees of freedom
Multiple R-squared:  0.8764,	Adjusted R-squared:  0.8202 
F-statistic:  15.6 on 5 and 11 DF,  p-value: 0.0001132

Notice the summary shows the results of two regressions: one for TOT and one for AMI. These are exactly the same results we would get if modeled each separately. You can verify this for yourself by running the following code and comparing the summaries to what we got above. They’re identical.

m1 <- lm(TOT ~ GEN + AMT + PR + DIAP + QRS, data = ami_data)
summary(m1)
m2 <- lm(AMI ~ GEN + AMT + PR + DIAP + QRS, data = ami_data)
summary(m2)

The same diagnostics we check for models with one predictor should be checked for these as well. For a review of some basic but essential diagnostics see our post Understanding Diagnostic Plots for Linear Regression Analysis.

We can use R’s extractor functions with our mlm1 object, except we’ll get double the output. For example, instead of one set of residuals, we get two:

head(resid(mlm1))
         TOT        AMI
1  132.82172  161.52769
2  -72.00392 -264.35329
3 -399.24769 -373.85244
4 -382.84730 -247.29456
5 -152.39129   15.78777
6  366.78644  217.13206

Instead of one set of fitted values, we get two:

head(fitted(mlm1))
        TOT       AMI
1 3256.1783 2987.4723
2 1173.0039  917.3533
3 1530.2477 1183.8524
4  978.8473  695.2946
5 1048.3913  828.2122
6 1400.2136 1232.8679

Instead of one set of coefficients, we get two:

coef(mlm1)
                      TOT           AMI
(Intercept) -2879.4782461 -2728.7085444
GEN           675.6507805   763.0297617
AMT             0.2848511     0.3063734
PR             10.2721328     8.8961977
DIAP            7.2511714     7.2055597
QRS             7.5982397     4.9870508

Instead of one residual standard error, we get two:

sigma(mlm1)
     TOT      AMI 
281.2324 292.4363 

Again these are all identical to what we get by running separate models for each response. The similarity ends, however, with the variance-covariance matrix of the model coefficients. We don’t reproduce the output here because of the size, but we encourage you to view it for yourself:

vcov(mlm1)

The main takeaway is that the coefficients from both models covary. That covariance needs to be taken into account when determining if a predictor is jointly contributing to both models. For example, the effects of PR and DIAP seem borderline. They appear significant for TOT but less so for AMI. But it’s not enough to eyeball the results from the two separate regressions! We need to formally test for their inclusion. And that test involves the covariances between the coefficients in both models.

Determining whether or not to include predictors in a multivariate multiple regression requires the use of multivariate test statistics. These are often taught in the context of MANOVA, or multivariate analysis of variance. Again the term “multivariate” here refers to multiple responses or dependent variables. This means we use modified hypothesis tests to determine whether a predictor contributes to a model.

The easiest way to do this is to use the Anova() or Manova() functions in the car package (Fox and Weisberg, 2011), like so:

library(car)
Anova(mlm1)

Type II MANOVA Tests: Pillai test statistic
     Df test stat approx F num Df den Df   Pr(>F)   
GEN   1   0.65521   9.5015      2     10 0.004873 **
AMT   1   0.69097  11.1795      2     10 0.002819 **
PR    1   0.34649   2.6509      2     10 0.119200   
DIAP  1   0.32381   2.3944      2     10 0.141361   
QRS   1   0.29184   2.0606      2     10 0.178092   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The results are titled “Type II MANOVA Tests”. The Anova() function automatically detects that mlm1 is a multivariate multiple regression object. “Type II” refers to the type of sum-of-squares. This basically says that predictors are tested assuming all other predictors are already in the model. This is usually what we want. Notice that PR and DIAP appear to be jointly insignificant for the two models despite what we were led to believe by examining each model separately.

Based on these results we may want to see if a model with just GEN and AMT fits as well as a model with all five predictors. One way we can do this is to fit a smaller model and then compare the smaller model to the larger model using the anova() function, (notice the little “a”; this is different from the Anova() function in the car package). For example, below we create a new model using the update() function that only includes GEN and AMT. The expression “. ~ . – PR – DIAP – QRS” says “keep the same responses and predictors except PR, DIAP and QRS.”

mlm2 <- update(mlm1, . ~ . - PR - DIAP - QRS)
anova(mlm1, mlm2)

Analysis of Variance Table

Model 1: cbind(TOT, AMI) ~ GEN + AMT + PR + DIAP + QRS
Model 2: cbind(TOT, AMI) ~ GEN + AMT
  Res.Df Df Gen.var.  Pillai approx F num Df den Df Pr(>F)
1     11       43803                                      
2     14  3    51856 0.60386   1.5859      6     22 0.1983

The large p-value provides good evidence that the model with two predictors fits as well as the model with five predictors. Notice the test statistic is “Pillai”, which is one of the four common multivariate test statistics.

The car package provides another way to conduct the same test using the linearHypothesis() function. The beauty of this function is that it allows us to run the test without fitting a separate model. It also returns all four multivariate test statistics. The first argument to the function is our model. The second argument is our null hypothesis. The linearHypothesis() function conveniently allows us to enter this hypothesis as character phrases. The null entered below is that the coefficients for PR, DIAP and QRS are all 0.

lh.out <- linearHypothesis(mlm1, hypothesis.matrix = c("PR = 0", "DIAP = 0", "QRS = 0"))
lh.out

Sum of squares and products for the hypothesis:
         TOT      AMI
TOT 930348.1 780517.7
AMI 780517.7 679948.4

Sum of squares and products for error:
         TOT      AMI
TOT 870008.3 765676.5
AMI 765676.5 940708.9

Multivariate Tests: 
                 Df test stat approx F num Df den Df  Pr(>F)  
Pillai            3 0.6038599 1.585910      6     22 0.19830  
Wilks             3 0.4405021 1.688991      6     20 0.17553  
Hotelling-Lawley  3 1.1694286 1.754143      6     18 0.16574  
Roy               3 1.0758181 3.944666      3     11 0.03906 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


The Pillai result is the same as we got using the anova() function above. The Wilks, Hotelling-Lawley, and Roy results are different versions of the same test. The consensus is that the coefficients for PR, DIAP and QRS do not seem to be statistically different from 0. There is some discrepancy in the test results. The Roy test in particular is significant, but this is likely due to the small sample size (n = 17).

Also included in the output are two sum of squares and products matrices, one for the hypothesis and the other for the error. These matrices are used to calculate the four test statistics. These matrices are stored in the lh.out object as SSPH (hypothesis) and SSPE (error). We can use these to manually calculate the test statistics. For example, let SSPH = H and SSPE = E. The formula for the Wilks test statistic is

$$
\frac{\begin{vmatrix}\bf{E}\end{vmatrix}}{\begin{vmatrix}\bf{E} + \bf{H}\end{vmatrix}}
$$

In R we can calculate that as follows:

E <- lh.out$SSPE
H <- lh.out$SSPH
det(E)/det(E + H)
[1] 0.4405021

Likewise the formula for Pillai is

$$
tr[\bf{H}(\bf{H} + \bf{E})^{-1}]
$$

tr means trace. That’s the sum of the diagonal elements of a matrix. In R we can calculate as follows:

sum(diag(H %*% solve(E + H)))
[1] 0.6038599

The formula for Hotelling-Lawley is

$$
tr[\bf{H}\bf{E}^{-1}]
$$

In R:

sum(diag(H %*% solve(E)))
[1] 1.169429

And finally the Roy statistics is the largest eigenvalue of \(\bf{H}\bf{E}^{-1}\).

In R code:

e.out <- eigen(H %*% solve(E))
max(e.out$values)
[1] 1.075818

Given these test results, we may decide to drop PR, DIAP and QRS from our model. In fact this is model mlm2 that we fit above. Here is the summary:

summary(mlm2)

Response TOT :

Call:
lm(formula = TOT ~ GEN + AMT, data = ami_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-756.05 -190.68  -59.83  203.32  560.84 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  56.72005  206.70337   0.274   0.7878    
GEN         507.07308  193.79082   2.617   0.0203 *  
AMT           0.32896    0.04978   6.609 1.17e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 358.6 on 14 degrees of freedom
Multiple R-squared:  0.7664,	Adjusted R-squared:  0.733 
F-statistic: 22.96 on 2 and 14 DF,  p-value: 3.8e-05


Response AMI :

Call:
lm(formula = AMI ~ GEN + AMT, data = ami_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-716.80 -135.83  -23.16  182.27  695.97 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -241.34791  196.11640  -1.231  0.23874    
GEN          606.30967  183.86521   3.298  0.00529 ** 
AMT            0.32425    0.04723   6.866 7.73e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 340.2 on 14 degrees of freedom
Multiple R-squared:  0.787,	Adjusted R-squared:  0.7566 
F-statistic: 25.87 on 2 and 14 DF,  p-value: 1.986e-05

Now let’s say we wanted to use this model to predict TOT and AMI for GEN = 1 (female) and AMT = 1200. We can use the predict() function for this. First we need put our new data into a data frame with column names that match our original data.

nd <- data.frame(GEN = 1, AMT = 1200)
p <- predict(mlm2, nd)
p
       TOT      AMI
1 958.5473 754.0677

This predicts two values, one for each response. Now this is just a prediction and has uncertainty. We usually quantify uncertainty with confidence intervals to give us some idea of a lower and upper bound on our estimate. But in this case we have two predictions from a multivariate model with two sets of coefficients that covary! This means calculating a confidence interval is more difficult. In fact we don’t calculate an interval but rather an ellipse to capture the uncertainty in two dimensions.

Unfortunately at the time of this writing there doesn’t appear to be a function in R for creating uncertainty ellipses for multivariate multiple regression models with two responses. However we have written one below you can use called “predictionEllipse”. The details of the function go beyond a “getting started” blog post but it should be easy enough to use. Simply submit the code in the console to create the function. Then use the function with any multivariate multiple regression model object that has two responses. The newdata argument works the same as the newdata argument for predict. Use the level argument to specify a confidence level between 0 and 1. The default is 0.95. Set ggplot to FALSE to create the plot using base R graphics.

predictionEllipse <- function(mod, newdata, level = 0.95, ggplot = TRUE){
  # labels
  lev_lbl <- paste0(level * 100, "%")
  resps <- colnames(mod$coefficients)
  title <- paste(lev_lbl, "confidence ellipse for", resps[1], "and", resps[2])
  
  # prediction
  p <- predict(mod, newdata)
  
  # center of ellipse
  cent <- c(p[1,1],p[1,2])
  
  # shape of ellipse
  Z <- model.matrix(mod)
  Y <- mod$model[[1]]
  n <- nrow(Y)
  m <- ncol(Y)
  r <- ncol(Z) - 1
  S <- crossprod(resid(mod))/(n-r-1)
  
  # radius of circle generating the ellipse
  z0 <- c(1, as.matrix(newdata))
  rad <- sqrt((m*(n-r-1)/(n-r-m))*qf(level,m,n-r-m)*t(z0)%*%solve(t(Z)%*%Z) %*% z0)
  
  # generate ellipse using ellipse function in car package
  ell_points <- car::ellipse(center = c(cent), shape = S, radius = c(rad), draw = FALSE)
  
  # ggplot2 plot
  if(ggplot){
    require(ggplot2, quietly = TRUE)
    ell_points_df <- as.data.frame(ell_points)
    ggplot(ell_points_df, aes(x, y)) +
      geom_path() +
      geom_point(aes(x = TOT, y = AMI), data = data.frame(p)) +
      labs(x = resps[1], y = resps[2], 
           title = title)
  } else {
    # base R plot
    plot(ell_points, type = "l", xlab = resps[1], ylab = resps[2], main = title)
    points(x = cent[1], y = cent[2])
  }
}

Here’s a demonstration of the function.

predictionEllipse(mod = mlm2, newdata = nd)

The dot in the center is our predicted values for TOT and AMI. The ellipse represents the uncertainty in this prediction. We’re 95% confident the true values of TOT and AMI when GEN = 1 and AMT = 1200 are within the area of the ellipse. Notice also that TOT and AMI seem to be positively correlated. Predicting higher values of TOT means predicting higher values of AMI, and vice versa.

References

Fox, J and Weisberg, S (2011). An {R} Companion to Applied Regression, Second Edition. Thousand Oaks CA: Sage. URL: http://socserv.socsci.mcmaster.ca/jfox/Books/Companion

Johnson, R and Wichern, D (2007). Applied Multivariate Statistical Analysis, Sixth Edition. Prentice-Hall.

Rudorfer, MV “Cardiovascular Changes and Plasma Drug Levels after Amitriptyline Overdose.” Journal of Toxicology-Clinical Toxicology, 19 (1982), 67-71.

For questions or clarifications regarding this article, contact the UVa Library StatLab: statlab@virginia.edu

Clay Ford
Statistical Research Consultant
University of Virginia Library
October 27, 2017

Spring 2018 Data Science Short Courses

The Library’s Research Data Services is offering a 1-credit data science course in Spring 2018 through the Data Science Institute: Data Wrangling in Python and the Public Interest Data Lab. Open to all, these courses are designed to provide short, guided training and practice on relevant data science skills.


DS 6501 Data Wrangling in Python (1 credit, meets the first five weeks of the semester)
Pete Alonzi
T,R 2:00-3:15 from 1/18/2018-2/20/2018
Dell 2 100

The goal of this class is to make the skills of data wrangling in Python familiar and easy to use. We’ll cover data cleaning and data manipulation in Python, including reading and writing data, the Pandas library for cleaning, transforming, merging, reshaping, and data aggregation, and the matplotlib library for plotting.


DS 5559 Public Interest Data Lab (1 credit, meets the first ten weeks of the semester)
Michele Claibourn
F 12:00-2:00 from 1/19/2018-3/30/2018
Brown Library

The lab course will provide experience working collaboratively, openly, and reproducibly on data science projects organized by the lab director — for example, working with local agencies to understand their data and improve processes. The goal is to provide students with an opportunity to enhance their data skills and to gain experience working as a team on a joint project while promoting social good. See the SIS entry for information about the Spring 2018 project.


To register search for Subject “DS” and the Course Number in SIS.

Announcing the Data Purchase Program

Did you know that UVA Library will consider purchasing datasets for you?  It’s true!

We are launching a new Data Purchase Program for UVA-affiliated researchers.  Awards are directed at smaller research needs (around $1,000 to $5,000).  However, the total amount available will depend on the number of applications and the potential overall impact on UVA’s research mission.  The deadline for submission for the inaugural round of the Data Purchase Program is October 15, 2017.  We will have another round in the Spring.  Please visit the main Data Purchase Program page to learn more.

Examples of Recent Data Purchases:

Questions?  Contact data@virginia.edu.

Visualizing the Effects of Proportional-Odds Logistic Regression

Proportional-odds logistic regression is often used to model an ordered categorical response. By “ordered”, we mean categories that have a natural ordering, such as “Disagree”, “Neutral”, “Agree”, or “Everyday”, “Some days”, “Rarely”, “Never”. For a primer on proportional-odds logistic regression, see our post, Fitting and Interpreting a Proportional Odds Model. In this post we demonstrate how to visualize a proportional-odds model in R.

To begin, we load the effects package. The effects package provides functions for visualizing regression models. This post is essentially a tutorial for using the effects package with proportional-odds models. We also load the car, MASS and splines packages for particular functions, which we’ll explain as we encounter them. If you don’t have the effects or car packages, uncomment the lines below and run them in R. MASS and splines are recommended packages that come with R.

> # install.packages("effects")
> # install.packages("car")

> library(effects)
> library(car)
> library(MASS)
> library(splines) 

To demonstrate how to visualize a proportional-odds model we’ll use data from the World Values Surveys (1995-1997) for Australia, Norway, Sweden, and the United States. This dataset, WVS, comes with the effects package. Once we load the effects package, the data is ready to access.

> head(WVS)
      poverty religion degree country age gender
1  Too Little      yes     no     USA  44   male
2 About Right      yes     no     USA  40 female
3  Too Little      yes     no     USA  36 female
4    Too Much      yes    yes     USA  25 female
5  Too Little      yes    yes     USA  39   male
6 About Right      yes     no     USA  80 female

The response variable of interest is poverty, which is a 3-level ordered categorical variable. It contains the answer to the question “Do you think that what the government is doing for people in poverty in this country is about the right amount, too much, or too little?” The answers are an ordered categorical variable with levels “Too Little”, “About Right”, and “Too Much”. The other variables will serve as our predictors. These include country, gender, religion (belong to a religion?), education (hold a university degree?), and age in years. The data contains 5381 records.

Before we get started we need to note this example comes from an article on the effects package in the Journal of Statistical Software by John Fox and Jangman Hong, the authors of the effects package. You should definitely take the time to read through that article and cite it if you plan to use the effects package for your own research. What we seek to do in this blog post is elaborate on the example and provide some additional details.

Before we can visualize a proportional odds model we need to fit it. For this we use the polr function from the MASS package. The first model we fit models poverty as a function of country interacted with gender, religion, degree and age. The interaction allows the effects of the predictors to vary with each country.

> wvs.1 <- polr(poverty ~ country*(gender + religion + degree + age), data = WVS)
> summary(wvs.1)

Call:
polr(formula = poverty ~ country * (gender + religion + degree + 
    age), data = WVS)

Coefficients:
                               Value Std. Error  t value
countryNorway              0.5308176   0.286989  1.84961
countrySweden              0.5446552   0.546029  0.99748
countryUSA                -0.0347317   0.248059 -0.14001
gendermale                 0.0696120   0.090212  0.77165
religionyes                0.0094685   0.112476  0.08418
degreeyes                 -0.1242920   0.167603 -0.74158
age                        0.0155849   0.002597  6.00185
countryNorway:gendermale   0.1873611   0.144503  1.29659
countrySweden:gendermale   0.0563508   0.154414  0.36493
countryUSA:gendermale      0.2119735   0.139513  1.51938
countryNorway:religionyes -0.2186724   0.216256 -1.01118
countrySweden:religionyes -0.8789724   0.513263 -1.71252
countryUSA:religionyes     0.6002277   0.174433  3.44101
countryNorway:degreeyes    0.0558595   0.208202  0.26829
countrySweden:degreeyes    0.6281743   0.214295  2.93136
countryUSA:degreeyes       0.3030866   0.206394  1.46848
countryNorway:age         -0.0157142   0.004367 -3.59846
countrySweden:age         -0.0092122   0.004657 -1.97826
countryUSA:age             0.0005419   0.003975  0.13635

Intercepts:
                       Value   Std. Error t value
Too Little|About Right  0.7161  0.1535     4.6644
About Right|Too Much    2.5355  0.1578    16.0666

Residual Deviance: 10347.07 
AIC: 10389.07 

The summary output is imposing. In addition to 19 coefficients we have 2 intercepts. Larger coefficients with large t-values are indicative of important predictors, but with so many interactions it’s hard to to see what’s happening or what the model “says”. To evaluate whether the interactions are significant, we use the Anova function from the car package. By default the Anova function returns Type II tests, which tests each term after all others, save interactions. (The base R anova function performs Type I tests which tests each term sequentially.)

> Anova(wvs.1)
Analysis of Deviance Table (Type II tests)

Response: poverty
                 LR Chisq Df Pr(>Chisq)    
country           250.881  3  < 2.2e-16 ***
gender             10.749  1  0.0010435 ** 
religion            4.132  1  0.0420698 *  
degree              4.284  1  0.0384725 *  
age                49.950  1  1.577e-12 ***
country:gender      3.049  3  0.3841657    
country:religion   21.143  3  9.833e-05 ***
country:degree     12.861  3  0.0049476 ** 
country:age        17.529  3  0.0005501 ***

The Anova result shows all interactions except country:gender are significant. But what do the interactions mean? How do, say, country and age interact?

This is where the effects package enters. The effects package allows us to easily create effect displays. What are effect displays? The documentation for the effects package explains it this way:

“To create an effect display, predictors in a term are allowed to range over their combinations of values, while other predictors in the model are held to typical values.”

In other words, we take our model and use it to calculate predicted values for various combinations of certain “focal” predictors while holding other predictors at fixed values. Then we plot our predicted values versus the “focal” predictors to see how the response changes. Let’s demonstrate.

The two primary functions are Effect and plot. Effect generates the predictions and plot creates the display. Let’s say we’re interested in the age and country interaction. We want to visualize how age affects views on poverty for each country. Since our model includes an interaction, which was significant, we expect to see different trajectories for each country.

The following code generates the predicted values. The first argument, focal.predictors, is where we list the predictors we’re interested in. Notice it requires a vector, which is why we use the c() function. The second argument is the fitted model.

> Effect(focal.predictors = c("age","country"), wvs.1)

age*country effect (probability) for Too Little
    country
age  Australia    Norway    Sweden       USA
  20 0.5958889 0.5632140 0.6496015 0.4330782
  30 0.5578683 0.5635318 0.6349615 0.3939898
  40 0.5191570 0.5638496 0.6200674 0.3562127
  50 0.4802144 0.5641674 0.6049438 0.3201442
  60 0.4415107 0.5644851 0.5896166 0.2861049
  70 0.4035049 0.5648027 0.5741134 0.2543308
  80 0.3666230 0.5651203 0.5584631 0.2249734
  90 0.3312402 0.5654379 0.5426958 0.1981045

age*country effect (probability) for About Right
    country
age  Australia    Norway    Sweden       USA
  20 0.3050532 0.3250955 0.2699789 0.3918455
  30 0.3282699 0.3249058 0.2797785 0.4064109
  40 0.3502854 0.3247160 0.2895692 0.4171736
  50 0.3704966 0.3245261 0.2993161 0.4237414
  60 0.3883075 0.3243362 0.3089823 0.4258700
  70 0.4031610 0.3241462 0.3185295 0.4234793
  80 0.4145714 0.3239561 0.3279182 0.4166592
  90 0.4221528 0.3237659 0.3371079 0.4056632

age*country effect (probability) for Too Much
    country
age   Australia    Norway     Sweden       USA
  20 0.09905788 0.1116905 0.08041952 0.1750762
  30 0.11386182 0.1115624 0.08526005 0.1995993
  40 0.13055755 0.1114343 0.09036331 0.2266137
  50 0.14928897 0.1113065 0.09574006 0.2561143
  60 0.17018180 0.1111787 0.10140107 0.2880252
  70 0.19333407 0.1110511 0.10735706 0.3221899
  80 0.21880555 0.1109236 0.11361867 0.3583674
  90 0.24660694 0.1107962 0.12019630 0.3962323

Notice the output lists three sections of probabilities corresponding to each level of the response. The first section lists predicted probabilities for answering “Too Little” for each country for ages ranging from 20 to 90 in increments of 10. The predicted probability a 20-year-old from the USA answers “Too Little” is about 0.43. The second section is for “About right”. The predicted probability a 20-year-old from the USA answers “About Right” is about 0.39. Finally, the third section is for “Too Much”. The predicted probability a 20-year-old from the USA answers “Too Much” is about 0.18.

This information is much easier to digest as an effect display. Simply insert the original Effect function call into the plot function to create the effect display.

> plot(Effect(focal.predictors = c("age","country"), wvs.1), rug = FALSE)

Now we see how the model “works”. For example, in the upper right plot, we see that in the USA, the probability of answering “Too Much” increases rather dramatically with age, while the probabilities for answering the same in Norway and Sweden stay low and constant. Likewise we see that the probability of USA respondents answering “Too Little” decreases with age while the probabilities for Norway and Sweden stay rather high and constant. The effect display shows us where the interactions are happening and to what degree. (Note: setting rug = FALSE turns off the rug plot. When set to TRUE, the default, the marginal distribution of the predictor is displayed on the x axis. In this case we don’t find it very helpful since we have so much data.)

Recall that we need to use all predictors to generate these plots. Country and age are the focal predictors, so they are varied. This means gender, religion and degree are held fixed. We can find out the values they’re fixed at by saving the result of the Effect function and viewing the model matrix.

> e.out <- Effect(focal.predictors = c("age","country"), wvs.1)
> e.out$model.matrix[1,c("gendermale","religionyes","degreeyes")]
 gendermale religionyes   degreeyes 
  0.4935886   0.8539305   0.2124140 

This says gender was set to 0.4935886, religion to 0.8539305, and degree to 0.2124140. In the original data these are indicator variables that take values of 0 or 1 corresponding to No and Yes. Does it make sense to plug in decimals instead of 0s or 1s? It does if you think of modeling a population that is about 49% men, 85% religious, and 21% with a college degree. In fact this describes our sample. By default, the effects package will take the mean of numeric variables that are held fixed. We can verify that’s what it’s doing:

> mean(WVS$gender=="male")
[1] 0.4935886
> mean(WVS$religion=="yes")
[1] 0.8539305
> mean(WVS$degree=="yes")
[1] 0.212414

If we want to change these values, we can use the given.values argument. It needs to be a named vector that uses the terms as listed in the output summary. For example, to create an effect plot for religious men without a college degree:

> e.out <- Effect(focal.predictors = c("country","age"), mod = wvs.1, 
                given.values = c(gendermale = 1, religionyes = 1, degreeyes = 0))
> plot(e.out, rug = FALSE)

We see that the overall “story” of the display does not change; the same changes are happening in the same plots. But the overall probabilities have increased by evidence of the y axis now topping out at 0.7.

We can also change the range and number of focal predictors using the xlevels argument. For example, we can set age to range from 20 to 80 in steps of 10. Notice it needs to be a named list.

> e.out <- Effect(focal.predictors = c("country","age"), mod = wvs.1,
                 xlevels = list(age = seq(20,80,10)))
> plot(e.out, rug = FALSE)

We can investigate other interactions using the same syntax. Below is the effect display for the religion and country interaction. In this case, age is no longer a focal predictor and is held fixed at its mean (45.04).

> plot(Effect(focal.predictors = c("religion","country"), wvs.1), rug = FALSE)

We notice, for example, that religious people in the USA have a higher probability of answering “Too Much” compared to their counterparts in the other countries surveyed. We also notice there is much more uncertainty about estimates in Sweden for people who answered “No” to the religion question. This is due to the small numbers of respondents in those categories, as we can see with the xtabs function.

> xtabs(~ religion + country + poverty, data = WVS, country == "Sweden", 
        drop.unused.levels = TRUE)
, , poverty = Too Little

        country
religion Sweden
     no       7
     yes    597

, , poverty = About Right

        country
religion Sweden
     no       5
     yes    369

, , poverty = Too Much

        country
religion Sweden
     no       3
     yes     22

Following the example in Fox’s article, let’s fit another model that relaxes the linearity assumption for age. We can do this by generating what’s called a basis matrix for natural cubic splines. Instead of fitting a regular polynomial such as age + age^2, we fit piecewise cubic polynomials over the range of age separated by a certain number of intervals, or knots. The ns function in the splines package makes this easy to do. Below we use it in the model formula and specify 4 knots. Harrell (2001) suggests 3-5 knots is usually a good choice (p. 23), so 4 seems wise in this case.

> wvs.2 <- polr(poverty ~ country*(gender + religion + degree + ns(age, 4)),data = WVS)
> summary(wvs.2)
> Anova(wvs.2)

Due to space considerations we don’t print the output of the summary and Anova functions. The model summary shows information for 31 coefficients and is very difficult to interpret. The Anova result is similar in substance to the first model, showing all interactions except country:gender significant. It’s worth noting the AIC of the second model is considerably lower than the first (10373.85 vs 10389.07), suggesting a better fit.

Let’s look at the country and age interaction while allowing age to range from 20 – 80:

> plot(Effect(focal.predictors = c("country","age"), mod = wvs.2, 
            xlevels = list(age = 20:80)), rug = FALSE)

We see that using a natural spline allows a nonlinear effect of age. For example we see the probability of answering “Too Little” in the USA decreases sharply from 20 to 30, increases from about age 30 to 45, and then decreases and levels out through age 80.

The effects package also allows us to create “stacked” effect displays for proportional-odds models. We do this by setting style="stacked" in the plot function.

> plot(Effect(focal.predictors = c("country","age"), mod = wvs.2, 
            xlevels = list(age = 20:80)), 
     rug = FALSE,
     style="stacked")

This plot is useful for allowing us to compare probabilities across the response categories. For example, in Norway and Sweden, people are most likely to answer “Too Little” regardless of age. The blue shaded regions dominate their graphs.

We can also create a “latent” version of the effect display. In this plot, the y axis is on the logit scale, which we interpret to be a latent, or hidden, scale from which the ordered categories are derived. We create it by setting latent = TRUE in the Effect function.

> plot(Effect(focal.predictors = c("country","age"), mod = wvs.2, 
            xlevels = list(age = 20:80),
            latent = TRUE),
     rug = FALSE,
     ylim = c(0,3.5))

This plot is useful when we’re more interested in classification than probability. The horizontal lines in the plots correspond to the intercepts in the summary output. We can think of these lines as threshholds that define where we crossover from one category to the next on the latent scale. The TL-AR line indicates the boundary between the “Too Little” and “About Right” categories. The AR-TM line indicates the boundary between the “About Right” and “Too Much” categories. Like the “stacked” effect display, we see that someone from Norway and Sweden would be expected to answer “Too Little” regardless of age, though the confidence ribbon indicates this expectation is far from certain, especially for older and younger respondents. On the other hand, most USA respondents are expected to answer “About Right”.

Let’s see how the “latent” plot changes when we set the non-focal predictors to college-educated, non-religious female.

> plot(Effect(focal.predictors = c("country","age"), mod = wvs.2, 
            xlevels = list(age = 20:80),
            given.values = c(gendermale = 0, religionyes = 0, degreeyes = 1),
            latent = TRUE),
     rug = FALSE,
     ylim = c(0,3.5))

Notice now that predicted classification for Sweden is “About Right” over the age range but with increased uncertainty. We also see increased chances of answering “Too Little” for certain age ranges in the USA.

We can add gender as a focal predictor to compare plots for males versus females:

> plot(Effect(focal.predictors = c("country","age","gender"), mod = wvs.2, 
            xlevels = list(age = 20:80),
            latent = TRUE),
     rug = FALSE,
     ylim = c(0,3.5))

Since we didn’t fit a 3-way interaction between country, gender and age, the trajectories do not change between genders. They simply shift horizontally between the two levels of gender.

References

Fox, J. and J. Hong (2009). Effect displays in R for multinomial and proportional-odds logit models: Extensions to the effects package. Journal of Statistical Software 32:1, 1–24, <http://www.jstatsoft.org/v32/i01/>.

Harrell, Frank. Regression Modeling Strategies. Springer, 2001.

R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical
Computing, Vienna, Austria. URL https://www.R-project.org/.

For questions or clarifications regarding this article, contact the UVa Library StatLab: statlab@virginia.edu

Clay Ford
Statistical Research Consultant
University of Virginia Library
May 10, 2017

Getting started with the purrr package in R

If you’re wondering what exactly the purrr package does, then this blog post is for you.

Before we get started, we should mention the Iteration chapter in R for Data Science by Garrett Grolemund and Hadley Wickham. We think this is the most thorough and extensive introduction to the purrr package currently available (at least at the time of this writing.) Wickham is one of the authors of the purrr package and he spends a good deal of the chapter clearly explaining how it works. Good stuff, recommended reading.

The purpose of this article is to provide a short introduction to purrr, focusing on just a handful of functions. We use some real world data and replicate what purrr does in base R so we have a better understanding of what’s going on.

We visited Yahoo Finance on 13 April 2017 and downloaded about three weeks of historical data for three companies: Boeing, Johnson & Johnson and IBM. The following R code will download and unzip the data in your current working directory if you wish to follow along.

URL <- "http://static.lib.virginia.edu/statlab/materials/data/stocks.zip"
download.file(url = URL, destfile = basename(URL))
unzip(basename(URL))

We have three CSV files. In the spirit of being efficient we would like to import these files into R using as little code as possible (as opposed to calling read.csv three different times.)

Using base R functions, we could put all the file names into a vector and then apply the read.csv function to each file. This results in a list of three data frames. When done we could name each list element using the names function and our vector of file names.

# get all files ending in csv
files <- list.files(pattern = "csv$") 
# read in data
dat <- lapply(files, read.csv)
names(dat) <- gsub("\\.csv", "", files) # remove file extension

Here is how we do the same using the map function from the purrr package.

install.packages("purrr") # if package not already installed
library(purrr)
dat2 <- map(files, read.csv)
names(dat2) <- gsub("\\.csv", "", files)

So we see that map is like lapply. It takes a vector as input and applies a function to each element of the vector. map is one of the star functions in the purrr package.

Let’s say we want to find the mean Open price for each stock. Here is a base R way using lapply and an anonymous function:

lapply(dat, function(x)mean(x$Open))
$BA
[1] 177.8287

$IBM
[1] 174.3617

$JNJ
[1] 125.8409

We can do the same with map.

map(dat, function(x)mean(x$Open))
$BA
[1] 177.8287

$IBM
[1] 174.3617

$JNJ
[1] 125.8409

But map allows us to bypass the function function. Using a tilda (~) in place of function and a dot (.) in place of x, we can do this:

map(dat, ~mean(.$Open))

Furthermore, purrr provides several versions of map that allow you to specify the structure of your output. For example, if we want a vector instead of a list we can use the map_dbl function. The “_dbl” indicates that it returns a vector of type double (ie, numbers with decimals).

map_dbl(dat, ~mean(.$Open))
      BA      IBM      JNJ 
177.8287 174.3617 125.8409 

Now let’s say that we want to extract each stock’s Open price data. In other words, we want to go into each data frame in our list and pull out the Open column. We can do that with lapply as follows:

lapply(dat, function(x)x$Open)
$BA
 [1] 178.25 177.50 179.00 178.39 177.56 179.00 176.88 177.08 178.02 177.25 177.40 176.29 174.37 176.85 177.34 175.96 179.99
[18] 180.10 178.31 179.82 179.00 178.54 177.16

$IBM
 [1] 171.04 170.65 172.53 172.08 173.47 174.70 173.52 173.82 173.98 173.86 174.30 173.94 172.69 175.12 174.43 174.04 176.01
[18] 175.65 176.29 178.46 175.71 176.18 177.85

$JNJ
 [1] 124.54 124.26 124.87 125.12 124.85 124.72 124.51 124.73 124.11 124.74 125.05 125.62 125.16 125.86 126.10 127.05 128.38
[18] 128.04 128.45 128.44 127.05 126.86 125.83

Using map is a little easier. We just provide the name of the column we want to extract.

map(dat, "Open")
$BA
 [1] 178.25 177.50 179.00 178.39 177.56 179.00 176.88 177.08 178.02 177.25 177.40 176.29 174.37 176.85 177.34 175.96 179.99
[18] 180.10 178.31 179.82 179.00 178.54 177.16

$IBM
 [1] 171.04 170.65 172.53 172.08 173.47 174.70 173.52 173.82 173.98 173.86 174.30 173.94 172.69 175.12 174.43 174.04 176.01
[18] 175.65 176.29 178.46 175.71 176.18 177.85

$JNJ
 [1] 124.54 124.26 124.87 125.12 124.85 124.72 124.51 124.73 124.11 124.74 125.05 125.62 125.16 125.86 126.10 127.05 128.38
[18] 128.04 128.45 128.44 127.05 126.86 125.83

We often want to plot financial data. In this case we may want to plot Closing price for each stock and look for trends. We can do this with the base R function mapply. First we create a vector of stock names for plot labeling. Next we set up one row of three plotting regions. Then we use mapply to create the plot. The “m” in mapply means “multiple arguments”. In this case we have two arguments: the Closing price and the stock name. Notice that mapply requires the function come first and then the arguments.

stocks <- sub("\\.csv","", files)
par(mfrow=c(1,3))
mapply(function(x,y)plot(x$Close, type = "l", main = y), x = dat, y = stocks)

The purrr equivalent is map2. Again we can substitute a tilda (~) for function, but now we need to use .x and .y to identify the arguments. However the ordering is the same as map: data come first and then the function.

map2(dat, stocks, ~plot(.x$Close, type="l", main = .y))

Each time we run mapply or map2 above, the following is printed to the console:

$BA
NULL

$IBM
NULL

$JNJ
NULL

This is because both functions return a value. Since plot returns no value, NULL is printed. The purrr package provides walk for dealing with functions like plot. Here is the same task with walk2 instead of map2. It produces the plots and prints nothing to the console.

walk2(dat, stocks, ~plot(.x$Close, type="l", main = .y))

At some point we may want to collapse our list of three data frames into a single data frame. This means we’ll want to add a column to indicate which record belongs to which stock. Using base R this is a two step process. We do.call the rbind function to the elements of our list. Then we add a column called Stock by taking advantage of the fact that the row names of our data frame contain the name of the original list element, in this case the stock name.

datDF <- do.call(rbind, dat)
# add stock names to data frame
datDF$Stock <- gsub("\\.[0-9]*", "", rownames(datDF)) # remove period and numbers
head(datDF)
           Date   Open   High    Low  Close  Volume Adj.Close Stock
BA.1 2017-04-12 178.25 178.25 175.94 176.05 2920000    176.05    BA
BA.2 2017-04-11 177.50 178.60 176.96 178.57 2259700    178.57    BA
BA.3 2017-04-10 179.00 179.97 177.48 177.56 2259500    177.56    BA
BA.4 2017-04-07 178.39 179.09 177.26 178.85 2704700    178.85    BA
BA.5 2017-04-06 177.56 178.22 177.12 177.37 2343600    177.37    BA
BA.6 2017-04-05 179.00 180.18 176.89 177.08 2387100    177.08    BA

Using purrr, we could have used map_df instead of map with the read.csv function, but we would have lost the source file information.

dat2DF <- map_df(files, read.csv) # works, but which record goes with which stock?

We could also use purrr’s reduce function. That will collapse the list into a single data frame. But again we have no way of labeling which row came from which stock.

dat2DF <- reduce(dat, rbind) # works, but which record goes with which stock?

To accomplish this with purrr, we need to use the stocks vector we created earlier along with the map2_df function. This function applies a function to two arguments and returns a data frame. The function we want to apply is update_list, another purrr function. The update_list function allows you to add things to a list element, such as a new column to a data frame. Below we use the formula notation again and .x and .y to indicate the arguments. The result is a single data frame with a new Stock column.

dat2DF <- map2_df(dat2, stocks, ~update_list(.x, stock = .y))
head(dat2DF)
        Date   Open   High    Low  Close  Volume Adj.Close stock
1 2017-04-12 178.25 178.25 175.94 176.05 2920000    176.05    BA
2 2017-04-11 177.50 178.60 176.96 178.57 2259700    178.57    BA
3 2017-04-10 179.00 179.97 177.48 177.56 2259500    177.56    BA
4 2017-04-07 178.39 179.09 177.26 178.85 2704700    178.85    BA
5 2017-04-06 177.56 178.22 177.12 177.37 2343600    177.37    BA
6 2017-04-05 179.00 180.18 176.89 177.08 2387100    177.08    BA

Finally, we should consider reformatting the Date column as a Date instead of a Factor. The easiest way to deal with this would have been to use the read_csv function from the readr package instead of read.csv. But in the interest of demonstrating some more purrr functionality, let’s pretend we can’t do that. Further, let’s pretend we don’t know which columns are Factor, but we would like to convert them to Date if they are Factor. This time we give a purrr solution first.

To do this we nest one map function in another. The first one is dmap_if. dmap is just like map, except dmap returns a data frame. dmap_if allows us to define a condition to dictate whether or not we apply the function. In this case the condition is determined by is.factor. If is.factor returns TRUE, then we apply the ymd function from the lubridate package. Now dmap_if takes a data frame not a list, so we have to use map to apply dmap_if to each data frame in our list. The final code is as follows:

dat2 <- map(dat2, ~dmap_if(., is.factor, lubridate::ymd))

Doing this in base R is possible but far more difficult. We nest one lapply function inside another, but since lapply returns a list, we need to wrap the first lapply with as.data.frame. And within the first lapply we have to use the assignment operator as a function, which works but looks cryptic!

dat <- lapply(dat, 
              function(x)as.data.frame(
                lapply(x,
                       function(y)
                         if(is.factor(y)) 
                           `<-`(y, lubridate::ymd(y)) 
                       else y)))

This article provides just a taste of purrr. We hope it gets you started learning more about the package. Be sure to read the documentation as well. Each help page contains illustrative examples. Note that purrr is a very young package. At the time of this writing it is at version 0.2.2. There are sure to be improvements and changes in the coming months and years.

For questions or clarifications regarding this article, contact the UVa Library StatLab: statlab@virginia.edu

Clay Ford
Statistical Research Consultant
University of Virginia Library
April 14, 2017

Endangered Data Week (April 17-21)

Endangered Data Week Logo

 

The UVa Library is hosting a number of events as part of Endangered Data Week (April 17-21), a new, collaborative effort, coordinated across campuses, nonprofits, libraries, citizen science initiatives, and cultural heritage institutions, to shed light on public datasets that are in danger of being deleted, repressed, mishandled, or lost.

Below is the current list of events we are hosting at UVa for Endangered Data Week. No reservations are required, and light refreshments will be provided, so please join us if you can! Click here for more information.

 

  • Introduction to Libra Data (Dataverse at UVa) – Monday, April 17th 11am-noon, in Brown Library, Room 133.
  • Introduction to Git/Github – Tuesday, April 18th, noon-1:30pm, in Brown Library, Room 133.
  • Introduction to DocNow – Tuesday, April 18th, 2pm-4pm, in Alderman Library, Room 421.
  • Web Scraping with R – Wednesday, April 19th, 10:30am-noon, in Brown Library, Room 133.
  • Preserving Artifact and Architecture with Cultural Heritage Informatics – Friday, April 21st, 10:30am-11:30am, Location TBD.
  • Endangered Data Week webinar – Friday, April 21st, 1pm-2:30pm, in Brown Library, Room 133.

Within the UVa Library, these events are being hosted by Research Data Services and the Scholars’ Lab.

For more information about other Endangered Data Week events around the country and around the world, please visit the Endangered Data Week website.