NOTE: This article demonstrates how to use propensity scores for matching data. However, propensity scores have come under fire in recent years. King and Nielsen’s 2019 article, Why Propensity Scores Should Not Be Used for Matching, argue that propensity scores increase imbalance, inefficiency, model dependence, and bias. Others argue that while propensity scores may be sub-optimal, they can be useful in certain situations. As with all statistical analyses, proceed with caution, accept uncertainty, and be thoughtful, open and modest.

A frequent research question is whether or not some “treatment” causes an effect. For example, does taking aspirin daily reduce the chance of a heart attack? Does more sleep lead to better academic performance for teenagers? Does smoking increase the risk of chronic obstructive pulmonary disease (COPD)?

To truly answer such questions, we need a time machine and a lack of ethics. We would need to be able to take a subject and, say, make him or her smoke for several years and observe whether or not they get COPD. Then travel back in time and observe the same subject NOT smoking to see if they get COPD. Only then could we really see the effect of smoking on that person’s chances of contracting COPD. To estimate the average effect of smoking on COPD, we would need to do the same with many subjects. Clearly the laws of physics and all IRBs forbid this.

To work around these issues researchers often employ what are called “matching methods”. This involves taking observational data, such as data from surveys, and matching people who have similar characteristics but different treatments. I can’t travel through time and make a person smoke and not smoke. But I can identify two people who are similar in almost every way except their smoking status. Here’s a 200-pound, middle class, college educated, Caucasian man who smokes, and a 200-pound, middle class, college educated, Caucasian man who does NOT smoke. They’re not the same people but they’re similar. It stands to reason the effects of smoking on the smoker can approximate the effect of smoking on the non-smoker, and vice versa. This is the idea of matching methods: create two such groups that are similar in every respect but their treatment.

For a fantastic and easy-to-read overview of matching methods, see Elizabeth Stuart’s 2010 paper “Matching Methods for Causal Inference: A Review and a Look Forward“. See also Michele Claibourn’s excellent 2014 workshop on matching methods. In this brief article, we demonstrate how to get started with matching methods using R and the MatchIt package.

To begin we need to load some data. For demonstration purposes we’ll use a sample of the 2015 BRFSS survey (pronounced bur-fiss). BRFSS stands for Behavioral Risk Factor Surveillance System. BRFSS is a CDC telephone survey that collects state data about U.S. residents regarding their health-related risk behaviors and chronic health conditions. The entire survey has over 400,000 records and over 300 variables. The sample we’ll work with has 5000 records and 7 variables.

brfss <- read.csv("http://static.lib.virginia.edu/statlab/materials/data/brfss_2015_sample.csv") summary(brfss) COPD SMOKE RACE AGE SEX WTLBS AVEDRNK2 No :4707 Min. :0.0000 Black : 308 18-24: 258 Female:2521 Min. : 85 Min. : 1.000 Yes: 293 1st Qu.:0.0000 Hispanic: 339 25-34: 599 Male :2479 1st Qu.:148 1st Qu.: 1.000 Median :0.0000 Other : 254 35-44: 667 Median :172 Median : 2.000 Mean :0.1524 White :4099 45-54: 914 Mean :178 Mean : 2.122 3rd Qu.:0.0000 55-64:1122 3rd Qu.:200 3rd Qu.: 2.000 Max. :1.0000 65+ :1440 Max. :527 Max. :30.000

The summary provides details on the seven variables:

COPD: Ever told you have chronic obstructive pulmonary disease (COPD)?

SMOKE: Adults who are current smokers (0 = no, 1 = yes)

RACE: Race group

AGE: age group

SEX: gender

WTLBS: weight in lbs

AVEDRNK2: During the past 30 days, when you drank, how many drinks did you drink on average?

We wish to investigate the effect of smoking on COPD. Does smoking increase the chance of contracting COPD? Obviously this is not something we could carry out as a randomized experiment. But perhaps using this observational data and matching methods we can simulate what our data might look like had we carried out a randomized experiment. We can create two groups, smokers and non-smokers, who are roughly equivalent in age, race, sex, weight and alcohol consumption. (Please note this is only for educational purposes. While the data is real and the methods are sound, we would certainly include several other variables as well as enlist the help of doctors and epidemiologists to investigate a research question like this.)

Before we start matching, let’s check the balance and overlap of our data. By balance we mean that the distributions of our data in the smoking and non-smoking groups are similar. By overlap, we mean that the continuous variables in the smoking and non-smoking groups span the same range of values. We can informally check both with some plots. Notice below we use the `factor()`

function to ensure ggplot recognizes smoking as a categorical variable. We have smoking stored in our data as a numeric column of zeroes and ones because that’s how the MatchIt package requires treatment variables to be stored.

library(ggplot2) p <- ggplot(brfss, aes(fill = factor(SMOKE))) + geom_bar(position = "dodge") + scale_fill_discrete("Smoke") p + aes(x = AGE) p + aes(x = RACE) p + aes(x = SEX)

Above we notice a lack of balance. For example, the number of non-smokers increase with age while the number of smokers stays fairly constant.

To check the balance and overlap of the continuous weight and drink variables we plot histograms.

ggplot(brfss, aes(x = WTLBS, fill = factor(SMOKE))) + geom_histogram(position = "identity") + scale_fill_discrete("Smoke") ggplot(brfss, aes(x = AVEDRNK2, fill = factor(SMOKE))) + geom_histogram(position = "identity", binwidth = 1) + scale_fill_discrete("Smoke")

The balance maybe looks OK but there appears to be a slight lack of overlap at the higher values. We can investigate more closely by looking at the quantiles within each group.

tapply(brfss$WTLBS, brfss$SMOKE, quantile, probs = seq(0.2, 1, 0.1)) $`0` 20% 30% 40% 50% 60% 70% 80% 90% 100% 140 150 163 175 185 200 212 240 522 $`1` 20% 30% 40% 50% 60% 70% 80% 90% 100% 140.0 150.0 160.0 172.5 180.0 195.0 209.8 235.0 506.0 tapply(brfss$AVEDRNK2, brfss$SMOKE, quantile, probs = seq(0.2, 1, 0.1)) $`0` 20% 30% 40% 50% 60% 70% 80% 90% 100% 1 1 1 1 2 2 2 3 30 $`1` 20% 30% 40% 50% 60% 70% 80% 90% 100% 1 2 2 2 3 3 4 6 30

We see that once we get past the 30th quantile the non-smokers begin to outweigh the smokers, and the smokers begin to outdrink the non-smokers, producing a lack of balance. On the other hand, it seems the lack of overlap isn’t as bad as we thought.

At this point we’re ready to proceed to matching.

The idea of matching isn’t complicated. We want to simply find subjects with matching covariates among the smokers and non-smokers. However it’s often difficult to find exact matches, so instead we define a “closeness” or “distance” metric and use that to generate matches. In this tutorial we’ll use Nearest Neighbor Matching which is the default method in the MatchIt package. Nearest Neighbor Matching selects the best control (non-smoker) for each treated subject (smoker) using a distance measure called the propensity score. The end result is two groups of equal size and (hopefully) similar distributions of covariates.

The propensity score is the estimated probability of receiving treatment (ie, being a smoker), conditional on the covariates. If two subjects, one who is a smoker and the other who is not, have similar propensity scores, then we think of them as potential matches.

To carry out Nearest Neighbor Matching we load the MatchIt package and use the `matchit()`

function. Notice we need to assign the result of the `matchit()`

function to an object. Below we name it “m.out”, but you can name it something else if you like. `matchit()`

uses R’s formula syntax to define what we want to match on. On the left side of the tilde (~) is SMOKE. That’s our treatment variable. On the right side of the tilde are the variables we want to match on. We have kept it simple but you can match on interactions and higher powers. Also notice we have not included our dependent variable, COPD! That is by design. At this step we are trying to create balance across covariates among smokers and non-smokers. The analysis of COPD comes after the matching procedure.

install.packages("MatchIt") library(MatchIt) m.out <- matchit(SMOKE ~ RACE + AGE + SEX + WTLBS + AVEDRNK2, data = brfss)

That should run fairly quickly on just about any computer. The next step is to investigate the results of the matching procedure. Did we improve balance? Are the smokers and non-smokers similar across age, race, weight, number of drinks and sex? The quickest way to check this is to plot a summary of the results, like so:

s.out <- summary(m.out, standardize = TRUE) plot(s.out)

First we call `summary()`

on m.out and save to a new object called “s.out”. For the plot to work we need to also set the `standardize`

argument to TRUE. Then we call `plot()`

on the s.out object. R will allow you to click on points to label them. When finished, click Esc. We labeled 3 points. The result is below.

On the left side of the plot is the standardized difference in means of covariates between smokers and non-smokers for all the data. We see that distance (propensity score) is the largest, followed by number of drinks (AVEDRNK2) and the AGE65+ group. This tells us that prior to matching we had, for example, large differences in the mean values of AVEDRNK2 between smokers and non-smokers. The right side shows the standardized difference in means *after* matching. We hope to see all the points near 0. Specifically, we want to see large differences in means become smaller. The three largest differences prior to matching are now tiny. The dark lines show mean differences that *increased* after balance. This is actually common when you match on covariates that were already pretty well balanced to begin with. Since the standardized mean difference is still less than 0.1, we’re not too concerned about it.

We can also compare the distribution of propensity scores (distance) before and after matching.

plot(m.out, type = "jitter", interactive = FALSE) plot(m.out, type = "hist")

What we want to see is that the Matched Treated (smokers) and Matched Control (non-smokers) distributions are roughly similar. We like what we see here.

We can also evaluate the matching results using tabular output. Just call `summary()`

on the “m.out” object. Below is the results of our matching procedure.

summary(m.out) Call: matchit(formula = SMOKE ~ RACE + AGE + SEX + WTLBS + AVEDRNK2, data = brfss) Summary of balance for all data: Means Treated Means Control SD Control Mean Diff eQQ Med eQQ Mean eQQ Max distance 0.1991 0.1440 0.0799 0.0551 0.0448 0.0547 0.1816 RACEBlack 0.1194 0.0512 0.2204 0.0682 0.0000 0.0682 1.0000 RACEHispanic 0.0709 0.0672 0.2505 0.0036 0.0000 0.0026 1.0000 RACEOther 0.0682 0.0477 0.2131 0.0206 0.0000 0.0197 1.0000 RACEWhite 0.7415 0.8339 0.3722 -0.0924 0.0000 0.0919 1.0000 AGE25-34 0.1824 0.1085 0.3111 0.0739 0.0000 0.0735 1.0000 AGE35-44 0.1811 0.1248 0.3306 0.0563 0.0000 0.0564 1.0000 AGE45-54 0.1837 0.1826 0.3864 0.0011 0.0000 0.0013 1.0000 AGE55-64 0.2192 0.2253 0.4179 -0.0062 0.0000 0.0066 1.0000 AGE65+ 0.1640 0.3103 0.4627 -0.1462 0.0000 0.1470 1.0000 SEXMale 0.5459 0.4868 0.4999 0.0591 0.0000 0.0591 1.0000 WTLBS 177.7561 177.9875 43.8187 -0.2314 0.0000 2.1432 50.0000 AVEDRNK2 2.9843 1.9672 1.8102 1.0171 1.0000 1.0000 4.0000 Summary of balance for matched data: Means Treated Means Control SD Control Mean Diff eQQ Med eQQ Mean eQQ Max distance 0.1991 0.1986 0.1086 0.0004 0 0.0006 0.0283 RACEBlack 0.1194 0.1155 0.3198 0.0039 0 0.0039 1.0000 RACEHispanic 0.0709 0.0669 0.2501 0.0039 0 0.0039 1.0000 RACEOther 0.0682 0.0761 0.2654 -0.0079 0 0.0079 1.0000 RACEWhite 0.7415 0.7415 0.4381 0.0000 0 0.0000 0.0000 AGE25-34 0.1824 0.1916 0.3938 -0.0092 0 0.0092 1.0000 AGE35-44 0.1811 0.1903 0.3928 -0.0092 0 0.0092 1.0000 AGE45-54 0.1837 0.1850 0.3886 -0.0013 0 0.0013 1.0000 AGE55-64 0.2192 0.1890 0.3917 0.0302 0 0.0302 1.0000 AGE65+ 0.1640 0.1588 0.3657 0.0052 0 0.0052 1.0000 SEXMale 0.5459 0.5276 0.4996 0.0184 0 0.0184 1.0000 WTLBS 177.7561 177.8207 42.4245 -0.0646 1 2.7788 60.0000 AVEDRNK2 2.9843 2.9423 2.7596 0.0420 0 0.2756 3.0000 Percent Balance Improvement: Mean Diff. eQQ Med eQQ Mean eQQ Max distance 99.1869 99.9585 98.9125 84.3921 RACEBlack 94.2289 0.0000 94.2308 0.0000 RACEHispanic -8.8341 0.0000 -50.0000 0.0000 RACEOther 61.7348 0.0000 60.0000 0.0000 RACEWhite 100.0000 0.0000 100.0000 100.0000 AGE25-34 87.5647 0.0000 87.5000 0.0000 AGE35-44 83.6772 0.0000 83.7209 0.0000 AGE45-54 -19.9887 0.0000 0.0000 0.0000 AGE55-64 -388.2488 0.0000 -360.0000 0.0000 AGE65+ 96.4106 0.0000 96.4286 0.0000 SEXMale 68.9365 0.0000 68.8889 0.0000 WTLBS 72.1013 -Inf -29.6583 -20.0000 AVEDRNK2 95.8709 100.0000 72.4409 25.0000 Sample sizes: Control Treated All 4238 762 Matched 762 762 Unmatched 3476 0 Discarded 0 0

The first two sections look at all the data and the matched data, respectively. The first, second and fourth columns are probably the easiest to examine. The first and second columns show the means for treated (smokers) and control (non-smokers) groups. Prior to matching, for example, we have 16% of smokers over age 65 versus 31% who are not smokers. That’s an absolute difference of about 15%. After matching we have roughly an equal proportion of subjects over age 65 in both groups with a negligible mean difference. The third section, Percent Balance Improvement, shows how much we have increased or decreased balance. Negative numbers in the Mean Diff column indicate covariates where balance actually got worse. The AGE55-64 value of -388 seems pretty terrible. What happened was the balance went from 21.9% vs 22.5% pre-match to 21.9% vs 18.9% post-match. Ideally we don’t want balance getting worse, but in this case the balance is still quite good. The last section tells us we now have equal numbers (762) in both the treated and control groups.

The columns labeled eQQ refer to empirical quantile-quantile plots. They provide the median, mean, and maximum distance between the empirical quantile functions of the treated and control groups. Values greater than 0 indicate deviations between the groups in some part of the empirical distributions. These are most useful for continuous variables. We can view these plots by calling `plot()`

on the matching object. The `which.xs`

argument allows us to specify which variables to plot. Below we plot weight and number of drinks.

plot(m.out, which.xs = c("WTLBS", "AVEDRNK2"))

We hope to see the points in the right column (Matched) falling between the dotted lines. The idea is that if both groups, smokers and non-smokers, have good balance then the quantiles should lie on a 45-degree line. Once again we like what we see.

When we’re satisfied with the results of the matching procedure, the next step is to create a new data frame that contains only the matched data. For this we use the `match.data()`

function on the matching object. Below we create a new data frame called brfss.match.

brfss.match <- match.data(m.out)

And now we’re ready to perform our statistical analysis as if our data had been generated through randomization. For instance, we might perform logistic regression to analyze how smoking affects the probability of contracting COPD.

brfss.match$SMOKE <- factor(brfss.match$SMOKE, labels = c("No", "Yes")) mod <- glm(COPD ~ SMOKE + AGE + RACE + SEX + WTLBS + AVEDRNK2, data = brfss.match, family = "binomial") exp(confint(mod, parm = "SMOKEYes")) 2.5 % 97.5 % 2.868077 7.091090

It appears that smokers are at least 2.9 times more likely to report COPD than non-smokers, controlling for age, sex, race, weight and drinking.

There is much more to learn about matching methods. We demonstrated Nearest Neighbor Matching, but there are several other methods. The MatchIt User’s Guide provides a nice overview of how to implement various matching methods: https://r.iq.harvard.edu/docs/matchit/2.4-20/User_s_Guide_to.html

It’s worth pointing out that matching does not necessarily guarantee that each treated subject will have a matching control that has identical covariate values. The only guarantee is that groups of individuals with similar distance measures (eg, propensity scores) will have similar covariate distributions. Having said that, it is possible to see which control subject was matched with which treated subject. The matching object has a `match.matrix`

component, where the row names correspond to the names of the treated, and the cells correspond to the names of the matched controls. Here are the first few matches in our m.out match.matrix:

head(m.out$match.matrix) 1 29 "3128" 44 "3957" 62 "4669" 81 "1970" 82 "2849" 87 "4754"

So row 29 of brfss was a treated subject (a smoker) matched to the control subject in row 3128 (a non-smoker). If we compare them we notice they don’t exactly match up:

brfss[29,] COPD SMOKE RACE AGE SEX WTLBS AVEDRNK2 29 No 1 Hispanic 25-34 Male 150 10 brfss[3128,] COPD SMOKE RACE AGE SEX WTLBS AVEDRNK2 3128 No 0 Other 25-34 Male 164 8

The most glaring difference is RACE. But recall, they weren’t matched on the specific covariates but rather their propensity score. We can compare their propensity scores as follows (taking advantage of the fact that the row names in the matched data frame are the same as those in the original data frame):

i <- rownames(brfss.match) %in% c(29,3128) brfss.match[i,] COPD SMOKE RACE AGE SEX WTLBS AVEDRNK2 distance weights 29 No Yes Hispanic 25-34 Male 150 10 0.4776235 1 3128 No No Other 25-34 Male 164 8 0.4634368 1

The propensity score is in the distance column. According to the propensity score, these subjects are similar. Matching on this distance metric helps ensure the smoking and non-smoking groups have similar covariate distributions. So even those these two specific subjects do not match on RACE, overall the smoking and non-smoking groups are balanced on RACE.

We mentioned earlier that the propensity score is the probability of receiving treatment (ie, being a smoker), conditional on the covariates. The `matchit()`

function calculates this for us but we can do it as well by performing a logistic regression of SMOKE on the covariates that we want to balance on. Notice the formula is identical to the one we used with `matchit()`

.

pscore <- glm(SMOKE ~ RACE + AGE + SEX + WTLBS + AVEDRNK2, data = brfss, family = binomial)

Here are the first 6 propensity scores in the brfss.match data frame (again, the row numbers correspond to the rows in our original brfss data frame):

head(brfss.match[, "distance", drop = F]) distance 1 0.19127152 3 0.08129118 13 0.10868555 14 0.41062392 21 0.07483546 22 0.32779268

We get the same results by using our model to make predictions for the same subjects:

predict(pscore, type = "response")[c(1,3,13,14,21,22)] 1 3 13 14 21 22 0.19127152 0.08129118 0.10868555 0.41062392 0.07483546 0.32779268

Another type of distance measure is the *linear propensity score*. It’s just the log-odds version of the propensity score. Whereas the probability-based propensity score is bounded from 0 to 1, the linear propensity score has no such bounds. This means we can make better matches in the lower and upper extremes of the scores since the values are not being compressed near 0 or 1. In section 6.2 of her paper, Stuart actually recommends using the linear propensity score. Fortunately MatchIt makes it easy to do. Simply set the `distance`

argument to `"linear.logit"`

:

m.out.lp <- matchit(SMOKE ~ RACE + AGE + SEX + WTLBS + AVEDRNK2, data = brfss, distance = "linear.logit")

The resulting propensity scores are on the log-odds scale, which we can verify as we did before:

brfss.match2 <- match.data(m.out.lp) head(brfss.match2[, "distance", drop = F]) distance 1 -1.4417693 2 -2.0709984 14 -0.3613867 21 -2.5146797 22 -0.7181855 25 -1.9113250 predict(pscore)[c(1,2,14,21,22, 25)] # default prediction is on log-odds scale 1 2 14 21 22 25 -1.4417693 -2.0709984 -0.3613867 -2.5146797 -0.7181855 -1.9113250

Notice that matching on the linear propensity score has resulted in different subjects being selected. We leave it as an exercise for the interested reader to verify that the improvement in balance is essentially the same as the matching performed on the probability-based propensity score.

To learn more about matching methods and causal inference in general, please see the following references:

- Claibourn, M. (2014). Matching Methods for Causal Inference. Workshop presented at the UVa Library. http://static.lib.virginia.edu/statlab/materials/workshops/MatchingMethods.zip
- Gelman, A. and Hill, J. (2007).
*Data Analysis Using Regression and Multilevel/Hierarchical Models*. Cambridge. Chapters 9 and 10. - Ho, D., Imai K., King, G., Stuart, E. (2011). MatchIt: Nonparametric Preprocessing for Parametric Causal Inference.
*Journal of Statistical Software*, Vol. 42, No. 8, pp. 1-28. URL http://www.jstatsoft.org/v42/i08/ - Stuart, E. (2010). Matching Methods for Causal Inference: A Review and a Look Forward.
*Statistical Science*. Vol. 25, No. 1, 1–21. DOI: 10.1214/09-STS313

For questions or clarifications regarding this article, contact the UVA Library StatLab: statlab@virginia.edu

View the entire collection of UVA Library StatLab articles.

Clay FordStatistical Research Consultant

University of Virginia Library

April 24, 2018