Latest Posts

2017-18 StatLab Fellows

The UVA Library’s Research Data Services is seeking StatLab Fellows for the 2017-2018 academic year. Responding to growing interest in applied data science experience along with a developing movement to use data and data science for the public interest, the program provides up to four UVA graduate students experience working collaboratively, openly, and reproducibly on data science projects — for example, working with local agencies to understand their data and improve processes, or working with news data to help citizens better navigate a complex media environment. The goal is to provide graduate students with an opportunity to enhance their data skills and to gain experience working as a team on a joint project while promoting social good.

Program Description
The StatLab provides consultation, training, and support around data analysis and statistical methods, and data wrangling and visualization. Research across the social, natural, engineering, and data sciences increasingly draws upon sophisticated research designs, complex statistical methods, and computational power to draw inferences about the world around us. The StatLab sees its mission as both contributing to and assisting with a wide portfolio of quantitative research and Fellows will have the opportunity to contribute to academic consulting, as well, providing exposure to a variety of designs, challenges, and methods across disciplines.

Fellows will have the opportunity to deepen their knowledge of methods and data analysis tools by: working collaboratively on a project, writing and blogging about their work, consulting and working with researchers across disciplinary boundaries, and presenting their own work to a diverse audience.

Fellowship Details
StatLab Fellows are expected to work between 5-10 hours per week in the fall and spring semesters and will be paid $20/hour (to provide some flexibility and accommodate variable schedules across the semester). Fellows should be available for a Friday noon meeting most weeks, and will:

  • contribute intellectually and methodologically to a collaborative project in public and reproducible ways;
  • expand or deepen their methodological knowledge and skills;
  • write blog posts about the project and their work;
  • contribute to consultations and collaborations with researchers seeking help from the StatLab on ocassion;
  • and advance and present their own research projects.

Applicants should have completed at least two methods courses before applying and have experience in a statistical software environment such as R or Stata or in a computing environment such as Python. We particularly encourage applications from women, people of color, LGBT students, first-generation college students and other under-represented groups.

How to Apply
Send a CV and cover letter by April 17 outlining:

  • your research interests;
  • your experience with data analysis, statistical methods, data wrangling, and visualization;
  • the skills you expect to bring to a collaborative project;
  • and a summary of what you hope to gain as a StatLab fellow.

Email a complete application package to Michele Claibourn, mclaibourn@virginia.edu. Questions about the StatLab Fellowship and the application process should also be directed to Michele Claibourn. You can learn more about the StatLab at our website.

Working with dates and time in R using the lubridate package

Sometimes we have data with dates and/or times that we want to manipulate or summarize. A common example in the health sciences is time-in-study. A subject may enter a study on Feb 12, 2008 and exit on November 4, 2009. How many days was the person in the study? (Don’t forget 2008 was a leap year; February had 29 days.) What was the median time-in-study for all subjects?

Another example are experiments that time participants performing an activity, applies a treatment to certain members, and then re-times the activity. What was the difference in times between subjects that received the treatment and those that did not? If our data is stored and read in as something like “01:23:03”, then we’ll need to convert to seconds.

The lubridate package for the R statistical computing environment was designed to help us deal with these kinds of data. The out-of-the-box base R installation also provides functions for working with dates and times, but the functions in the lubridate package are a little easier to use and remember.

Formatting dates

When we import data into R, dates and times are usually stored as character or factor by default due to symbols such as “-”, “:” and “/”. (Though see the readr package for functions that attempt to parse date and times automatically.) Using the str or class functions will tell you how they’re stored. If dates or times are stored as character or factor that means we can’t calculate or summarize elapsed times.

To format dates, lubridate provides a series of functions that are a permutation of the letters “m”, “d” and “y” to represent the ordering of month, day and year. For example, if our data has a column of dates such as May 11, 1996, our dates are ordered month-day-year. Therefore we would use the mdy function to transform the column to a date object. If our dates were in the order of, say, year-month-day, we would use the ymd function. lubridate provides functions for every permutation of “m”, “d”, “y”.

Let’s demonstrate. Below we generate two character vectors of dates, inspect their class, reformat them using the mdy function and then inspect their class again.

library(lubridate)
begin <- c("May 11, 1996", "September 12, 2001", "July 1, 1988")
end <- c("7/8/97","10/23/02","1/4/91")
class(begin)
## [1] "character"
class(end)
## [1] "character"
(begin <- mdy(begin))
## [1] "1996-05-11" "2001-09-12" "1988-07-01"
(end <- mdy(end))
## [1] "1997-07-08" "2002-10-23" "1991-01-04"
class(begin)
## [1] "Date"
class(end)
## [1] "Date"

The dates now have class “Date” and are printed in year-month-day format. They may appear to still be character data when printed, but they are in fact numbers. The “Date” class means dates are stored as the number of days since January 1, 1970, with negative values for earlier dates. We can use the as.numeric function to view the raw values.

as.numeric(begin)
## [1]  9627 11577  6756
as.numeric(end)
## [1] 10050 11983  7673

With dates stored in this fashion we can do things like subtraction to calculate number of days between two dates.

We can also format dates that contain time information by appending _h, _hm, or _hms to any of the aforementioned functions. “h”, “m”, and “s” stand for hour, minute, and second, respectively. Below we add some time data to our dates and demonstrate how to use mdy_hms.

begin <- c("May 11, 1996 12:05", "September 12, 2001 1:00", "July 1, 1988 3:32")
end <- c("7/8/97 8:00","10/23/02: 12:00","1/4/91 2:05")
(begin <- mdy_hm(begin))
## [1] "1996-05-11 12:05:00 UTC" "2001-09-12 01:00:00 UTC"
## [3] "1988-07-01 03:32:00 UTC"
(end <- mdy_hm(end))
## [1] "1997-07-08 08:00:00 UTC" "2002-10-23 12:00:00 UTC"
## [3] "1991-01-04 02:05:00 UTC"
class(begin)
## [1] "POSIXct" "POSIXt"
class(end)
## [1] "POSIXct" "POSIXt"

Notice the class is now “POSIXct”. “POSIXct” represents the number of seconds since the beginning of 1970. If a date is before 1970, the number of seconds is negative. Notice also the the letters “UTC” have been appended to the date-times. UTC is short for Universal Coordinated Time. You can read more about UTC here, but it’s basically the time standard by which the world regulates clocks. If we prefer we can specify a time zone when formatting dates by using the tz argument. Here’s how we can specify the Eastern Time Zone in the United States when formatting our dates.

begin <- c("May 11, 1996 12:05", "September 12, 2001 1:00", "July 1, 1988 3:32")
end <- c("7/8/97 8:00","10/23/02: 12:00","1/4/91 2:05")
(begin <- mdy_hm(begin, tz = "US/Eastern"))
## [1] "1996-05-11 12:05:00 EDT" "2001-09-12 01:00:00 EDT"
## [3] "1988-07-01 03:32:00 EDT"
(end <- mdy_hm(end, tz = "US/Eastern"))
## [1] "1997-07-08 08:00:00 EDT" "2002-10-23 12:00:00 EDT"
## [3] "1991-01-04 02:05:00 EST"

Notice the last date is EST instead of EDT. EST means “Eastern Standard Time”. EDT means “Eastern Daylight Time”. Any day and time that falls during Daylight Savings is EDT. Otherwise it’s EST. How do we know the appropriate time zone phrase to use in the tz argument? We can use the OlsonNames function to see a character vector of all time zone names. Just enter OlsonNames() in the R console and hit Enter.

We can also read in times without dates using the functions ms, hm, or hms, where again “h”, “m”, and “s” stand for “hours”, “minutes”, and “seconds”. Here are a few examples.

time1 <- c("1:13", "0:58", "1:01")
time2 <- c("12:23:11", "09:45:31", "12:05:22")
time3 <- c("2:14", "2:16", "3:35")

(time1 <- ms(time1))
## [1] "1M 13S" "58S"    "1M 1S"
(time2 <- hms(time2))
## [1] "12H 23M 11S" "9H 45M 31S"  "12H 5M 22S"
(time3 <- hm(time3))
## [1] "2H 14M 0S" "2H 16M 0S" "3H 35M 0S"

Once again, don’t be fooled by the print out. These times are actually stored as seconds. Use as.numeric to verify.

as.numeric(time1)
## [1] 73 58 61
as.numeric(time2)
## [1] 44591 35131 43522
as.numeric(time3)
## [1]  8040  8160 12900

The class of these new time objects is neither “Date” nor “POSIX” but rather “Period”.

class(time1)
## [1] "Period"
## attr(,"package")
## [1] "lubridate"

Period is one of three classes lubridate provides for time spans. Let’s learn more about these classes.

Durations, Intervals and Periods

lubridate provides three classes, or three different ways, to distinguish between different types of time spans.

  1. Duration
  2. Interval
  3. Period

Understanding these classes will help you get the most out of lubridate.

The most simple is Duration. This is simply a span of time measured in seconds. There is no start date.

An Interval is also measured in seconds but has an associated start date. An Interval measures elapsed seconds between two specific points in time.

A Period records a time span in units larger than seconds, such as years or months. Unlike seconds, years and months differ in time. June has 30 days while July has 31 days. February has 28 days except for leap years when it has 29 days. With the Period class, we can add 1 month to February 1 and get March 1. It allows us to perform calculations in calendar or clock time as opposed to absolute number of seconds.

Let’s see these three classes in action. Below we define two dates in the US Eastern time zone. The start day is March 11, 2017 at 5:21 AM. The end day is March 12, 2017 at the same time. Note that Daylight Savings begins (or began, depending on when you’re reading this) on March 12 at 2:00 AM.

start <- mdy_hm("3-11-2017 5:21", tz = "US/Eastern")
end <- mdy_hm("3-12-2017 5:21", tz = "US/Eastern")

Since we’re dealing with elapsed time between two dates, let’s start with Intervals. We can define an Interval using the %--% operator.

time.interval <- start %--% end
time.interval
## [1] 2017-03-11 05:21:00 EST--2017-03-12 05:21:00 EDT

Notice how Intervals print. They show the beginng date and end date. And also notice how the time zone changes from EST to EDT indicating that Daylight Savings has started. If we look at the structure of an Interval object we see it contains elapsed time in seconds, 82800, and the start date.

str(time.interval)
## Formal class 'Interval' [package "lubridate"] with 3 slots
##   ..@ .Data: num 82800
##   ..@ start: POSIXct[1:1], format: "2017-03-11 05:21:00"
##   ..@ tzone: chr "US/Eastern"

To create a Duration between these two dates, we can use the as.duration function.

time.duration <- as.duration(time.interval)
time.duration
## [1] "82800s (~23 hours)"

Notice a Duration object prints the elapsed time in seconds as well as something a little friendlier to read, in this case hours. Because Daylight Savings went into effect at 2:00 AM during the interval, an hour was skipped. Thus the duration between these two time points is only 23 hours.

If we look at the structure of a Duration object we see it just contains elapsed time in seconds.

str(time.duration)
## Formal class 'Duration' [package "lubridate"] with 1 slot
##   ..@ .Data: num 82800

We can create a Period from an Interval using the as.period function.

time.period <- as.period(time.interval)
time.period
## [1] "1d 0H 0M 0S"

A Period prints elapsed time as integers in the form of years, months, weeks, days and so on. Notice this Period is 1 day long. While only 23 hours have technically elapsed since the start date, according to our clock one day has elapsed.

If we look at the structure we see a Period contains several slots for “clock time” values and, like the Duration object, no associated date.

str(time.period)
## Formal class 'Period' [package "lubridate"] with 6 slots
##   ..@ .Data : num 0
##   ..@ year  : int 0
##   ..@ month : int 0
##   ..@ day   : int 1
##   ..@ hour  : int 0
##   ..@ minute: int 0

To recap:

  • An Interval is elapsed time in seconds between two specific dates. (If no time is provided, the time for each date is assumed to be 00:00:00, or midnight.)
  • A Duration is elapsed time in seconds independent of a start date.
  • A Period is elapsed time in “calendar” or “clock” time (4 weeks, 2 months, etc) independent of a start date.

Calculations and conversions

Once we format dates and define our time span we often want to do some calculations and conversions. For example, we may want to calculate the mean elapsed time in weeks for different groups.

Let’s create some data and demonstrate. First we enter arbitrary start and end dates and define an Interval

start <- c("2012-08-21", "2012-09-01", "2012-08-15", "2012-09-18")
end <- c("2012-09-16", "2012-09-06", "2012-08-22", "2012-10-11")
elapsed.time <- start %--% end

If we view the elapsed.time object we’ll just see date ranges. We can use as.duration or even as.numeric to view the elapsed time in seconds but that’s not very useful in this case. It would be better if we converted seconds to another unit of time such as weeks or days. Fortunately lubridate makes this easy.

The trick is to convert intervals to durations and then divide the duration by a duration object in the units we desire. That’s a mouthful but easy to demonstrate. Below we demonstrate how to convert to weeks. First we convert our interval to a duration, and then we divide by dweeks(1). The function call dweeks(1) generates a duration of one week in seconds, which is 604800. Dividing that into our duration returns number of weeks.

as.duration(elapsed.time) / dweeks(1)
## [1] 3.7142857 0.7142857 1.0000000 3.2857143

We can do the same with hours, days, minutes and years.

as.duration(elapsed.time) / dhours(1)
## [1] 624 120 168 552
as.duration(elapsed.time) / ddays(1)
## [1] 26  5  7 23
as.duration(elapsed.time) / dminutes(1)
## [1] 37440  7200 10080 33120
as.duration(elapsed.time) / dyears(1)
## [1] 0.07123288 0.01369863 0.01917808 0.06301370

Once we have the durations in the units we want, we can then do things like find the mean.

mean(as.duration(elapsed.time) / dweeks(1))
## [1] 2.178571

Of course this was just for demonstration. With only 4 values, the mean is not a very useful summary.

As another example, consider the following vector of character data summarizing a duration of time. “12w” means 12 weeks and “4d” means 4 days.

StudyTime <- c("12w 4d", "11w", "10w 5d", NA, "12w 6d")

What if we wanted to convert that to numeric weeks? First we’ll give the R code and them explain how it works.

as.duration(period(StudyTime, units = c("week","day"))) / dweeks(1)
## [1] 12.57143 11.00000 10.71429       NA 12.85714

First we use the period function to define a Period using our data. The units argument says the first part of our data represents weeks and the second part represents days. That is then converted to a Duration object that stores time in seconds. Finally we divide by dweeks(1) to convert seconds to weeks. Notice how the NA remains NA and that “11w” converts to 11 just fine even though it had no days appended to it.

There is much more to the lubridate package. Read the vignette and check out the examples on each function’s help page. But hopefully the material in this post gets you started with reading in dates, creating time-spans, and making conversions and calculations.

For questions or clarifications regarding this article, contact the UVa Library StatLab: statlab@virginia.edu

Clay Ford
Statistical Research Consultant
University of Virginia Library
January 11, 2017

The Wilcoxon Rank Sum Test

The Wilcoxon Rank Sum Test is often described as the non-parametric version of the two-sample t-test. You sometimes see it in analysis flowcharts after a question such as “is your data normal?” A “no” branch off this question will recommend a Wilcoxon test if you’re comparing two groups of continuous measures.

So what is this Wilcoxon test? What makes it non-parametric? What does that even mean? And how do we implement it and interpret it? Those are some of the questions we aim to address in this post.

First, let’s recall the assumptions of the two-sample t test for comparing two population means:

1. The two samples are independent of one another
2. The two populations have equal variance or spread
3. The two populations are normally distributed

There’s no getting around #1. That assumption must be satisfied for a two-sample t-test. When assumptions #2 and #3 (equal variance and normality) are not satisfied but the samples are large (say, greater than 30), the results are approximately correct. But when our samples are small and our data skew or non-normal, we probably shouldn’t place much faith in the two-sample t-test.

This is where the Wilcoxon Rank Sum Test comes in. It only makes the first two assumptions of independence and equal variance. It does not assume our data have have a known distribution. Known distributions are described with math formulas. These formulas have parameters that dictate the shape and/or location of the distribution. For example, variance and mean are the two parameters of the Normal distribution that dictate its shape and location, respectively. Since the Wilcoxon Rank Sum Test does not assume known distributions, it does not deal with parameters, and therefore we call it a non-parametric test.

Whereas the null hypothesis of the two-sample t test is equal means, the null hypothesis of the Wilcoxon test is usually taken as equal medians. Another way to think of the null is that the two populations have the same distribution with the same median. If we reject the null, that means we have evidence that one distribution is shifted to the left or right of the other. Since we’re assuming our distributions are equal, rejecting the null means we have evidence that the medians of the two populations differ. The R statistical programming environment, which we use to implement the Wilcoxon rank sum test below, refers to this a “location shift”.

wilcox_01

Let’s work a quick example in R. The data below come from Hogg & Tanis, example 8.4-6. It involves the weights of packaging from two companies selling the same product. We have 8 observations from each company, A and B. We would like to know if the distribution of weights is the same at each company. A quick boxplot reveals the data have similar spread but may be skew and non-normal. With such a small sample it might be dangerous to assume normality.

A <- c(117.1, 121.3, 127.8, 121.9, 117.4, 124.5, 119.5, 115.1)
B <- c(123.5, 125.3, 126.5, 127.9, 122.1, 125.6, 129.8, 117.2)
dat <- data.frame(weight = c(A,B), 
                  company = rep(c("A","B"), each=8))
boxplot(weight ~ company, data = dat)

wilcox_02

Now we run the Wilcoxon Rank Sum Test using the wilcox.test function. Again, the null is that the distributions are the same, and hence have the same median. The alternative is two-sided. We have no idea if one distribution is shifted to the left or right of the other.

wilcox.test(weight ~ company, data = dat)

	Wilcoxon rank sum test

data:  weight by company
W = 13, p-value = 0.04988
alternative hypothesis: true location shift is not equal to 0

First we notice the p-value is a little less than 0.05. Based on this result we may conclude the medians of these two distributions differ. The alternative hypothesis is stated as the “true location shift is not equal to 0”. That’s another way of saying “the distribution of one population is shifted to the left or right of the other,” which implies different medians.

The Wilcoxon statistic is returned as W = 13. This is NOT an estimate of the difference in medians. This is actually the number of times that a package weight from company B is less than a package weight from company A. We can calculate it by hand using nested for loops as follows (though we should note that this is not how the wilcox.test function calculates W):

W <- 0
for(i in 1:length(B)){
  for(j in 1:length(A)){
    if(B[j] < A[i]) W <- W + 1
  }
}
W
[1] 13

Another way to do this is to use the outer function, which can take two vectors and perform an operation on all pairs. The result is an 8 x 8 matrix consisting of TRUE/FALSE values. Using sum on the matrix counts all instances of TRUE.

sum(outer(B, A, "<"))
[1] 13

Of course we could also go the other way and count the number of times that a package weight from company A is less than a package weight from company B. This gives us 51.

sum(outer(A, B, "<"))
[1] 51

If we relevel our company variable in data.frame dat to have “B” as the reference level, we get the same result in the wilcox.test output.

dat$company <- relevel(dat$company, ref = "B")
wilcox.test(weight ~ company, data = dat)

	Wilcoxon rank sum test

data:  weight by company
W = 51, p-value = 0.04988
alternative hypothesis: true location shift is not equal to 0

So why are we counting pairs? Recall this is a non-parametric test. We’re not estimating parameters such as a mean. We’re simply trying to find evidence that one distribution is shifted to the left or right of the other. In our boxplot above, it looks like the distributions from both companies are reasonably similar but with B shifted to the right, or higher, than A. One way to think about testing if the distributions are the same is to consider the probability of a randomly selected observation from company A being less than a randomly selected observation from company B: P(A < B). We could estimate this probability as the number of pairs with A less than B divided by the total number of pairs. In our case that comes to \(51/(8\times8)\) or \(51/64\). Likewise we could estimate the probability of B being less than A. In our case that's \(13/64\). So we see that the statistic W is the numerator in this estimated probability.

The exact p-value is determined from the distribution of the Wilcoxon Rank Sum Statistic. We say "exact" because the distribution of the Wilcoxon Rank Sum Statistic is discrete. It is parametrized by the two sample sizes we're comparing. "But wait, I thought the Wilcoxon test was non-parametric?" It is! But the test statistic W has a distribution which does not depend on the distribution of the data.

We can calculate the exact two-sided p-values explicitly using the pwilcox function (they’re two-sided, so we multiply by 2):

For W = 13, \(P(W \leq 13)\):

pwilcox(q = 13, m = 8, n = 8) * 2
[1] 0.04988345

For W = 51, \(P(W \geq 51)\), we have to get \(P(W \leq 50)\) and then subtract from 1 to get \(P(W \geq 51)\):

(1 - pwilcox(q = 51 - 1, m = 8, n = 8)) * 2
[1] 0.04988345

By default the wilcox.test function will calculate exact p-values if the samples contains less than 50 finite values and there are no ties in the values. (More on “ties” in a moment.) Otherwise a normal approximation is used. To force the normal approximation, set exact = FALSE.

dat$company <- relevel(dat$company, ref = "A")
wilcox.test(weight ~ company, data = dat, exact = FALSE)

	Wilcoxon rank sum test with continuity correction

data:  weight by company
W = 13, p-value = 0.05203
alternative hypothesis: true location shift is not equal to 0

When we use the normal approximation the phrase “with continuity correction” is added to the name of the test. A continuity correction is an adjustment that is made when a discrete distribution is approximated by a continuous distribution. The normal approximation is very good and computationally faster for samples larger than 50.

Let’s return to “ties”. What does that mean and why does that matter? To answer those questions first consider the name “Wilcoxon Rank Sum test”. The name is due to the fact that the test statistic can be calculated as the sum of the ranks of the values. In other words, take all the values from both groups, rank them from lowest to highest according to their value, and then sum the ranks from one of the groups. Here’s how we can do it in R with our data:

sum(rank(tins$weight)[tins$company=="A"])
[1] 49

Above we rank all the weights using the rank function, select only those ranks for company A, and then sum them. This is the classic way to calculate the Wilcoxon Rank Sum test statistic. Notice it doesn’t match the test statistic provided by wilcox.test, which was 13. That’s because R is using a different calculation due to Mann and Whitney. Their test statistic, sometimes called U, is a linear function of the original rank sum statistic, usually called W:

\[U = W – \frac{n_2(n_2 + 1)}{2}\]

where \(n_2\) is the number of observations in the other group whose ranks were not summed. We can verify this relationship for our data

sum(rank(tins$weight)[tins$company=="A"]) - (8*9/2)
[1] 13

This is in fact how the wilcox.test function calculates the test statistic, though it labels it W instead of U.

The rankings of values have to be modified in the event of ties. For example, in the data below 7 occurs twice. One of the 7’s could be ranked 3 and the other 4. But then one would be ranked higher than the other and that’s not correct. We could rank them both 3 or both 4, but that wouldn’t be right either. What we do then is take the average of their ranks. Below this is \((3 + 4)/2 = 3.5\). R does this by default when ranking values.

vals <- c(2, 4, 7, 7, 12)
rank(vals)
[1] 1.0 2.0 3.5 3.5 5.0

The impact of ties means the Wilcoxon rank sum distribution cannot be used to calculate exact p-values. If ties occur in our data and we have fewer than 50 observations, the wilcox.test function returns a normal approximated p-value along with a warning message that says “cannot compute exact p-value with ties”.

Whether exact or approximate, p-values do not tell us anything about how different these distributions are. For the Wilcoxon test, a p-value is the probability of getting a test statistic as large or larger assuming both distributions are the same. In addition to a p-value we would like some estimated measure of how these distributions differ. The wilcox.test function provides this information when we set conf.int = TRUE.

wilcox.test(weight ~ company, data = dat, conf.int = TRUE)

	Wilcoxon rank sum test

data:  weight by company
W = 13, p-value = 0.04988
alternative hypothesis: true location shift is not equal to 0
95 percent confidence interval:
 -8.5 -0.1
sample estimates:
difference in location 
                 -4.65 

This returns a “difference in location” measure of -4.65. The documentation for the wilcox.test function states this “does not estimate the difference in medians (a common misconception) but rather the median of the difference between a sample from x and a sample from y.”

Again we can use the outer function to verify this calculation. First we calculate the difference between all pairs and then find the median of those differences.

median(outer(A,B,"-"))
[1] -4.65

The confidence interval is fairly wide due to the small sample size, but it appears we can safely say the median weight of company A’s packaging is at least -0.1 less than the median weight of company B’s packaging.

If we’re explicitly interested in the difference in medians between the two populations, we could try a bootstrap approach using the boot package. The idea is to resample the data (with replacement) many times, say 1000 times, each time taking a difference in medians. We then take the median of those 1000 differences to estimate the difference in medians. We can then find a confidence interval based on our 1000 differences. An easy way is to use the 2.5th and 97.5th percentiles as the upper and lower bounds of a 95% confidence interval.

Here is one way to carry this out in R.

First we load the boot package, which comes with R, and create a function called med.diff to calculate the difference in medians. In order to work with the boot package’s boot function, our function needs two arguments: one for the data and one to index the data. We have arbitrarily named these arguments d and i. The boot function will take our data, d, and resample it according to randomly selected row numbers, i. It will then return the difference in medians for the resampled data.

library(boot)
med.diff <- function(d, i) {
   tmp <- d[i,] 
   median(tmp$weight[tmp$company=="A"]) - 
     median(tmp$weight[tmp$company=="B"])
 }

Now we use the boot function to resample our data 1000 times, taking a difference in medians each time, and saving the results into an object called boot.out.

boot.out <- boot(data = dat, statistic = med.diff, R = 1000)

The boot.out object is a list object. The element named “t” contains the 1000 differences in medians. Taking the median of those values gives us a point estimate of the estimated difference in medians. Below we get -5.05, but you will likely get something different.

median(boot.out$t)
[1] -5.05

Next we use the boot.ci function to calculate confidence intervals. We specify type = "perc" to obtain the bootstrap percentile interval.

boot.ci(boot.out, type = c("perc","bca"))
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = boot.out, type = "perc")

Intervals : 
Level     Percentile     
95%   (-9.399, -0.100 )  
Calculations and Intervals on Original Scale

We notice the interval is not too different from what the wilcox.test function returned, but certainly bigger on the lower bound. Like the Wilcoxon rank sum test, bootstrapping is a non-parametric approach that can be useful for small and/or non-normal data.

References

Hogg, R.V. and Tanis, E.A., Probability and Statistical Inference, 7th Ed, Prentice Hall, 2006.
R Core Team (2016). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

For questions or clarifications regarding this article, contact the UVa Library StatLab: statlab@virginia.edu

Clay Ford
Statistical Research Consultant
University of Virginia Library
Jan 5, 2017

Pairwise comparisons of proportions

Pairwise comparison means comparing all pairs of something. If I have three items A, B and C, that means comparing A to B, A to C, and B to C. Given n items, I can determine the number of possible pairs using the binomial coefficient:

$$ \frac{n!}{2!(n – 2)!} = \binom {n}{2}$$

Using the R statistical computing environment, we can use the choose function to quickly calculate this. For example, how many possible 2-item combinations can I “choose” from 10 items:

choose(10,2)
[1] 45

We sometimes want to make pairwise comparisons to see where differences occur. Let’s say we go to 8 high schools in an area, survey 30 students at each school, and ask them whether or not they floss their teeth at least once a day. When finished we’ll have 8 proportions of students who answered “Yes”. An obvious first step would be to conduct a hypothesis test for any differences between these proportions. The null would be no difference between the proportions versus some difference. If we reject the null, we have evidence of differences. But where are the differences? This leads us to pairwise comparisons of proportions, where we make multiple comparisons. The outcome of these pairwise comparisons will hopefully tell us which schools have significantly different proportions of students flossing.

Making multiple comparisons leads to an increased chance of making a false discovery, i.e. rejecting a null hypothesis that should not have been rejected. When we run a hypothesis test, we always run a risk of finding something that isn’t there. Think of flipping a fair coin 10 times and getting 9 or 10 heads (or 0 or 1 heads). That’s improbable but not impossible. If it happened to us we may conclude the coin is unfair, but that would be the wrong conclusion if the coin truly was fair. It just so happened we were very unlucky to witness such an unusual event. As we said, the chance of this happening is low in a single trial, but we increase our chances of it happening by conducting multiple trials.

The probability of observing 0, 1, 9 or 10 heads when flipping a fair coin 10 times is about 2% which can be calculated in R as follows:

pbinom(q = 1, size = 10, prob = 0.5) * 2
[1] 0.02148438

Therefore the the probability of getting 2 – 8 heads is about 98%:

1 - pbinom(q = 1, size = 10, prob = 0.5) * 2
[1] 0.9785156

The probability of getting 2 – 8 heads in 10 trials is 98% multiplied by itself 10 times:

(1 - pbinom(q = 1, size = 10, prob = 0.5) * 2)^10
[1] 0.8047809

Therefore the probability of getting 0, 1, 9, or 10 heads in 10 trials is now about 20%:

1 - (1 - pbinom(q = 1, size = 10, prob = 0.5) * 2)^10
[1] 0.1952191

We can think of this as doing multiple hypothesis tests. Flip 10 coins 10 times each, get the proportion of heads for each coin, and use 10 one-sample proportion tests to statistically determine if the results we got are consistent with a fair coin. In other words, do we get any p-values less than, say, 0.05?

We can simulate this in R. First we replicate 1,000 times the act of flipping 10 fair coins 10 times each and counting the number of heads using the rbinom function. This produces a 10 x 1000 matrix of results that we save as “coin.flips”. We then apply a function to each column of the matrix that runs 10 one-sample proportion tests using the prop.test function and saves a TRUE/FALSE value if any of the p-values are less than 0.05 (we talk more about the prop.test function below). This returns a vector we save as “results” that contains TRUE or FALSE for each replicate. R treats TRUE and FALSE as 0 or 1, so calling mean on results returns the proportion of TRUEs in the vector. We get about 20%, confirming our calculations. (If you run the code below you’ll probably get a slightly different but similar answer.)

trials <- 10
coin.flips <- replicate(1000, rbinom(n = 10, size = trials, prob = 0.5))

multHypTest <- function(x){
  pvs <- sapply(x, function(x)prop.test(x = x, n = trials, p = 0.5)$p.value)
  any(pvs < 0.05)
}

results <- apply(coin.flips,2,multHypTest)
mean(results)
[1] 0.206

That’s just for 10 trials. What about 15 or 20 or more? You can re-run the code above with trials set to a different value. We can also visualize it by plotting the probability of an unusual result (0, 1, 9, or 10 heads) versus the number trials. Notice how rapidly the probability of a false discovery increases with the number of trials.

curve(expr = 1 - (1 - pbinom(q = 1, size = 10, prob = 0.5) * 2)^x, 
      xlim = c(1,50),
      xlab = "Number of tests",
      ylab = "Probability of 0, 1, 9, or 10 heads")

mult_test_fdr_rate

So what does all of this tell us? It reveals that traditional significance levels such as 0.05 are too high when conducting multiple hypothesis tests. We need to either adjust our significance level or adjust our p-values. As we’ll see, the usual approach is to adjust the p-values using one of several methods for p-value adjustment.

Let’s return to our example of examining the proportion of high school students (sample size 30 at each school) who floss at 8 different high schools. We’ll simulate this data as if the true proportion is 30% at each school (i.e., no difference). We use set.seed to make the data reproducible.

set.seed(15)
n <- 30
k <- 8
school <- rep(1:k, each = n)
floss <- replicate(k, sample(x = c("Y","N"), 
                             size = n, 
                             prob = c(0.3, 0.7), 
                             replace = TRUE))
dat <- data.frame(school, floss = as.vector(floss))

With our data generated, we can tabulate the number of Yes and No responses at each school:

flossTab <- with(dat, table(school, floss))
flossTab
      floss
school  N  Y
     1 18 12
     2 19 11
     3 14 16
     4 19 11
     5 26  4
     6 15 15
     7 20 10
     8 21  9

Using prop.table we can determine the proportions. Specifying margin = 1 means proportions are calculated across the rows for each school. (We also round to two decimal places for presentation purposes.) The second column contains the proportion of students who answered Yes at each school.

round(prop.table(flossTab, margin = 1),2)
      floss
school    N    Y
     1 0.60 0.40
     2 0.63 0.37
     3 0.47 0.53
     4 0.63 0.37
     5 0.87 0.13
     6 0.50 0.50
     7 0.67 0.33
     8 0.70 0.30

First we might want to run a test to see if we can statistically conclude that not all proportions are equal. We can do this with the prop.test function. The prop.test function requires that Yes (or “success”) counts be in the first column of a table and No (or “failure”) counts in the second column. Thus we switch the columns using subsetting brackets with a vector indicating column order.

prop.test(flossTab[,c("Y","N")])

	8-sample test for equality of proportions without continuity correction

data:  flossTab[, c("Y", "N")]
X-squared = 13.78, df = 7, p-value = 0.05524
alternative hypothesis: two.sided
sample estimates:
   prop 1    prop 2    prop 3    prop 4    prop 5    prop 6    prop 7    prop 8 
0.4000000 0.3666667 0.5333333 0.3666667 0.1333333 0.5000000 0.3333333 0.3000000 

The p-value of 0.055 is borderline significant and indicates some evidence of differences among proportions. We generated the data so we know there actually is no difference! But if this were real data that we had spent considerable resources collecting, we might be led to believe (perhaps even want to believe) some differences indeed exist. That p-value is so close to significance! School #5, in particular, with a proportion of 13% looks far lower than school #3 with 53%. We could conclude this hypothesis test is significant at 0.10 level and proceed to pairwise comparisons.

To do that in R we use the pairwise.prop.test function which requires a table in the same format as prop.test, Yes counts in the first column and No counts in the second column:

pairwise.prop.test(x = flossTab[,c("Y","N")])

	Pairwise comparisons using Pairwise comparison of proportions 

data:  flossTab[, c("Y", "N")] 

  1     2     3     4     5     6     7    
2 1.000 -     -     -     -     -     -    
3 1.000 1.000 -     -     -     -     -    
4 1.000 1.000 1.000 -     -     -     -    
5 1.000 1.000 0.073 1.000 -     -     -    
6 1.000 1.000 1.000 1.000 0.149 -     -    
7 1.000 1.000 1.000 1.000 1.000 1.000 -    
8 1.000 1.000 1.000 1.000 1.000 1.000 1.000

P value adjustment method: holm 

This produces a table of 28 p-values since there are 28 possible pairs between 8 items. We interpret the table by using row and column numbers to find the p-value for a particular pair. For example the p-value of 0.073 at the intersection of row 5 and column 3 is the p-value for the two-sample proportion test between school #5 and school #3. It appears to be insignificant at the traditional 5% level. All other p-values are clearly insignificant. In fact, most are 1. This is due to the p-value adjustment that was made. The output tells us the “holm” method was used. We won’t get into the details of how this method works, but suffice to say it increases the p-values in an effort to adjust for the many comparisons being made. In this case, it does what it’s supposed to: it adjusts the p-values and allows us to make a good case there is no differences between schools, at least not at the 5% level, which would be the correct decision.

We can do pairwise comparisons without adjusted p-values by setting p.adjust.method = "none". Let’s do that and see what happens:

# NOTE: This analysis is wrong!
pairwise.prop.test(x = flossTab[,c("Y","N")], p.adjust.method = "none")

	Pairwise comparisons using Pairwise comparison of proportions 

data:  flossTab[, c("Y", "N")] 

  1      2      3      4      5      6      7     
2 1.0000 -      -      -      -      -      -     
3 0.4376 0.2993 -      -      -      -      -     
4 1.0000 1.0000 0.2993 -      -      -      -     
5 0.0410 0.0736 0.0026 0.0736 -      -      -     
6 0.6038 0.4345 1.0000 0.4345 0.0055 -      -     
7 0.7888 1.0000 0.1927 1.0000 0.1270 0.2949 -     
8 0.5883 0.7842 0.1161 0.7842 0.2100 0.1876 1.0000

P value adjustment method: none 

Notice now we have significant differences for 3 pairs: (5,1), (5,3), and (6,5). Again we know this is wrong because we simulated the data. The truth is all schools have a floss rate of 30%. But we see that through random chance and not adjusting our p-values for multiple testing we got what look to be significant results. This illustrates the importance of using adjusted p-values when making multiple comparisons.

There are other p-value adjustment methods available. A common and conservative choice is the bonferroni method. It simply multiplies all p-values by the number of pairs. In our example that is 28. To see all p-value adjustment methods available in R enter ?p.adjust at the console.

For questions or clarifications regarding this article, contact the UVa Library StatLab: statlab@virginia.edu

Clay Ford
Statistical Research Consultant
University of Virginia Library
October 20, 2016

Welcome Meagan

Meagan Christensen joined our Social, Natural, and Engineering Sciences team in August as our new Social Science Librarian, and we are thrilled to have her with us!

christensenRDSMeagan will be working with the Economics, Politics, Psychology, and Sociology departments, engaging with students and faculty on their research and information needs — in the classroom, through consultations, and online — and helping these communities navigate resources both within the Library and across the University.

Meagan earned her MLIS from the University of Washington and her B.A. in Psychology and Sociology (with a minor in French) from the University of Portland. She joined UVA in February 2014 as the Online and Distance Learning Librarian, and has also worked as part of the Teaching and Learning team in Academic Engagement.

You can find Meagan in the Library Data Commons at Curry (Ruffner 302). Or email her at mck6n@virginia.edu.

Stata Basics: foreach and forvalues

There are times we need to do some repetitive tasks in the process of data preparation, analysis or presentation, for instance, computing a set of variables in a same manner, rename or create a series of variables, or repetitively recode values of a number of variables. In this post, I show a few of simple example “loops” using Stata commands -foreach-, -local- and -forvalues- to handle some common simple repetitive tasks.

-foreach-: loop over items

Consider this sample dataset of monthly average temperature for three years.

 
* input data
> clear
> input year mtemp1-mtemp12

          year     mtemp1     mtemp2     mtemp3     mtemp4     mtemp5     mtemp6     mtemp7     mtemp8     mtemp9    mtemp10    mtemp11    mtemp12
  1. 2013 4 3 5 14 18 23 25 22 19 15 7 6
  2. 2014 -1 3 5 13 19 23 24 23 21 15 7 5
  3. 2015 2 -1 7 14 21 24 25 24 21 14 11 10
  4. end

Now the mean temperatures of each month are in Centigrade, if we want to convert them to Fahrenheit, we could do the computation for the 12 variables.

generate fmtemp1 = mtemp1*(9/5)+32
generate fmtemp2 = mtemp1*(9/5)+32
generate fmtemp3 = mtemp1*(9/5)+32
generate fmtemp4 = mtemp1*(9/5)+32
generate fmtemp5 = mtemp1*(9/5)+32
generate fmtemp6 = mtemp1*(9/5)+32
generate fmtemp7 = mtemp1*(9/5)+32
generate fmtemp8 = mtemp1*(9/5)+32
generate fmtemp9 = mtemp1*(9/5)+32
generate fmtemp10 = mtemp1*(9/5)+32
generate fmtemp11 = mtemp1*(9/5)+32
generate fmtemp12 = mtemp1*(9/5)+32

However this takes a lot of typing. Alternatively, we can use the -foreach- command to achieve the same goal. In the following codes, we tell Stata to do the same thing (the computation: c*9/5+32) for each of the variable in the varlist – mtemp1 to mtemp12.

> foreach v of varlist mtemp1-mtemp12 {
    generate f`v' = `v'*(9/5)+32
  } 
 
* list variables
> ds
year      mtemp3    mtemp6    mtemp9    mtemp12   fmtemp3   fmtemp6   fmtemp9   fmtemp12
mtemp1    mtemp4    mtemp7    mtemp10   fmtemp1   fmtemp4   fmtemp7   fmtemp10
mtemp2    mtemp5    mtemp8    mtemp11   fmtemp2   fmtemp5   fmtemp8   fmtemp11

Note that braces must be specified with -foreach-. The open brace has to be on the same line as the foreach, and the close brace must be on a line by itself. It’s crucial to close loops properly, especially if you have one or more loops nested in another loop.

-local-: define macro

This was a rather simple repetitive task which can be handled solely by the foreach command. Here we introduce another command -local-, which is utilized a lot with commands like foreach to deal with repetitive tasks that are more complex. The -local- command is a way of defining macro in Stata. A Stata macro can contain multiple elements; it has a name and contents. Consider the following two examples:

 
* define a local macro called month
> local month jan feb mar apr

> display `"`month'"' 
jan feb mar apr

Define a local macro called mcode and another called month, alter the contents of mcode in the foreach loop, then display them in a form of “mcode: month”.

> local mcode 0
> local month jan feb mar apr
> foreach m of local month {
    local mcode = `mcode' + 1
    display "`mcode': `m'"
   }
1: jan
2: feb
3: mar
4: apr

Note when you call a defined macro, it has to be wrapped in “`” (left tick) and “‘” (apostrophe) symbols.

Rename multiple variables

Take the temperature dataset we created as an example. Let’s say we want to rename variables mtemp1-mtemp12 as mtempjan-mtenpdec. We can do so by just tweaking a bit of the codes in the previous example.

Define local macro mcode and month, then rename the 12 vars in the foreach loop.

> local mcode 0
> local month jan feb mar apr may jun jul aug sep oct nov dec
> foreach m of local month {
    local mcode = `mcode' + 1
    rename mtemp`mcode' mtemp`m'
  }
> ds
year      mtempmar  mtempjun  mtempsep  mtempdec  fmtemp3   fmtemp6   fmtemp9   fmtemp12
mtempjan  mtempapr  mtempjul  mtempoct  fmtemp1   fmtemp4   fmtemp7   fmtemp10
mtempfeb  mtempmay  mtempaug  mtempnov  fmtemp2   fmtemp5   fmtemp8   fmtemp11

We can obtain the same results in a slightly different way. This time we use another 12 variables fmtemp1-fmtemp12 as examples. Again, we will rename them as fmtempjan-fmtempdec.

Define local macro month, then define local macro monthII in the foreach loop with specifying the string function word to reference the contents of the local macro month.

      
> local month jan feb mar apr may jun jul aug sep oct nov dec
> foreach n of numlist 1/12 {
    local monthII: word `n' of `month'
    display "`monthII'"
    rename fmtemp`n' fmtemp`monthII'   
  } 
jan
feb
mar
apr
may
jun
jul
aug
sep
oct
nov
dec

> ds
year       mtempmar   mtempjun   mtempsep   mtempdec   fmtempmar  fmtempjun  fmtempsep  fmtempdec
mtempjan   mtempapr   mtempjul   mtempoct   fmtempjan  fmtempapr  fmtempjul  fmtempoct
mtempfeb   mtempmay   mtempaug   mtempnov   fmtempfeb  fmtempmay  fmtempaug  fmtempnov

I usually run -display- to see how the macro looks like before actually applying the defined macro on tasks like changing variable names, just to make sure I don’t accidentally change them to some undesired results or even cause errors; however the display line is not necessary in this case.

Here we rename them back to fmtemp1-fmtemp12.

> local mcode 0
> foreach n in jan feb mar apr may jun jul aug sep oct nov dec {
    local mcode = `mcode' + 1
    rename fmtemp`n' fmtemp`mcode'
  }

> ds
year      mtempmar  mtempjun  mtempsep  mtempdec  fmtemp3   fmtemp6   fmtemp9   fmtemp12
mtempjan  mtempapr  mtempjul  mtempoct  fmtemp1   fmtemp4   fmtemp7   fmtemp10
mtempfeb  mtempmay  mtempaug  mtempnov  fmtemp2   fmtemp5   fmtemp8   fmtemp11

-forvalues-: loop over consecutive values

The -forvalues- command is another command that gets to be used a lot in handling repetitive works. Consider the same temperature dataset we created, suppose we would like to generate twelve dummy variables (warm1-warm12) to reflect if each of the monthly average temperature is higher than the one in the previous year. For example, I will code warm1 for the year of 2014 as 1 if the value of fmtemp1 for 2014 is higher than the value for 2013. I will code all the warm variables as 99 for the year of 2013, since they don’t have references to compare in this case.

We can do this by running the following codes, then repeat them for twelve times to create the twelve variables warm1-warm12.

 
* _n creates sequences of numbers. Type "help _n" for descriptions and examples.
> generate warm1=1 if fmtemp1 > fmtemp1[_n-1]
(2 missing values generated)

> replace warm1=0 if fmtemp1 <= fmtemp1[_n-1]
(2 real changes made)

> replace warm1=99 if year==2013
(1 real change made)

> list year fmtemp1 warm1, clean

       year   fmtemp1   warm1  
  1.   2013      39.2      99  
  2.   2014      30.2       0  
  3.   2015      35.6       1  

However this takes a lot of typing and may even create unwanted mistakes in the process of typing or copy-paste them over and over.

 
* drop warm1 we generated
> drop warm1

Instaed, we can use -forvalues- to do so:

> forvalues i=1/12 {
    generate warm`i'=1 if fmtemp`i' > fmtemp`i'[_n-1]
    replace warm`i'=0 if fmtemp`i' <= fmtemp`i'[_n-1]
    replace warm`i'=99 if year==2013
  }
 
* see the results
> list year fmtemp1-fmtemp3 warm1-warm3, clean 

       year   fmtemp1   fmtemp2   fmtemp3   warm1   warm2   warm3  
  1.   2013      39.2      37.4        41      99      99      99  
  2.   2014      30.2      37.4        41       0       0       0  
  3.   2015      35.6      30.2      44.6       1       0       1  

Reference
Baum, C. (2005). A little bit of Stata programming goes a long way… Working Papers in Economics, 69.

 

Yun Tai
CLIR Postdoctoral Fellow
University of Virginia Library

Stata Basics: Reshape Data

In this post, I use a few examples to illustrate the two common data forms: wide form and long form, and how to convert datasets between the two forms – here we call it “reshape” data. Reshaping often needed when you work with datasets that contain variables with some kinds of sequences, say, time-series data. It is fairly easy to transform data between wide and long forms in Stata using the -reshape- command, however you may still like to be careful when you convert a dataset from one to another so that you can eliminate possible mistakes in the process of transforming.

First, let’s see how the wide and long forms look like.
Here is a simple example of a wide form dataset, in which every variable lives in a column.

     
     +----------------------------+
     | id   inc80   inc81   inc82 |
     |----------------------------|
  1. |  1    5000    5500    6000 |
  2. |  2    2000    2200    3300 |
  3. |  3    3000    2000    1000 |
     +----------------------------+
													 

While the same dataset in long form should look like this, in which each case takes 3 rows – the 3 years and the corresponding income.

				
     +------------------+
     | id   year    inc |
     |------------------|
  1. |  1     80   5000 |
  2. |  1     81   5500 |
  3. |  1     82   6000 |
  4. |  2     80   2000 |
  5. |  2     81   2200 |
     |------------------|
  6. |  2     82   3300 |
  7. |  3     80   3000 |
  8. |  3     81   2000 |
  9. |  3     82   1000 |
     +------------------+

Which form works better for you? It depends on what you need to do with the data. You may find it easier to enter your records in wide format, however in my experience, long format may work better in many cases of data analysis. So let’s see how to convert a dataset in wide form to long form.

* load dataset reshape1
> webuse reshape1, clear

* list the data
> list
									
     +-------------------------------------------------------+
     | id   sex   inc80   inc81   inc82   ue80   ue81   ue82 |
     |-------------------------------------------------------|
  1. |  1     0    5000    5500    6000      0      1      0 |
  2. |  2     1    2000    2200    3300      1      0      0 |
  3. |  3     0    3000    2000    1000      0      0      1 |
     +-------------------------------------------------------+
										

* let's make the first example simpler by keeping id, sex and the inc variables
> drop ue*
> list  
						
     +----------------------------------+
     | id   sex   inc80   inc81   inc82 |
     |----------------------------------|
  1. |  1     0    5000    5500    6000 |
  2. |  2     1    2000    2200    3300 |
  3. |  3     0    3000    2000    1000 |
     +----------------------------------+
							

Reshape from wide to long

The syntax should look like this in general: reshape long stub, i(i) j(j)
In this case, 1) the stub should be inc, which is the variable to be converted from wide to long, 2) i is the id variable, which is the unique identifier of observations in wide form, and 3) j is the year variable that I am going to create – it tells Stata that suffix of inc (i.e., 80, 81, 82) should be put in the variable called year.

> reshape long inc, i(id) j(year)
> list

     +------------------------+
     | id   year   sex    inc |
     |------------------------|
  1. |  1     80     0   5000 |
  2. |  1     81     0   5500 |
  3. |  1     82     0   6000 |
  4. |  2     80     1   2000 |
  5. |  2     81     1   2200 |
     |------------------------|
  6. |  2     82     1   3300 |
  7. |  3     80     0   3000 |
  8. |  3     81     0   2000 |
  9. |  3     82     0   1000 |
     +------------------------+

Here is what Stata did for us. In wide form, we had 3 observations and 3 income variables for the 3 years (80-82), we now have 9 observations in long form – so the transformation looks right to me in terms of number of observations/rows.

(note: j = 80 81 82)

Data                               wide   ->   long
-----------------------------------------------------------------------------
Number of obs.                        3   ->       9
Number of variables                   5   ->       4
j variable (3 values)                     ->   year
xij variables:
                      inc80 inc81 inc82   ->   inc
-----------------------------------------------------------------------------

* To convert this current data back to wide form, simply type:
> reshape wide
> list

     +----------------------------------+
     | id   inc80   inc81   inc82   sex |
     |----------------------------------|
  1. |  1    5000    5500    6000     0 |
  2. |  2    2000    2200    3300     1 |
  3. |  3    3000    2000    1000     0 |
     +----------------------------------+

Reshape from wide to long: more than one stub

Remember we actually had more variables in the reshape1 dataset, let’s see how to reshape it.

* load the dataset
> webuse reshape1, clear
> list

     +-------------------------------------------------------+
     | id   sex   inc80   inc81   inc82   ue80   ue81   ue82 |
     |-------------------------------------------------------|
  1. |  1     0    5000    5500    6000      0      1      0 |
  2. |  2     1    2000    2200    3300      1      0      0 |
  3. |  3     0    3000    2000    1000      0      0      1 |
     +-------------------------------------------------------+

* reshape from wide to long
* we simply put inc and ue as stubs, then put id and year as i and j as we did in the previous example.
> reshape long inc ue, i(id) j(year)
> list

     +-----------------------------+
     | id   year   sex    inc   ue |
     |-----------------------------|
  1. |  1     80     0   5000    0 |
  2. |  1     81     0   5500    1 |
  3. |  1     82     0   6000    0 |
  4. |  2     80     1   2000    1 |
  5. |  2     81     1   2200    0 |
     |-----------------------------|
  6. |  2     82     1   3300    0 |
  7. |  3     80     0   3000    0 |
  8. |  3     81     0   2000    0 |
  9. |  3     82     0   1000    1 |
     +-----------------------------+

Reshape from wide to long: complex unique identifier

Sometimes a variable called id does not serve as unique identifier – and that’s one of the reasons we need to be careful when reshaping data. Consider another sample data called reshape2.

* load the data
> webuse reshape2, clear
> list

     +----------------------------------+
     | id   sex   inc80   inc81   inc82 |
     |----------------------------------|
  1. |  1     0    5000    5500    6000 |
  2. |  2     1    2000    2200    3300 |
  3. |  3     0    3000    2000    1000 |
  4. |  2     0    2400    2500    2400 |
     +----------------------------------+

If you reshape using id as the unique identifier i, you’ll get error as the variable id does not uniquely identify the observations.

> reshape long inc, i(id) j(year)
(note: j = 80 81 82)
variable id does not uniquely identify the observations
    Your data are currently wide.  You are performing a reshape long.  You specified i(id) and j(year).  In
    the current wide form, variable id should uniquely identify the observations.  Remember this picture:

         long                                wide
        +---------------+                   +------------------+
        | i   j   a   b |                   | i   a1 a2  b1 b2 |
        |---------------|  |------------------|
        | 1   1   1   2 |                   | 1   1   3   2  4 |
        | 1   2   3   4 |                   | 2   5   7   6  8 |
        | 2   1   5   6 |                   +------------------+
        | 2   2   7   8 |
        +---------------+
    Type reshape error for a list of the problem observations.

In this case, this id problem may be due to some mistakes in the dataset, however in some other circumstances, you may need to create an unique identifier when reshape the dataset. Let’s modify the dataset reshape2 by turning variable sex to group id called gid.

> rename sex gid
> order gid id
> list

     +----------------------------------+
     | gid   id   inc80   inc81   inc82 |
     |----------------------------------|
  1. |   0    1    5000    5500    6000 |
  2. |   1    2    2000    2200    3300 |
  3. |   0    3    3000    2000    1000 |
  4. |   0    2    2400    2500    2400 |
     +----------------------------------+

Now we have a dataset with gid, id and income for the 3 years – combining gid and id will make an unique identifier.

> reshape long inc, i(gid id) j(year)
(note: j = 80 81 82)

Data                               wide   ->   long
-----------------------------------------------------------------------------
Number of obs.                        4   ->      12
Number of variables                   5   ->       4
j variable (3 values)                     ->   year
xij variables:
                      inc80 inc81 inc82   ->   inc
-----------------------------------------------------------------------------

. list

     +------------------------+
     | gid   id   year    inc |
     |------------------------|
  1. |   0    1     80   5000 |
  2. |   0    1     81   5500 |
  3. |   0    1     82   6000 |
  4. |   0    2     80   2400 |
  5. |   0    2     81   2500 |
     |------------------------|
  6. |   0    2     82   2400 |
  7. |   0    3     80   3000 |
  8. |   0    3     81   2000 |
  9. |   0    3     82   1000 |
 10. |   1    2     80   2000 |
     |------------------------|
 11. |   1    2     81   2200 |
 12. |   1    2     82   3300 |
     +------------------------+

Reshape from wide to long: character suffixes

You can still reshape data if the stub variables come with character suffixes. Here we use the bpwide data installed with Stata as an example.

* load data and list the first 4 observations
> sysuse bpwide, clear 
(fictional blood-pressure data)

> list in 1/4

     +-----------------------------------------------+
     | patient    sex   agegrp   bp_bef~e   bp_after |
     |-----------------------------------------------|
  1. |       1   Male    30-45        143        153 |
  2. |       2   Male    30-45        163        170 |
  3. |       3   Male    30-45        153        168 |
  4. |       4   Male    30-45        153        142 |
     +-----------------------------------------------+

* reshape data, note the string option added at the end 
> reshape long bp_, i(patient) j(when) string
(note: j = after before)

Data                               wide   ->   long
-----------------------------------------------------------------------------
Number of obs.                      120   ->     240
Number of variables                   5   ->       5
j variable (2 values)                     ->   when
xij variables:
                     bp_after bp_before   ->   bp_
-----------------------------------------------------------------------------

> list in 1/4

     +----------------------------------------+
     | patient     when    sex   agegrp   bp_ |
     |----------------------------------------|
  1. |       1    after   Male    30-45   153 |
  2. |       1   before   Male    30-45   143 |
  3. |       2    after   Male    30-45   170 |
  4. |       2   before   Male    30-45   163 |
     +----------------------------------------+

Reshape from long to wide: -reshape wide-

To convert a dataset from long form to wide, simply use -reshape wide- command instead.

Consider the airacc data, to make a simple example, we only keep 3 variables –
airline, time and i_cnt.

> webuse airacc.dta, clear
> keep airline time i_cnt
> list in 1/8

     +------------------------+
     | airline   i_cnt   time |
     |------------------------|
  1. |       1      25      1 |
  2. |       1      17      2 |
  3. |       1      22      3 |
  4. |       1      34      4 |
  5. |       2      26      1 |
     |------------------------|
  6. |       2      45      2 |
  7. |       2      30      3 |
  8. |       2      25      4 |
     +------------------------+

In this case, variable i_cnt is the one that we are going to restructure from long to wide, and just like what we did with -reshape long-, the i variable is the unique identifier in wide form, and the j variable is the one contains the suffix in wide form.

> reshape wide i_cnt, i(airline) j(time)
(note: j = 1 2 3 4)

Data                               long   ->   wide
-----------------------------------------------------------------------------
Number of obs.                       80   ->      20
Number of variables                   3   ->       5
j variable (4 values)              time   ->   (dropped)
xij variables:
                                  i_cnt   ->   i_cnt1 i_cnt2 ... i_cnt4
-----------------------------------------------------------------------------

> list in 1/8

     +---------------------------------------------+
     | airline   i_cnt1   i_cnt2   i_cnt3   i_cnt4 |
     |---------------------------------------------|
  1. |       1       25       17       22       34 |
  2. |       2       26       45       30       25 |
  3. |       3       10       23        8       21 |
  4. |       4       17       18        5       21 |
  5. |       5       18       19       13       27 |
     |---------------------------------------------|
  6. |       6       36       32       23       27 |
  7. |       7       27       28       25       17 |
  8. |       8       31       14       22       17 |
     +---------------------------------------------+

Other usages should be similar to -reshape long- as well, for instance, reshape more than one variable –

* load the airacc data again, this time we keep one more variable: inprog
> webuse airacc.dta, clear
> keep airline time i_cnt inprog
> list in 1/8

     +---------------------------------+
     | airline   inprog   i_cnt   time |
     |---------------------------------|
  1. |       1        1      25      1 |
  2. |       1        1      17      2 |
  3. |       1        0      22      3 |
  4. |       1        0      34      4 |
  5. |       2        0      26      1 |
     |---------------------------------|
  6. |       2        0      45      2 |
  7. |       2        0      30      3 |
  8. |       2        1      25      4 |
     +---------------------------------+

Reshape the two variables i_cnt and inprog with the i and j variables remained the same.

> reshape wide i_cnt inprog, i(airline) j(time)
(note: j = 1 2 3 4)

Data                               long   ->   wide
-----------------------------------------------------------------------------
Number of obs.                       80   ->      20
Number of variables                   4   ->       9
j variable (4 values)              time   ->   (dropped)
xij variables:
                                  i_cnt   ->   i_cnt1 i_cnt2 ... i_cnt4
                                 inprog   ->   inprog1 inprog2 ... inprog4
-----------------------------------------------------------------------------

> list in 1/8

     +-------------------------------------------------------------------------------------+
     | airline   inprog1   i_cnt1   inprog2   i_cnt2   inprog3   i_cnt3   inprog4   i_cnt4 |
     |-------------------------------------------------------------------------------------|
  1. |       1         1       25         1       17         0       22         0       34 |
  2. |       2         0       26         0       45         0       30         1       25 |
  3. |       3         0       10         0       23         1        8         0       21 |
  4. |       4         0       17         1       18         0        5         0       21 |
  5. |       5         0       18         0       19         0       13         1       27 |
     |-------------------------------------------------------------------------------------|
  6. |       6         0       36         0       32         0       23         1       27 |
  7. |       7         0       27         1       28         1       25         1       17 |
  8. |       8         1       31         0       14         0       22         0       17 |
     +-------------------------------------------------------------------------------------+

 

Yun Tai
CLIR Postdoctoral Fellow
University of Virginia Library

Stata Basics: Combine Data (Append and Merge)

When I first started working with data, which was in a statistics class, we mostly used clean and completed dataset as examples. Later on, I realize it’s not always the case when doing research or data analysis for other purposes; in reality, we often need to put two or more dataset together to be able to begin whatever statistic analysis tasks we would like to perform. In this post, I demonstrate how to combine datasets into one file in two typical ways: append and merge, that are row-wise combining and column-wise combining, respectively.

Append data: -append-

Say you would like to stack one of your data file on top of another, then you can use the -append- command to do so. Usually the data files we would like to append contain the same variables, so let’s create two fictional data files, each of them has 4 variables: id, character name, character family and numbers of episode the character appeared in.

* Set working directory
> cd [YOUR PATH] 

* create dataset 1
> clear
> input id str8 name str9 family epi

            id       name     family        epi
  1. 1 "Arya" "Stark" 33
  2. 2 "Cersei" "Lannister" 36
  3. 3 "Ned" "Stark" 11
  4. end

> save got1, replace
file got1.dta saved

> list

     +-------------------------------+
     | id     name      family   epi |
     |-------------------------------|
  1. |  1     Arya       Stark    33 |
  2. |  2   Cersei   Lannister    36 |
  3. |  3      Ned       Stark    11 |
     +-------------------------------+

* create dataset 2
> clear
> input id str8 name str9 family epi

            id       name     family        epi
  1. 5 "Robert" "Baratheon" 7
  2. 4 "Jon" "Stark" 32
  3. 6 "Tyrion" "Lannister" 36
  4. end

> save got2, replace
file got2.dta saved

> list

     +-------------------------------+
     | id     name      family   epi |
     |-------------------------------|
  1. |  5   Robert   Baratheon     7 |
  2. |  4      Jon       Stark    32 |
  3. |  6   Tyrion   Lannister    36 |
     +-------------------------------+

* combine the two datasets and see the results
> use got1, clear
> append using got2
> list

     +-------------------------------+
     | id     name      family   epi |
     |-------------------------------|
  1. |  1     Arya       Stark    33 |
  2. |  2   Cersei   Lannister    36 |
  3. |  3      Ned       Stark    11 |
  4. |  5   Robert   Baratheon     7 |
  5. |  4      Jon       Stark    32 |
     |-------------------------------|
  6. |  6   Tyrion   Lannister    36 |
     +-------------------------------+

The combined dataset looks right to me, however we are not able to tell which dataset the observations come from. In some cases this may cause some inconvenience in tracing back to the original files or even problems in data analysis – say, in this case, if got1 and got2 contain records from two different seasons, we should mark that in the combined dataset. We can simply do this by generating a variable indicating season before we append them.

> use got1, clear
> generate season=1
> save got1, replace
file got1.dta saved

> use got2, clear
> generate season=2
> save got2, replace
file got2.dta saved

> use got1, clear
> append using got2
> list

     +----------------------------------------+
     | id     name      family   epi   season |
     |----------------------------------------|
  1. |  1     Arya       Stark    33        1 |
  2. |  2   Cersei   Lannister    36        1 |
  3. |  3      Ned       Stark    11        1 |
  4. |  5   Robert   Baratheon     7        2 |
  5. |  4      Jon       Stark    32        2 |
     |----------------------------------------|
  6. |  6   Tyrion   Lannister    36        2 |
     +----------------------------------------+

> save got3, replace
file got3.dta saved

Now we have a combined dataset with a variable indicating which original dataset the observations come from – although this dataset is officially fictional, as Robert Baratheon was not seen in season two…

Merge data: -merge-

It is usually pretty straightforward to append data, however it sometimes gets a bit tricky when you need to combine data in a column-wise manner, that is, merge data. Below we use two examples to demonstrate one-to-one merge and one-to-many merge.

One-to-one merge: -merge 1:1-

In the dataset we just appended (got3), we have 5 variables, with the id variable uniquely identifying the 6 observations in the data. Say we have another data file contains the id variable and the same 6 observations, but with a new variable called status – in other words, a new column. In this case, if we want to combine this new data file to got3, we should use one-to-one merge to match the records in the two files.

* First, we create the new data file with id and the new variable status
> clear 
> input id status

            id     status
  1. 1 1
  2. 2 1
  3. 3 0 
  4. 4 1
  5. 6 1
  6. 5 0
  7. end 

> list

     +-------------+
     | id   status |
     |-------------|
  1. |  1        1 |
  2. |  2        1 |
  3. |  3        0 |
  4. |  4        1 |
  5. |  6        1 |
     |-------------|
  6. |  5        0 |
     +-------------+

> save got4, replace
file got4.dta saved

* sort observations by id in got3
> use got3, clear
> sort id
> list

     +----------------------------------------+
     | id     name      family   epi   season |
     |----------------------------------------|
  1. |  1     Arya       Stark    33        1 |
  2. |  2   Cersei   Lannister    36        1 |
  3. |  3      Ned       Stark    11        1 |
  4. |  4      Jon       Stark    32        2 |
  5. |  5   Robert   Baratheon     7        2 |
     |----------------------------------------|
  6. |  6   Tyrion   Lannister    36        2 |
     +----------------------------------------+

> save got3m, replace
file got3m.dta saved

* sort observations by id in got4
> use got4, clear
> sort id
> list

     +-------------+
     | id   status |
     |-------------|
  1. |  1        1 |
  2. |  2        1 |
  3. |  3        0 |
  4. |  4        1 |
  5. |  5        0 |
     |-------------|
  6. |  6        1 |
     +-------------+

> save got4m, replace
file got4m.dta saved

* merge the two files, we base this merge on the id variable in both files
> use got3m, clear
> merge 1:1 id using got4m

    Result                           # of obs.
    -----------------------------------------
    not matched                             0
    matched                                 6  (_merge==3)
    -----------------------------------------

> list

     +---------------------------------------------------------------+
     | id     name      family   epi   season   status        _merge |
     |---------------------------------------------------------------|
  1. |  1     Arya       Stark    33        1        1   matched (3) |
  2. |  2   Cersei   Lannister    36        1        1   matched (3) |
  3. |  3      Ned       Stark    11        1        0   matched (3) |
  4. |  4      Jon       Stark    32        2        1   matched (3) |
  5. |  5   Robert   Baratheon     7        2        0   matched (3) |
     |---------------------------------------------------------------|
  6. |  6   Tyrion   Lannister    36        2        1   matched (3) |
     +---------------------------------------------------------------+

Note Stata creates a _merge variable in the merged results, which indicates how the merge was done for each observation. The value of _merge is 1 if the observation comes form file1 (master file) only, 2 if the observation comes from file2 (using file) only, 3 if the observation comes from both of the two files – in other words, 3 means the observation is matched. In this example, we can easily inspect every observation to see if they are matched. If you get more records in a dataset, which we normally do, you can summarize this _merge variable to see if you have any mismatched case.

 
> tabulate _merge

                 _merge |      Freq.     Percent        Cum.
------------------------+-----------------------------------
            matched (3) |          6      100.00      100.00
------------------------+-----------------------------------
                  Total |          6      100.00

Looks like we have every observation matched in this merging example.

One-to-many merge: -merge 1:m-

Here I show an example of another kind of merge called one-to-many merge. Let’s illustrate when would we need to perform one-to-many merge by combining two sample datasets: one with information of dads, another with records of their kids.

First we create the dads file with family id, family name, dads name and their status, sort the observations by family id.

> clear 
> input familyid str9 family str8 dname dstatus

      familyid     family      dname    dstatus
  1. 3 "Stark" "Ned" 0
  2. 1 "Baratheon" "Robert" 0
  3. 2 "Lannister" "Tywin" 1
  4. end

> list

     +-----------------------------------------+
     | familyid      family    dname   dstatus |
     |-----------------------------------------|
  1. |        3       Stark      Ned         0 |
  2. |        1   Baratheon   Robert         0 |
  3. |        2   Lannister    Tywin         1 |
     +-----------------------------------------+

> sort familyid
> save got5, replace
file got5.dta saved

Then we create the kids file with the same variables, sort by family id as well.

> clear
> input familyid str9 family str8 kname kstatus

      familyid     family      kname    kstatus
  1. 2 "Lannister" "Cersei" 1
  2. 3 "Stark" "Arya" 1
  3. 2 "Lannister" "Tyrion" 1
  4. 3 "Stark" "Jon" 1
  5. 1 "Baratheon" "Joffrey" 0
  6. end

> list

     +------------------------------------------+
     | familyid      family     kname   kstatus |
     |------------------------------------------|
  1. |        2   Lannister    Cersei         1 |
  2. |        3       Stark      Arya         1 |
  3. |        2   Lannister    Tyrion         1 |
  4. |        3       Stark       Jon         1 |
  5. |        1   Baratheon   Joffrey         0 |
     +------------------------------------------+

> sort familyid
> save got6, replace
file got6.dta saved

Now we have the two files sharing the familyid variable as an identifier, since each dad may have more than one kid, we use one-to-many merge to combine them.

 
* use the dads file as master file and kids file as using file
> use got5, clear
> merge 1:m familyid using got6

    Result                           # of obs.
    -----------------------------------------
    not matched                             0
    matched                                 5  (_merge==3)
    -----------------------------------------

> list

     +---------------------------------------------------------------------------+
     | familyid      family    dname   dstatus     kname   kstatus        _merge |
     |---------------------------------------------------------------------------|
  1. |        1   Baratheon   Robert         0   Joffrey         0   matched (3) |
  2. |        2   Lannister    Tywin         1    Cersei         1   matched (3) |
  3. |        3       Stark      Ned         0       Jon         1   matched (3) |
  4. |        2   Lannister    Tywin         1    Tyrion         1   matched (3) |
  5. |        3       Stark      Ned         0      Arya         1   matched (3) |
     +---------------------------------------------------------------------------+

* sort by familyid 
> sort familyid 
> list

     +---------------------------------------------------------------------------+
     | familyid      family    dname   dstatus     kname   kstatus        _merge |
     |---------------------------------------------------------------------------|
  1. |        1   Baratheon   Robert         0   Joffrey         0   matched (3) |
  2. |        2   Lannister    Tywin         1    Cersei         1   matched (3) |
  3. |        2   Lannister    Tywin         1    Tyrion         1   matched (3) |
  4. |        3       Stark      Ned         0      Arya         1   matched (3) |
  5. |        3       Stark      Ned         0       Jon         1   matched (3) |
     +---------------------------------------------------------------------------+

So the steps are really the same for one-to-one and one-to-many merge, just need to pick the right one depending on the datasets your are going to combine, and what kind of end product you would like to obtain from the merging.

 

Yun Tai
CLIR Postdoctoral Fellow
University of Virginia Library

Stata Basics: Subset Data

Sometimes only parts of a dataset mean something to you. In this post, we show you how to subset a dataset in Stata, by variables or by observations. We use the census.dta dataset installed with Stata as the sample data.

Subset by variables

* Load the data
> sysuse census.dta
(1980 Census data by state)

* See the information of the data
> describe

Contains data from /Applications/Stata/ado/base/c/census.dta
  obs:            50                          1980 Census data by state
 vars:            13                          6 Apr 2014 15:43
 size:         2,900                          
---------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
---------------------------------------------------------------------------
state           str14   %-14s                 State
state2          str2    %-2s                  Two-letter state abbreviation
region          int     %-8.0g     cenreg     Census region
pop             long    %12.0gc               Population
poplt5          long    %12.0gc               Pop, < 5 year
pop5_17         long    %12.0gc               Pop, 5 to 17 years
pop18p          long    %12.0gc               Pop, 18 and older
pop65p          long    %12.0gc               Pop, 65 and older
popurban        long    %12.0gc               Urban population
medage          float   %9.2f                 Median age
death           long    %12.0gc               Number of deaths
marriage        long    %12.0gc               Number of marriages
divorce         long    %12.0gc               Number of divorces
----------------------------------------------------------------------------
Sorted by: 

-keep-: keep variables or observations

There are 13 variables in this dataset. Say we would like to have a separate file contains only the list of the states with the region variable, we can use the -keep- command to do so.

> keep state state2 region

> describe

Contains data from /Applications/Stata/ado/base/c/census.dta
  obs:            50                          1980 Census data by state
 vars:             3                          6 Apr 2014 15:43
 size:           900                          
---------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
---------------------------------------------------------------------------
state           str14   %-14s                 State
state2          str2    %-2s                  Two-letter state abbreviation
region          int     %-8.0g     cenreg     Census region
---------------------------------------------------------------------------
Sorted by: 

Now that you should only see the three variables remain in the data. Note that this change only applies to the copy of the data in the memory, not the file on disk – you need to use the -save- command to make change to the file itself. You may want to be careful when you save this change, as you will permanently lose all the other variables that are not in the keep list. So here I save it as a new file called slist.

> save slist
file slist.dta saved

* Now if you load back the original file, you should still have all the variables
> sysuse census.dta, clear
(1980 Census data by state)

Note the clear option clears the current data in the memory, which contains the three variables we kept – don’t worry, you should still have it on your disk since we have saved it as slist.dta.

-drop-: drop variables or observations

The -drop- command also works in subsetting data. Say we only need to work with population of different age groups, we can remove other variables and save as a new file called census2.

> drop medage death marriage divorce

> describe

Contains data from /Applications/Stata/ado/base/c/census.dta
  obs:            50                          1980 Census data by state
 vars:             9                          6 Apr 2014 15:43
 size:         2,100                          
---------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
---------------------------------------------------------------------------
state           str14   %-14s                 State
state2          str2    %-2s                  Two-letter state abbreviation
region          int     %-8.0g     cenreg     Census region
pop             long    %12.0gc               Population
poplt5          long    %12.0gc               Pop, < 5 year
pop5_17         long    %12.0gc               Pop, 5 to 17 years
pop18p          long    %12.0gc               Pop, 18 and older
pop65p          long    %12.0gc               Pop, 65 and older
popurban        long    %12.0gc               Urban population
----------------------------------------------------------------------------
Sorted by: 

> save census2
file census2.dta saved

Subset by observations

We can also use -keep- and -drop- commands to subset data by keeping or eliminating observations that meet one or more conditions. For example, we can keep the states in the South.

* Load the data again and clear the current one in memory
> sysuse census.dta, clear
(1980 Census data by state)
 
* See the contents of region
> tabulate region

     Census |
     region |      Freq.     Percent        Cum.
------------+-----------------------------------
         NE |          9       18.00       18.00
    N Cntrl |         12       24.00       42.00
      South |         16       32.00       74.00
       West |         13       26.00      100.00
------------+-----------------------------------
      Total |         50      100.00

Note region is an integer type of variable with a value label called cenreg indicating the four regions. We can use -label list- to see how the integers are associated with the texts representing the regions.

> label list cenreg
cenreg:
           1 NE
           2 N Cntrl
           3 South
           4 West

* The states in the South are coded as 3.
  
* Keep the observations/rows (the states) that are in South region
> keep if region==3
(34 observations deleted)

> tabulate region 

     Census |
     region |      Freq.     Percent        Cum.
------------+-----------------------------------
      South |         16      100.00      100.00
------------+-----------------------------------
      Total |         16      100.00

* Here are the 16 South states (rows) remained in the dataset.

Now let’s use -drop- to eliminate those states with population below the average.

 
* Load back the data
> sysuse census.dta, clear
(1980 Census data by state)

* summary statistics, mean=4518149 
> summarize pop 

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
         pop |         50     4518149     4715038     401851   2.37e+07

* drop the rows/states with population less than the mean 
> drop if pop < 4518149 
(33 observations deleted)

* list the states remained (those with population above the average)
> list state 

     +---------------+
     | state         |
     |---------------|
  1. | California    |
  2. | Florida       |
  3. | Georgia       |
  4. | Illinois      |
  5. | Indiana       |
     |---------------|
  6. | Massachusetts |
  7. | Michigan      |
  8. | Missouri      |
  9. | New Jersey    |
 10. | New York      |
     |---------------|
 11. | N. Carolina   |
 12. | Ohio          |
 13. | Pennsylvania  |
 14. | Tennessee     |
 15. | Texas         |
     |---------------|
 16. | Virginia      |
 17. | Wisconsin     |
     +---------------+

You probably like to be careful when using the -list- command. In this case, the census.dta is a small dataset with only 50 rows/observations in it, and I eliminated 33 observations so I know I only have a fairly small number of cases to be listed in the output. If you are working with a big dataset, you may not want to list too much information to your output.

 

Yun Tai
CLIR Postdoctoral Fellow
University of Virginia Library

Stata Basics: Create, Recode and Label Variables

This post demonstrates how to create new variables, recode existing variables and label variables and values of variables. We use variables of the census.dta data come with Stata as examples.

-generate-: create variables

Here we use the -generate- command to create a new variable representing population younger than 18 years old. We do so by summing up the two existing variables: poplt5 (population < 5 years old) and pop5_17 (population of 5 to 17 years old).

* Load data census.dta 
> sysuse census.dta
(1980 Census data by state)

* See the information of census.dta
> describe

Contains data from /Applications/Stata/ado/base/c/census.dta
  obs:            50                          1980 Census data by state
 vars:            13                          6 Apr 2014 15:43
 size:         2,900                          
---------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
---------------------------------------------------------------------------
state           str14   %-14s                 State
state2          str2    %-2s                  Two-letter state abbreviation
region          int     %-8.0g     cenreg     Census region
pop             long    %12.0gc               Population
poplt5          long    %12.0gc               Pop, < 5 year
pop5_17         long    %12.0gc               Pop, 5 to 17 years
pop18p          long    %12.0gc               Pop, 18 and older
pop65p          long    %12.0gc               Pop, 65 and older
popurban        long    %12.0gc               Urban population
medage          float   %9.2f                 Median age
death           long    %12.0gc               Number of deaths
marriage        long    %12.0gc               Number of marriages
divorce         long    %12.0gc               Number of divorces
-----------------------------------------------------------------------------
Sorted by: 

* Create a new variable pop0_17 representing youth population 
> generate pop0_17 = poplt5 + pop5_17

* Summary statistics for the three variables
> summarize poplt5 pop5_17 pop0_17

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
      poplt5 |         50    326277.8    331585.1      35998    1708400
     pop5_17 |         50    945951.6    959372.8      91796    4680558
     pop0_17 |         50     1272229     1289731     130745    6388958

* -order-: reorder variables
> order state state2 region pop poplt5 pop0_17 

-replace-: replace contents of existing variables

Here we create the youth population variable again, but this time we make it into thousands and replace the one we just created.

> replace pop0_17 = pop0_17/1000
(50 real changes made)

> summarize pop0_17 

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
     pop0_17 |         50    1272.229    1289.731    130.745   6388.958

Recode variables

Say if we want to break pop (total population) into three categories. First we use the -tabulate- command to see the frequencies of this variable.

> tabulate pop

 Population |      Freq.     Percent        Cum.
------------+-----------------------------------
    401,851 |          1        2.00        2.00
    469,557 |          1        2.00        4.00
    511,456 |          1        2.00        6.00
    594,338 |          1        2.00        8.00
    652,717 |          1        2.00       10.00
    690,768 |          1        2.00       12.00
    786,690 |          1        2.00       14.00
    800,493 |          1        2.00       16.00
    920,610 |          1        2.00       18.00
    943,935 |          1        2.00       20.00
    947,154 |          1        2.00       22.00
    964,691 |          1        2.00       24.00
    1124660 |          1        2.00       26.00
    1302894 |          1        2.00       28.00
    1461037 |          1        2.00       30.00
    1569825 |          1        2.00       32.00
    1949644 |          1        2.00       34.00
    2286435 |          1        2.00       36.00
    2363679 |          1        2.00       38.00
    2520638 |          1        2.00       40.00
    2633105 |          1        2.00       42.00
    2718215 |          1        2.00       44.00
    2889964 |          1        2.00       46.00
    2913808 |          1        2.00       48.00
    3025290 |          1        2.00       50.00
    3107576 |          1        2.00       52.00
    3121820 |          1        2.00       54.00
    3660777 |          1        2.00       56.00
    3893888 |          1        2.00       58.00
    4075970 |          1        2.00       60.00
    4132156 |          1        2.00       62.00
    4205900 |          1        2.00       64.00
    4216975 |          1        2.00       66.00
    4591120 |          1        2.00       68.00
    4705767 |          1        2.00       70.00
    4916686 |          1        2.00       72.00
    5346818 |          1        2.00       74.00
    5463105 |          1        2.00       76.00
    5490224 |          1        2.00       78.00
    5737037 |          1        2.00       80.00
    5881766 |          1        2.00       82.00
    7364823 |          1        2.00       84.00
    9262078 |          1        2.00       86.00
    9746324 |          1        2.00       88.00
   1.08e+07 |          1        2.00       90.00
   1.14e+07 |          1        2.00       92.00
   1.19e+07 |          1        2.00       94.00
   1.42e+07 |          1        2.00       96.00
   1.76e+07 |          1        2.00       98.00
   2.37e+07 |          1        2.00      100.00
------------+-----------------------------------
      Total |         50      100.00

Then we create a new variable called pop_c and transform the original variable pop into three categories.

> generate pop_c = .
(50 missing values generated)

> replace pop_c = 1 if (pop <= 2000000) 
(17 real changes made) 

> replace pop_c = 2 if (pop >= 2000001) & (pop <= 4800000) 
(18 real changes made) 

> replace pop_c = 3 if (pop >= 4800001)
(15 real changes made)
 
* See if our recoding worked correctly
> tabulate pop pop_c

           |              pop_c
Population |         1          2          3 |     Total
-----------+---------------------------------+----------
   401,851 |         1          0          0 |         1 
   469,557 |         1          0          0 |         1 
   511,456 |         1          0          0 |         1 
   594,338 |         1          0          0 |         1 
   652,717 |         1          0          0 |         1 
   690,768 |         1          0          0 |         1 
   786,690 |         1          0          0 |         1 
   800,493 |         1          0          0 |         1 
   920,610 |         1          0          0 |         1 
   943,935 |         1          0          0 |         1 
   947,154 |         1          0          0 |         1 
   964,691 |         1          0          0 |         1 
   1124660 |         1          0          0 |         1 
   1302894 |         1          0          0 |         1 
   1461037 |         1          0          0 |         1 
   1569825 |         1          0          0 |         1 
   1949644 |         1          0          0 |         1 
   2286435 |         0          1          0 |         1 
   2363679 |         0          1          0 |         1 
   2520638 |         0          1          0 |         1 
   2633105 |         0          1          0 |         1 
   2718215 |         0          1          0 |         1 
   2889964 |         0          1          0 |         1 
   2913808 |         0          1          0 |         1 
   3025290 |         0          1          0 |         1 
   3107576 |         0          1          0 |         1 
   3121820 |         0          1          0 |         1 
   3660777 |         0          1          0 |         1 
   3893888 |         0          1          0 |         1 
   4075970 |         0          1          0 |         1 
   4132156 |         0          1          0 |         1 
   4205900 |         0          1          0 |         1 
   4216975 |         0          1          0 |         1 
   4591120 |         0          1          0 |         1 
   4705767 |         0          1          0 |         1 
   4916686 |         0          0          1 |         1 
   5346818 |         0          0          1 |         1 
   5463105 |         0          0          1 |         1 
   5490224 |         0          0          1 |         1 
   5737037 |         0          0          1 |         1 
   5881766 |         0          0          1 |         1 
   7364823 |         0          0          1 |         1 
   9262078 |         0          0          1 |         1 
   9746324 |         0          0          1 |         1 
  1.08e+07 |         0          0          1 |         1 
  1.14e+07 |         0          0          1 |         1 
  1.19e+07 |         0          0          1 |         1 
  1.42e+07 |         0          0          1 |         1 
  1.76e+07 |         0          0          1 |         1 
  2.37e+07 |         0          0          1 |         1 
-----------+---------------------------------+----------
     Total |        17         18         15 |        50 

We can use the -recode- command to recode variables as well. Here we create another new variable called pop_c2 then do the recode in the same manner as we did for pop_c.

> generate pop_c2 = pop

> recode pop_c2 (min/2000000=1) (2000001/4800000=2) (4800001/max=3)
(pop_c2: 50 changes made)
 
* Summary statistics for the two recoded variables
> summarize pop_c pop_c2

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
       pop_c |         50        1.96    .8071113          1          3
      pop_c2 |         50        1.96    .8071113          1          3

If you are not happy with the original variable name of total population, you can change it by using the -rename- command. Here we rename pop as pop_t.

> rename pop pop_t 

Label variables and values

Now that we have some new variables created or recoded from original variables. I would like to label them so that other people could get a better idea of what they are. I personally believe this is good practice even if you are the only person using the dataset – we forget things, the labels can serve as basic “documentation” of the dataset.

 
* See which variables need to be labeled
> describe

Contains data from /Applications/Stata/ado/base/c/census.dta
  obs:            50                          1980 Census data by state
 vars:            16                          6 Apr 2014 15:43
 size:         3,500                          
---------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
---------------------------------------------------------------------------
state           str14   %-14s                 State
state2          str2    %-2s                  Two-letter state abbreviation
region          int     %-8.0g     cenreg     Census region
pop_t           long    %12.0gc               Population
poplt5          long    %12.0gc               Pop, < 5 year
pop0_17         float   %9.0g                 
pop5_17         long    %12.0gc               Pop, 5 to 17 years
pop18p          long    %12.0gc               Pop, 18 and older
pop65p          long    %12.0gc               Pop, 65 and older
popurban        long    %12.0gc               Urban population
medage          float   %9.2f                 Median age
death           long    %12.0gc               Number of deaths
marriage        long    %12.0gc               Number of marriages
divorce         long    %12.0gc               Number of divorces
pop_c           float   %9.0g                 
pop_c2          float   %9.0g                 
----------------------------------------------------------------------------
Sorted by: 
 
* Label variable 
> label variable pop0_17 "Pop, < 18 years" 
> label variable pop_c "Categorized population"

* Remember we categorized pop_c into three categories: 1,2 and 3
> table pop_c

----------------------
Categoriz |
ed        |
populatio |
n         |      Freq.
----------+-----------
        1 |         17
        2 |         18
        3 |         15
----------------------

Let’s label them as low, medium and high.

* Label values
* First we define those labels
> label define popcl 1 "low" 2 "medium" 3 "high"
 
* Then we attach the value label popcl to the variable pop_c
> label values pop_c popcl 
 
* Now the three categories are presented as low, medium and high 
> table pop_c 

----------------------
Categoriz |
ed        |
populatio |
n         |      Freq.
----------+-----------
      low |         17
   medium |         18
     high |         15
----------------------

* Remove the duplicated variable pop_c2 
> drop pop_c2
 
* You can also label the dataset
> label data "1980 Census data by state: v2"

* see the information of the dataset 
> describe 

Contains data from /Applications/Stata/ado/base/c/census.dta
  obs:            50                          1980 Census data by state: v2
 vars:            15                          6 Apr 2014 15:43
 size:         3,300                          
---------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
---------------------------------------------------------------------------
state           str14   %-14s                 State
state2          str2    %-2s                  Two-letter state abbreviation
region          int     %-8.0g     cenreg     Census region
pop_t           long    %12.0gc               Population
poplt5          long    %12.0gc               Pop, < 5 year
pop0_17         float   %9.0g                 Pop, < 18 years
pop5_17         long    %12.0gc               Pop, 5 to 17 years
pop18p          long    %12.0gc               Pop, 18 and older
pop65p          long    %12.0gc               Pop, 65 and older
popurban        long    %12.0gc               Urban population
medage          float   %9.2f                 Median age
death           long    %12.0gc               Number of deaths
marriage        long    %12.0gc               Number of marriages
divorce         long    %12.0gc               Number of divorces
pop_c           float   %9.0g      popcl      Categorized population
----------------------------------------------------------------------------
Sorted by: 

 

Yun Tai
CLIR Postdoctoral Fellow
University of Virginia Library