# Latest Posts

## Pairwise comparisons of proportions

Pairwise comparison means comparing all pairs of something. If I have three items A, B and C, that means comparing A to B, A to C, and B to C. Given n items, I can determine the number of possible pairs using the binomial coefficient:

$$\frac{n!}{2!(n – 2)!} = \binom {n}{2}$$

Using the R statistical computing environment, we can use the choose function to quickly calculate this. For example, how many possible 2-item combinations can I “choose” from 10 items:

choose(10,2)
[1] 45



We sometimes want to make pairwise comparisons to see where differences occur. Let’s say we go to 8 high schools in an area, survey 30 students at each school, and ask them whether or not they floss their teeth at least once a day. When finished we’ll have 8 proportions of students who answered “Yes”. An obvious first step would be to conduct a hypothesis test for any differences between these proportions. The null would be no difference between the proportions versus some difference. If we reject the null, we have evidence of differences. But where are the differences? This leads us to pairwise comparisons of proportions, where we make multiple comparisons. The outcome of these pairwise comparisons will hopefully tell us which schools have significantly different proportions of students flossing.

Making multiple comparisons leads to an increased chance of making a false discovery, i.e. rejecting a null hypothesis that should not have been rejected. When we run a hypothesis test, we always run a risk of finding something that isn’t there. Think of flipping a fair coin 10 times and getting 9 or 10 heads (or 0 or 1 heads). That’s improbable but not impossible. If it happened to us we may conclude the coin is unfair, but that would be the wrong conclusion if the coin truly was fair. It just so happened we were very unlucky to witness such an unusual event. As we said, the chance of this happening is low in a single trial, but we increase our chances of it happening by conducting multiple trials.

The probability of observing 0, 1, 9 or 10 heads when flipping a fair coin 10 times is about 2% which can be calculated in R as follows:

pbinom(q = 1, size = 10, prob = 0.5) * 2
[1] 0.02148438



Therefore the the probability of getting 2 – 8 heads is about 98%:

1 - pbinom(q = 1, size = 10, prob = 0.5) * 2
[1] 0.9785156



The probability of getting 2 – 8 heads in 10 trials is 98% multiplied by itself 10 times:

(1 - pbinom(q = 1, size = 10, prob = 0.5) * 2)^10
[1] 0.8047809



Therefore the probability of getting 0, 1, 9, or 10 heads in 10 trials is now about 20%:

1 - (1 - pbinom(q = 1, size = 10, prob = 0.5) * 2)^10
[1] 0.1952191



We can think of this as doing multiple hypothesis tests. Flip 10 coins 10 times each, get the proportion of heads for each coin, and use 10 one-sample proportion tests to statistically determine if the results we got are consistent with a fair coin. In other words, do we get any p-values less than, say, 0.05?

We can simulate this in R. First we replicate 1,000 times the act of flipping 10 fair coins 10 times each and counting the number of heads using the rbinom function. This produces a 10 x 1000 matrix of results that we save as “coin.flips”. We then apply a function to each column of the matrix that runs 10 one-sample proportion tests using the prop.test function and saves a TRUE/FALSE value if any of the p-values are less than 0.05 (we talk more about the prop.test function below). This returns a vector we save as “results” that contains TRUE or FALSE for each replicate. R treats TRUE and FALSE as 0 or 1, so calling mean on results returns the proportion of TRUEs in the vector. We get about 20%, confirming our calculations. (If you run the code below you’ll probably get a slightly different but similar answer.)

trials <- 10
coin.flips <- replicate(1000, rbinom(n = 10, size = trials, prob = 0.5))

multHypTest <- function(x){
pvs <- sapply(x, function(x)prop.test(x = x, n = trials, p = 0.5)$p.value) any(pvs < 0.05) } results <- apply(coin.flips,2,multHypTest) mean(results) [1] 0.206  That’s just for 10 trials. What about 15 or 20 or more? You can re-run the code above with trials set to a different value. We can also visualize it by plotting the probability of an unusual result (0, 1, 9, or 10 heads) versus the number trials. Notice how rapidly the probability of a false discovery increases with the number of trials. curve(expr = 1 - (1 - pbinom(q = 1, size = 10, prob = 0.5) * 2)^x, xlim = c(1,50), xlab = "Number of tests", ylab = "Probability of 0, 1, 9, or 10 heads")  So what does all of this tell us? It reveals that traditional significance levels such as 0.05 are too high when conducting multiple hypothesis tests. We need to either adjust our significance level or adjust our p-values. As we’ll see, the usual approach is to adjust the p-values using one of several methods for p-value adjustment. Let’s return to our example of examining the proportion of high school students (sample size 30 at each school) who floss at 8 different high schools. We’ll simulate this data as if the true proportion is 30% at each school (i.e., no difference). We use set.seed to make the data reproducible. set.seed(15) n <- 30 k <- 8 school <- rep(1:k, each = n) floss <- replicate(k, sample(x = c("Y","N"), size = n, prob = c(0.3, 0.7), replace = TRUE)) dat <- data.frame(school, floss = as.vector(floss))  With our data generated, we can tabulate the number of Yes and No responses at each school: flossTab <- with(dat, table(school, floss)) flossTab floss school N Y 1 18 12 2 19 11 3 14 16 4 19 11 5 26 4 6 15 15 7 20 10 8 21 9  Using prop.table we can determine the proportions. Specifying margin = 1 means proportions are calculated across the rows for each school. (We also round to two decimal places for presentation purposes.) The second column contains the proportion of students who answered Yes at each school. round(prop.table(flossTab, margin = 1),2) floss school N Y 1 0.60 0.40 2 0.63 0.37 3 0.47 0.53 4 0.63 0.37 5 0.87 0.13 6 0.50 0.50 7 0.67 0.33 8 0.70 0.30  First we might want to run a test to see if we can statistically conclude that not all proportions are equal. We can do this with the prop.test function. The prop.test function requires that Yes (or “success”) counts be in the first column of a table and No (or “failure”) counts in the second column. Thus we switch the columns using subsetting brackets with a vector indicating column order. prop.test(flossTab[,c("Y","N")]) 8-sample test for equality of proportions without continuity correction data: flossTab[, c("Y", "N")] X-squared = 13.78, df = 7, p-value = 0.05524 alternative hypothesis: two.sided sample estimates: prop 1 prop 2 prop 3 prop 4 prop 5 prop 6 prop 7 prop 8 0.4000000 0.3666667 0.5333333 0.3666667 0.1333333 0.5000000 0.3333333 0.3000000  The p-value of 0.055 is borderline significant and indicates some evidence of differences among proportions. We generated the data so we know there actually is no difference! But if this were real data that we had spent considerable resources collecting, we might be led to believe (perhaps even want to believe) some differences indeed exist. That p-value is so close to significance! School #5, in particular, with a proportion of 13% looks far lower than school #3 with 53%. We could conclude this hypothesis test is significant at 0.10 level and proceed to pairwise comparisons. To do that in R we use the pairwise.prop.test function which requires a table in the same format as prop.test, Yes counts in the first column and No counts in the second column: pairwise.prop.test(x = flossTab[,c("Y","N")]) Pairwise comparisons using Pairwise comparison of proportions data: flossTab[, c("Y", "N")] 1 2 3 4 5 6 7 2 1.000 - - - - - - 3 1.000 1.000 - - - - - 4 1.000 1.000 1.000 - - - - 5 1.000 1.000 0.073 1.000 - - - 6 1.000 1.000 1.000 1.000 0.149 - - 7 1.000 1.000 1.000 1.000 1.000 1.000 - 8 1.000 1.000 1.000 1.000 1.000 1.000 1.000 P value adjustment method: holm  This produces a table of 28 p-values since there are 28 possible pairs between 8 items. We interpret the table by using row and column numbers to find the p-value for a particular pair. For example the p-value of 0.073 at the intersection of row 5 and column 3 is the p-value for the two-sample proportion test between school #5 and school #3. It appears to be insignificant at the traditional 5% level. All other p-values are clearly insignificant. In fact, most are 1. This is due to the p-value adjustment that was made. The output tells us the “holm” method was used. We won’t get into the details of how this method works, but suffice to say it increases the p-values in an effort to adjust for the many comparisons being made. In this case, it does what it’s supposed to: it adjusts the p-values and allows us to make a good case there is no differences between schools, at least not at the 5% level, which would be the correct decision. We can do pairwise comparisons without adjusted p-values by setting p.adjust.method = "none". Let’s do that and see what happens: # NOTE: This analysis is wrong! pairwise.prop.test(x = flossTab[,c("Y","N")], p.adjust.method = "none") Pairwise comparisons using Pairwise comparison of proportions data: flossTab[, c("Y", "N")] 1 2 3 4 5 6 7 2 1.0000 - - - - - - 3 0.4376 0.2993 - - - - - 4 1.0000 1.0000 0.2993 - - - - 5 0.0410 0.0736 0.0026 0.0736 - - - 6 0.6038 0.4345 1.0000 0.4345 0.0055 - - 7 0.7888 1.0000 0.1927 1.0000 0.1270 0.2949 - 8 0.5883 0.7842 0.1161 0.7842 0.2100 0.1876 1.0000 P value adjustment method: none  Notice now we have significant differences for 3 pairs: (5,1), (5,3), and (6,5). Again we know this is wrong because we simulated the data. The truth is all schools have a floss rate of 30%. But we see that through random chance and not adjusting our p-values for multiple testing we got what look to be significant results. This illustrates the importance of using adjusted p-values when making multiple comparisons. There are other p-value adjustment methods available. A common and conservative choice is the bonferroni method. It simply multiplies all p-values by the number of pairs. In our example that is 28. To see all p-value adjustment methods available in R enter ?p.adjust at the console. For questions or clarifications regarding this article, contact the UVa Library StatLab: statlab@virginia.edu Clay Ford Statistical Research Consultant University of Virginia Library October 20, 2016 ## Welcome Meagan Meagan Christensen joined our Social, Natural, and Engineering Sciences team in August as our new Social Science Librarian, and we are thrilled to have her with us! Meagan will be working with the Economics, Politics, Psychology, and Sociology departments, engaging with students and faculty on their research and information needs — in the classroom, through consultations, and online — and helping these communities navigate resources both within the Library and across the University. Meagan earned her MLIS from the University of Washington and her B.A. in Psychology and Sociology (with a minor in French) from the University of Portland. She joined UVA in February 2014 as the Online and Distance Learning Librarian, and has also worked as part of the Teaching and Learning team in Academic Engagement. You can find Meagan in the Library Data Commons at Curry (Ruffner 302). Or email her at mck6n@virginia.edu. ## Stata Basics: foreach and forvalues There are times we need to do some repetitive tasks in the process of data preparation, analysis or presentation, for instance, computing a set of variables in a same manner, rename or create a series of variables, or repetitively recode values of a number of variables. In this post, I show a few of simple example “loops” using Stata commands -foreach-, -local- and -forvalues- to handle some common simple repetitive tasks. ### -foreach-: loop over items Consider this sample dataset of monthly average temperature for three years.  * input data > clear > input year mtemp1-mtemp12 year mtemp1 mtemp2 mtemp3 mtemp4 mtemp5 mtemp6 mtemp7 mtemp8 mtemp9 mtemp10 mtemp11 mtemp12 1. 2013 4 3 5 14 18 23 25 22 19 15 7 6 2. 2014 -1 3 5 13 19 23 24 23 21 15 7 5 3. 2015 2 -1 7 14 21 24 25 24 21 14 11 10 4. end  Now the mean temperatures of each month are in Centigrade, if we want to convert them to Fahrenheit, we could do the computation for the 12 variables. generate fmtemp1 = mtemp1*(9/5)+32 generate fmtemp2 = mtemp1*(9/5)+32 generate fmtemp3 = mtemp1*(9/5)+32 generate fmtemp4 = mtemp1*(9/5)+32 generate fmtemp5 = mtemp1*(9/5)+32 generate fmtemp6 = mtemp1*(9/5)+32 generate fmtemp7 = mtemp1*(9/5)+32 generate fmtemp8 = mtemp1*(9/5)+32 generate fmtemp9 = mtemp1*(9/5)+32 generate fmtemp10 = mtemp1*(9/5)+32 generate fmtemp11 = mtemp1*(9/5)+32 generate fmtemp12 = mtemp1*(9/5)+32  However this takes a lot of typing. Alternatively, we can use the -foreach- command to achieve the same goal. In the following codes, we tell Stata to do the same thing (the computation: c*9/5+32) for each of the variable in the varlist – mtemp1 to mtemp12. > foreach v of varlist mtemp1-mtemp12 { generate fv' = v'*(9/5)+32 } * list variables > ds year mtemp3 mtemp6 mtemp9 mtemp12 fmtemp3 fmtemp6 fmtemp9 fmtemp12 mtemp1 mtemp4 mtemp7 mtemp10 fmtemp1 fmtemp4 fmtemp7 fmtemp10 mtemp2 mtemp5 mtemp8 mtemp11 fmtemp2 fmtemp5 fmtemp8 fmtemp11  Note that braces must be specified with -foreach-. The open brace has to be on the same line as the foreach, and the close brace must be on a line by itself. It’s crucial to close loops properly, especially if you have one or more loops nested in another loop. ### -local-: define macro This was a rather simple repetitive task which can be handled solely by the foreach command. Here we introduce another command -local-, which is utilized a lot with commands like foreach to deal with repetitive tasks that are more complex. The -local- command is a way of defining macro in Stata. A Stata macro can contain multiple elements; it has a name and contents. Consider the following two examples:  * define a local macro called month > local month jan feb mar apr > display "month'"' jan feb mar apr  Define a local macro called mcode and another called month, alter the contents of mcode in the foreach loop, then display them in a form of “mcode: month”. > local mcode 0 > local month jan feb mar apr > foreach m of local month { local mcode = mcode' + 1 display "mcode': m'" } 1: jan 2: feb 3: mar 4: apr  Note when you call a defined macro, it has to be wrapped in “” (left tick) and “‘” (apostrophe) symbols. ### Rename multiple variables Take the temperature dataset we created as an example. Let’s say we want to rename variables mtemp1-mtemp12 as mtempjan-mtenpdec. We can do so by just tweaking a bit of the codes in the previous example. Define local macro mcode and month, then rename the 12 vars in the foreach loop. > local mcode 0 > local month jan feb mar apr may jun jul aug sep oct nov dec > foreach m of local month { local mcode = mcode' + 1 rename mtempmcode' mtempm' } > ds year mtempmar mtempjun mtempsep mtempdec fmtemp3 fmtemp6 fmtemp9 fmtemp12 mtempjan mtempapr mtempjul mtempoct fmtemp1 fmtemp4 fmtemp7 fmtemp10 mtempfeb mtempmay mtempaug mtempnov fmtemp2 fmtemp5 fmtemp8 fmtemp11  We can obtain the same results in a slightly different way. This time we use another 12 variables fmtemp1-fmtemp12 as examples. Again, we will rename them as fmtempjan-fmtempdec. Define local macro month, then define local macro monthII in the foreach loop with specifying the string function word to reference the contents of the local macro month.  > local month jan feb mar apr may jun jul aug sep oct nov dec > foreach n of numlist 1/12 { local monthII: word n' of month' display "monthII'" rename fmtempn' fmtempmonthII' } jan feb mar apr may jun jul aug sep oct nov dec > ds year mtempmar mtempjun mtempsep mtempdec fmtempmar fmtempjun fmtempsep fmtempdec mtempjan mtempapr mtempjul mtempoct fmtempjan fmtempapr fmtempjul fmtempoct mtempfeb mtempmay mtempaug mtempnov fmtempfeb fmtempmay fmtempaug fmtempnov  I usually run -display- to see how the macro looks like before actually applying the defined macro on tasks like changing variable names, just to make sure I don’t accidentally change them to some undesired results or even cause errors; however the display line is not necessary in this case. Here we rename them back to fmtemp1-fmtemp12. > local mcode 0 > foreach n in jan feb mar apr may jun jul aug sep oct nov dec { local mcode = mcode' + 1 rename fmtempn' fmtempmcode' } > ds year mtempmar mtempjun mtempsep mtempdec fmtemp3 fmtemp6 fmtemp9 fmtemp12 mtempjan mtempapr mtempjul mtempoct fmtemp1 fmtemp4 fmtemp7 fmtemp10 mtempfeb mtempmay mtempaug mtempnov fmtemp2 fmtemp5 fmtemp8 fmtemp11  ### -forvalues-: loop over consecutive values The -forvalues- command is another command that gets to be used a lot in handling repetitive works. Consider the same temperature dataset we created, suppose we would like to generate twelve dummy variables (warm1-warm12) to reflect if each of the monthly average temperature is higher than the one in the previous year. For example, I will code warm1 for the year of 2014 as 1 if the value of fmtemp1 for 2014 is higher than the value for 2013. I will code all the warm variables as 99 for the year of 2013, since they don’t have references to compare in this case. We can do this by running the following codes, then repeat them for twelve times to create the twelve variables warm1-warm12.  * _n creates sequences of numbers. Type "help _n" for descriptions and examples. > generate warm1=1 if fmtemp1 > fmtemp1[_n-1] (2 missing values generated) > replace warm1=0 if fmtemp1 <= fmtemp1[_n-1] (2 real changes made) > replace warm1=99 if year==2013 (1 real change made) > list year fmtemp1 warm1, clean year fmtemp1 warm1 1. 2013 39.2 99 2. 2014 30.2 0 3. 2015 35.6 1  However this takes a lot of typing and may even create unwanted mistakes in the process of typing or copy-paste them over and over.  * drop warm1 we generated > drop warm1  Instead, we can use -forvalues- to do so: > forvalues i=1/12 { generate warmi'=1 if fmtempi' > fmtempi'[_n-1] replace warmi'=0 if fmtempi' <= fmtempi'[_n-1] replace warmi'=99 if year==2013 } * see the results > list year fmtemp1-fmtemp3 warm1-warm3, clean year fmtemp1 fmtemp2 fmtemp3 warm1 warm2 warm3 1. 2013 39.2 37.4 41 99 99 99 2. 2014 30.2 37.4 41 0 0 0 3. 2015 35.6 30.2 44.6 1 0 1  Reference Baum, C. (2005). A little bit of Stata programming goes a long way… Working Papers in Economics, 69. Yun Tai CLIR Postdoctoral Fellow University of Virginia Library ## Stata Basics: Reshape Data In this post, I use a few examples to illustrate the two common data forms: wide form and long form, and how to convert datasets between the two forms – here we call it “reshape” data. Reshaping often needed when you work with datasets that contain variables with some kinds of sequences, say, time-series data. It is fairly easy to transform data between wide and long forms in Stata using the -reshape- command, however you may still like to be careful when you convert a dataset from one to another so that you can eliminate possible mistakes in the process of transforming. First, let’s see how the wide and long forms look like. Here is a simple example of a wide form dataset, in which every variable lives in a column.  +----------------------------+ | id inc80 inc81 inc82 | |----------------------------| 1. | 1 5000 5500 6000 | 2. | 2 2000 2200 3300 | 3. | 3 3000 2000 1000 | +----------------------------+  While the same dataset in long form should look like this, in which each case takes 3 rows – the 3 years and the corresponding income.  +------------------+ | id year inc | |------------------| 1. | 1 80 5000 | 2. | 1 81 5500 | 3. | 1 82 6000 | 4. | 2 80 2000 | 5. | 2 81 2200 | |------------------| 6. | 2 82 3300 | 7. | 3 80 3000 | 8. | 3 81 2000 | 9. | 3 82 1000 | +------------------+  Which form works better for you? It depends on what you need to do with the data. You may find it easier to enter your records in wide format, however in my experience, long format may work better in many cases of data analysis. So let’s see how to convert a dataset in wide form to long form. * load dataset reshape1 > webuse reshape1, clear * list the data > list +-------------------------------------------------------+ | id sex inc80 inc81 inc82 ue80 ue81 ue82 | |-------------------------------------------------------| 1. | 1 0 5000 5500 6000 0 1 0 | 2. | 2 1 2000 2200 3300 1 0 0 | 3. | 3 0 3000 2000 1000 0 0 1 | +-------------------------------------------------------+ * let's make the first example simpler by keeping id, sex and the inc variables > drop ue* > list +----------------------------------+ | id sex inc80 inc81 inc82 | |----------------------------------| 1. | 1 0 5000 5500 6000 | 2. | 2 1 2000 2200 3300 | 3. | 3 0 3000 2000 1000 | +----------------------------------+  ### Reshape from wide to long The syntax should look like this in general: reshape long stub, i(i) j(j) In this case, 1) the stub should be inc, which is the variable to be converted from wide to long, 2) i is the id variable, which is the unique identifier of observations in wide form, and 3) j is the year variable that I am going to create – it tells Stata that suffix of inc (i.e., 80, 81, 82) should be put in the variable called year. > reshape long inc, i(id) j(year) > list +------------------------+ | id year sex inc | |------------------------| 1. | 1 80 0 5000 | 2. | 1 81 0 5500 | 3. | 1 82 0 6000 | 4. | 2 80 1 2000 | 5. | 2 81 1 2200 | |------------------------| 6. | 2 82 1 3300 | 7. | 3 80 0 3000 | 8. | 3 81 0 2000 | 9. | 3 82 0 1000 | +------------------------+  Here is what Stata did for us. In wide form, we had 3 observations and 3 income variables for the 3 years (80-82), we now have 9 observations in long form – so the transformation looks right to me in terms of number of observations/rows. (note: j = 80 81 82) Data wide -> long ----------------------------------------------------------------------------- Number of obs. 3 -> 9 Number of variables 5 -> 4 j variable (3 values) -> year xij variables: inc80 inc81 inc82 -> inc ----------------------------------------------------------------------------- * To convert this current data back to wide form, simply type: > reshape wide > list +----------------------------------+ | id inc80 inc81 inc82 sex | |----------------------------------| 1. | 1 5000 5500 6000 0 | 2. | 2 2000 2200 3300 1 | 3. | 3 3000 2000 1000 0 | +----------------------------------+  ### Reshape from wide to long: more than one stub Remember we actually had more variables in the reshape1 dataset, let’s see how to reshape it. * load the dataset > webuse reshape1, clear > list +-------------------------------------------------------+ | id sex inc80 inc81 inc82 ue80 ue81 ue82 | |-------------------------------------------------------| 1. | 1 0 5000 5500 6000 0 1 0 | 2. | 2 1 2000 2200 3300 1 0 0 | 3. | 3 0 3000 2000 1000 0 0 1 | +-------------------------------------------------------+ * reshape from wide to long * we simply put inc and ue as stubs, then put id and year as i and j as we did in the previous example. > reshape long inc ue, i(id) j(year) > list +-----------------------------+ | id year sex inc ue | |-----------------------------| 1. | 1 80 0 5000 0 | 2. | 1 81 0 5500 1 | 3. | 1 82 0 6000 0 | 4. | 2 80 1 2000 1 | 5. | 2 81 1 2200 0 | |-----------------------------| 6. | 2 82 1 3300 0 | 7. | 3 80 0 3000 0 | 8. | 3 81 0 2000 0 | 9. | 3 82 0 1000 1 | +-----------------------------+  ### Reshape from wide to long: complex unique identifier Sometimes a variable called id does not serve as unique identifier – and that’s one of the reasons we need to be careful when reshaping data. Consider another sample data called reshape2. * load the data > webuse reshape2, clear > list +----------------------------------+ | id sex inc80 inc81 inc82 | |----------------------------------| 1. | 1 0 5000 5500 6000 | 2. | 2 1 2000 2200 3300 | 3. | 3 0 3000 2000 1000 | 4. | 2 0 2400 2500 2400 | +----------------------------------+  If you reshape using id as the unique identifier i, you’ll get error as the variable id does not uniquely identify the observations. > reshape long inc, i(id) j(year) (note: j = 80 81 82) variable id does not uniquely identify the observations Your data are currently wide. You are performing a reshape long. You specified i(id) and j(year). In the current wide form, variable id should uniquely identify the observations. Remember this picture: long wide +---------------+ +------------------+ | i j a b | | i a1 a2 b1 b2 | |---------------| |------------------| | 1 1 1 2 | | 1 1 3 2 4 | | 1 2 3 4 | | 2 5 7 6 8 | | 2 1 5 6 | +------------------+ | 2 2 7 8 | +---------------+ Type reshape error for a list of the problem observations.  In this case, this id problem may be due to some mistakes in the dataset, however in some other circumstances, you may need to create an unique identifier when reshape the dataset. Let’s modify the dataset reshape2 by turning variable sex to group id called gid. > rename sex gid > order gid id > list +----------------------------------+ | gid id inc80 inc81 inc82 | |----------------------------------| 1. | 0 1 5000 5500 6000 | 2. | 1 2 2000 2200 3300 | 3. | 0 3 3000 2000 1000 | 4. | 0 2 2400 2500 2400 | +----------------------------------+  Now we have a dataset with gid, id and income for the 3 years – combining gid and id will make an unique identifier. > reshape long inc, i(gid id) j(year) (note: j = 80 81 82) Data wide -> long ----------------------------------------------------------------------------- Number of obs. 4 -> 12 Number of variables 5 -> 4 j variable (3 values) -> year xij variables: inc80 inc81 inc82 -> inc ----------------------------------------------------------------------------- . list +------------------------+ | gid id year inc | |------------------------| 1. | 0 1 80 5000 | 2. | 0 1 81 5500 | 3. | 0 1 82 6000 | 4. | 0 2 80 2400 | 5. | 0 2 81 2500 | |------------------------| 6. | 0 2 82 2400 | 7. | 0 3 80 3000 | 8. | 0 3 81 2000 | 9. | 0 3 82 1000 | 10. | 1 2 80 2000 | |------------------------| 11. | 1 2 81 2200 | 12. | 1 2 82 3300 | +------------------------+  ### Reshape from wide to long: character suffixes You can still reshape data if the stub variables come with character suffixes. Here we use the bpwide data installed with Stata as an example. * load data and list the first 4 observations > sysuse bpwide, clear (fictional blood-pressure data) > list in 1/4 +-----------------------------------------------+ | patient sex agegrp bp_bef~e bp_after | |-----------------------------------------------| 1. | 1 Male 30-45 143 153 | 2. | 2 Male 30-45 163 170 | 3. | 3 Male 30-45 153 168 | 4. | 4 Male 30-45 153 142 | +-----------------------------------------------+ * reshape data, note the string option added at the end > reshape long bp_, i(patient) j(when) string (note: j = after before) Data wide -> long ----------------------------------------------------------------------------- Number of obs. 120 -> 240 Number of variables 5 -> 5 j variable (2 values) -> when xij variables: bp_after bp_before -> bp_ ----------------------------------------------------------------------------- > list in 1/4 +----------------------------------------+ | patient when sex agegrp bp_ | |----------------------------------------| 1. | 1 after Male 30-45 153 | 2. | 1 before Male 30-45 143 | 3. | 2 after Male 30-45 170 | 4. | 2 before Male 30-45 163 | +----------------------------------------+  ### Reshape from long to wide: -reshape wide- To convert a dataset from long form to wide, simply use -reshape wide- command instead. Consider the airacc data, to make a simple example, we only keep 3 variables – airline, time and i_cnt. > webuse airacc.dta, clear > keep airline time i_cnt > list in 1/8 +------------------------+ | airline i_cnt time | |------------------------| 1. | 1 25 1 | 2. | 1 17 2 | 3. | 1 22 3 | 4. | 1 34 4 | 5. | 2 26 1 | |------------------------| 6. | 2 45 2 | 7. | 2 30 3 | 8. | 2 25 4 | +------------------------+  In this case, variable i_cnt is the one that we are going to restructure from long to wide, and just like what we did with -reshape long-, the i variable is the unique identifier in wide form, and the j variable is the one contains the suffix in wide form. > reshape wide i_cnt, i(airline) j(time) (note: j = 1 2 3 4) Data long -> wide ----------------------------------------------------------------------------- Number of obs. 80 -> 20 Number of variables 3 -> 5 j variable (4 values) time -> (dropped) xij variables: i_cnt -> i_cnt1 i_cnt2 ... i_cnt4 ----------------------------------------------------------------------------- > list in 1/8 +---------------------------------------------+ | airline i_cnt1 i_cnt2 i_cnt3 i_cnt4 | |---------------------------------------------| 1. | 1 25 17 22 34 | 2. | 2 26 45 30 25 | 3. | 3 10 23 8 21 | 4. | 4 17 18 5 21 | 5. | 5 18 19 13 27 | |---------------------------------------------| 6. | 6 36 32 23 27 | 7. | 7 27 28 25 17 | 8. | 8 31 14 22 17 | +---------------------------------------------+  Other usages should be similar to -reshape long- as well, for instance, reshape more than one variable – * load the airacc data again, this time we keep one more variable: inprog > webuse airacc.dta, clear > keep airline time i_cnt inprog > list in 1/8 +---------------------------------+ | airline inprog i_cnt time | |---------------------------------| 1. | 1 1 25 1 | 2. | 1 1 17 2 | 3. | 1 0 22 3 | 4. | 1 0 34 4 | 5. | 2 0 26 1 | |---------------------------------| 6. | 2 0 45 2 | 7. | 2 0 30 3 | 8. | 2 1 25 4 | +---------------------------------+  Reshape the two variables i_cnt and inprog with the i and j variables remained the same. > reshape wide i_cnt inprog, i(airline) j(time) (note: j = 1 2 3 4) Data long -> wide ----------------------------------------------------------------------------- Number of obs. 80 -> 20 Number of variables 4 -> 9 j variable (4 values) time -> (dropped) xij variables: i_cnt -> i_cnt1 i_cnt2 ... i_cnt4 inprog -> inprog1 inprog2 ... inprog4 ----------------------------------------------------------------------------- > list in 1/8 +-------------------------------------------------------------------------------------+ | airline inprog1 i_cnt1 inprog2 i_cnt2 inprog3 i_cnt3 inprog4 i_cnt4 | |-------------------------------------------------------------------------------------| 1. | 1 1 25 1 17 0 22 0 34 | 2. | 2 0 26 0 45 0 30 1 25 | 3. | 3 0 10 0 23 1 8 0 21 | 4. | 4 0 17 1 18 0 5 0 21 | 5. | 5 0 18 0 19 0 13 1 27 | |-------------------------------------------------------------------------------------| 6. | 6 0 36 0 32 0 23 1 27 | 7. | 7 0 27 1 28 1 25 1 17 | 8. | 8 1 31 0 14 0 22 0 17 | +-------------------------------------------------------------------------------------+  Yun Tai CLIR Postdoctoral Fellow University of Virginia Library ## Stata Basics: Combine Data (Append and Merge) When I first started working with data, which was in a statistics class, we mostly used clean and completed dataset as examples. Later on, I realize it’s not always the case when doing research or data analysis for other purposes; in reality, we often need to put two or more dataset together to be able to begin whatever statistic analysis tasks we would like to perform. In this post, I demonstrate how to combine datasets into one file in two typical ways: append and merge, that are row-wise combining and column-wise combining, respectively. ### Append data: -append- Say you would like to stack one of your data file on top of another, then you can use the -append- command to do so. Usually the data files we would like to append contain the same variables, so let’s create two fictional data files, each of them has 4 variables: id, character name, character family and numbers of episode the character appeared in. * Set working directory > cd [YOUR PATH] * create dataset 1 > clear > input id str8 name str9 family epi id name family epi 1. 1 "Arya" "Stark" 33 2. 2 "Cersei" "Lannister" 36 3. 3 "Ned" "Stark" 11 4. end > save got1, replace file got1.dta saved > list +-------------------------------+ | id name family epi | |-------------------------------| 1. | 1 Arya Stark 33 | 2. | 2 Cersei Lannister 36 | 3. | 3 Ned Stark 11 | +-------------------------------+ * create dataset 2 > clear > input id str8 name str9 family epi id name family epi 1. 5 "Robert" "Baratheon" 7 2. 4 "Jon" "Stark" 32 3. 6 "Tyrion" "Lannister" 36 4. end > save got2, replace file got2.dta saved > list +-------------------------------+ | id name family epi | |-------------------------------| 1. | 5 Robert Baratheon 7 | 2. | 4 Jon Stark 32 | 3. | 6 Tyrion Lannister 36 | +-------------------------------+ * combine the two datasets and see the results > use got1, clear > append using got2 > list +-------------------------------+ | id name family epi | |-------------------------------| 1. | 1 Arya Stark 33 | 2. | 2 Cersei Lannister 36 | 3. | 3 Ned Stark 11 | 4. | 5 Robert Baratheon 7 | 5. | 4 Jon Stark 32 | |-------------------------------| 6. | 6 Tyrion Lannister 36 | +-------------------------------+  The combined dataset looks right to me, however we are not able to tell which dataset the observations come from. In some cases this may cause some inconvenience in tracing back to the original files or even problems in data analysis – say, in this case, if got1 and got2 contain records from two different seasons, we should mark that in the combined dataset. We can simply do this by generating a variable indicating season before we append them. > use got1, clear > generate season=1 > save got1, replace file got1.dta saved > use got2, clear > generate season=2 > save got2, replace file got2.dta saved > use got1, clear > append using got2 > list +----------------------------------------+ | id name family epi season | |----------------------------------------| 1. | 1 Arya Stark 33 1 | 2. | 2 Cersei Lannister 36 1 | 3. | 3 Ned Stark 11 1 | 4. | 5 Robert Baratheon 7 2 | 5. | 4 Jon Stark 32 2 | |----------------------------------------| 6. | 6 Tyrion Lannister 36 2 | +----------------------------------------+ > save got3, replace file got3.dta saved  Now we have a combined dataset with a variable indicating which original dataset the observations come from – although this dataset is officially fictional, as Robert Baratheon was not seen in season two… ### Merge data: -merge- It is usually pretty straightforward to append data, however it sometimes gets a bit tricky when you need to combine data in a column-wise manner, that is, merge data. Below we use two examples to demonstrate one-to-one merge and one-to-many merge. #### One-to-one merge: -merge 1:1- In the dataset we just appended (got3), we have 5 variables, with the id variable uniquely identifying the 6 observations in the data. Say we have another data file contains the id variable and the same 6 observations, but with a new variable called status – in other words, a new column. In this case, if we want to combine this new data file to got3, we should use one-to-one merge to match the records in the two files. * First, we create the new data file with id and the new variable status > clear > input id status id status 1. 1 1 2. 2 1 3. 3 0 4. 4 1 5. 6 1 6. 5 0 7. end > list +-------------+ | id status | |-------------| 1. | 1 1 | 2. | 2 1 | 3. | 3 0 | 4. | 4 1 | 5. | 6 1 | |-------------| 6. | 5 0 | +-------------+ > save got4, replace file got4.dta saved * sort observations by id in got3 > use got3, clear > sort id > list +----------------------------------------+ | id name family epi season | |----------------------------------------| 1. | 1 Arya Stark 33 1 | 2. | 2 Cersei Lannister 36 1 | 3. | 3 Ned Stark 11 1 | 4. | 4 Jon Stark 32 2 | 5. | 5 Robert Baratheon 7 2 | |----------------------------------------| 6. | 6 Tyrion Lannister 36 2 | +----------------------------------------+ > save got3m, replace file got3m.dta saved * sort observations by id in got4 > use got4, clear > sort id > list +-------------+ | id status | |-------------| 1. | 1 1 | 2. | 2 1 | 3. | 3 0 | 4. | 4 1 | 5. | 5 0 | |-------------| 6. | 6 1 | +-------------+ > save got4m, replace file got4m.dta saved * merge the two files, we base this merge on the id variable in both files > use got3m, clear > merge 1:1 id using got4m Result # of obs. ----------------------------------------- not matched 0 matched 6 (_merge==3) ----------------------------------------- > list +---------------------------------------------------------------+ | id name family epi season status _merge | |---------------------------------------------------------------| 1. | 1 Arya Stark 33 1 1 matched (3) | 2. | 2 Cersei Lannister 36 1 1 matched (3) | 3. | 3 Ned Stark 11 1 0 matched (3) | 4. | 4 Jon Stark 32 2 1 matched (3) | 5. | 5 Robert Baratheon 7 2 0 matched (3) | |---------------------------------------------------------------| 6. | 6 Tyrion Lannister 36 2 1 matched (3) | +---------------------------------------------------------------+  Note Stata creates a _merge variable in the merged results, which indicates how the merge was done for each observation. The value of _merge is 1 if the observation comes form file1 (master file) only, 2 if the observation comes from file2 (using file) only, 3 if the observation comes from both of the two files – in other words, 3 means the observation is matched. In this example, we can easily inspect every observation to see if they are matched. If you get more records in a dataset, which we normally do, you can summarize this _merge variable to see if you have any mismatched case.  > tabulate _merge _merge | Freq. Percent Cum. ------------------------+----------------------------------- matched (3) | 6 100.00 100.00 ------------------------+----------------------------------- Total | 6 100.00  Looks like we have every observation matched in this merging example. #### One-to-many merge: -merge 1:m- Here I show an example of another kind of merge called one-to-many merge. Let’s illustrate when would we need to perform one-to-many merge by combining two sample datasets: one with information of dads, another with records of their kids. First we create the dads file with family id, family name, dads name and their status, sort the observations by family id. > clear > input familyid str9 family str8 dname dstatus familyid family dname dstatus 1. 3 "Stark" "Ned" 0 2. 1 "Baratheon" "Robert" 0 3. 2 "Lannister" "Tywin" 1 4. end > list +-----------------------------------------+ | familyid family dname dstatus | |-----------------------------------------| 1. | 3 Stark Ned 0 | 2. | 1 Baratheon Robert 0 | 3. | 2 Lannister Tywin 1 | +-----------------------------------------+ > sort familyid > save got5, replace file got5.dta saved  Then we create the kids file with the same variables, sort by family id as well. > clear > input familyid str9 family str8 kname kstatus familyid family kname kstatus 1. 2 "Lannister" "Cersei" 1 2. 3 "Stark" "Arya" 1 3. 2 "Lannister" "Tyrion" 1 4. 3 "Stark" "Jon" 1 5. 1 "Baratheon" "Joffrey" 0 6. end > list +------------------------------------------+ | familyid family kname kstatus | |------------------------------------------| 1. | 2 Lannister Cersei 1 | 2. | 3 Stark Arya 1 | 3. | 2 Lannister Tyrion 1 | 4. | 3 Stark Jon 1 | 5. | 1 Baratheon Joffrey 0 | +------------------------------------------+ > sort familyid > save got6, replace file got6.dta saved  Now we have the two files sharing the familyid variable as an identifier, since each dad may have more than one kid, we use one-to-many merge to combine them.  * use the dads file as master file and kids file as using file > use got5, clear > merge 1:m familyid using got6 Result # of obs. ----------------------------------------- not matched 0 matched 5 (_merge==3) ----------------------------------------- > list +---------------------------------------------------------------------------+ | familyid family dname dstatus kname kstatus _merge | |---------------------------------------------------------------------------| 1. | 1 Baratheon Robert 0 Joffrey 0 matched (3) | 2. | 2 Lannister Tywin 1 Cersei 1 matched (3) | 3. | 3 Stark Ned 0 Jon 1 matched (3) | 4. | 2 Lannister Tywin 1 Tyrion 1 matched (3) | 5. | 3 Stark Ned 0 Arya 1 matched (3) | +---------------------------------------------------------------------------+ * sort by familyid > sort familyid > list +---------------------------------------------------------------------------+ | familyid family dname dstatus kname kstatus _merge | |---------------------------------------------------------------------------| 1. | 1 Baratheon Robert 0 Joffrey 0 matched (3) | 2. | 2 Lannister Tywin 1 Cersei 1 matched (3) | 3. | 2 Lannister Tywin 1 Tyrion 1 matched (3) | 4. | 3 Stark Ned 0 Arya 1 matched (3) | 5. | 3 Stark Ned 0 Jon 1 matched (3) | +---------------------------------------------------------------------------+  So the steps are really the same for one-to-one and one-to-many merge, just need to pick the right one depending on the datasets your are going to combine, and what kind of end product you would like to obtain from the merging. Yun Tai CLIR Postdoctoral Fellow University of Virginia Library ## Stata Basics: Subset Data Sometimes only parts of a dataset mean something to you. In this post, we show you how to subset a dataset in Stata, by variables or by observations. We use the census.dta dataset installed with Stata as the sample data. ### Subset by variables * Load the data > sysuse census.dta (1980 Census data by state) * See the information of the data > describe Contains data from /Applications/Stata/ado/base/c/census.dta obs: 50 1980 Census data by state vars: 13 6 Apr 2014 15:43 size: 2,900 --------------------------------------------------------------------------- storage display value variable name type format label variable label --------------------------------------------------------------------------- state str14 %-14s State state2 str2 %-2s Two-letter state abbreviation region int %-8.0g cenreg Census region pop long %12.0gc Population poplt5 long %12.0gc Pop, < 5 year pop5_17 long %12.0gc Pop, 5 to 17 years pop18p long %12.0gc Pop, 18 and older pop65p long %12.0gc Pop, 65 and older popurban long %12.0gc Urban population medage float %9.2f Median age death long %12.0gc Number of deaths marriage long %12.0gc Number of marriages divorce long %12.0gc Number of divorces ---------------------------------------------------------------------------- Sorted by:  ### -keep-: keep variables or observations There are 13 variables in this dataset. Say we would like to have a separate file contains only the list of the states with the region variable, we can use the -keep- command to do so. > keep state state2 region > describe Contains data from /Applications/Stata/ado/base/c/census.dta obs: 50 1980 Census data by state vars: 3 6 Apr 2014 15:43 size: 900 --------------------------------------------------------------------------- storage display value variable name type format label variable label --------------------------------------------------------------------------- state str14 %-14s State state2 str2 %-2s Two-letter state abbreviation region int %-8.0g cenreg Census region --------------------------------------------------------------------------- Sorted by:  Now that you should only see the three variables remain in the data. Note that this change only applies to the copy of the data in the memory, not the file on disk – you need to use the -save- command to make change to the file itself. You may want to be careful when you save this change, as you will permanently lose all the other variables that are not in the keep list. So here I save it as a new file called slist. > save slist file slist.dta saved * Now if you load back the original file, you should still have all the variables > sysuse census.dta, clear (1980 Census data by state)  Note the clear option clears the current data in the memory, which contains the three variables we kept – don’t worry, you should still have it on your disk since we have saved it as slist.dta. ### -drop-: drop variables or observations The -drop- command also works in subsetting data. Say we only need to work with population of different age groups, we can remove other variables and save as a new file called census2. > drop medage death marriage divorce > describe Contains data from /Applications/Stata/ado/base/c/census.dta obs: 50 1980 Census data by state vars: 9 6 Apr 2014 15:43 size: 2,100 --------------------------------------------------------------------------- storage display value variable name type format label variable label --------------------------------------------------------------------------- state str14 %-14s State state2 str2 %-2s Two-letter state abbreviation region int %-8.0g cenreg Census region pop long %12.0gc Population poplt5 long %12.0gc Pop, < 5 year pop5_17 long %12.0gc Pop, 5 to 17 years pop18p long %12.0gc Pop, 18 and older pop65p long %12.0gc Pop, 65 and older popurban long %12.0gc Urban population ---------------------------------------------------------------------------- Sorted by: > save census2 file census2.dta saved  ### Subset by observations We can also use -keep- and -drop- commands to subset data by keeping or eliminating observations that meet one or more conditions. For example, we can keep the states in the South. * Load the data again and clear the current one in memory > sysuse census.dta, clear (1980 Census data by state) * See the contents of region > tabulate region Census | region | Freq. Percent Cum. ------------+----------------------------------- NE | 9 18.00 18.00 N Cntrl | 12 24.00 42.00 South | 16 32.00 74.00 West | 13 26.00 100.00 ------------+----------------------------------- Total | 50 100.00  Note region is an integer type of variable with a value label called cenreg indicating the four regions. We can use -label list- to see how the integers are associated with the texts representing the regions. > label list cenreg cenreg: 1 NE 2 N Cntrl 3 South 4 West * The states in the South are coded as 3. * Keep the observations/rows (the states) that are in South region > keep if region==3 (34 observations deleted) > tabulate region Census | region | Freq. Percent Cum. ------------+----------------------------------- South | 16 100.00 100.00 ------------+----------------------------------- Total | 16 100.00 * Here are the 16 South states (rows) remained in the dataset.  Now let’s use -drop- to eliminate those states with population below the average.  * Load back the data > sysuse census.dta, clear (1980 Census data by state) * summary statistics, mean=4518149 > summarize pop Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- pop | 50 4518149 4715038 401851 2.37e+07 * drop the rows/states with population less than the mean > drop if pop < 4518149 (33 observations deleted) * list the states remained (those with population above the average) > list state +---------------+ | state | |---------------| 1. | California | 2. | Florida | 3. | Georgia | 4. | Illinois | 5. | Indiana | |---------------| 6. | Massachusetts | 7. | Michigan | 8. | Missouri | 9. | New Jersey | 10. | New York | |---------------| 11. | N. Carolina | 12. | Ohio | 13. | Pennsylvania | 14. | Tennessee | 15. | Texas | |---------------| 16. | Virginia | 17. | Wisconsin | +---------------+  You probably like to be careful when using the -list- command. In this case, the census.dta is a small dataset with only 50 rows/observations in it, and I eliminated 33 observations so I know I only have a fairly small number of cases to be listed in the output. If you are working with a big dataset, you may not want to list too much information to your output. Yun Tai CLIR Postdoctoral Fellow University of Virginia Library ## Stata Basics: Create, Recode and Label Variables This post demonstrates how to create new variables, recode existing variables and label variables and values of variables. We use variables of the census.dta data come with Stata as examples. ### -generate-: create variables Here we use the -generate- command to create a new variable representing population younger than 18 years old. We do so by summing up the two existing variables: poplt5 (population < 5 years old) and pop5_17 (population of 5 to 17 years old). * Load data census.dta > sysuse census.dta (1980 Census data by state) * See the information of census.dta > describe Contains data from /Applications/Stata/ado/base/c/census.dta obs: 50 1980 Census data by state vars: 13 6 Apr 2014 15:43 size: 2,900 --------------------------------------------------------------------------- storage display value variable name type format label variable label --------------------------------------------------------------------------- state str14 %-14s State state2 str2 %-2s Two-letter state abbreviation region int %-8.0g cenreg Census region pop long %12.0gc Population poplt5 long %12.0gc Pop, < 5 year pop5_17 long %12.0gc Pop, 5 to 17 years pop18p long %12.0gc Pop, 18 and older pop65p long %12.0gc Pop, 65 and older popurban long %12.0gc Urban population medage float %9.2f Median age death long %12.0gc Number of deaths marriage long %12.0gc Number of marriages divorce long %12.0gc Number of divorces ----------------------------------------------------------------------------- Sorted by: * Create a new variable pop0_17 representing youth population > generate pop0_17 = poplt5 + pop5_17 * Summary statistics for the three variables > summarize poplt5 pop5_17 pop0_17 Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- poplt5 | 50 326277.8 331585.1 35998 1708400 pop5_17 | 50 945951.6 959372.8 91796 4680558 pop0_17 | 50 1272229 1289731 130745 6388958 * -order-: reorder variables > order state state2 region pop poplt5 pop0_17  ### -replace-: replace contents of existing variables Here we create the youth population variable again, but this time we make it into thousands and replace the one we just created. > replace pop0_17 = pop0_17/1000 (50 real changes made) > summarize pop0_17 Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- pop0_17 | 50 1272.229 1289.731 130.745 6388.958  ### Recode variables Say if we want to break pop (total population) into three categories. First we use the -tabulate- command to see the frequencies of this variable. > tabulate pop Population | Freq. Percent Cum. ------------+----------------------------------- 401,851 | 1 2.00 2.00 469,557 | 1 2.00 4.00 511,456 | 1 2.00 6.00 594,338 | 1 2.00 8.00 652,717 | 1 2.00 10.00 690,768 | 1 2.00 12.00 786,690 | 1 2.00 14.00 800,493 | 1 2.00 16.00 920,610 | 1 2.00 18.00 943,935 | 1 2.00 20.00 947,154 | 1 2.00 22.00 964,691 | 1 2.00 24.00 1124660 | 1 2.00 26.00 1302894 | 1 2.00 28.00 1461037 | 1 2.00 30.00 1569825 | 1 2.00 32.00 1949644 | 1 2.00 34.00 2286435 | 1 2.00 36.00 2363679 | 1 2.00 38.00 2520638 | 1 2.00 40.00 2633105 | 1 2.00 42.00 2718215 | 1 2.00 44.00 2889964 | 1 2.00 46.00 2913808 | 1 2.00 48.00 3025290 | 1 2.00 50.00 3107576 | 1 2.00 52.00 3121820 | 1 2.00 54.00 3660777 | 1 2.00 56.00 3893888 | 1 2.00 58.00 4075970 | 1 2.00 60.00 4132156 | 1 2.00 62.00 4205900 | 1 2.00 64.00 4216975 | 1 2.00 66.00 4591120 | 1 2.00 68.00 4705767 | 1 2.00 70.00 4916686 | 1 2.00 72.00 5346818 | 1 2.00 74.00 5463105 | 1 2.00 76.00 5490224 | 1 2.00 78.00 5737037 | 1 2.00 80.00 5881766 | 1 2.00 82.00 7364823 | 1 2.00 84.00 9262078 | 1 2.00 86.00 9746324 | 1 2.00 88.00 1.08e+07 | 1 2.00 90.00 1.14e+07 | 1 2.00 92.00 1.19e+07 | 1 2.00 94.00 1.42e+07 | 1 2.00 96.00 1.76e+07 | 1 2.00 98.00 2.37e+07 | 1 2.00 100.00 ------------+----------------------------------- Total | 50 100.00  Then we create a new variable called pop_c and transform the original variable pop into three categories. > generate pop_c = . (50 missing values generated) > replace pop_c = 1 if (pop <= 2000000) (17 real changes made) > replace pop_c = 2 if (pop >= 2000001) & (pop <= 4800000) (18 real changes made) > replace pop_c = 3 if (pop >= 4800001) (15 real changes made) * See if our recoding worked correctly > tabulate pop pop_c | pop_c Population | 1 2 3 | Total -----------+---------------------------------+---------- 401,851 | 1 0 0 | 1 469,557 | 1 0 0 | 1 511,456 | 1 0 0 | 1 594,338 | 1 0 0 | 1 652,717 | 1 0 0 | 1 690,768 | 1 0 0 | 1 786,690 | 1 0 0 | 1 800,493 | 1 0 0 | 1 920,610 | 1 0 0 | 1 943,935 | 1 0 0 | 1 947,154 | 1 0 0 | 1 964,691 | 1 0 0 | 1 1124660 | 1 0 0 | 1 1302894 | 1 0 0 | 1 1461037 | 1 0 0 | 1 1569825 | 1 0 0 | 1 1949644 | 1 0 0 | 1 2286435 | 0 1 0 | 1 2363679 | 0 1 0 | 1 2520638 | 0 1 0 | 1 2633105 | 0 1 0 | 1 2718215 | 0 1 0 | 1 2889964 | 0 1 0 | 1 2913808 | 0 1 0 | 1 3025290 | 0 1 0 | 1 3107576 | 0 1 0 | 1 3121820 | 0 1 0 | 1 3660777 | 0 1 0 | 1 3893888 | 0 1 0 | 1 4075970 | 0 1 0 | 1 4132156 | 0 1 0 | 1 4205900 | 0 1 0 | 1 4216975 | 0 1 0 | 1 4591120 | 0 1 0 | 1 4705767 | 0 1 0 | 1 4916686 | 0 0 1 | 1 5346818 | 0 0 1 | 1 5463105 | 0 0 1 | 1 5490224 | 0 0 1 | 1 5737037 | 0 0 1 | 1 5881766 | 0 0 1 | 1 7364823 | 0 0 1 | 1 9262078 | 0 0 1 | 1 9746324 | 0 0 1 | 1 1.08e+07 | 0 0 1 | 1 1.14e+07 | 0 0 1 | 1 1.19e+07 | 0 0 1 | 1 1.42e+07 | 0 0 1 | 1 1.76e+07 | 0 0 1 | 1 2.37e+07 | 0 0 1 | 1 -----------+---------------------------------+---------- Total | 17 18 15 | 50  We can use the -recode- command to recode variables as well. Here we create another new variable called pop_c2 then do the recode in the same manner as we did for pop_c. > generate pop_c2 = pop > recode pop_c2 (min/2000000=1) (2000001/4800000=2) (4800001/max=3) (pop_c2: 50 changes made) * Summary statistics for the two recoded variables > summarize pop_c pop_c2 Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- pop_c | 50 1.96 .8071113 1 3 pop_c2 | 50 1.96 .8071113 1 3  If you are not happy with the original variable name of total population, you can change it by using the -rename- command. Here we rename pop as pop_t. > rename pop pop_t  ### Label variables and values Now that we have some new variables created or recoded from original variables. I would like to label them so that other people could get a better idea of what they are. I personally believe this is good practice even if you are the only person using the dataset – we forget things, the labels can serve as basic “documentation” of the dataset.  * See which variables need to be labeled > describe Contains data from /Applications/Stata/ado/base/c/census.dta obs: 50 1980 Census data by state vars: 16 6 Apr 2014 15:43 size: 3,500 --------------------------------------------------------------------------- storage display value variable name type format label variable label --------------------------------------------------------------------------- state str14 %-14s State state2 str2 %-2s Two-letter state abbreviation region int %-8.0g cenreg Census region pop_t long %12.0gc Population poplt5 long %12.0gc Pop, < 5 year pop0_17 float %9.0g pop5_17 long %12.0gc Pop, 5 to 17 years pop18p long %12.0gc Pop, 18 and older pop65p long %12.0gc Pop, 65 and older popurban long %12.0gc Urban population medage float %9.2f Median age death long %12.0gc Number of deaths marriage long %12.0gc Number of marriages divorce long %12.0gc Number of divorces pop_c float %9.0g pop_c2 float %9.0g ---------------------------------------------------------------------------- Sorted by: * Label variable > label variable pop0_17 "Pop, < 18 years" > label variable pop_c "Categorized population" * Remember we categorized pop_c into three categories: 1,2 and 3 > table pop_c ---------------------- Categoriz | ed | populatio | n | Freq. ----------+----------- 1 | 17 2 | 18 3 | 15 ----------------------  Let’s label them as low, medium and high. * Label values * First we define those labels > label define popcl 1 "low" 2 "medium" 3 "high" * Then we attach the value label popcl to the variable pop_c > label values pop_c popcl * Now the three categories are presented as low, medium and high > table pop_c ---------------------- Categoriz | ed | populatio | n | Freq. ----------+----------- low | 17 medium | 18 high | 15 ---------------------- * Remove the duplicated variable pop_c2 > drop pop_c2 * You can also label the dataset > label data "1980 Census data by state: v2" * see the information of the dataset > describe Contains data from /Applications/Stata/ado/base/c/census.dta obs: 50 1980 Census data by state: v2 vars: 15 6 Apr 2014 15:43 size: 3,300 --------------------------------------------------------------------------- storage display value variable name type format label variable label --------------------------------------------------------------------------- state str14 %-14s State state2 str2 %-2s Two-letter state abbreviation region int %-8.0g cenreg Census region pop_t long %12.0gc Population poplt5 long %12.0gc Pop, < 5 year pop0_17 float %9.0g Pop, < 18 years pop5_17 long %12.0gc Pop, 5 to 17 years pop18p long %12.0gc Pop, 18 and older pop65p long %12.0gc Pop, 65 and older popurban long %12.0gc Urban population medage float %9.2f Median age death long %12.0gc Number of deaths marriage long %12.0gc Number of marriages divorce long %12.0gc Number of divorces pop_c float %9.0g popcl Categorized population ---------------------------------------------------------------------------- Sorted by:  Yun Tai CLIR Postdoctoral Fellow University of Virginia Library ## Stata Basics: Data Import, Use and Export In Stata, the very first step of analyzing a dataset, or do anything about it, should be opening the dataset in Stata so that it knows which file you are going to work with. Yes, you can simply double click on a Stata data file ends in .dta to open it, or you can do something fancier to achieve the same goal – like write some codes. Okay, there is at least one more reason than being fancier that makes me prefer to write syntax than clicking through things in Stata – I like to have everything I did recorded so that I can easily reproduce the same work or use the scripts again when working on similar tasks next time. In this post, I introduce ways of reading in, using and saving Stata and other formats of data files. ### -sysuse-: reading in datasets come with Stata Several example datasets are installed with Stata. This command reads in one of them, census.dta, to memory. You should be able to see the data in your Stata Data Browser after running this following line. > sysuse census.dta (1980 Census data by state)  ### -describe-: the information of the dataset in memory > describe Contains data from /Applications/Stata/ado/base/c/census.dta obs: 50 1980 Census data by state vars: 13 6 Apr 2014 15:43 size: 2,900 ------------------------------------------------------------------------------------- storage display value variable name type format label variable label ------------------------------------------------------------------------------------- state str14 %-14s State state2 str2 %-2s Two-letter state abbreviation region int %-8.0g cenreg Census region pop long %12.0gc Population poplt5 long %12.0gc Pop, < 5 year pop5_17 long %12.0gc Pop, 5 to 17 years pop18p long %12.0gc Pop, 18 and older pop65p long %12.0gc Pop, 65 and older popurban long %12.0gc Urban population medage float %9.2f Median age death long %12.0gc Number of deaths marriage long %12.0gc Number of marriages divorce long %12.0gc Number of divorces ------------------------------------------------------------------------------------- Sorted by:  Tip: run “set more off” to tell Stata to pause for -more- messages ### -summarize-: summary statistics > summarize Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- state | 0 state2 | 0 region | 50 2.66 1.061574 1 4 pop | 50 4518149 4715038 401851 2.37e+07 poplt5 | 50 326277.8 331585.1 35998 1708400 -------------+--------------------------------------------------------- pop5_17 | 50 945951.6 959372.8 91796 4680558 pop18p | 50 3245920 3430531 271106 1.73e+07 pop65p | 50 509502.8 538932.4 11547 2414250 popurban | 50 3328253 4090178 172735 2.16e+07 medage | 50 29.54 1.693445 24.2 34.7 -------------+--------------------------------------------------------- death | 50 39474.26 41742.35 1604 186428 marriage | 50 47701.4 45130.42 4437 210864 divorce | 50 23679.44 25094.01 2142 133541  ### -clear-: wipe out the data in memory > clear  ### -use-: read in Stata datasets Most of the time, we use datasets that are either stored on our machine or on the web. Simply use this -use- command to read in the data file to memory. * read in data files on the web > use http://www.stata-press.com/data/r14/apple.dta (Apple trees) > describe Contains data from http://www.stata-press.com/data/r14/apple.dta obs: 10 Apple trees vars: 2 16 Jan 2014 11:23 size: 100 ------------------------------------------------------------------------- storage display value variable name type format label variable label ------------------------------------------------------------------------- treatment int %8.0g Fertilizer weight double %10.0g Average weight in grams ------------------------------------------------------------------------- Sorted by:  ### -cd-: change directory Now let’s save this dataset on the web to your machine. You can use the -cd- command to tell Stata where to save this file. * see the current directory > pwd /Users/Username/Desktop/StataBasics * Change directory (plug in the path on your machine) > cd YOUR PATH * Your directory/path may look like this - * Stata for Windows: * cd C:Users\username\data * Stata for Mac: * cd /Users/username/data  ### -save-: save files > save apple file apple.dta saved * use the replace option to overwrite an existing file > save apple, replace file apple.dta saved  ### -dir-: display file names * see what's in your working directory > dir * you should see apple.dta listed in your directory  ### -insheet- and -outsheet-: import and export .csv files Oftentimes we work with Stata and other softwares for a same project, in that case we need to import data files that are not in a Stata format or export Stata data files to other formats. Here is an example of how to save datasets as .csv files and read them into Stata. * -outsheet-: save as .csv files > outsheet using apple.csv, comma * -insheet-: read in .csv files > insheet using "apple.csv", clear (2 vars, 10 obs)  Yun Tai CLIR Postdoctoral Fellow University of Virginia Library ## Using Data.gov APIs in R Data.gov catalogs government data and makes them available on the web; you can find data in a variety of topics such as agriculture, business, climate, education, energy, finance, public safty and many more. It is a good start point for finding data if you don’t already know which particular data source to begin your search, however it can still be time consuming when it comes to actually download the raw data you need. Fortunately, Data.gov also includes APIs from across government, which can help with obtaining raw datasets. In this post, I share examples of using Data.gov APIs in R to download data from the web. Many APIs are available from the agencies partnering with Data.gov – in this post, I use College Scorecard data from the Department of Education and Alternative Fuel Stations data from the National Renewable Energy Laboratory as examples. Before running this script, you’ll need to install the httr package if you haven’t done so before. Make sure your machine is connected to the Internet, then run install.packages(“httr”) – you only need to do this once. ### API key Get your API key at: https://api.data.gov/signup/. Make sure to plug in your own API key in the following R codes. ### Working directory and R package Set the working directory on your computer (the path to where you want R to read/store files) and load the httr package. # Set working directory > setwd('~/DataApiR') # plug in the working directory on your machine # Load package > library(httr)  ### College Scorecard Data API The College Scorecard project provides information of college costs and outcomes at individual postsecondary institution level. Read the documentation of this data will help with understanding how this dataset is structured. If you’re interested in using this data, find variables in their Data Dictionary. In the following example, I download all available data for Emory University, then extract variables from the downloaded data. # plug in your API key > myapikey <- "YOUR API KEY" # the url path to the service > URL <- "https://api.data.gov/ed/collegescorecard/v1/schools?"   # GET(): download all available data for Emory University > get.data <- GET(URL, query=list(api_key=myapikey, school.name="Emory University")) # content(): extract the content from the query > emory.data <- content(get.data) > class(emory.data) # it's a list object  ## [1] "list" # what's in emory.data > names(emory.data) # contains two components: metadata, results  ## [1] "metadata" "results"  # what's inside the results component > names(emory.data$results[[1]])


##  [1] "2008"     "2009"     "2006"     "ope6_id"  "2007"     "2004"
##  [7] "2013"     "2005"     "location" "2014"     "2002"     "2003"
## [13] "id"       "1996"     "1997"     "school"   "1998"     "2012"
## [19] "2011"     "2010"     "ope8_id"  "1999"     "2001"     "2000"


# see available dev-categories for 2013 data
> names(emory.data$results[[1]]$2013)


## [1] "earnings"   "academics"  "student"    "admissions" "repayment"
## [6] "aid"        "cost"       "completion"

# available variables under the cost category for 2013 data
> names(emory.data$results[[1]]$2013$cost)  ## [1] "title_iv" "avg_net_price" "attendance" "tuition" ## [5] "net_price"  # elements of the tuition variable > names(emory.data$results[[1]]$2013$cost$tuition)  ## [1] "out_of_state" "in_state" "program_year"  Hopefully you now get the idea of how the dataset is structured through seeing the levels of the list. Let’s try to extract variables from the downloaded data. To run the following codes, you need to install the magrittr package. # load package > library(magrittr)  # subset list for annual data only > emory.ann <- emory.data$results[[1]][c(as.character(1996:2013))]
> names(emory.ann)


##  [1] "1996" "1997" "1998" "1999" "2000" "2001" "2002" "2003" "2004" "2005"
## [11] "2006" "2007" "2008" "2009" "2010" "2011" "2012" "2013"

# extract enrollment of undergraduate degree-seeking students for each year
> s.size <- emory.ann %>%
sapply(function(x) x$student$size) %>%
unlist()

# extract percentage of first-generation students for each year
> s.fg <- emory.ann %>%
sapply(function(x) x$student$share_firstgeneration) %>%
unlist()

# combine the two variables into a data frame
> emory.s <- data.frame(s.size, s.fg)

# see the first few rows of the data frame


##      s.size      s.fg
## 1996   6027 0.1405167
## 1997   5996 0.1400849
## 1998   6316 0.1480363
## 1999   6215 0.1534494
## 2001   6265 0.1474711
## 2002   6187 0.1468254

# create a variable of year from the row number
> emory.s$year <- rownames(emory.s) # create a variable s.fg.n: number of first-generation students > emory.s$s.fg.n <- round(emory.s$s.size*emory.s$s.fg)

# save the data as a .csv file
> write.csv(emory.s, file="emory.s.csv", row.names = F)



Just to play a bit more with the data we got, we can plot the variables we extracted and created. Install the ggplot2 package to run the following codes.

# load package
> library(ggplot2)

# Line graph of total enrollment and first generation student number
> ggplot(emory.s, aes(year)) +
geom_line(aes(y = s.size, colour = "s.size", group=1)) +
geom_line(aes(y = s.fg.n, colour = "s.fg.n", group=1)) +
xlab("Year") + ylab("Number") + # Set axis labels
ggtitle("Enrollment: Emory University") + # set title
scale_colour_discrete(name="Enrollment",  # modify legend
labels=c("First Generation", "Total")) +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) # adjust x-axis text position



### The R Package for College Scorecard Data API

Many data APIs can be accessed via R packages, College Scorecard is one of them. I find this rscorecard wrapper fairly easy to use for downloading purpose. Check the package’s Github page for more details. Here is a simple example of using the package to obtain College Scorecard records that meet the conditions of the query.

> library(rscorecard)

# extract Virginia institutions' 2013 data with three variables
> df <- sc_init() %>%
sc_filter(stabbr == 'VA') %>%
sc_select(unitid, instnm, stabbr) %>%
sc_year(2013) %>%
sc_get()

# see first few cases


##     unitid                                         instnm stabbr year
## 1   481526               The Chrysm Insitute of Esthetics     VA 2013
## 2 24893410 ECPI University-Culinary Institute of Virginia     VA 2013
## 3   475194         Miller-Motte Technical College-Roanoke     VA 2013
## 4   484765                 University of Phoenix-Virginia     VA 2013
## 5   459268                 South UniversityVirginia Beach     VA 2013
## 6 23368402           Strayer University-Woodbridge Campus     VA 2013



### Renewable Energy Data API

Let’s use another institution’s data API as an example. The National Renewable Data Laboratory offers APIs for users to access energy data in a few categories, see its documentation. In the following example, I query their database of alternative fuel stations in the Transportation category. Again, read its data documentation should help with data query construction.

# load packages
> library(httr)
> library(magrittr)
> library(ggplot2)


# Get all stations data in Virginia, remember to plug in your own API key
> get.afs <- GET("http://api.data.gov/nrel/alt-fuel-stations/v1.json?api_key=[YOUR API KEY]&state=VA") 

# extract content from the query
> afs <- content(get.afs)

> names(afs)


## [1] "station_locator_url" "total_results"       "station_counts"
## [4] "fuel_stations"


# how many stations in the downloaded data
> afs$total_results  ## [1] 526  # see variables/fields under fuel_stations > names(afs$fuel_stations[[1]])


##  [1] "access_days_time"        "cards_accepted"
##  [3] "date_last_confirmed"     "expected_date"
##  [5] "fuel_type_code"          "id"
##  [7] "groups_with_access_code" "open_date"
##  [9] "owner_type_code"         "status_code"
## [11] "station_name"            "station_phone"
## [13] "updated_at"              "geocode_status"
## [15] "latitude"                "longitude"
## [17] "city"                    "intersection_directions"
## [19] "plus4"                   "state"
## [23] "bd_blends"               "e85_blender_pump"
## [25] "ev_connector_types"      "ev_dc_fast_num"
## [27] "ev_level1_evse_num"      "ev_level2_evse_num"
## [29] "ev_network"              "ev_network_web"
## [33] "lpg_primary"             "ng_fill_type_code"
## [35] "ng_psi"                  "ng_vehicle_class"


# extract vars: station_name, fuel_type_code
> fsname <- afs$fuel_stations %>% sapply(function(x) x$station_name) %>%
unlist()
> ftcode <- afs$fuel_stations %>% sapply(function(x) x$fuel_type_code) %>%
unlist()

# combine the two vars in a data frame
> afsdf <- data.frame(fsname, ftcode)

# see the first few rows


##                              fsname ftcode
## 1   Virginia Natural Gas - Lance Rd    CNG
## 2 Virginia Natural Gas - VNG Office    CNG
## 3                      Dixie Gas Co    LPG
## 4                            U-Haul    LPG
## 5                  Suburban Propane    LPG
## 6                        Revere Gas    LPG


# plot ftcode: type of alternative fuel the station provides
> ggplot(afsdf, aes(x=ftcode)) +
geom_bar(stat="count", fill="steelblue") +
theme_minimal()



References
18F/open-data-maker. (2016). GitHub. Retrieved 20 September 2016, from https://github.com/18F/open-data-maker/blob/api-docs/API.md.
Boehmke, B. (2016). Scraping via APIs. http://bradleyboehmke.github.io. Retrieved 20 September 2016, from http://bradleyboehmke.github.io/2016/01/scraping-via-apis.html.

Yun Tai
CLIR Postdoctoral Fellow
University of Virginia Library

## Getting UN Comtrade Data with R

The UN Comtrade Database provides free access to global trade data. You can get data by using their data extraction interface or using their API to do so. In this post, we share some possible ways of downloading, preparing and plotting trade data in R.

Before running this script, you’ll need to install the rjson package if you haven’t done so before. Make sure your machine is connected to the Internet, and run install.packages(“rjson”) – you only need to do this once.

### Get country/area codes

Now we can load the package, then use the fromJSON function to read the country/area codes into R. Typically, the codes are needed to identify countries and extract the corresponding data.

# load rjson package
> library(rjson)

# the url of the country/area codes in .json

# fromJSON(): read the .json object and convert to R object
> reporters <- fromJSON(file=string)

# convert to data frame
validation<- unlist(raw.data$validation, recursive=TRUE) ndata 0) { var.names<- names(data[[1]]) data<- as.data.frame(t( sapply(data,rbind))) ndata<- NULL for(i in 1:ncol(data)){ data[sapply(data[,i],is.null),i]<- NA ndata<- cbind(ndata, unlist(data[,i])) } ndata<- as.data.frame(ndata) colnames(ndata)<- var.names } return(list(validation=validation,data =ndata)) } } }  ### Extract total trade flows data Now let’s try to use the function get.Comtrade defined above to download data. We specify one reporter and two partners, all other parameters are as default in the function. # Extract latest annual HS total trade flows in .json format (default parameters), # specifying the codes of reporter (USA=842) and partners (France=251, UK=826) # call the download function > dt1 <- get.Comtrade(r="842", p="251,826") # extract data in .csv format > dt2 <- get.Comtrade(r="842", p="251,826", fmt="csv") > class(dt2) ## [1] "list" # convert to data frame > dt2df <- as.data.frame(do.call(rbind, dt2)) > head(dt2df,2) ## Classification Year Period Period.Desc. Aggregate.Level ## data.1 H4 2015 2015 2015 0 ## data.2 H4 2015 2015 2015 0 ## Is.Leaf.Code Trade.Flow.Code Trade.Flow Reporter.Code Reporter ## data.1 0 1 Import 842 USA ## data.2 0 2 Export 842 USA ## Reporter.ISO Partner.Code Partner Partner.ISO X2nd.Partner.Code ## data.1 USA 251 France FRA NA ## data.2 USA 251 France FRA NA ## X2nd.Partner X2nd.Partner.ISO Customs.Proc..Code Customs ## data.1 NA NA NA NA ## data.2 NA NA NA NA ## Mode.of.Transport.Code Mode.of.Transport Commodity.Code ## data.1 NA NA TOTAL ## data.2 NA NA TOTAL ## Commodity Qty.Unit.Code Qty.Unit Qty Alt.Qty.Unit.Code ## data.1 All Commodities 1 No Quantity NA NA ## data.2 All Commodities 1 No Quantity NA NA ## Alt.Qty.Unit Alt.Qty Netweight..kg. Gross.weight..kg. ## data.1 NA NA NA NA ## data.2 NA NA NA NA ## Trade.Value..US.. CIF.Trade.Value..US.. FOB.Trade.Value..US.. Flag ## data.1 48692329891 NA NA 0 ## data.2 31454467959 NA NA 0  ### Extract cereal export data Here we extract cereal export data from three countries. This list contains HS codes defining commodities. # prepared cereal food export from Canada, Mexico and USA, 2010-2014 > dt3 <- get.Comtrade(r="124,484,842", p="0", ps="2010,2011,2012,2013,2014", rg=2, cc="1904", fmt="csv") > dt3df <- as.data.frame(do.call(rbind, dt3))  ### Plot the data Here we use the r package ggplot2 to plot the cereal export value from 2010 to 2014 of Canada, Mexico and USA. Make sure to install the package to be able to use ggplot function. In this example, we plot year on x-axis and trade value on y-axis, color the lines by country, then modify the axis labels and set the title of the plot. # round trade value to millions and save as var. TradeValue > dt3df$TradeValue <- round(dt3df\$Trade.Value..US../1e6, 1)

> library(ggplot2)

# plot cereal trade value of the three countries
> ggplot(dt3df, aes(x=Year, y=TradeValue, group=Reporter, colour=Reporter)) +
geom_line() +
xlab("Year") + ylab("Trade Value (M)") + # set axis labels
ggtitle("Cereal Export")  # set title



References