R

Working with dates and time in R using the lubridate package

Sometimes we have data with dates and/or times that we want to manipulate or summarize. A common example in the health sciences is time-in-study. A subject may enter a study on Feb 12, 2008 and exit on November 4, 2009. How many days was the person in the study? (Don’t forget 2008 was a leap […]

The Wilcoxon Rank Sum Test

The Wilcoxon Rank Sum Test is often described as the non-parametric version of the two-sample t-test. You sometimes see it in analysis flowcharts after a question such as “is your data normal?” A “no” branch off this question will recommend a Wilcoxon test if you’re comparing two groups of continuous measures. So what is this […]

Pairwise comparisons of proportions

Pairwise comparison means comparing all pairs of something. If I have three items A, B and C, that means comparing A to B, A to C, and B to C. Given n items, I can determine the number of possible pairs using the binomial coefficient: $$ \frac{n!}{2!(n – 2)!} = \binom {n}{2}$$ Using the R […]

Using Data.gov APIs in R

Data.gov catalogs government data and makes them available on the web; you can find data in a variety of topics such as agriculture, business, climate, education, energy, finance, public safty and many more. It is a good start point for finding data if you don’t already know which particular data source to begin your search, […]

Getting UN Comtrade Data with R

The UN Comtrade Database provides free access to global trade data. You can get data by using their data extraction interface or using their API to do so. In this post, we share some possible ways of downloading, preparing and plotting trade data in R. Before running this script, you’ll need to install the rjson […]

Look People are Going to Think… (Debate Rhetoric Redux)

I’m still looking at the rhetoric from the presidential debates, this time focusing on the first general election debate between Hillary Clinton and Donald Trump. I turned the data frame from last week into a corpus, did some pre-processing with the tm package (remove capitalization, punctuation, stopwords, stemming the words and then completing stems with […]

Using Census Data API with R

Datasets provided by the US Census Bureau, such as Decennial Census and American Community Survey (ACS), are widely used by many researchers, among others. You can certainly find and download census data from the Census Bureau website, from our licensed data source Social Explorer, or other free sources such as IPUMS-USA, then load the data […]

Debate Prep!

I’m teaching a Text as Data short course (using R) right now, and as a card-carrying political scientist, I couldn’t resist using the ongoing campaign as an example (this was, in party, a way of handling my own anxiety about last Monday’s debate — this is what I was doing while watching). So here goes… […]

A tidyr Tutorial

The tidyr package by Hadley Wickham centers on two functions: gather and spread. If you have struggled to understand what exactly these two functions do, this tutorial is for you. To begin we need to wrap our heads around the idea of “key-value pairs”. The help pages for gather and spread use this terminology to […]

Getting Started with Factor Analysis

Take a look at the following correlation matrix for Olympic decathlon data calculated from 280 scores from 1960 through 2004 (Johnson and Wichern, p. 499): 100m LJ SP HJ 400m 100mH DS PV JV 1500m 100m 1.0000 0.6386 0.4752 0.3227 0.5520 0.3262 0.3509 0.4008 0.1821 -0.0352 LJ 0.6386 1.0000 0.4953 0.5668 0.4706 0.3520 0.3998 0.5167 […]