Using Data.gov APIs in R

Data.gov catalogs government data and makes them available on the web; you can find data in a variety of topics such as agriculture, business, climate, education, energy, finance, public safty and many more. It is a good start point for finding data if you don’t already know which particular data source to begin your search, however it can still be time consuming when it comes to actually download the raw data you need. Fortunately, Data.gov also includes APIs from across government, which can help with obtaining raw datasets. In this post, I share examples of using Data.gov APIs in R to download data from the web. Many APIs are available from the agencies partnering with Data.gov – in this post, I use College Scorecard data from the Department of Education and Alternative Fuel Stations data from the National Renewable Energy Laboratory as examples.

Before running this script, you’ll need to install the httr package if you haven’t done so before. Make sure your machine is connected to the Internet, then run install.packages(“httr”) – you only need to do this once.

API key

Get your API key at: https://api.data.gov/signup/. Make sure to plug in your own API key in the following R codes.

Working directory and R package

Set the working directory on your computer (the path to where you want R to read/store files) and load the httr package.

# Set working directory 
> setwd('~/DataApiR') # plug in the working directory on your machine

# Load package
> library(httr)

 

College Scorecard Data API

The College Scorecard project provides information of college costs and outcomes at individual postsecondary institution level. Read the documentation of this data will help with understanding how this dataset is structured. If you’re interested in using this data, find variables in their Data Dictionary. In the following example, I download all available data for Emory University, then extract variables from the downloaded data.

# plug in your API key
> myapikey <- "YOUR API KEY"

# the url path to the service
> URL <- "https://api.data.gov/ed/collegescorecard/v1/schools?"

# GET(): download all available data for Emory University
> get.data <- GET(URL, query=list(api_key=myapikey,
                                  school.name="Emory University"))

# content(): extract the content from the query
> emory.data <- content(get.data) 
> class(emory.data) # it's a list object

## [1] "list"

# what's in emory.data
> names(emory.data) # contains two components: metadata, results

## [1] "metadata" "results"

# what's inside the results component
> names(emory.data$results[[1]])

##  [1] "2008"     "2009"     "2006"     "ope6_id"  "2007"     "2004"    
##  [7] "2013"     "2005"     "location" "2014"     "2002"     "2003"    
## [13] "id"       "1996"     "1997"     "school"   "1998"     "2012"    
## [19] "2011"     "2010"     "ope8_id"  "1999"     "2001"     "2000"

# see available dev-categories for 2013 data
> names(emory.data$results[[1]]$`2013`)

## [1] "earnings"   "academics"  "student"    "admissions" "repayment" 
## [6] "aid"        "cost"       "completion"

# available variables under the cost category for 2013 data
> names(emory.data$results[[1]]$`2013`$cost)

## [1] "title_iv"      "avg_net_price" "attendance"    "tuition"      
## [5] "net_price"

# elements of the tuition variable
> names(emory.data$results[[1]]$`2013`$cost$tuition)

## [1] "out_of_state" "in_state"     "program_year" 

Hopefully you now get the idea of how the dataset is structured through seeing the levels of the list. Let’s try to extract variables from the downloaded data. To run the following codes, you need to install the magrittr package.

# load package
> library(magrittr)

# subset list for annual data only
> emory.ann <- emory.data$results[[1]][c(as.character(1996:2013))]
> names(emory.ann)

##  [1] "1996" "1997" "1998" "1999" "2000" "2001" "2002" "2003" "2004" "2005"
## [11] "2006" "2007" "2008" "2009" "2010" "2011" "2012" "2013"

# extract enrollment of undergraduate degree-seeking students for each year
> s.size <- emory.ann %>%
    sapply(function(x) x$student$size) %>% 
    unlist()

# extract percentage of first-generation students for each year
> s.fg <- emory.ann %>%
    sapply(function(x) x$student$share_firstgeneration) %>% 
    unlist()

# combine the two variables into a data frame
> emory.s <- data.frame(s.size, s.fg)

# see the first few rows of the data frame
> head(emory.s) 

##      s.size      s.fg
## 1996   6027 0.1405167
## 1997   5996 0.1400849
## 1998   6316 0.1480363
## 1999   6215 0.1534494
## 2001   6265 0.1474711
## 2002   6187 0.1468254

# create a variable of year from the row number
> emory.s$year <- rownames(emory.s)

# create a variable s.fg.n: number of first-generation students
> emory.s$s.fg.n <- round(emory.s$s.size*emory.s$s.fg)

# save the data as a .csv file
> write.csv(emory.s, file="emory.s.csv", row.names = F)

Just to play a bit more with the data we got, we can plot the variables we extracted and created. Install the ggplot2 package to run the following codes.

# load package
> library(ggplot2)

# Line graph of total enrollment and first generation student number
> ggplot(emory.s, aes(year)) + 
    geom_line(aes(y = s.size, colour = "s.size", group=1)) + 
    geom_line(aes(y = s.fg.n, colour = "s.fg.n", group=1)) +
    xlab("Year") + ylab("Number") + # Set axis labels
    ggtitle("Enrollment: Emory University") + # set title      
    scale_colour_discrete(name="Enrollment",  # modify legend
                          labels=c("First Generation", "Total")) +
    theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) # adjust x-axis text position
 

datagovapi01

The R Package for College Scorecard Data API

Many data APIs can be accessed via R packages, College Scorecard is one of them. I find this rscorecard wrapper fairly easy to use for downloading purpose. Check the package’s Github page for more details. Here is a simple example of using the package to obtain College Scorecard records that meet the conditions of the query.

> library(rscorecard)
> sc_key('YOUR API KEY')

# extract Virginia institutions' 2013 data with three variables
> df <- sc_init() %>% 
    sc_filter(stabbr == 'VA') %>% 
    sc_select(unitid, instnm, stabbr) %>% 
    sc_year(2013) %>% 
    sc_get()

# see first few cases
> head(df)

##     unitid                                         instnm stabbr year
## 1   481526               The Chrysm Insitute of Esthetics     VA 2013
## 2 24893410 ECPI University-Culinary Institute of Virginia     VA 2013
## 3   475194         Miller-Motte Technical College-Roanoke     VA 2013
## 4   484765                 University of Phoenix-Virginia     VA 2013
## 5   459268                 South UniversityVirginia Beach     VA 2013
## 6 23368402           Strayer University-Woodbridge Campus     VA 2013



Renewable Energy Data API

Let’s use another institution’s data API as an example. The National Renewable Data Laboratory offers APIs for users to access energy data in a few categories, see its documentation. In the following example, I query their database of alternative fuel stations in the Transportation category. Again, read its data documentation should help with data query construction.

# load packages
> library(httr)
> library(magrittr)
> library(ggplot2)

# Get all stations data in Virginia, remember to plug in your own API key
> get.afs <- GET("http://api.data.gov/nrel/alt-fuel-stations/v1.json?api_key=[YOUR API KEY]&state=VA") 

# extract content from the query
> afs <- content(get.afs)

# see what's available in the downloaded data
> names(afs)

## [1] "station_locator_url" "total_results"       "station_counts"     
## [4] "fuel_stations"

# how many stations in the downloaded data
> afs$total_results

## [1] 526

# see variables/fields under fuel_stations
> names(afs$fuel_stations[[1]])

##  [1] "access_days_time"        "cards_accepted"         
##  [3] "date_last_confirmed"     "expected_date"          
##  [5] "fuel_type_code"          "id"                     
##  [7] "groups_with_access_code" "open_date"              
##  [9] "owner_type_code"         "status_code"            
## [11] "station_name"            "station_phone"          
## [13] "updated_at"              "geocode_status"         
## [15] "latitude"                "longitude"              
## [17] "city"                    "intersection_directions"
## [19] "plus4"                   "state"                  
## [21] "street_address"          "zip"                    
## [23] "bd_blends"               "e85_blender_pump"       
## [25] "ev_connector_types"      "ev_dc_fast_num"         
## [27] "ev_level1_evse_num"      "ev_level2_evse_num"     
## [29] "ev_network"              "ev_network_web"         
## [31] "ev_other_evse"           "hy_status_link"         
## [33] "lpg_primary"             "ng_fill_type_code"      
## [35] "ng_psi"                  "ng_vehicle_class"

# extract vars: station_name, fuel_type_code
> fsname <- afs$fuel_stations %>%
    sapply(function(x) x$station_name) %>%
    unlist()
> ftcode <- afs$fuel_stations %>%
    sapply(function(x) x$fuel_type_code) %>%
    unlist()

# combine the two vars in a data frame
> afsdf <- data.frame(fsname, ftcode)

# see the first few rows
> head(afsdf)

##                              fsname ftcode
## 1   Virginia Natural Gas - Lance Rd    CNG
## 2 Virginia Natural Gas - VNG Office    CNG
## 3                      Dixie Gas Co    LPG
## 4                            U-Haul    LPG
## 5                  Suburban Propane    LPG
## 6                        Revere Gas    LPG

# plot ftcode: type of alternative fuel the station provides
> ggplot(afsdf, aes(x=ftcode)) +
    geom_bar(stat="count", fill="steelblue") +
    theme_minimal()



datagovapi02

References
18F/open-data-maker. (2016). GitHub. Retrieved 20 September 2016, from https://github.com/18F/open-data-maker/blob/api-docs/API.md.
Boehmke, B. (2016). Scraping via APIs. http://bradleyboehmke.github.io. Retrieved 20 September 2016, from http://bradleyboehmke.github.io/2016/01/scraping-via-apis.html.

Yun Tai
CLIR Postdoctoral Fellow
University of Virginia Library