If you’re wondering what exactly the purrr package does, then this blog post is for you.
Before we get started, we should mention the Iteration chapter in R for Data Science by Garrett Grolemund and Hadley Wickham. We think this is the most thorough and extensive introduction to the purrr package currently available (at least at the time of this writing.) Wickham is one of the authors of the purrr package and he spends a good deal of the chapter clearly explaining how it works. Good stuff, recommended reading.
The purpose of this article is to provide a short introduction to purrr, focusing on just a handful of functions. We use some real world data and replicate what purrr does in base R so we have a better understanding of what’s going on.
We visited Yahoo Finance on 13 April 2017 and downloaded about three weeks of historical data for three companies: Boeing, Johnson & Johnson and IBM. The following R code will download and unzip the data in your current working directory if you wish to follow along.
URL <- "http://static.lib.virginia.edu/statlab/materials/data/stocks.zip" download.file(url = URL, destfile = basename(URL)) unzip(basename(URL))
We have three CSV files. In the spirit of being efficient we would like to import these files into R using as little code as possible (as opposed to calling
read.csv three different times.)
Using base R functions, we could put all the file names into a vector and then apply the
read.csv function to each file. This results in a list of three data frames. When done we could name each list element using the
names function and our vector of file names.
# get all files ending in csv files <- list.files(pattern = "csv$") # read in data dat <- lapply(files, read.csv) names(dat) <- gsub("\\.csv", "", files) # remove file extension
Here is how we do the same using the
map function from the purrr package.
install.packages("purrr") # if package not already installed library(purrr) dat2 <- map(files, read.csv) names(dat2) <- gsub("\\.csv", "", files)
So we see that
map is like
lapply. It takes a vector as input and applies a function to each element of the vector.
map is one of the star functions in the purrr package.
Let’s say we want to find the mean Open price for each stock. Here is a base R way using
lapply and an anonymous function:
lapply(dat, function(x)mean(x$Open)) $BA  177.8287 $IBM  174.3617 $JNJ  125.8409
We can do the same with map.
map(dat, function(x)mean(x$Open)) $BA  177.8287 $IBM  174.3617 $JNJ  125.8409
map allows us to bypass the
function function. Using a tilda (~) in place of
function and a dot (.) in place of x, we can do this:
Furthermore, purrr provides several versions of
map that allow you to specify the structure of your output. For example, if we want a vector instead of a list we can use the
map_dbl function. The “_dbl” indicates that it returns a vector of type double (ie, numbers with decimals).
map_dbl(dat, ~mean(.$Open)) BA IBM JNJ 177.8287 174.3617 125.8409
Now let’s say that we want to extract each stock’s Open price data. In other words, we want to go into each data frame in our list and pull out the Open column. We can do that with
lapply as follows:
lapply(dat, function(x)x$Open) $BA  178.25 177.50 179.00 178.39 177.56 179.00 176.88 177.08 178.02 177.25 177.40 176.29 174.37 176.85 177.34 175.96 179.99  180.10 178.31 179.82 179.00 178.54 177.16 $IBM  171.04 170.65 172.53 172.08 173.47 174.70 173.52 173.82 173.98 173.86 174.30 173.94 172.69 175.12 174.43 174.04 176.01  175.65 176.29 178.46 175.71 176.18 177.85 $JNJ  124.54 124.26 124.87 125.12 124.85 124.72 124.51 124.73 124.11 124.74 125.05 125.62 125.16 125.86 126.10 127.05 128.38  128.04 128.45 128.44 127.05 126.86 125.83
map is a little easier. We just provide the name of the column we want to extract.
map(dat, "Open") $BA  178.25 177.50 179.00 178.39 177.56 179.00 176.88 177.08 178.02 177.25 177.40 176.29 174.37 176.85 177.34 175.96 179.99  180.10 178.31 179.82 179.00 178.54 177.16 $IBM  171.04 170.65 172.53 172.08 173.47 174.70 173.52 173.82 173.98 173.86 174.30 173.94 172.69 175.12 174.43 174.04 176.01  175.65 176.29 178.46 175.71 176.18 177.85 $JNJ  124.54 124.26 124.87 125.12 124.85 124.72 124.51 124.73 124.11 124.74 125.05 125.62 125.16 125.86 126.10 127.05 128.38  128.04 128.45 128.44 127.05 126.86 125.83
We often want to plot financial data. In this case we may want to plot Closing price for each stock and look for trends. We can do this with the base R function
mapply. First we create a vector of stock names for plot labeling. Next we set up one row of three plotting regions. Then we use
mapply to create the plot. The “m” in mapply means “multiple arguments”. In this case we have two arguments: the Closing price and the stock name. Notice that
mapply requires the function come first and then the arguments.
stocks <- sub("\\.csv","", files) par(mfrow=c(1,3)) mapply(function(x,y)plot(x$Close, type = "l", main = y), x = dat, y = stocks)
The purrr equivalent is
map2. Again we can substitute a tilda (~) for function, but now we need to use
.y to identify the arguments. However the ordering is the same as
map: data come first and then the function.
map2(dat, stocks, ~plot(.x$Close, type="l", main = .y))
Each time we run
map2 above, the following is printed to the console:
$BA NULL $IBM NULL $JNJ NULL
This is because both functions return a value. Since
plot returns no value, NULL is printed. The purrr package provides
walk for dealing with functions like
plot. Here is the same task with
walk2 instead of
map2. It produces the plots and prints nothing to the console.
walk2(dat, stocks, ~plot(.x$Close, type="l", main = .y))
At some point we may want to collapse our list of three data frames into a single data frame. This means we’ll want to add a column to indicate which record belongs to which stock. Using base R this is a two step process. We
rbind function to the elements of our list. Then we add a column called Stock by taking advantage of the fact that the row names of our data frame contain the name of the original list element, in this case the stock name.
datDF <- do.call(rbind, dat) # add stock names to data frame datDF$Stock <- gsub("\\.[0-9]*", "", rownames(datDF)) # remove period and numbers head(datDF) Date Open High Low Close Volume Adj.Close Stock BA.1 2017-04-12 178.25 178.25 175.94 176.05 2920000 176.05 BA BA.2 2017-04-11 177.50 178.60 176.96 178.57 2259700 178.57 BA BA.3 2017-04-10 179.00 179.97 177.48 177.56 2259500 177.56 BA BA.4 2017-04-07 178.39 179.09 177.26 178.85 2704700 178.85 BA BA.5 2017-04-06 177.56 178.22 177.12 177.37 2343600 177.37 BA BA.6 2017-04-05 179.00 180.18 176.89 177.08 2387100 177.08 BA
Using purrr, we could have used
map_df instead of
map with the
read.csv function, but we would have lost the source file information.
dat2DF <- map_df(files, read.csv) # works, but which record goes with which stock?
We could also use purrr’s
reduce function. That will collapse the list into a single data frame. But again we have no way of labeling which row came from which stock.
dat2DF <- reduce(dat, rbind) # works, but which record goes with which stock?
To accomplish this with purrr, we need to use the stocks vector we created earlier along with the
map2_df function. This function applies a function to two arguments and returns a data frame. The function we want to apply is
update_list, another purrr function. The
update_list function allows you to add things to a list element, such as a new column to a data frame. Below we use the formula notation again and
.y to indicate the arguments. The result is a single data frame with a new Stock column.
dat2DF <- map2_df(dat2, stocks, ~update_list(.x, stock = .y)) head(dat2DF) Date Open High Low Close Volume Adj.Close stock 1 2017-04-12 178.25 178.25 175.94 176.05 2920000 176.05 BA 2 2017-04-11 177.50 178.60 176.96 178.57 2259700 178.57 BA 3 2017-04-10 179.00 179.97 177.48 177.56 2259500 177.56 BA 4 2017-04-07 178.39 179.09 177.26 178.85 2704700 178.85 BA 5 2017-04-06 177.56 178.22 177.12 177.37 2343600 177.37 BA 6 2017-04-05 179.00 180.18 176.89 177.08 2387100 177.08 BA
Finally, we should consider reformatting the Date column as a Date instead of a Factor. The easiest way to deal with this would have been to use the
read_csv function from the readr package instead of
read.csv. But in the interest of demonstrating some more purrr functionality, let’s pretend we can’t do that. Further, let’s pretend we don’t know which columns are Factor, but we would like to convert them to Date if they are Factor. This time we give a purrr solution first.
To do this we nest one map function in another. The first one is
dmap is just like
dmap returns a data frame.
dmap_if allows us to define a condition to dictate whether or not we apply the function. In this case the condition is determined by
is.factor returns TRUE, then we apply the
ymd function from the lubridate package. Now
dmap_if takes a data frame not a list, so we have to use
map to apply
dmap_if to each data frame in our list. The final code is as follows:
dat2 <- map(dat2, ~dmap_if(., is.factor, lubridate::ymd))
Doing this in base R is possible but far more difficult. We nest one
lapply function inside another, but since
lapply returns a list, we need to wrap the first
as.data.frame. And within the first
lapply we have to use the assignment operator as a function, which works but looks cryptic!
dat <- lapply(dat, function(x)as.data.frame( lapply(x, function(y) if(is.factor(y)) `<-`(y, lubridate::ymd(y)) else y)))
This article provides just a taste of purrr. We hope it gets you started learning more about the package. Be sure to read the documentation as well. Each help page contains illustrative examples. Note that purrr is a very young package. At the time of this writing it is at version 0.2.2. There are sure to be improvements and changes in the coming months and years.
For questions or clarifications regarding this article, contact the UVa Library StatLab: email@example.comClay Ford
Statistical Research Consultant
University of Virginia Library
April 14, 2017