The tidyr package by Hadley Wickham centers on two functions:
spread. If you have struggled to understand what exactly these two functions do, this tutorial is for you.
To begin we need to wrap our heads around the idea of “key-value pairs”. The help pages for
spread use this terminology to explain how they work. Without some intuition for key-value pairs, it can be difficult to truly understand how these functions work.
Let’s generate some data to help explain this concept. The code below comes from the
gather help page with a few modifications. It creates stock price data for three fictional companies named X, Y and Z. We’ve added a set.seed line so you can reproduce the results if you wish to follow along.
set.seed(1) stocks <- data.frame( time = as.Date('2009-01-01') + 0:9, X = rnorm(10, 20, 1), Y = rnorm(10, 20, 2), Z = rnorm(10, 20, 4) ) stocks time X Y Z 1 2009-01-01 19.37355 23.02356 23.67591 2 2009-01-02 20.18364 20.77969 23.12855 3 2009-01-03 19.16437 18.75752 20.29826 4 2009-01-04 21.59528 15.57060 12.04259 5 2009-01-05 20.32951 22.24986 22.47930 6 2009-01-06 19.17953 19.91013 19.77549 7 2009-01-07 20.48743 19.96762 19.37682 8 2009-01-08 20.73832 21.88767 14.11699 9 2009-01-09 20.57578 21.64244 18.08740 10 2009-01-10 19.69461 21.18780 21.67177
Look at row 1. It shows three stock prices for companies X, Y and Z for 2009-01-01. The numbers are the values. Which values go with which company? We have to look at the column headers to find out. We can think of the column headers as the “keys” to knowing which value goes with which company. The value 19.37355 is keyed to X. That’s a key-value pair. Another key-value pair is 23.02356 and Y. The value 23.02356 is keyed to stock Y. The three key-value pairs in row 1 are unique to 2009-01-01.
If you understood the last paragraph, then you understand the concept of key-value pairs (at least well enough to learn how to use the tidyr package). The help page for
gather says that it “takes multiple columns and collapses into key-value pairs, duplicating all other columns as needed.” Applying the
gather function to the data above would mean gathering the X, Y and Z columns into two columns of key-value pairs. For the date 2009-01-01, that would look something like this:
2009-01-01 X 19.37355 2009-01-02 X 20.18364 2009-01-03 X 19.16437
We took multiple columns (X, Y and Z) and collapsed into two columns consisting of key-value pairs. The column with X, Y and Z contains the keys. The last column contains the values. The date column value was duplicated so we know which key-value pair goes with which date. This is sometimes referred to as “reshaping wide data to long data”. Let’s demonstrate how to do this with the
The first argument is the data frame we wish to reshape. The next two arguments are the key and value. What’s tricky is that you give them the names that you want your new key and value columns to be titled. They do NOT refer to columns in the data frame. Below we name our new columns “stock” and “price”. The column headers (X, Y, Z) will be in the stock column. The values in the X, Y and Z columns will go in the price column. The final arguments to provide are the columns that contain the key-value pairs. We assign our reshaped data frame to a new object called “stocksL”.
install.packages("tidyr) # if not already installed library(tidyr) stocksL <- gather(data = stocks, key = stock, value = price, X, Y, Z) stocksL time stock price 1 2009-01-01 X 19.37355 2 2009-01-02 X 20.18364 3 2009-01-03 X 19.16437 4 2009-01-04 X 21.59528 5 2009-01-05 X 20.32951 6 2009-01-06 X 19.17953 7 2009-01-07 X 20.48743 8 2009-01-08 X 20.73832 9 2009-01-09 X 20.57578 10 2009-01-10 X 19.69461 11 2009-01-01 Y 23.02356 12 2009-01-02 Y 20.77969 13 2009-01-03 Y 18.75752 14 2009-01-04 Y 15.57060 15 2009-01-05 Y 22.24986 16 2009-01-06 Y 19.91013 17 2009-01-07 Y 19.96762 18 2009-01-08 Y 21.88767 19 2009-01-09 Y 21.64244 20 2009-01-10 Y 21.18780 21 2009-01-01 Z 23.67591 22 2009-01-02 Z 23.12855 23 2009-01-03 Z 20.29826 24 2009-01-04 Z 12.04259 25 2009-01-05 Z 22.47930 26 2009-01-06 Z 19.77549 27 2009-01-07 Z 19.37682 28 2009-01-08 Z 14.11699 29 2009-01-09 Z 18.08740 30 2009-01-10 Z 21.67177
The documentation for
gather demonstrates the same operation like this:
gather(stocks, stock, price, -time)
The last argument, -time, means all columns except time contain the key-value pairs. We can also do it like this:
gather(stocks, stock, price, X:Z)
The last argument, X:Z, means the columns X through Z, inclusive, contain the key-value pairs. This method is very useful if you have many adjacent columns that contain key-value pairs.
Again it’s worth noting the 2nd and 3rd arguments are the column names we want to create after gathering the key-value pairs. Here is the same thing we just did, but with different column names (stk and pr):
head(gather(stocks, stk, pr, -time)) time stk pr 1 2009-01-01 X 19.37355 2 2009-01-02 X 20.18364 3 2009-01-03 X 19.16437 4 2009-01-04 X 21.59528 5 2009-01-05 X 20.32951 6 2009-01-06 X 19.17953
The tidyr package description states that it’s “an evolution of ‘reshape2′”. Let’s see how we do the same thing with the reshape2 package.
install.packages("reshape2") # if not already installed library(reshape2) melt(stocks, measure.vars = c("X","Y","Z"), variable.name = "stock", value.name = "price") time stock price 1 2009-01-01 X 19.37355 2 2009-01-02 X 20.18364 3 2009-01-03 X 19.16437 4 2009-01-04 X 21.59528 ...[remaining output not shown]
melt function, the columns containing the key-value pairs are provided to the measure.vars argument. The variable.name and value.name arguments provide the new column names for the key and value columns, respectively, but these are optional. The
melt function will create default column names for us if we don’t provide any. Notice we also have to use quotes around variable names which we don’t have to do with
Reshaping data wide to long often gives us what Wickham refers to as “tidy data” (Wickham, 2014). Wickham defines tidy data as follows:
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.
Our stocks data is now “tidy”:
1. After reshaping (or gathering, or melting) our data, each variable forms a column. Our three variables are time, stock, and price.
2. Each row is now an observation. Before reshaping our data, each row represented three observations.
3. Our table (or data frame) consists of one type of unit: observations made per company per day
Tidy data is often required in R (and other statistical programs) in order to carry out tasks such as modeling or plotting. Making data tidy often means reshaping it. But according to Wickham, the word “reshape” is imprecise. He refers to the process as “melting” in his 2014 paper and as “gathering key-value pairs” in the documentation for tidyr. Regardless of what you call the process, the
gather function is designed to help you obtain tidy data for further analysis.
The other primary tidyr function is
spread, which spreads key-value pairs across multiple columns. It is the complement of
gather. It takes two columns, key and value, and spreads them out such that the keys are the column headers and the values are in the columns to which they’re keyed. This is sometimes called “reshaping long data to wide data”.
Look at the first 6 records of stocksL:
head(stocksL) time stock price 1 2009-01-01 X 19.37355 2 2009-01-02 X 20.18364 3 2009-01-03 X 19.16437 4 2009-01-04 X 21.59528 5 2009-01-05 X 20.32951 6 2009-01-06 X 19.17953
The first key-value pair is X and 19.37355. The next key-value pair is X and 20.18364. And so on. Spreading this data means creating columns for each key (X, Y and Z) and placing their respective values in each column. We demonstrate how to do this with the
The first argument is the data frame we want to reshape. The second and third arguments are the key and value columns, respectively. In contrast to
gather, these arguments refer to columns that already exist in our data. In this case, that’s stock and price. These are the key-value pairs we want to spread out. Here how’s to do it with the “long” data frame we created, stocksL:
spread(data = stocksL, key = stock, value = price) time X Y Z 1 2009-01-01 19.37355 23.02356 23.67591 2 2009-01-02 20.18364 20.77969 23.12855 3 2009-01-03 19.16437 18.75752 20.29826 4 2009-01-04 21.59528 15.57060 12.04259 5 2009-01-05 20.32951 22.24986 22.47930 6 2009-01-06 19.17953 19.91013 19.77549 7 2009-01-07 20.48743 19.96762 19.37682 8 2009-01-08 20.73832 21.88767 14.11699 9 2009-01-09 20.57578 21.64244 18.08740 10 2009-01-10 19.69461 21.18780 21.67177
We have recovered our original data frame, undoing the work of
gather. That is what we mean when we say
spread is the complement of
The reshape2 package has a function that does this called
dcast. It’s a little more complicated to use than
spread. First, it requires a “casting formula”. The left hand side is what you want in the first column while the left hand side is what you want to comprise the new columns. The value.var argument is what you want placed in the new columns. Below we show how to reshape the “long” stocksL data frame back to “wide” format using
dcast(stocksL, time ~ stock, value.var = "price") time X Y Z 1 2009-01-01 19.37355 23.02356 23.67591 2 2009-01-02 20.18364 20.77969 23.12855 3 2009-01-03 19.16437 18.75752 20.29826 4 2009-01-04 21.59528 15.57060 12.04259 5 2009-01-05 20.32951 22.24986 22.47930 6 2009-01-06 19.17953 19.91013 19.77549 7 2009-01-07 20.48743 19.96762 19.37682 8 2009-01-08 20.73832 21.88767 14.11699 9 2009-01-09 20.57578 21.64244 18.08740 10 2009-01-10 19.69461 21.18780 21.67177
Notice that while you don’t have to quote the variable names in the casting formula, you do have to quote the variable in the value.var argument. Also, unlike
dcast can do more than spread data. It can aggregate data as well. See the
dcast documentation for several examples.
Choosing between tidyr and reshape2 is mostly a personal preference. Both get the job done when it comes to reshaping data. One possible advantage to using tidyr is that it’s designed to work well with dplyr data pipelines. dplyr is a package that provides a grammar for data manipulation. It’s meant to make data manipulation easier by providing consistent and easy-to-remember syntax. (Here’s an introductory blog post we wrote if you want to learn more, or Google for “dplyr tutorial” for more such articles.) A pipeline refers to piping, or chaining, together commands. In a dplyr pipeline, the output of one function becomes the input to the next function. One way to think of a pipeline is logically giving instructions, in order, of how to do a task. For example: “take the stocks data, reshape it to long, and then find the maximum and minimum stock price for each company.”
The way this works with a dplyr data pipeline is as follows:
stocks %>% gather(stock,price,X:Z)%>% group_by(stock) %>% summarise(min = min(price), max = max(price)) stock min max 1 X 19.16437 21.59528 2 Y 15.57060 23.02356 3 Z 12.04259 23.67591
We start with the stocks data frame followed by the pipe operator,
%>%. The pipe passes stocks to the first argument of the next function, which is
gather. Notice we don’t populate the first argument of the
gather function. The pipe operator does this for us. After the
gather function another pipe operator passes the reshaped data to a dplyr function,
group_by. This function groups the data by stock. This is followed by one more pipe operator which passes the grouped data to another dplyr function,
summarise, which calculates the min and max for each group (X, Y and Z). Notice how the pipeline mirrors how we might explain in words what we’re doing to the data: “we’re taking stocks, gathering columns X – Z, grouping by stock, and calculating the minimum and maximum price for each group.”
If you’re fluent with tidyr and dplyr, this is a fast and easy way to work with data. For comparison, here is the same thing carried out with the base R
stocksL <- gather(stocks, stock, price, X, Y, Z) aggregate(price ~ stock, data = stocksL, function(x)c(min=min(x),max=max(x))) stock price.min price.max 1 X 19.16437 21.59528 2 Y 15.57060 23.02356 3 Z 12.04259 23.67591
This works just as well, but our column headers are slightly different and we had to know how to create an anonymous function. We also had to save the reshaped data before using it with
aggregate, which we didn’t have to do with the pipeline. (Well, we didn’t have to do that. We could have plugged in the entire
gather function call into the data argument of
aggregate, but that would make for hard-to-read code.)
Wickham, Hadley (2014). “Tidy Data.” Journal of Statistical Software, 59(10), https://www.jstatsoft.org/article/view/v059i10 .
For questions or clarifications regarding this article, contact the UVa Library StatLab: email@example.comClay Ford
Statistical Research Consultant
University of Virginia Library
August 24, 2016