dplyr: an R package for fast and easy data manipulation

Whether you’re brand new to R or a long time user, you need to check out the new dplyr package. Released in January 2014, the dplyr package provides simple functions that can be chained together to easily and quickly manipulate data.

The idea behind dplyr is that data manipulation often involves common tasks, such as selecting certain variables, filtering on certain conditions, deriving new variables from existing variables, and so forth. If we think of these tasks as “verbs”, we can define a grammar of sorts for data manipulation. In dplyr the main verbs (or functions) are filter, arrange, select, mutate, summarize, and group_by. You can probably guess what these functions do by their names, but let’s describe them and try them out:

  • filter – select a subset of the rows of a data frame
  • arrange – works similarly to filter, except that instead of filtering or selecting rows, it reorders them
  • select – select columns of a data frame
  • mutate – add new columns to a data frame that are functions of existing columns
  • summarize – summarize values
  • group_by – describe how to break a data frame into groups of rows

Now these functions work fine on their own, but the real power of dplyr is that it allows you to combine (or chain together) these functions to create elegant and efficient commands to manipulate data. To chain the commands together you simply connect them with the chain operator: %>%

Let’s demonstrate. First you’ll need to install dplyr. Recall in R that you only need to install a package once. Thereafter you load it each session with the library function.
install.packages("dplyr")
library(dplyr)

To keep matters simple, we’ll use a data set that comes with R called ToothGrowth. It contains data on the length of teeth in each of 10 guinea pigs at each of three dose levels of Vitamin C (0.5, 1, and 2 mg) with each of two delivery methods (orange juice or ascorbic acid).

Let’s say I want to calculate the mean length of teeth (len) for each delivery method (supp). In dplyr we do that as follows:
ToothGrowth %>%
group_by(supp) %>%
summarize(meanLen = mean(len))

Notice we chain the commands together. First we state which dataset we’re using, then we specify how it should be grouped (by supp, i.e. the delivery method), and finally we request a new variable be created with the summary command that equals the mean of teeth length in each group.

What about mean teeth length by delivery method and dose level? No problem. Just add dose to the group_by function:
ToothGrowth %>%
group_by(supp, dose) %>%
summarize(meanLen = mean(len))

We can create new variables as well. Here we standardize tooth length in the six supp x dose groups:
ToothGrowth %>%
group_by(supp, dose) %>%
mutate(stdLen = (len - mean(len))/sd(len))

The mutate function says create a new variable called “stdLen” that equals the given expression.

Can you figure out what this does?
ToothGrowth %>%
filter(dose==0.5) %>%
select(len, supp) %>%
arrange(len)

Hopefully it makes sense. We first filter out the rows with dose level equal to 0.5, then select only the len and supp columns, and then finally sort ascending by len.

How about this?
ToothGrowth %>%
filter(dose==0.5) %>%
select(len, supp) %>%
arrange(desc(len)) %>%
head(n=5)

Same as before, but now we’re sorting in descending order (notice the desc function, included with dplyr) and using the head function to get the top 5 tooth lengths. If the head function looks familiar, that’s because it’s a standard R function. That’s right, you can chain in other functions besides those that come with dplyr.

If we want to save this data set, we can use the normal assignment operator in R:
TG05 <- ToothGrowth %>%
filter(dose==0.5) %>%
select(len, supp) %>%
arrange(desc(len)) %>%
head(n=5)

This is but a taste of what the dplyr package offers. If you want to learn more, install the package and read the accompanying vignettes. They’re quite friendly and useful.

For questions or clarifications regarding this article, contact the UVa Library StatLab: statlab@virginia.edu

Clay Ford
Statistical Research Consultant
University of Virginia Library
April 30, 2012