Chapter 2.2.1 dplyr Package
Visualisation is an important tool for insight generation, but it is rare that you get the data in exactly the right form you need. Often you’ll need to create some new variables or summaries, or maybe you just want to rename the variables or reorder the observations in order to make the data a little easier to work with. You’ll learn how to do all that (and more!) in this chapter, which will teach you how to transform your data using the dplyr package and a new dataset.
The dplyr package was developed by Hadley Wickham of RStudio and is an optimized and distilled version of his plyr package. The dplyr package does not provide any “new” functionality to R per se, in the sense that everything dplyr does could already be done with base R, but it greatly simplifies existing functionality in R.
One important contribution of the dplyr package is that it provides a “grammar” (in particular, verbs) for data manipulation and for operating on data frames. With this grammar, you can sensibly communicate what it is that you are doing to a data frame that other people can understand (assuming they also know the grammar). This is useful because it provides an abstraction for data manipulation that previously did not exist. Another useful contribution is that the dplyr functions are very fast, as many key operations are coded in C++.
Summary
The dplyr package provides a concise set of operations for managing data frames. With these functions we can do a number of complex operations in just a few lines of code. In particular, we can often conduct the beginnings of an exploratory analysis with the powerful combination of group_by() and summarize().
Once you learn the dplyr grammar there are a few additional benefits
-
dplyrcan work with other data frame “backends” such as SQL databases. There is an SQL interface for relational databases via the DBI package -
dplyrcan be integrated with thedata.tablepackage for large fast tables
The dplyr package is a handy way to both simplify and speed up your data frame management code. It’s rare that you get such a combination at the same time!
The data frame is a key data structure in statistics and in R. The basic structure of a data frame is that there is one observation per row and each column represents a variable, a measure, feature, or characteristic of that observation. R has an internal implementation of data frames that is likely the one you will use most often. However, there are packages on CRAN that implement data frames via things like relational databases that allow you to operate on very very large data frames(but we won’t discuss them here).
Given the importance of managing data frames, it’s important that we have good tools for dealing with them. In previous chapters we have already discussed some tools like the subset() function and the use of [ and $ operators to extract subsets of data frames. However, other operations, like filtering, re-ordering, and collapsing, can often be tedious operations in R whose syntax is not very intuitive.
The dplyr package is designed to mitigate a lot of these problems and to provide a highly optimized set of routines specifically for dealing with data frames.
Common dplyr function properties
All of the functions that we will discuss in this Chapter will have a few common characteristics. In particular,
-
The first argument is a data frame.
-
The subsequent arguments describe what to do with the data frame specified in the first argument, and you can refer to columns in the data frame directly without using the $ operator (just use the column names).
-
The return result of a function is a new data frame
-
Data frames must be properly formatted and annotated for this to all be useful. In particular, the data must be tidy. In short, there should be one observation per row, and each column should represent a feature or characteristic of that observation.
dplyr grammar
Some of the key “verbs” provided by the dplyr package are
-
select: return a subset of the columns of a data frame, using a flexible notation -
filter: extract a subset of rows from a data frame based on logical conditions -
arrange: reorder rows of a data frame -
rename: rename variables in a data frame -
mutate: add new variables/columns or transform existing variables -
summarise/summarize: generate summary statistics of different variables in the data frame, possibly within strata -
%>%: the “pipe” operator is used to connect multiple verb actions together into a pipeline
The dplyr package as a number of its own data types that it takes advantage of. For example, there is a handy print method that prevents you from printing a lot of data to the console. Most of the time, these additional data types are transparent to the user and do not need to be worried about.
select()
The select() function can be used to select columns of a data frame that you want to focus on. Often you’ll have a large data frame containing “all” of the data, but any given analysis might only use a subset of variables or observations. The select() function allows you to get the few columns you might need.
Suppose we wanted to take the first 3 columns only. There are a few ways to do this. We could for example use numerical indices. But we can also use the names directly.
filter()
The filter() function is used to extract subsets of rows from a data frame. This function is similar to the existing subset() function in R but is quite a bit faster in my experience.
rename()
Renaming a variable in a data frame in R is surprisingly hard to do! The rename() function is designed to make this process easier.
I leave it as an exercise for the reader to figure how you do this in base R without dplyr.
%>%
The pipeline operater %>% is very handy for stringing together multiple dplyr functions in a sequence of operations.
Notice above that every time we wanted to apply more than one function, the sequence gets buried in a sequence of nested function calls that is difficult to read, i.e.
> third(second(first(x)))
This nesting is not a natural way to think about a sequence of operations. The %>% operator allows you to string operations in a left-to-right fashion, i.e.
> first(x) %>% second %>% third
Take the example that we just did in the last section where we computed the mean of o3 and no2 within quintiles of pm25. There we had to
- create a new variable
pm25.quint - split the data frame by that new variable
- compute the mean of
o3andno2in the sub-groups defined bypm25.quint
That can be done with the following sequence in a single R expression.
> mutate(chicago, pm25.quint = cut(pm25, qq)) %>%
+ group_by(pm25.quint) %>%
+ summarize(o3 = mean(o3tmean2, na.rm = TRUE),
+ no2 = mean(no2tmean2, na.rm = TRUE))
# A tibble: 6 × 3
pm25.quint o3 no2
<fctr> <dbl> <dbl>
1 (1.7,8.7] 21.66401 17.99129
2 (8.7,12.4] 20.38248 22.13004
3 (12.4,16.7] 20.66160 24.35708
4 (16.7,22.6] 19.88122 27.27132
5 (22.6,61.5] 20.31775 29.64427
6 NA 18.79044 25.77585
This way we don’t have to create a set of temporary variables along the way or create a massive nested sequence of function calls.
Notice in the above code that I pass the chicago data frame to the first call to mutate(), but then afterwards I do not have to pass the first argument to group_by() or summarize(). Once you travel down the pipeline with %>%, the first argument is taken to be the output of the previous element in the pipeline.