dplyr
{#r-dplyr}Let's talk about data manipulation. We'll be using the dplyr
package, which is part of tidyverse
. You can find a quick reference page under Help -> Cheatsheets in RStudio. First, we need to load our packages & data.
# This invisible block is a workaround for Travis CI not having the tidyverse meta-package; library(ggplot2) library(dplyr) library(readr) library(tibble) library(forcats) library(cowplot) theme_set(theme_cowplot()) lizards <- read_csv("example_data/anoles.csv") # See Appendix A if you don't have this data # Capture View() real_view = View View = function(x,...){ # Scrollable table output # knitr::kable(head(x)) x |> head(n = 10) |> kableExtra::kbl() |> kableExtra::kable_paper() |> kableExtra::scroll_box(width = "100%") }
library(tidyverse) library(cowplot) theme_set(theme_cowplot()) lizards <- read_csv("example_data/anoles.csv") # See Appendix A if you don't have this data
The pipe ( |> ) operator strings functions together in a sequence. It takes the result of the function on its left and makes it the first argument to the function on the right. Let's say you wanted to calculate the base-12 log of the mean of the square root of the absolute value of numbers between -50 and 50. The traditional way to write that would be:
log(mean(sqrt(abs(-50:50))), base = 12)
This is rather difficult to read; it has a lot of nested parentheses, and you need to start from the inside and work your way out to see what's happening. With the pipe, you could re-write it like this:
-50:50 |> abs() |> sqrt() |> mean() |> log(base = 12)
Using pipes can make your code much clearer, and is quite helpful when creating a sequence of related transformations on data.
In RStudio, you can insert the pipe by pressing Ctrl+Shift+M.
When looking for help online, you may also find a different version of the pipe (%>%
); this is an older version of the pipe that is part of the tidyverse. It works similarly to |>
but has a number of technical drawbacks. In most cases, you can use them interchangeably.
The mutate()
function creates a new column in a data frame. For example, the total length of a lizard is defined as its snout-vent length (SVL) plus it's tail length.
mutate(.data = lizards, total_length = SVL + Tail) |> View() # Use View to look at the results in RStudio
This uses the lizards data to create a new column, total_length. Within the mutate command, you can reference columns directly by their names (like you do for aes() in ggplot). Mutate can create multiple new columns in a single command.
mutate(lizards, # generally, the .data argument is not named # All subsequent arguments refer to new columns total_length = SVL + Tail, rel_limb = Limb/SVL, log_total_length = log(total_length), # You can also change an existing column by saving something to its name Color_morph = paste(Color_morph, "morph") # add "morph" after each color ) |> View()
Note that this doesn't modify the lizards
dataset. It creates a new data frame that has an additional column. You'll need to save it as a new variable to use it.
lizards_full = lizards |> # It's also traditional to pipe the data argument in mutate(total_length = SVL + Tail, rel_limb = Limb/SVL, log_total_length = log(total_length))
Tidyverse functions are designed to be piped together. For example:
lizards |> mutate(Total_length = SVL + Tail) |> ggplot(aes(x = Site, y = Total_length)) + geom_boxplot()
Creating a quick plot at the end of a data manipulation step can be a good way to get a visual idea of what you're doing.
Here are a few other helpful things to do with mutate()
:
lizards |> mutate( intercept = 1, # Add a constant row_number = 1:n() # the n() function tells you how many rows are # in the current data frame (it only works in mutate & related functions) ) |> View()
Here are a few exercises to try:
lizards
dataset that gives the lizard's height relative to the maximum height of any lizard (hint: use max(Height) in a mutate command to find that value). (Diameter * pi)
, then pipe that result into a scatter plot of relative limb length vs. circumference. Note that pi
is a pre-defined variable in R.Let's define "large lizards" as:
lizards |> mutate(large = SVL > 60) |> View()
The large column is a logical vector, with TRUE
& FALSE
values. We can use logical vectors to get a subset of the data frame where the vector is TRUE
with the filter()
function.
lizards |> mutate(large = SVL > 60) |> filter(large) |> View()
Note that only TRUE
values of large remain the the data frame. It's not actually necessary to create a column before filtering:
lizards |> filter(SVL > 60) |> View()
The filter()
command returns every row where its logical conditions are TRUE
. Conditional statements that create logical statements include the following:
x == y
: TRUE if x equals y (This is not the same as x = y
)!x != y
: TRUE if x does not equal yx > y
: x >= y; x < y
: x <= y; between(x, y, z)
: TRUE if x >= y AND x <= zis.na(x)
: TRUE if x is a missing value (NA)x %in% y
: TRUE if x is an element in yNote that all of these (except for x %in% y
) are vectorized.
c(1, 2) == c(2, 3) - 1 c(1, 2, 3) == c(1, 4, 9) c(1, 2) == c(2, 1) # positions don't match c(1, 2, 3, 4, 5) %in% c(1, 2) # for %in%, position only matters for the left argument
You can also combine and modify logical values:
x & y
: returns TRUE if both x and y are TRUEx | y
: returns TRUE if either x or y are TRUE!x
: returns the opposite of x; this one is particularly useful to combine with other logical functions; for example filter(data, !is.na(x))
will return all rows where x
is not a missing value.If you give filter()
more than one condition, it applies all of them by combining them with &
.
lizards |> filter(Color_morph == "Blue", Site %in% c("A", "B", "C", "D", "E")) |> ggplot(aes(x = SVL, y = Height, color = Site)) + geom_point()
Exercises:
%in%
or a combination of ==
and |
to meet the first condition.You can subset certain columns with the select()
function. There are several ways to do this. The simplest is by name:
lizards |> select(Site, Color_morph, SVL) |> View()
You can also select by position:
lizards |> select(1, 2, 7) |> View()
This is more useful for ranges of values:
lizards |> select(1:4, 7) |> View()
You can also use character vectors:
lizards |> select("Site", "Color_morph", "SVL") |> View()
Note that if you want to use a variable that has column names saved as a character vector, you'll need to use a helper function (all_of
) to tell select
that you want to look for the contents of the variable, not the name of the variable:
select_vars = c("Site", "Color_morph", "SVL") lizards |> select(all_of(select_vars)) |> # without all_of, it would try to look for a column called "select_vars" View()
You can remove columns by using a negative sign. (Note that negative signs are ignored if you have any names without negative signs).
lizards |> select(-Color_morph, -Limb) |> View()
lizards |> select(-(1:5)) |> View()
You can also use the where
helper function to select columns based on their characteristics. For example, the is.numeric
function returns TRUE if its argument is a number; you can use it to select all numeric columns.
lizards |> select(where(is.numeric)) |> View()
You could do the same for text or logical vectors with is.character
or is.logical
, respectively. Note that there aren't parentheses after is.numeric
in the above code. that's because we aren't calling it on any particular value; instead, the where
function calls it on every column of the data frame, and we're just telling it what function to use.
There's a lot more you can do with this if you want to get fancy; the documentation is available at ?tidyselect::language
.
You can use arrange()
to sort by one or more column values. To sort lizards from lowest to highest mass:
lizards |> arrange(Mass) |> View()
If you wish to sort from highest to lowest, use the desc()
helper function:
lizards |> arrange(desc(Mass)) |> View()
When you have categorical variables, you'll often have ties:
lizards |> arrange(Site) |> View()
In this case, it's helpful to sort by multiple variables; the following code orders by Site, then by color morph within site, then by SVL.
lizards |> arrange(Site, Color_morph, SVL) |> View()
One particularly useful thing you can do with this is create rankings.
lizards |> arrange(desc(SVL)) |> mutate(size_rank = 1:n()) |> View()
Summarize is like mutate, but it generally produces columns that are shorter than the input. It's typically used for summary stats. For example, this calculates several characteristics of SVL.
lizards |> summarize(mean_SVL = mean(SVL), sd_SVL = sd(SVL), med_SVL = median(SVL), count = n()) |> View()
A useful trick for summarize is to take the mean of a logical vector; TRUE
and FALSE
are interpreted as 1 and 0, so this gives you a frequency. For example, if you wanted to get the proportion of color morphs:
lizards |> summarize(freq_Blue = mean(Color_morph == "Blue"), freq_Brown = mean(Color_morph == "Brown"), freq_Green = mean(Color_morph == "Green") )
You can use the across
helper function to apply the same summary function to multiple rows.
lizards |> summarize( across(.cols = c(SVL, Tail), .fns = mean) ) |> View()
The .cols
argument identifies the columns to use for the summary, using the same methods as select()
, the .fns should be one or more functions to apply to each column. If you wish to use more than one summary function, you need to create a named vector:
lizards |> summarize( across(.cols = where(is.numeric), # apply to all numeric functions .fns = c(Mean = mean, StDev = sd)) # named vector (Mean and StDev) ) |> View()
This applies the functions mean
and sd
to all numeric columns; The results have the names "Mean"
and "StDev"
that we gave each function applied to the end of the column.
The real power of the dplyr
package comes from being able to apply all of the above functions to grouped subsets of a data frame. To create a grouped table use the group_by
function:
lizards |> group_by(Site)
This doesn't appear to do much on its own; however, look what happens when you combine it with summarize
:
lizards |> group_by(Site) |> summarize(mean_SVL = mean(SVL), sd_SVL = sd(SVL), count = n()) |> View()
This calculates the mean, SD of SVL for each site. You can group by multiple factors
lizards |> group_by(Site, Color_morph) |> summarize(mean_SVL = mean(SVL), count = n()) |> ungroup() |> # Removes grouping; usually a good idea at the end unless you want surprises View()
Grouping doesn't just work with summarize
; for example, you can use it with mutate to find the relative height of each lizard within its site:
lizards |> group_by(Site) |> mutate( rel_height = Height/max(Height)) |> # max(Height) returns the max height in each site ungroup() |> ggplot(aes(x = Site, y = rel_height)) + geom_jitter(width = .2, height = 0)
You can also use this to easily calculate frequencies:
An alternate way to do this would be to combine r
summariseand
mutate`.
lizards |> group_by(Site, Color_morph) |> summarize(count = n()) |> # count is the total number of each morph at each site group_by(Site) |> # Calculate color morph frequency at each site mutate(site_frequency = count / sum(count)) |> View()
Some Exercises:
lizards |> # Put your summarize() code here # You should name your new columns max_height and mean_limb ggplot(aes(x = max_height, y = mean_limb)) + geom_smooth(method = "lm") + geom_point()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.