In Christopher-Peterson/Bio373L-Book: Does not matter.

Data manipulation with `dplyr` {#r-dplyr}

Let's talk about data manipulation. We'll be using the dplyr package, which is part of tidyverse. You can find a quick reference page under Help -> Cheatsheets in RStudio. First, we need to load our packages & data.

# This invisible block is a workaround for Travis CI not having the tidyverse meta-package; 
library(ggplot2)
library(dplyr)
library(readr)
library(tibble)
library(forcats)
library(cowplot)
theme_set(theme_cowplot())
lizards <- read_csv("example_data/anoles.csv") # See Appendix A if you don't have this data
# Capture View()
real_view = View
View = function(x,...){
  # Scrollable table output
  # knitr::kable(head(x))
  x |> head(n = 10) |> 
  kableExtra::kbl() |> 
  kableExtra::kable_paper() |> 
  kableExtra::scroll_box(width = "100%")
}

library(tidyverse) 
library(cowplot)
theme_set(theme_cowplot())
lizards <- read_csv("example_data/anoles.csv") # See Appendix A if you don't have this data

The Pipe (|>) {#r-dplyr-pipe}

The pipe ( |> ) operator strings functions together in a sequence. It takes the result of the function on its left and makes it the first argument to the function on the right. Let's say you wanted to calculate the base-12 log of the mean of the square root of the absolute value of numbers between -50 and 50. The traditional way to write that would be:

log(mean(sqrt(abs(-50:50))), base = 12)

This is rather difficult to read; it has a lot of nested parentheses, and you need to start from the inside and work your way out to see what's happening. With the pipe, you could re-write it like this:

-50:50 |> abs() |> sqrt() |> 
  mean() |> log(base = 12)

Using pipes can make your code much clearer, and is quite helpful when creating a sequence of related transformations on data.

In RStudio, you can insert the pipe by pressing Ctrl+Shift+M.

When looking for help online, you may also find a different version of the pipe (%>%); this is an older version of the pipe that is part of the tidyverse. It works similarly to |> but has a number of technical drawbacks. In most cases, you can use them interchangeably.

Adding/modifying columns (mutate) {#r-dplyr-mutate}

The mutate() function creates a new column in a data frame. For example, the total length of a lizard is defined as its snout-vent length (SVL) plus it's tail length.

mutate(.data = lizards, total_length = SVL + Tail) |> 
  View() # Use View to look at the results in RStudio

This uses the lizards data to create a new column, total_length. Within the mutate command, you can reference columns directly by their names (like you do for aes() in ggplot). Mutate can create multiple new columns in a single command.

mutate(lizards, # generally, the .data argument is not named
       # All subsequent arguments refer to new columns
       total_length = SVL + Tail,
       rel_limb = Limb/SVL,
       log_total_length = log(total_length),
       # You can also change an existing column by saving something to its name
       Color_morph = paste(Color_morph, "morph") # add "morph" after each color
       ) |> View()

Note that this doesn't modify the lizards dataset. It creates a new data frame that has an additional column. You'll need to save it as a new variable to use it.

lizards_full = lizards |> # It's also traditional to pipe the data argument in
  mutate(total_length = SVL + Tail,
         rel_limb = Limb/SVL,
         log_total_length = log(total_length))

Tidyverse functions are designed to be piped together. For example:

lizards |> 
  mutate(Total_length = SVL + Tail) |> 
  ggplot(aes(x = Site, y = Total_length)) +
  geom_boxplot()

Creating a quick plot at the end of a data manipulation step can be a good way to get a visual idea of what you're doing.

Here are a few other helpful things to do with mutate():

lizards |> mutate(
  intercept = 1, # Add a constant
  row_number = 1:n() # the n() function tells you how many rows are 
    # in the current data frame (it only works in mutate & related functions)
) |>  View()

Here are a few exercises to try:

Add a column to the lizards dataset that gives the lizard's height relative to the maximum height of any lizard (hint: use max(Height) in a mutate command to find that value).
Calculate perch circumference (Diameter * pi), then pipe that result into a scatter plot of relative limb length vs. circumference. Note that pi is a pre-defined variable in R.

Subsetting by row (filter) {#r-dplyr-filter}

Let's define "large lizards" as:

lizards |> 
  mutate(large = SVL > 60) |> 
  View()

The large column is a logical vector, with TRUE & FALSE values. We can use logical vectors to get a subset of the data frame where the vector is TRUE with the filter() function.

lizards |>
  mutate(large = SVL > 60) |> 
  filter(large) |> 
  View()

Note that only TRUE values of large remain the the data frame. It's not actually necessary to create a column before filtering:

lizards |> filter(SVL > 60) |> View()

The filter() command returns every row where its logical conditions are TRUE. Conditional statements that create logical statements include the following:

x == y: TRUE if x equals y (This is not the same as x = y)!
x != y: TRUE if x does not equal y
x > y: x >= y;
x < y: x <= y;
between(x, y, z): TRUE if x >= y AND x <= z
is.na(x): TRUE if x is a missing value (NA)
x %in% y: TRUE if x is an element in y

Note that all of these (except for x %in% y) are vectorized.

c(1, 2) == c(2, 3) - 1
c(1, 2, 3) == c(1, 4, 9)
c(1, 2) == c(2, 1) # positions don't match
c(1, 2, 3, 4, 5) %in% c(1, 2) # for %in%, position only matters for the left argument

You can also combine and modify logical values:

x & y: returns TRUE if both x and y are TRUE
x | y: returns TRUE if either x or y are TRUE
!x: returns the opposite of x; this one is particularly useful to combine with other logical functions; for example filter(data, !is.na(x)) will return all rows where x is not a missing value.

If you give filter() more than one condition, it applies all of them by combining them with &.

lizards |> 
  filter(Color_morph == "Blue",
         Site %in% c("A", "B", "C", "D", "E")) |> 
  ggplot(aes(x = SVL, y = Height, color = Site)) + geom_point()

Exercises:

Print a dataframe that shows only the lizards higher than 150 cm. How many are there (the console printout should tell you).
How many lizards perching on trees or shrubs are not brown? Visualize the height to diameter relationship between them. Hint: you can use either %in% or a combination of == and | to meet the first condition.

Subsetting by column (select) {#r-dplyr-select}

You can subset certain columns with the select() function. There are several ways to do this. The simplest is by name:

lizards |> select(Site, Color_morph, SVL) |> View()

You can also select by position:

lizards |> select(1, 2, 7) |> View()

This is more useful for ranges of values:

lizards |> select(1:4, 7) |> View()

You can also use character vectors:

lizards |> select("Site", "Color_morph", "SVL") |> View()

Note that if you want to use a variable that has column names saved as a character vector, you'll need to use a helper function (all_of) to tell select that you want to look for the contents of the variable, not the name of the variable:

select_vars = c("Site", "Color_morph", "SVL")
lizards |> 
  select(all_of(select_vars)) |> # without all_of, it would try to look for a column called "select_vars"
  View()

You can remove columns by using a negative sign. (Note that negative signs are ignored if you have any names without negative signs).

lizards |> select(-Color_morph, -Limb) |> View()

lizards |> select(-(1:5)) |> View()

You can also use the where helper function to select columns based on their characteristics. For example, the is.numeric function returns TRUE if its argument is a number; you can use it to select all numeric columns.

lizards |> 
  select(where(is.numeric)) |> 
  View()

You could do the same for text or logical vectors with is.character or is.logical, respectively. Note that there aren't parentheses after is.numeric in the above code. that's because we aren't calling it on any particular value; instead, the where function calls it on every column of the data frame, and we're just telling it what function to use.

There's a lot more you can do with this if you want to get fancy; the documentation is available at ?tidyselect::language.

Sorting by columns (arrange) {#r-dplyr-arrange}

You can use arrange() to sort by one or more column values. To sort lizards from lowest to highest mass:

lizards |> arrange(Mass) |> View()

If you wish to sort from highest to lowest, use the desc() helper function:

lizards |> arrange(desc(Mass)) |> View()

When you have categorical variables, you'll often have ties:

lizards |> arrange(Site) |> View()

In this case, it's helpful to sort by multiple variables; the following code orders by Site, then by color morph within site, then by SVL.

lizards |> arrange(Site, Color_morph, SVL) |> View()

One particularly useful thing you can do with this is create rankings.

lizards |> 
  arrange(desc(SVL)) |> 
  mutate(size_rank = 1:n()) |>
  View()

Summarizing data {#r-dplyr-summarize}

Summarize is like mutate, but it generally produces columns that are shorter than the input. It's typically used for summary stats. For example, this calculates several characteristics of SVL.

lizards |> 
  summarize(mean_SVL = mean(SVL), 
            sd_SVL = sd(SVL),
            med_SVL = median(SVL), 
            count = n()) |> 
  View()

A useful trick for summarize is to take the mean of a logical vector; TRUE and FALSE are interpreted as 1 and 0, so this gives you a frequency. For example, if you wanted to get the proportion of color morphs:

lizards |> 
  summarize(freq_Blue = mean(Color_morph == "Blue"),
            freq_Brown = mean(Color_morph == "Brown"),
            freq_Green = mean(Color_morph == "Green")
            )

You can use the across helper function to apply the same summary function to multiple rows.

lizards |> 
  summarize(
    across(.cols = c(SVL, Tail),
           .fns = mean)
    ) |> View()

The .cols argument identifies the columns to use for the summary, using the same methods as select(), the .fns should be one or more functions to apply to each column. If you wish to use more than one summary function, you need to create a named vector:

lizards |> 
  summarize(
    across(.cols = where(is.numeric), # apply to all numeric functions
           .fns = c(Mean = mean, StDev = sd)) # named vector (Mean and StDev)
    ) |> View()

This applies the functions mean and sd to all numeric columns; The results have the names "Mean" and "StDev" that we gave each function applied to the end of the column.

Group Operations {#r-dplyr-group}

The real power of the dplyr package comes from being able to apply all of the above functions to grouped subsets of a data frame. To create a grouped table use the group_by function:

lizards |> group_by(Site)

This doesn't appear to do much on its own; however, look what happens when you combine it with summarize:

lizards |> group_by(Site) |> 
  summarize(mean_SVL = mean(SVL),
            sd_SVL = sd(SVL),
            count = n()) |> 
  View()

This calculates the mean, SD of SVL for each site. You can group by multiple factors

lizards |> group_by(Site, Color_morph) |> 
  summarize(mean_SVL = mean(SVL), 
            count = n()) |> 
  ungroup() |> # Removes grouping; usually a good idea at the end unless you want surprises
  View()

Grouping doesn't just work with summarize; for example, you can use it with mutate to find the relative height of each lizard within its site:

lizards |> group_by(Site) |> 
  mutate(
    rel_height = Height/max(Height)) |> 
    # max(Height) returns the max height in each site
  ungroup() |> 
  ggplot(aes(x = Site, y = rel_height)) + 
  geom_jitter(width = .2, height = 0)

You can also use this to easily calculate frequencies:

An alternate way to do this would be to combine rsummariseandmutate`.

lizards |> 
  group_by(Site, Color_morph) |> 
  summarize(count = n()) |> 
  # count is the total number of each morph at each site
  group_by(Site) |> 
  # Calculate color morph frequency at each site
  mutate(site_frequency = count / sum(count)) |> 
  View()

Some Exercises:

Visualize the relationship between the maximum height at a site and the average limb length. Use this to help:

 lizards |> 
  # Put your summarize() code here
  # You should name your new columns max_height and mean_limb
    ggplot(aes(x = max_height, y = mean_limb)) + 
    geom_smooth(method = "lm") + geom_point()