In Christopher-Peterson/Bio373L-Book: Does not matter.

Wrangling, reshaping, and tidying data with `tidyr` {#r-tidyr}

Tidy data

Most of the previous appendices have required you to have your data in a tidy format. Tidy data consist of data frames with the following characteristics:

Each column is a different variable
Each row is a single observation
Each cell is a single value

The packages in the tidyverse are generally designed to operate on tidy data; however, other functions may require the data to be in a different shape. It's also frequently convenient to record field data in a non-tidy format. The tidyr package is designed to reshape your data. For more information on tidy data, please see this article.

We're going to be using some new datasets for this chapter; if you don't have them already, download them here and put them in your "example_data" directory.

# This invisible block is a workaround for Travis CI not having the tidyverse meta-package; 
library(ggplot2)
library(dplyr)
library(readr)
library(tibble)
library(tidyr)
library(forcats)
library(cowplot)
theme_set(theme_cowplot())
lizards <- read_csv("example_data/anoles.csv") # See Appendix A if you don't have this
succession_data_wide <- read_csv("example_data/succession_wide_untidy.csv")
yeast_data <- read_csv("example_data/yeast_data_partial.csv")
knitr::opts_chunk$set(collapse = TRUE)
# Capture View()
# rm(View)
real_view = View
View = function(x,...){
  # Scrollable table output
  # knitr::kable(head(x))
  x |> head(n = 10) |> 
    kableExtra::kbl() |> 
    kableExtra::kable_paper() |> 
    kableExtra::scroll_box(width = "100%")
}

library(tidyverse) 
library(cowplot)
theme_set(theme_cowplot())
lizards <- read_csv("example_data/anoles.csv") # See Appendix A if you don't have this data
succession_data_wide <- read_csv("example_data/succession_wide_untidy.csv")
yeast_data <- read_csv("example_data/yeast_data_partial.csv")

Going from many columns to many rows: `pivot_longer()` {#pivot-longer-tutorial}

Let's look at the succession data file; this is slightly modified data from Fall 2018, which was collected as part of the first lab (Section \@ref(#lab1)).

View(succession_data_wide)

A quick description of the columns:

Type: Contains information on both the habitat type and Canopy/Understory
Team, Sample, Quadrant: identifiers for where the data were collected
Distance: Distance in meters fromt the central point to the tree
DBH: Diameter at breast height for canopy trees
16 species columns: These indicate whether the named species is present with a 1; there should only be a single 1 per row.

This clearly does not meet the definition of tidy data; the "Species" variable is split between columns, and the habitat type and tree type are both combined into a single "type" column.

The first step to tidying this data is consolidate the last sixteen columns; we're going to use pivot_longer() for that. In its simplest form, pivot_longer() converts several columns into two: one with the original column names, and the other with the original column values. This will result in a data frame with more rows than the original (hence, "longer").

succession_data_wide |> 
  pivot_longer(
    # The first argument is the column names you want to reshape
    # This uses the same syntax of select() and across() from dplyr
    cols = -c(Type:DBH), # minus sign grabs everything NOT between Type & DBH: 
    names_to = "Species", # Name of the column that stores the old column names
    values_to = "is_present") |>  # Name of the oclumn that stores old cell values
  View()

If you'll notice, the first 16 rows are identical in the Type:DBH columns, while Species and is_present vary. A few alternative ways to do the same thing:

# The only difference between these is the .cols argument
succession_data_wide |> pivot_longer(
    .cols = 7:22, # positions of the columns to pivot
    names_to = "Species", values_to = "is_present")
succession_data_wide |> pivot_longer(
    -c(Type, Team, Sample, Quadrant, Distance, DBH), # explicitly list names
    names_to = "Species", values_to = "is_present")
succession_data_wide |> pivot_longer(
  # You can use ranges of column names to keep, though that's not practical here
    `Acer negundo (Boxwood elder)`:`Ulmus crassifolia (Cedar elm)`,
    names_to = "Species", values_to = "is_present")
succession_data_wide |> pivot_longer(
  .cols = contains("("), # graps all columns with an open parentheses in them
  names_to = "Species", values_to = "is_present")

Note that the .cols argument generally doesn't have quotations around its column names, while the names_to and values_to do. The simplest explanation for it is that the .cols columns already exist in the data, so R knows how to find them; the others don't, so we use quotes to create the names (This isn't entirely true, but the full details are quite complicated and it's a useful rule of thumb for tidyverse functions).

One thing to note about our output is that there's a lot of NA values in is_present, indicating species that were not found at each point. We don't really care about those, so let's get rid of them.

succession_longer = 
  succession_data_wide |> pivot_longer(
    cols = -c(Type:DBH), names_to = "Species", values_to = "is_present") |>
  filter(!is.na(is_present)) |> # Remove missing values
  select(-is_present) # This column is no longer useful
View(succession_longer)

(( Add example to practice with genotype data))

Splitting and merging columns: `separate() and unite()` {#separate-tutorial}

To fully tidy this, we need to split the "Type" column into habitat type and tree type columns. The separate() function is useful for this.

succession_sep = succession_longer |> 
    separate(col = Type, # Column to separate; note that the arg is col, not .col
           into = c("Habitat", "Tree_type"), # Names of the new columns to separate into
           sep = "-") # the character used to mark the separation
View(succession_sep)

Note that the sep argument can be either text or a integer(s); if it's a character vector, the character(s) are removed during the split (as with above). If sep is an integer (or integer vector), then the split is made after that/those position(s) in the text without removing anything.

It would also be useful to have a column that specifically identifies the each sample point; currenlty, that information is split between the Habitat, Team, and Sample columns. The unite() function does this (it's the complement of separate).

succession_tidy = succession_sep |> 
  unite(col = "sample_point", # Name of new colum
        Habitat, Team, Sample, # columns to unite (note; these are all seaprate args)
        sep = "-", # separate the columns with a dash
        remove = FALSE # by default, unite() removes the columns to separate; this disables that because we want to keep Habitat
        ) |> 
  select(-Team, -Sample) 
succession_tidy |> View()

An alternative option to unite() is to use a combination of mutate() and paste():

succession_sep |> 
  mutate(sample_point = paste(Habitat, Team, Sample, sep = "-")) |> 
  select(-Team, -Sample) |> 
  View() # The column order will be different, but otherwise it's the same.

Pivoting multiple columns {#pivot-longer-multi-tutorial}

We're going to take a look a dataset from Brauer et al (2008). The experiment tested yeast gene expression under nutrient limitation (for several different nutrients & several different concentrations).

View(yeast_data)

The first four columns are single variables (gene name, biological process, molecular function, and systematic_id). The remaining columns (G0.05 through U0.3) contain TWO variables in their name: the nutrient that was added to the substrate (first letter) and rate at which it was added (the rest). The values in these columns is the gene expression level. To tidy the data, we need to convert these into three columns:

Substrate
Concentration
Gene expression

While we could do this with a combination of pivot_longer() and separate(), pivot_longer() can do both tasks at once with a few extra arguments.

yeast_data |> 
  pivot_longer(G0.05:U0.3,
    names_to = c("Substrate", "Concentration"), # The names will go into these columns
    names_sep = 1, # separate the names between columns after the first character
    values_to = "Gene_expression") |> 
  glimpse()

The names_sep argument works exactly like sep in separate().

Notice that Concentration is listed as a character vector (<chr>). It's probably a good idea to have it as a numeric vector instead. The names_transform argument lets you specify a function that can be applied to the names column(s) after they're re-shaped. In this case, we're going to use as.numeric() to convert it into a number.

# We're going to create a list that tells pivot_longer to use the 
  # as.numeric() function on Concentration
  # a list() is basically a big box you can put any other kind of data structure into
  # Each item in a list is called an elements; elements can be named or not
  # This list has one named element; if we wanted to transform more columns, we could include them here as well
convert_concentration = list(
  Concentration = as.numeric # Note there's no parentheses; this is because we're identifying the function, not calling it.
  ) 

# We'll talk more about lists later

tidy_yeast = yeast_data |> 
  pivot_longer(G0.05:U0.3,
    names_to = c("Substrate", "Concentration"), names_sep = 1, 
    names_transform = convert_concentration, # We could also have defined the list here
    values_to = "Gene_expression") 
tidy_yeast |> glimpse()

Combining `pivot_longer()` with `summarise()`

Here's a particularly powerful combination of tidyverse() functions you can use when you're exploring a new dataset: calculate summary statistics on all of your variables with summarise(across(...)), tidy the results with pivot_longer(), then create a faceted ggplot.

lizard_smry_by_color = lizards |> 
  # Group the data by color morph
  group_by(Color_morph) |> 
  summarize(
    across(where(is.numeric), # For all numeric columns...
           list(Mean = mean, Med = median, SD = sd)
           # run mean(), meidan(), and sd() on each group & variable
           # Note that the list elements follows the pattern Name = function
    ),N = n() ) # This is outside of across, since it's only run once per group
glimpse(lizard_smry_by_color)

This creates a data frame with columns for the color, the sample size, and a bunch that follow the pattern Trait_statistic. An ideal way to tidy these data would be reduce these columns into Trait, Mean, Median, and SD. We can do that like this:

lizard_smry_tidy = lizard_smry_by_color |> 
  pivot_longer(-c(Color_morph, N), # exclude the columns
               names_to = c("Trait", ".value"),
               names_sep = "_") |> 
  mutate(SE = SD/sqrt(N)) # may as well calculate standard error while we're here
View(lizard_smry_tidy)

This call didn't have a values_to argument in it; instead, one of the names_to was listed as ".value". This is a special indicator that tells pivot_longer to create one column for each matching name and place the corresponding values into it.

From here, you can make a nice little summary statistic plot:

lizard_smry_tidy |> 
  ggplot(aes(x = Color_morph)) + 
  facet_wrap(~Trait, scales = "free_y") +
  # SD:
  geom_linerange(aes(ymin = Mean - SD, ymax = Mean + SD), color = "cornflowerblue") + 
  # Confidence Interval:
  geom_errorbar(aes(ymin = Mean - SE * 1.96, ymax = Mean + SE * 1.96), 
                color = "red", width = .3) + 
  # Mean
  geom_point(aes(y = Mean), size = 2) + 
  ylab("Trait Mean, SD, and 95% CI")