tidyr
{#r-tidyr}Most of the previous appendices have required you to have your data in a tidy format. Tidy data consist of data frames with the following characteristics:
The packages in the tidyverse
are generally designed to operate on tidy data; however, other functions may require the data to be in a different shape. It's also frequently convenient to record field data in a non-tidy format. The tidyr
package is designed to reshape your data. For more information on tidy data, please see this article.
We're going to be using some new datasets for this chapter; if you don't have them already, download them here and put them in your "example_data" directory.
# This invisible block is a workaround for Travis CI not having the tidyverse meta-package; library(ggplot2) library(dplyr) library(readr) library(tibble) library(tidyr) library(forcats) library(cowplot) theme_set(theme_cowplot()) lizards <- read_csv("example_data/anoles.csv") # See Appendix A if you don't have this succession_data_wide <- read_csv("example_data/succession_wide_untidy.csv") yeast_data <- read_csv("example_data/yeast_data_partial.csv") knitr::opts_chunk$set(collapse = TRUE) # Capture View() # rm(View) real_view = View View = function(x,...){ # Scrollable table output # knitr::kable(head(x)) x |> head(n = 10) |> kableExtra::kbl() |> kableExtra::kable_paper() |> kableExtra::scroll_box(width = "100%") }
library(tidyverse) library(cowplot) theme_set(theme_cowplot()) lizards <- read_csv("example_data/anoles.csv") # See Appendix A if you don't have this data succession_data_wide <- read_csv("example_data/succession_wide_untidy.csv") yeast_data <- read_csv("example_data/yeast_data_partial.csv")
pivot_longer()
{#pivot-longer-tutorial}Let's look at the succession data file; this is slightly modified data from Fall 2018, which was collected as part of the first lab (Section \@ref(#lab1)).
View(succession_data_wide)
A quick description of the columns:
This clearly does not meet the definition of tidy data; the "Species" variable is split between columns, and the habitat type and tree type are both combined into a single "type" column.
The first step to tidying this data is consolidate the last sixteen columns; we're going to use pivot_longer()
for that. In its simplest form, pivot_longer()
converts several columns into two: one with the original column names, and the other with the original column values. This will result in a data frame with more rows than the original (hence, "longer").
succession_data_wide |> pivot_longer( # The first argument is the column names you want to reshape # This uses the same syntax of select() and across() from dplyr cols = -c(Type:DBH), # minus sign grabs everything NOT between Type & DBH: names_to = "Species", # Name of the column that stores the old column names values_to = "is_present") |> # Name of the oclumn that stores old cell values View()
If you'll notice, the first 16 rows are identical in the Type:DBH
columns, while Species
and is_present
vary. A few alternative ways to do the same thing:
# The only difference between these is the .cols argument succession_data_wide |> pivot_longer( .cols = 7:22, # positions of the columns to pivot names_to = "Species", values_to = "is_present") succession_data_wide |> pivot_longer( -c(Type, Team, Sample, Quadrant, Distance, DBH), # explicitly list names names_to = "Species", values_to = "is_present") succession_data_wide |> pivot_longer( # You can use ranges of column names to keep, though that's not practical here `Acer negundo (Boxwood elder)`:`Ulmus crassifolia (Cedar elm)`, names_to = "Species", values_to = "is_present") succession_data_wide |> pivot_longer( .cols = contains("("), # graps all columns with an open parentheses in them names_to = "Species", values_to = "is_present")
Note that the .cols
argument generally doesn't have quotations around its column names, while the names_to
and values_to
do. The simplest explanation for it is that the .cols
columns already exist in the data, so R knows how to find them; the others don't, so we use quotes to create the names (This isn't entirely true, but the full details are quite complicated and it's a useful rule of thumb for tidyverse
functions).
One thing to note about our output is that there's a lot of NA
values in is_present
, indicating species that were not found at each point. We don't really care about those, so let's get rid of them.
succession_longer = succession_data_wide |> pivot_longer( cols = -c(Type:DBH), names_to = "Species", values_to = "is_present") |> filter(!is.na(is_present)) |> # Remove missing values select(-is_present) # This column is no longer useful View(succession_longer)
(( Add example to practice with genotype data))
separate() and unite()
{#separate-tutorial}To fully tidy this, we need to split the "Type" column into habitat type and tree type columns. The separate()
function is useful for this.
succession_sep = succession_longer |> separate(col = Type, # Column to separate; note that the arg is col, not .col into = c("Habitat", "Tree_type"), # Names of the new columns to separate into sep = "-") # the character used to mark the separation View(succession_sep)
Note that the sep
argument can be either text or a integer(s); if it's a character vector, the character(s) are removed during the split (as with above). If sep
is an integer (or integer vector), then the split is made after that/those position(s) in the text without removing anything.
It would also be useful to have a column that specifically identifies the each sample point; currenlty, that information is split between the Habitat
, Team
, and Sample
columns. The unite()
function does this (it's the complement of separate
).
succession_tidy = succession_sep |> unite(col = "sample_point", # Name of new colum Habitat, Team, Sample, # columns to unite (note; these are all seaprate args) sep = "-", # separate the columns with a dash remove = FALSE # by default, unite() removes the columns to separate; this disables that because we want to keep Habitat ) |> select(-Team, -Sample) succession_tidy |> View()
An alternative option to unite()
is to use a combination of mutate()
and paste()
:
succession_sep |> mutate(sample_point = paste(Habitat, Team, Sample, sep = "-")) |> select(-Team, -Sample) |> View() # The column order will be different, but otherwise it's the same.
We're going to take a look a dataset from Brauer et al (2008). The experiment tested yeast gene expression under nutrient limitation (for several different nutrients & several different concentrations).
View(yeast_data)
The first four columns are single variables (gene name, biological process, molecular function, and systematic_id). The remaining columns (G0.05
through U0.3
) contain TWO variables in their name: the nutrient that was added to the substrate (first letter) and rate at which it was added (the rest). The values in these columns is the gene expression level. To tidy the data, we need to convert these into three columns:
While we could do this with a combination of pivot_longer()
and separate()
, pivot_longer()
can do both tasks at once with a few extra arguments.
yeast_data |> pivot_longer(G0.05:U0.3, names_to = c("Substrate", "Concentration"), # The names will go into these columns names_sep = 1, # separate the names between columns after the first character values_to = "Gene_expression") |> glimpse()
The names_sep
argument works exactly like sep
in separate()
.
Notice that Concentration
is listed as a character vector (<chr>
). It's probably a good idea to have it as a numeric vector instead. The names_transform
argument lets you specify a function that can be applied to the names column(s) after they're re-shaped. In this case, we're going to use as.numeric()
to convert it into a number.
# We're going to create a list that tells pivot_longer to use the # as.numeric() function on Concentration # a list() is basically a big box you can put any other kind of data structure into # Each item in a list is called an elements; elements can be named or not # This list has one named element; if we wanted to transform more columns, we could include them here as well convert_concentration = list( Concentration = as.numeric # Note there's no parentheses; this is because we're identifying the function, not calling it. ) # We'll talk more about lists later tidy_yeast = yeast_data |> pivot_longer(G0.05:U0.3, names_to = c("Substrate", "Concentration"), names_sep = 1, names_transform = convert_concentration, # We could also have defined the list here values_to = "Gene_expression") tidy_yeast |> glimpse()
pivot_longer()
with summarise()
Here's a particularly powerful combination of tidyverse()
functions you can use when you're exploring a new dataset: calculate summary statistics on all of your variables with summarise(across(...))
, tidy the results with pivot_longer()
, then create a faceted ggplot
.
lizard_smry_by_color = lizards |> # Group the data by color morph group_by(Color_morph) |> summarize( across(where(is.numeric), # For all numeric columns... list(Mean = mean, Med = median, SD = sd) # run mean(), meidan(), and sd() on each group & variable # Note that the list elements follows the pattern Name = function ),N = n() ) # This is outside of across, since it's only run once per group glimpse(lizard_smry_by_color)
This creates a data frame with columns for the color, the sample size, and a bunch that follow the pattern Trait_statistic
. An ideal way to tidy these data would be reduce these columns into Trait
, Mean
, Median
, and SD
. We can do that like this:
lizard_smry_tidy = lizard_smry_by_color |> pivot_longer(-c(Color_morph, N), # exclude the columns names_to = c("Trait", ".value"), names_sep = "_") |> mutate(SE = SD/sqrt(N)) # may as well calculate standard error while we're here View(lizard_smry_tidy)
This call didn't have a values_to
argument in it; instead, one of the names_to
was listed as ".value"
. This is a special indicator that tells pivot_longer to create one column for each matching name and place the corresponding values into it.
From here, you can make a nice little summary statistic plot:
lizard_smry_tidy |> ggplot(aes(x = Color_morph)) + facet_wrap(~Trait, scales = "free_y") + # SD: geom_linerange(aes(ymin = Mean - SD, ymax = Mean + SD), color = "cornflowerblue") + # Confidence Interval: geom_errorbar(aes(ymin = Mean - SE * 1.96, ymax = Mean + SE * 1.96), color = "red", width = .3) + # Mean geom_point(aes(y = Mean), size = 2) + ylab("Trait Mean, SD, and 95% CI")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.