In astamm/teachr: Tutorials for Teaching Tidyverse-Based R Basics

library(learnr)
library(testwhat)
library(magrittr)

tutorial_options(
  exercise.timelimit = 60,
  exercise.checker = testwhat::testwhat_learnr
)
knitr::opts_chunk$set(comment = NA)

Disclaimer

This tutorial is in many parts built from tutorials published on GitHub by RStudio and its Education team, mainly from their 2-day internal R bootcamp and from the RStudio Cloud primers and the following blog from David Robinson.

Tidy data

Data can come in many different shapes. Here are some ways of structuring the exact same information content. The $6$ following tables all display the number of tuberculosis cases documented by the World Health Organization in Afghanistan, Brazil, and China between 1999 and 2000. The data contains values associated with four variables (country, year, cases, and population), but each table organizes the values in a different layout. The data is a subset of the data contained in the World Health Organization Global Tuberculosis Report.

tidyr::table1
tidyr::table2
tidyr::table3
tidyr::table4a
tidyr::table4b
tidyr::table5

R prefers just one format called tidy data. A data set is in tidy format if:

Every column is a variable;
Every row is an observation;
Every cell is a single value.

quiz(
  caption = "Tidy Data Quiz", 
  question(
    "Among the previous tables, which one(s) is/are tidy?", 
    answer("`table1`", correct = TRUE), 
    answer("`table2`"), 
    answer("`table3`"), 
    answer("`table4a`"), 
    answer("`table4b`"), 
    answer("`table5`"), 
    random_answer_order = TRUE
  )
)

Why tidy data ?

Let us take a look at a funny data set, called starwars, which lists characters from the StarWars movies and some information about them. This package is part of the dplyr package and can be accessed through:

dplyr::starwars

Suppose we want to focus on the height of the characters and calculate the mean height per gender. Let us look at a reduced data set with this focus in mind:

dplyr::starwars %>% 
  dplyr::select(name, height, gender) %>% 
  dplyr::arrange(gender, name, height) %>% 
  dplyr::mutate(gender = kableExtra::cell_spec(
    x = gender, 
    color = "white", 
    bold = TRUE, 
    background = kableExtra::spec_color(
      x = gender %>% 
        forcats::as_factor() %>% 
        as.numeric(), 
      option = "E", 
      direction = -1
    )
  )) %>% 
  kableExtra::kable(escape = FALSE, align = "c") %>%
  kableExtra::kable_styling(
    bootstrap_options = c("striped", "condensed"), 
    full_width = FALSE
  )

Thanks to the tidy format, dplyr can effectively operates a splitting of the original tibble into several smaller tibbles, one for each level of the gender categorical variable:

starwars <- dplyr::starwars %>% 
  dplyr::select(name, height, gender) %>% 
  dplyr::arrange(gender, name, height) %>% 
  dplyr::mutate(gender = kableExtra::cell_spec(
    x = gender, 
    color = "white", 
    bold = TRUE, 
    background = kableExtra::spec_color(
      x = gender %>% 
        forcats::as_factor() %>% 
        as.numeric(), 
      option = "E", 
      direction = -1
    )
  ))
t1 <- starwars %>% 
  dplyr::filter(stringr::str_detect(gender, "\\bfemale\\b"))
t2 <- starwars %>% 
  dplyr::filter(stringr::str_detect(gender, "\\bhermaphrodite\\b"))
t3 <- starwars %>% 
  dplyr::filter(stringr::str_detect(gender, "\\bmale\\b"))
t4 <- starwars %>% 
  dplyr::filter(stringr::str_detect(gender, "\\bnone\\b"))
t5 <- starwars %>% 
  dplyr::filter(stringr::str_detect(gender, "\\bNA\\b"))
list(t1, t2, t3, t4, t5) %>% 
  kableExtra::kable(
    escape = FALSE, 
    align = "c"
  ) %>%
  kableExtra::kable_styling(
    bootstrap_options = c("striped", "condensed"), 
    full_width = FALSE
  )

The tidy format is extremely handy because:

You can filter into smaller tibbles by values of a given categorical variable and subsequently focus your analysis on a subset of observations that all share the same value for that categorical variable;
The manual splitting operated through dplyr::filter() can actually be performed behind the scene by dplyr::group_by(); it can nicely be combined with dplyr::summarise() to get summaries, visualisations (see next tutorial) or analysis results based on the individual subsets of observations created by splitting the original tibble by value of a categorical variable.

For instance, back to the starwars data set, it is straightforward from the tibble in tidy format to get the average height of characters by gender:

starwars %>% 
  dplyr::group_by(gender) %>% 
  dplyr::summarise(height = mean(height, na.rm = TRUE)) %>% 
  kableExtra::kable(
    escape = FALSE, 
    align = "c"
  ) %>%
  kableExtra::kable_styling(
    bootstrap_options = c("striped", "condensed"), 
    full_width = FALSE
  )

The tidyr package

General presentation of the features

The tidyr package is designed to help users easily reshape their imported data into tidy format but not only. Quoting from the website, tidyr functions fall into five main categories:

Pivotting. It converts between long and wide forms. See pivot_longer(), pivot_wider(), and vignette("pivot", package = "tidyr") for more details.
Rectangling. It turns deeply nested lists (as from JSON) into tidy tibbles. See unnest_longer(), unnest_wider(), hoist() and vignette("rectangle", package = "tidyr") for more details.
Nesting. It converts grouped data to a form where each group becomes a single row containing a nested tibble, and unnesting does the opposite. See nest(), unnest() and vignette("nest", package = "tidyr") for more details.
Splitting and combining character columns. Use separate() and extract() to pull a single character column into multiple columns; use unite() to combine multiple columns into a single character column.
Handling missing values. Make implicit missing values explicit with complete(); make explicit missing values implicit with drop_na(); replace missing values with next/previous value with fill() or a known value with replace_na().

Making data tidy

The following animation illustrates how we can use the pivot_*() functions from the tidyr package to alternate between long and wide representation of the data:

knitr::include_graphics("images/tidyr-longer-wider.gif")

Now, let us go a little deeper into the syntax of the pivot_longer() function, which is the function you should have to use for making data tidy. To that effect, let us look at the staff data set, which is an extraction from a report of the American Association of University Professors (AAUP) (nonprofit membership association of faculty and other academic professionals). It reports the distribution of instructional staff employees for some years between 1975 and 2011:

staff <- readr::read_csv("www/instructional-staff.csv")
staff

There are in fact there 3 variables in this data set: faculty, year and percentage. However, the .csv file does not explicitly report these as variables. In other words, the data has not been collected in a tidy format. Instead, each row in the CSV represents a faculty type, and the columns are the years for which we have the precentage data. The values are percentage of hires of that type of faculty for each year. We can use tidyr to reshape the imported data into tidy format using only one function call:

staff_long <- staff %>%
  tidyr::pivot_longer(
    cols = -faculty_type, 
    names_to = "year", 
    values_to = "percentage"
  )
staff_long

The function to bring data into tidy format is tidyr::pivot_longer(). Let us comment its syntax:

pivot_longer(data, cols, names_to = "name", values_to = "value")

The first argument is data as usual;
The second argument, cols, is where you specify which columns to pivot into longer format -- in this case all columns except for the faculty_type;
The third argument, names_to, is a string specifying the name of the column to create from the data stored in the column names of data -- in this case year;
The fourth argument, values_to, is a string specifying the name of the column to create from the data stored in cell values, in this case percentage.

`separate()`

separate() pulls apart one column into multiple columns, by splitting wherever a separator character appears. Take table3:

tidyr::table3

The rate column contains both cases and population variables, and we need to split it into two variables. separate() takes the name of the column to separate, and the names of the columns to separate into:

tidyr::table3 %>% 
  tidyr::separate(
    col = rate, 
    into = c("cases", "population"), 
    sep = "/", 
    convert = TRUE
  ) %>% 
  tidyr::separate(
    col = year, 
    into = c("century", "year"), 
    sep = 2, 
    convert = FALSE
  )

knitr::include_graphics("images/tidy-17.png")

unite() is the inverse of separate(): it combines multiple columns into a single column. You’ll need it much less frequently than separate(), but it’s still a useful tool to have in your back pocket.

Exercise

Background: gene expression in starvation

Through the process of gene regulation, a cell can control which genes are transcribed from DNA to RNA -- what we call being expressed (if a gene is never turned into RNA, it may as well not be there at all). This provides a sort of cellular switchboard that can activate some systems and deactivate others, which can speed up or slow down growth, switch what nutrients are transported into or out of the cell, and respond to other stimuli. A gene expression microarray lets us measure how much of each gene is expressed in a particular condition. We can use this to figure out the function of a specific gene (based on when it turns on and off), or to get an overall picture of the cell’s activity.

Brauer et al, 2008 used microarrays to test the effect of starvation and growth rate on baker’s yeast (a popular model organism for studying molecular genomics because of its simplicity). Basically, if you give yeast plenty of nutrients (a rich media), except that you sharply restrict its supply of one nutrient, you can control the growth rate to whatever level you desire (we do this with a tool called a chemostat). For example, you could limit the yeast’s supply of glucose (sugar, which the cell metabolizes to get energy and carbon), of leucine (an essential amino acid), or of ammonium (a source of nitrogen).

Starving the yeast of these nutrients lets us find genes that:

Raise or lower their activity in response to growth rate. Growth-rate dependent expression patterns can tell us a lot about cell cycle control, and how the cell responds to stress.
Respond differently when different nutrients are being limited. These genes may be involved in the transport or metabolism of those nutrients.

The original gene expression data set

Let us look at the original gene expression data set:

readr::read_delim(
  file = "http://varianceexplained.org/files/Brauer2008_DataSet1.tds", 
  delim = "\t"
)

Each of those columns like G0.05, N0.3 and so on represents gene expression values for that sample, as measured by the microarray. The column titles show the condition: G0.05, for instance, means the limiting nutrient was glucose and the growth rate was .05. A higher value means the gene was more expressed in that sample, lower means the gene was less expressed. In total the yeast was grown with 6 limiting nutrients and 6 growth rates, which makes 36 samples, and therefore 36 columns, of gene expression data.

What is untidy about this dataset?

Column headers are values, not variable names. Our column names contain the values of two variables: nutrient (G, N, P, etc) and growth rate (0.05-0.3). For this reason, we end up with not one observation per row, but 36! This is a very common issue in biological datasets: you often see one-row-per-gene and one-column-per-sample, rather than one-row-per-gene-per-sample.
Multiple variables are stored in one column. The NAME column contains lots of information, split up by ||’s. If we examine one of the names, it looks like:

SFB2 || ER to Golgi transport || molecular function unknown || YNL049C || 1082129

which have both some systematic IDs and some biological information about the gene. The details of each of these fields isn’t annotated in the paper, but we can figure out most of it. It contains:

Gene name e.g. SFB2. Note that not all genes have a name;
Biological process e.g. proteolysis and peptidolysis;
Molecular function e.g. metalloendopeptidase activity;
Systematic ID e.g. YNL049C. Unlike a gene name, every gene in this dataset has a systematic ID;
Another ID number e.g. 1082129.

Your turn

Tidy the data to end up with the following 7 variables:

name: Gene name.
bp: Biological process.
mf: Molecular function.
systematic_name: Systematic ID.
nutrient: Limiting nutrient; this has six possible values: glucose (G), ammonium (N), sulfate (S), phosphate (P), uracil (U) or leucine (L).
rate: Growth rate; a number, ranging from .05 to .3. .05 means slow growth (the yeast were being starved hard of that nutrient) while .3 means fast growth.
expression: Expression level; these are the values currently stored in those columns, as measured by the microarray.

original_data <- readr::read_delim(
  file = "http://varianceexplained.org/files/Brauer2008_DataSet1.tds", 
  delim = "\t"
)

astamm/teachr documentation built on Jan. 12, 2023, 7:21 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

astamm/teachr
Tutorials for Teaching Tidyverse-Based R Basics

In astamm/teachr: Tutorials for Teaching Tidyverse-Based R Basics

Disclaimer

Tidy data

Why tidy data ?

The tidyr package

General presentation of the features

Making data tidy

`separate()`

Exercise

Background: gene expression in starvation

The original gene expression data set

Your turn

R Package Documentation

Browse R Packages

We want your feedback!

astamm/teachr Tutorials for Teaching Tidyverse-Based R Basics

In astamm/teachr: Tutorials for Teaching Tidyverse-Based R Basics

Disclaimer

Tidy data

Why tidy data ?

The tidyr package

General presentation of the features

Making data tidy

separate()

Exercise

Background: gene expression in starvation

The original gene expression data set

Your turn

R Package Documentation

Browse R Packages

We want your feedback!

astamm/teachr
Tutorials for Teaching Tidyverse-Based R Basics

`separate()`