library(learnr)
library(testwhat)
library(magrittr)

tutorial_options(
  exercise.timelimit = 60,
  exercise.checker = testwhat::testwhat_learnr
)
knitr::opts_chunk$set(comment = NA)

Disclaimer

This tutorial is in many parts built from tutorials published on GitHub by RStudio and its Education team, mainly from their 2-day internal R bootcamp and from the RStudio Cloud primers and the following blog from David Robinson.

Tidy data

Data can come in many different shapes. Here are some ways of structuring the exact same information content. The $6$ following tables all display the number of tuberculosis cases documented by the World Health Organization in Afghanistan, Brazil, and China between 1999 and 2000. The data contains values associated with four variables (country, year, cases, and population), but each table organizes the values in a different layout. The data is a subset of the data contained in the World Health Organization Global Tuberculosis Report.

tidyr::table1
tidyr::table2
tidyr::table3
tidyr::table4a
tidyr::table4b
tidyr::table5

R prefers just one format called tidy data. A data set is in tidy format if:

  1. Every column is a variable;
  2. Every row is an observation;
  3. Every cell is a single value.
quiz(
  caption = "Tidy Data Quiz", 
  question(
    "Among the previous tables, which one(s) is/are tidy?", 
    answer("`table1`", correct = TRUE), 
    answer("`table2`"), 
    answer("`table3`"), 
    answer("`table4a`"), 
    answer("`table4b`"), 
    answer("`table5`"), 
    random_answer_order = TRUE
  )
)

Why tidy data ?

Let us take a look at a funny data set, called starwars, which lists characters from the StarWars movies and some information about them. This package is part of the dplyr package and can be accessed through:

dplyr::starwars

Suppose we want to focus on the height of the characters and calculate the mean height per gender. Let us look at a reduced data set with this focus in mind:

dplyr::starwars %>% 
  dplyr::select(name, height, gender) %>% 
  dplyr::arrange(gender, name, height) %>% 
  dplyr::mutate(gender = kableExtra::cell_spec(
    x = gender, 
    color = "white", 
    bold = TRUE, 
    background = kableExtra::spec_color(
      x = gender %>% 
        forcats::as_factor() %>% 
        as.numeric(), 
      option = "E", 
      direction = -1
    )
  )) %>% 
  kableExtra::kable(escape = FALSE, align = "c") %>%
  kableExtra::kable_styling(
    bootstrap_options = c("striped", "condensed"), 
    full_width = FALSE
  )

Thanks to the tidy format, dplyr can effectively operates a splitting of the original tibble into several smaller tibbles, one for each level of the gender categorical variable:

starwars <- dplyr::starwars %>% 
  dplyr::select(name, height, gender) %>% 
  dplyr::arrange(gender, name, height) %>% 
  dplyr::mutate(gender = kableExtra::cell_spec(
    x = gender, 
    color = "white", 
    bold = TRUE, 
    background = kableExtra::spec_color(
      x = gender %>% 
        forcats::as_factor() %>% 
        as.numeric(), 
      option = "E", 
      direction = -1
    )
  ))
t1 <- starwars %>% 
  dplyr::filter(stringr::str_detect(gender, "\\bfemale\\b"))
t2 <- starwars %>% 
  dplyr::filter(stringr::str_detect(gender, "\\bhermaphrodite\\b"))
t3 <- starwars %>% 
  dplyr::filter(stringr::str_detect(gender, "\\bmale\\b"))
t4 <- starwars %>% 
  dplyr::filter(stringr::str_detect(gender, "\\bnone\\b"))
t5 <- starwars %>% 
  dplyr::filter(stringr::str_detect(gender, "\\bNA\\b"))
list(t1, t2, t3, t4, t5) %>% 
  kableExtra::kable(
    escape = FALSE, 
    align = "c"
  ) %>%
  kableExtra::kable_styling(
    bootstrap_options = c("striped", "condensed"), 
    full_width = FALSE
  )

The tidy format is extremely handy because:

  1. You can filter into smaller tibbles by values of a given categorical variable and subsequently focus your analysis on a subset of observations that all share the same value for that categorical variable;
  2. The manual splitting operated through dplyr::filter() can actually be performed behind the scene by dplyr::group_by(); it can nicely be combined with dplyr::summarise() to get summaries, visualisations (see next tutorial) or analysis results based on the individual subsets of observations created by splitting the original tibble by value of a categorical variable.

For instance, back to the starwars data set, it is straightforward from the tibble in tidy format to get the average height of characters by gender:

starwars %>% 
  dplyr::group_by(gender) %>% 
  dplyr::summarise(height = mean(height, na.rm = TRUE)) %>% 
  kableExtra::kable(
    escape = FALSE, 
    align = "c"
  ) %>%
  kableExtra::kable_styling(
    bootstrap_options = c("striped", "condensed"), 
    full_width = FALSE
  )

The tidyr package

General presentation of the features

tidyr logo

The tidyr package is designed to help users easily reshape their imported data into tidy format but not only. Quoting from the website, tidyr functions fall into five main categories:

Making data tidy

The following animation illustrates how we can use the pivot_*() functions from the tidyr package to alternate between long and wide representation of the data:

knitr::include_graphics("images/tidyr-longer-wider.gif")

Now, let us go a little deeper into the syntax of the pivot_longer() function, which is the function you should have to use for making data tidy. To that effect, let us look at the staff data set, which is an extraction from a report of the American Association of University Professors (AAUP) (nonprofit membership association of faculty and other academic professionals). It reports the distribution of instructional staff employees for some years between 1975 and 2011:

staff <- readr::read_csv("www/instructional-staff.csv")
staff

There are in fact there 3 variables in this data set: faculty, year and percentage. However, the .csv file does not explicitly report these as variables. In other words, the data has not been collected in a tidy format. Instead, each row in the CSV represents a faculty type, and the columns are the years for which we have the precentage data. The values are percentage of hires of that type of faculty for each year. We can use tidyr to reshape the imported data into tidy format using only one function call:

staff_long <- staff %>%
  tidyr::pivot_longer(
    cols = -faculty_type, 
    names_to = "year", 
    values_to = "percentage"
  )
staff_long

The function to bring data into tidy format is tidyr::pivot_longer(). Let us comment its syntax:

pivot_longer(data, cols, names_to = "name", values_to = "value")

separate()

separate() pulls apart one column into multiple columns, by splitting wherever a separator character appears. Take table3:

tidyr::table3

The rate column contains both cases and population variables, and we need to split it into two variables. separate() takes the name of the column to separate, and the names of the columns to separate into:

tidyr::table3 %>% 
  tidyr::separate(
    col = rate, 
    into = c("cases", "population"), 
    sep = "/", 
    convert = TRUE
  ) %>% 
  tidyr::separate(
    col = year, 
    into = c("century", "year"), 
    sep = 2, 
    convert = FALSE
  )
knitr::include_graphics("images/tidy-17.png")

unite() is the inverse of separate(): it combines multiple columns into a single column. You’ll need it much less frequently than separate(), but it’s still a useful tool to have in your back pocket.

Exercise

Background: gene expression in starvation

Through the process of gene regulation, a cell can control which genes are transcribed from DNA to RNA -- what we call being expressed (if a gene is never turned into RNA, it may as well not be there at all). This provides a sort of cellular switchboard that can activate some systems and deactivate others, which can speed up or slow down growth, switch what nutrients are transported into or out of the cell, and respond to other stimuli. A gene expression microarray lets us measure how much of each gene is expressed in a particular condition. We can use this to figure out the function of a specific gene (based on when it turns on and off), or to get an overall picture of the cell’s activity.

Brauer et al, 2008 used microarrays to test the effect of starvation and growth rate on baker’s yeast (a popular model organism for studying molecular genomics because of its simplicity). Basically, if you give yeast plenty of nutrients (a rich media), except that you sharply restrict its supply of one nutrient, you can control the growth rate to whatever level you desire (we do this with a tool called a chemostat). For example, you could limit the yeast’s supply of glucose (sugar, which the cell metabolizes to get energy and carbon), of leucine (an essential amino acid), or of ammonium (a source of nitrogen).

Starving the yeast of these nutrients lets us find genes that:

The original gene expression data set

Let us look at the original gene expression data set:

readr::read_delim(
  file = "http://varianceexplained.org/files/Brauer2008_DataSet1.tds", 
  delim = "\t"
)

Each of those columns like G0.05, N0.3 and so on represents gene expression values for that sample, as measured by the microarray. The column titles show the condition: G0.05, for instance, means the limiting nutrient was glucose and the growth rate was .05. A higher value means the gene was more expressed in that sample, lower means the gene was less expressed. In total the yeast was grown with 6 limiting nutrients and 6 growth rates, which makes 36 samples, and therefore 36 columns, of gene expression data.

What is untidy about this dataset?

SFB2 || ER to Golgi transport || molecular function unknown || YNL049C || 1082129

which have both some systematic IDs and some biological information about the gene. The details of each of these fields isn’t annotated in the paper, but we can figure out most of it. It contains:

Your turn

Tidy the data to end up with the following 7 variables:

original_data <- readr::read_delim(
  file = "http://varianceexplained.org/files/Brauer2008_DataSet1.tds", 
  delim = "\t"
)


astamm/teachr documentation built on Jan. 12, 2023, 7:21 a.m.