library(learnr) library(testwhat) library(magrittr) tutorial_options( exercise.timelimit = 60, exercise.checker = testwhat::testwhat_learnr ) knitr::opts_chunk$set(comment = NA)
This tutorial is in many parts built from tutorials published on GitHub by RStudio and its Education team, mainly from their 2-day internal R bootcamp and from the RStudio Cloud primers and the following blog from David Robinson.
Data can come in many different shapes. Here are some ways of structuring the exact same information content. The $6$ following tables all display the number of tuberculosis cases documented by the World Health Organization in Afghanistan, Brazil, and China between 1999 and 2000. The data contains values associated with four variables (country
, year
, cases
, and population
), but each table organizes the values in a different layout. The data is a subset of the data contained in the World Health Organization Global Tuberculosis Report.
tidyr::table1 tidyr::table2 tidyr::table3 tidyr::table4a tidyr::table4b tidyr::table5
R prefers just one format called tidy data. A data set is in tidy format if:
quiz( caption = "Tidy Data Quiz", question( "Among the previous tables, which one(s) is/are tidy?", answer("`table1`", correct = TRUE), answer("`table2`"), answer("`table3`"), answer("`table4a`"), answer("`table4b`"), answer("`table5`"), random_answer_order = TRUE ) )
Let us take a look at a funny data set, called starwars
, which lists characters from the StarWars movies and some information about them. This package is part of the dplyr package and can be accessed through:
dplyr::starwars
Suppose we want to focus on the height of the characters and calculate the mean height per gender. Let us look at a reduced data set with this focus in mind:
dplyr::starwars %>% dplyr::select(name, height, gender) %>% dplyr::arrange(gender, name, height) %>% dplyr::mutate(gender = kableExtra::cell_spec( x = gender, color = "white", bold = TRUE, background = kableExtra::spec_color( x = gender %>% forcats::as_factor() %>% as.numeric(), option = "E", direction = -1 ) )) %>% kableExtra::kable(escape = FALSE, align = "c") %>% kableExtra::kable_styling( bootstrap_options = c("striped", "condensed"), full_width = FALSE )
Thanks to the tidy format, dplyr can effectively operates a splitting of the original tibble into several smaller tibbles, one for each level of the gender
categorical variable:
starwars <- dplyr::starwars %>% dplyr::select(name, height, gender) %>% dplyr::arrange(gender, name, height) %>% dplyr::mutate(gender = kableExtra::cell_spec( x = gender, color = "white", bold = TRUE, background = kableExtra::spec_color( x = gender %>% forcats::as_factor() %>% as.numeric(), option = "E", direction = -1 ) )) t1 <- starwars %>% dplyr::filter(stringr::str_detect(gender, "\\bfemale\\b")) t2 <- starwars %>% dplyr::filter(stringr::str_detect(gender, "\\bhermaphrodite\\b")) t3 <- starwars %>% dplyr::filter(stringr::str_detect(gender, "\\bmale\\b")) t4 <- starwars %>% dplyr::filter(stringr::str_detect(gender, "\\bnone\\b")) t5 <- starwars %>% dplyr::filter(stringr::str_detect(gender, "\\bNA\\b")) list(t1, t2, t3, t4, t5) %>% kableExtra::kable( escape = FALSE, align = "c" ) %>% kableExtra::kable_styling( bootstrap_options = c("striped", "condensed"), full_width = FALSE )
The tidy format is extremely handy because:
dplyr::filter()
can actually be performed behind the scene by dplyr::group_by()
; it can nicely be combined with dplyr::summarise()
to get summaries, visualisations (see next tutorial) or analysis results based on the individual subsets of observations created by splitting the original tibble by value of a categorical variable.For instance, back to the starwars
data set, it is straightforward from the tibble in tidy format to get the average height of characters by gender
:
starwars %>% dplyr::group_by(gender) %>% dplyr::summarise(height = mean(height, na.rm = TRUE)) %>% kableExtra::kable( escape = FALSE, align = "c" ) %>% kableExtra::kable_styling( bootstrap_options = c("striped", "condensed"), full_width = FALSE )
The tidyr package is designed to help users easily reshape their imported data into tidy format but not only. Quoting from the website, tidyr functions fall into five main categories:
pivot_longer()
, pivot_wider()
, and vignette("pivot", package = "tidyr")
for more details.unnest_longer()
, unnest_wider()
, hoist()
and vignette("rectangle", package = "tidyr")
for more details.nest()
, unnest()
and vignette("nest", package = "tidyr")
for more details.separate()
and extract()
to pull a single character column into multiple columns; use unite()
to combine multiple columns into a single character column.complete()
; make explicit missing values implicit with drop_na()
; replace missing values with next/previous value with fill()
or a known value with replace_na()
.The following animation illustrates how we can use the pivot_*()
functions from the tidyr package to alternate between long and wide representation of the data:
knitr::include_graphics("images/tidyr-longer-wider.gif")
Now, let us go a little deeper into the syntax of the pivot_longer()
function, which is the function you should have to use for making data tidy. To that effect, let us look at the staff
data set, which is an extraction from a report of the American Association of University Professors (AAUP) (nonprofit membership association of faculty and other academic professionals). It reports the distribution of instructional staff employees for some years between 1975 and 2011:
staff <- readr::read_csv("www/instructional-staff.csv") staff
There are in fact there 3 variables in this data set: faculty, year and percentage. However, the .csv
file does not explicitly report these as variables. In other words, the data has not been collected in a tidy format. Instead, each row in the CSV represents a faculty type, and the columns are the years for which we have the precentage data. The values are percentage of hires of that type of faculty for each year. We can use tidyr to reshape the imported data into tidy format using only one function call:
staff_long <- staff %>% tidyr::pivot_longer( cols = -faculty_type, names_to = "year", values_to = "percentage" ) staff_long
The function to bring data into tidy format is tidyr::pivot_longer()
. Let us comment its syntax:
pivot_longer(data, cols, names_to = "name", values_to = "value")
data
as usual;cols
, is where you specify which columns to pivot into longer format -- in this case all columns except for the faculty_type
;names_to
, is a string specifying the name of the column to create from the data stored in the column names of data -- in this case year
;values_to
, is a string specifying the name of the column to create from the data stored in cell values, in this case percentage
.separate()
separate()
pulls apart one column into multiple columns, by splitting wherever a separator character appears. Take table3
:
tidyr::table3
The rate
column contains both cases
and population
variables, and we need to split it into two variables. separate()
takes the name of the column to separate, and the names of the columns to separate into:
tidyr::table3 %>% tidyr::separate( col = rate, into = c("cases", "population"), sep = "/", convert = TRUE ) %>% tidyr::separate( col = year, into = c("century", "year"), sep = 2, convert = FALSE )
knitr::include_graphics("images/tidy-17.png")
unite()
is the inverse of separate()
: it combines multiple columns into a single column. You’ll need it much less frequently than separate()
, but it’s still a useful tool to have in your back pocket.
Through the process of gene regulation, a cell can control which genes are transcribed from DNA to RNA -- what we call being expressed (if a gene is never turned into RNA, it may as well not be there at all). This provides a sort of cellular switchboard that can activate some systems and deactivate others, which can speed up or slow down growth, switch what nutrients are transported into or out of the cell, and respond to other stimuli. A gene expression microarray lets us measure how much of each gene is expressed in a particular condition. We can use this to figure out the function of a specific gene (based on when it turns on and off), or to get an overall picture of the cell’s activity.
Brauer et al, 2008 used microarrays to test the effect of starvation and growth rate on baker’s yeast (a popular model organism for studying molecular genomics because of its simplicity). Basically, if you give yeast plenty of nutrients (a rich media), except that you sharply restrict its supply of one nutrient, you can control the growth rate to whatever level you desire (we do this with a tool called a chemostat). For example, you could limit the yeast’s supply of glucose (sugar, which the cell metabolizes to get energy and carbon), of leucine (an essential amino acid), or of ammonium (a source of nitrogen).
Starving the yeast of these nutrients lets us find genes that:
Let us look at the original gene expression data set:
readr::read_delim( file = "http://varianceexplained.org/files/Brauer2008_DataSet1.tds", delim = "\t" )
Each of those columns like G0.05
, N0.3
and so on represents gene expression values for that sample, as measured by the microarray. The column titles show the condition: G0.05
, for instance, means the limiting nutrient was glucose and the growth rate was .05. A higher value means the gene was more expressed in that sample, lower means the gene was less expressed. In total the yeast was grown with 6 limiting nutrients and 6 growth rates, which makes 36 samples, and therefore 36 columns, of gene expression data.
What is untidy about this dataset?
NAME
column contains lots of information, split up by ||
’s. If we examine one of the names, it looks like:SFB2 || ER to Golgi transport || molecular function unknown || YNL049C || 1082129
which have both some systematic IDs and some biological information about the gene. The details of each of these fields isn’t annotated in the paper, but we can figure out most of it. It contains:
Tidy the data to end up with the following 7 variables:
name
: Gene name.bp
: Biological process.mf
: Molecular function.systematic_name
: Systematic ID.nutrient
: Limiting nutrient; this has six possible values: glucose (G), ammonium (N), sulfate (S), phosphate (P), uracil (U) or leucine (L).rate
: Growth rate; a number, ranging from .05 to .3. .05 means slow growth (the yeast were being starved hard of that nutrient) while .3 means fast growth.expression
: Expression level; these are the values currently stored in those columns, as measured by the microarray.original_data <- readr::read_delim( file = "http://varianceexplained.org/files/Brauer2008_DataSet1.tds", delim = "\t" )
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.