library(learnr) library(tidyverse) library(gapminder) library(knitr) library(kableExtra) knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE) #beautify table printout pretty_table <- function(table, max_rows = 6, max_cols = 6) { if (nrow(table) > max_rows) { table <- table %>% head(max_rows) } if (ncol(table) > max_cols) { table <- table[, 1:max_cols] } kable(table) %>% kable_styling(c('striped', 'hover'), full_width = F, position = 'left') } nba <- tibble( player = c("James", "Durant", "Curry", "Harden", "Paul", "Wade"), team = c("CLEOH", "GSWOAK", "GSWOAK", "HOUTX", "HOUTX", "CLEOH"), day1points = c("25","23","30","41","26","20"), day2points = c("24","25","33","45","26","23") )
Recall the basic components that make a dataset tidy:
1) Each variable forms a column.
2) Each observation forms a row.
3) Each type of observational unit forms a table
Here's a quick test dataset called nba
(credit to this site)
nba <- tibble( player = c("James", "Durant", "Curry", "Harden", "Paul", "Wade"), team = c("CLEOH", "GSWOAK", "GSWOAK", "HOUTX", "HOUTX", "CLEOH"), day1points = c("25","23","30","41","26","20"), day2points = c("24","25","33","45","26","23") ) pretty_table(nba)
Recall that to use pivot_longer
you need to specify:
cols
: The set of columns that represent values not variables
names_to
: The name of the variable whose values form column names
values_to
: The name of the variable whose values are spread over the cells (called the value
)
Use pivot_longer
to make this nba
dataset below into a 'tidy' one, call it nba_tidy
.
nba_tidy <- nba %>% head(nba_tidy)
Your nba_tidy
table should look like this:
nba_tidy <- nba %>% pivot_longer(cols = c('day1points', 'day2points'), names_to = 'day', values_to = 'points') pretty_table(nba_tidy)
nba_tidy <- nba %>% pivot_longer(cols = c('day1points', 'day2points'), names_to = 'day', values_to = 'points') head(nba_tidy)
Now use mutate
and the str_replace
function to format the days as 'day1', 'day2', etc.
Note about str_replace
: it takes as input a vector of strings, a text pattern to replace, and the text to replace it with. For example, here we replace the text "_test"
with nothing (""
) wherever it appears in the input vector.
str_replace(c('boat_test', 'car', 'train_test'), '_test', '')
nba_tidy <- nba %>% pivot_longer(cols = c('day1points', 'day2points'), names_to = 'day', values_to = 'points')
nba_tidy <- nba_tidy %>% mutate(day = ) head(nba_tidy)
nba_tidy <- nba_tidy %>% mutate(day = str_replace(day, 'points', '')) head(nba_tidy)
Your table should now look like this:
nba_tidy <- nba_tidy %>% mutate(day = str_replace(day, 'points', '')) pretty_table(nba_tidy)
To use pivot_wider()
you need to specify:
names_from
: The column containing the variable names.
values_from
: The column containing the values from multiple variables (the value
column).
Use pivot_wider
to make the nba_tidy
dataset back into what you started with. Remember you can use rename
to rename the columns as needed.
nba_tidy <- nba %>% pivot_longer(cols = c('day1points', 'day2points'), names_to = 'day', values_to = 'points') %>% mutate(day = str_replace(day, 'points', ''))
nba_tidy %>%
nba_tidy %>% pivot_wider(names_from = 'day', values_from = 'points') %>% rename(day1points = day1, day2points = day2) %>% head()
As a reminder, here's what the original table looked like:
nba %>% head()
Use separate
to split the team
column of nba
into one called team
and one called location
. The first 3 characters refer to the team, and the rest to the location (city or state).
nba %>% separate(col = team, into = c('team', 'location'), sep = 3)
# sep = 3
Starting with the nba_tidy
table we made above, use group_by
and summarise
to create a table called team_bests
which contains the max points recorded from a given player for each team.
team_bests <- nba_tidy %>%
team_bests <- nba_tidy %>% group_by(team) %>% summarise(team_max_points = max(points))
It should look like this:
team_bests <- nba_tidy %>% group_by(team) %>% summarise(team_max_points = max(points)) pretty_table(team_bests)
#prepare the nba_tidy and team_bests tables nba_tidy <- nba %>% pivot_longer(cols = c('day1points', 'day2points'), names_to = 'day', values_to = 'points') %>% mutate(day = str_replace(day, 'points', '')) team_bests <- nba_tidy %>% group_by(team) %>% summarise(team_max_points = max(points))
Now use a left_join
to add this info into your nba_tidy
table. Then make a new column called is_team_max
which is TRUE if the point total for a given player is the overall max for that team and FALSE otherwise.
nba_tidy %>% left_join(team_bests, by = 'team') %>% ...
... %>% mutate(is_team_max = points == team_max_points)
It should look like this:
nba_tidy %>% left_join(team_bests, by = 'team') %>% mutate(is_team_max = points == team_max_points) %>% pretty_table()
NOTE: as I recently learned, a more efficient way of doing this would be to use group_by
in combination with mutate
like this:
nba_tidy %>% group_by(team) %>% mutate(team_max_points = max(points), is_team_max = points == team_max_points)
The who
table contains data from the WHO Global Tuberculosis Report. First inspect the table. You can also read about the dataset using help(who)
All the columns from new_sp_m014
onward are actually values. Use pivot_longer()
to collect all these columns into a tidy table called who_tidy
. Call the names column "Types" and the values column "Cases".
Note, use the values_drop_na=TRUE
option with pivot_longer to ignore rows with NA in the values column.
who_tidy <- who %>% head(who_tidy)
cols = new_sp_m014:new_rel_f65 #this will pick the range of 'value columns'
who_tidy <- who %>% pivot_longer(cols = new_sp_m014:newrel_f65, names_to = 'Types', values_to = 'Cases', values_drop_na = TRUE) head(who_tidy)
Your table should look like this:
who_tidy <- who %>% pivot_longer(cols = new_sp_m014:newrel_f65, names_to = 'Types', values_to = 'Cases', values_drop_na = TRUE) pretty_table(who_tidy)
who_tidy <- who %>% pivot_longer(cols = new_sp_m014:newrel_f65, names_to = 'Types', values_to = 'Cases', values_drop_na = TRUE)
Look at the names of the Types
column. If you read about this dataset (help(who)
) you will see that these names actually encode multiple different pieces of information:
Let's split this variable into multiple columns that each encode one piece of info
Use the following three steps to update the columns in your who_tidy
table:
1) First, notice that the 'rel' cases don't have a '_' after 'new', which we need to fix. Use mutate
along with str_replace
to replace any matches of 'newrel'
with 'new_rel'
2) Now each piece in the Types
column is separated by a _
. Use separate()
to split the Types
column into three columns, call them "obs_type", "diagnosis_meth" and "SexAge".
3) Now use separate()
again to spit the SexAge
column into a column called "Sex" and one called "Age". (Here you can split by position, since the sex is always the first character)
who_tidy <- who_tidy %>%
who_tidy %>% mutate(Types = str_replace(Types, 'newrel', 'new_rel'))
who_tidy %>% mutate(Types = str_replace(Types, 'newrel', 'new_rel')) %>% separate(Types, into = c('obs_type', 'diagnosis_meth', 'SexAge'), sep = '_')
who_tidy %>% mutate(Types = str_replace(Types, 'newrel', 'new_rel')) %>% separate(Types, into = c('obs_type', 'diagnosis_meth', 'SexAge'), sep = '_') %>% separate(SexAge, into = c('Sex', 'Age'), sep = 1)
When you're done, who_tidy
should look like this:
who_tidy %>% mutate(Types = str_replace(Types, 'newrel', 'new_rel')) %>% separate(Types, into = c('obs_type', 'diagnosis_meth', 'SexAge'), sep = '_') %>% separate(SexAge, into = c('Sex', 'Age'), sep = 1) %>% pretty_table()
Now let's aggregate data about total cases in each year by continent.
First we need to make a table that has info on which continent each country is in. We can do that by taking the gapminder table, and using the distinct()
, like this:
country_continents <- gapminder %>% distinct(country, continent) pretty_table(country_continents)
who_tidy <- who %>% pivot_longer(cols = new_sp_m014:newrel_f65, names_to = 'Types', values_to = 'Cases', values_drop_na = TRUE) %>% mutate(Types = str_replace(Types, 'newrel', 'new_rel')) %>% separate(Types, into = c('obs_type', 'diagnosis_meth', 'SexAge'), sep = '_') %>% separate(SexAge, into = c('Sex', 'Age'), sep = 1) country_continents <- gapminder %>% distinct(country, continent)
Now add in the continent information in country_continents
into the who_tidy
table using an inner_join
(this makes sure we only include data from countries where we also have the continent in our lookup table)
Then calculate the total number of new female cases per year, continent, and diagnosis method. Restrict to years after 1990. Save the results into a new table called tot_by_continent
tot_by_continent <- who_tidy %>% head(tot_by_continent)
tot_by_continent <- who_tidy %>% ... %>% group_by(year, continent, diagnosis_meth) %>% summarise(tot_cases = sum(Cases))
tot_by_continent <- who_tidy %>% inner_join(country_continents, by = 'country') %>% filter(year > 1990, Sex == 'f') %>% group_by(year, continent, diagnosis_meth) %>% summarise(tot_cases = sum(Cases)) head(tot_by_continent)
Your result should look like this:
tot_by_continent <- who_tidy %>% inner_join(country_continents, by = 'country') %>% filter(year > 1990, Sex == 'f') %>% group_by(year, continent, diagnosis_meth) %>% summarise(tot_cases = sum(Cases)) pretty_table(tot_by_continent)
who_tidy <- who %>% pivot_longer(cols = new_sp_m014:newrel_f65, names_to = 'Types', values_to = 'Cases', values_drop_na = TRUE) %>% mutate(Types = str_replace(Types, 'newrel', 'new_rel')) %>% separate(Types, into = c('obs_type', 'diagnosis_meth', 'SexAge'), sep = '_') %>% separate(SexAge, into = c('Sex', 'Age'), sep = 1) country_continents <- gapminder %>% distinct(country, continent) tot_by_continent <- who_tidy %>% inner_join(country_continents, by = 'country') %>% filter(year > 1990, Sex == 'f') %>% group_by(year, continent, diagnosis_meth) %>% summarise(tot_cases = sum(Cases))
Use the tot_by_continent
table to make a ggplot
scatterplot (with lines connecting points) of the total number of TB cases over time. Encode continent by color. Only include diagnosis_meth
of either "sn" or "sp", and use facet_wrap
to make separate sub-plots by diagnosis_meth
data_to_plot <- tot_by_continent %>% filter(diagnosis_meth %in% c('sn', 'sp'))
facet_wrap(~diagnosis_meth)
Your plot should look like this:
ggplot(tot_by_continent %>% filter(diagnosis_meth %in% c('sn', 'sp')), aes(year, tot_cases, color = continent)) + geom_point() + geom_line() + facet_wrap(~diagnosis_meth) + labs(y = 'Total TB Cases')
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.