library(learnr) library(magrittr) library(tidyverse) library(gapminder) library(knitr) library(kableExtra) knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE) tutorial_options(exercise.timelimit = 60, exercise.blanks = "___+") #beautify table printout pretty_table <- function(table, max_rows = 10, max_cols = 10) { if (nrow(table) > max_rows) { table <- table %>% head(max_rows) } if (ncol(table) > max_cols) { table <- table[, 1:max_cols] } kable(table) %>% kable_styling(c('striped', 'hover'), full_width = F, position = 'left') }
Topic 1: Formatting data the 'tidy' way
Topic 2: How to combine datasets
#create some untidy datasets gp_1 <- gapminder %>% mutate(gdp = pop * gdpPercap) %>% dplyr::select(country, continent, year, pop, gdp) gp_2 <- gp_1 %>% gather(-c(country, continent, year), key = 'type', value = 'value') %>% arrange(country, year) gp_3 <- gp_1 %>% mutate(gdp = round(gdp), gdpPercap = paste0(gdp, '/', pop)) %>% select(-c(pop, gdp)) gp_4 <- gp_1 %>% filter(year < 1970) %>% select(country, year, gdp) %>% spread(key = year, value = gdp)
1) Each variable forms a column.
2) Each observation forms a row.
3) Each type of observational unit forms a table
Standardized framework for structuring data
Works well with R tools
head(gp_1, 3) %>% pretty_table
gp_1 %>% mutate(gpdPerCap = gdp / pop) %>% filter(country %in% c('Serbia', 'Spain', 'Turkey')) %>% ggplot(aes(year, gpdPerCap, color = country)) + geom_line()
head(gp_2, 4) %>% pretty_table()
head(gp_2, 4) %>% pretty_table()
head(gp_3, 4) %>% pretty_table()
head(gp_4, 4) %>% pretty_table()
“Happy families are all alike; every unhappy family is unhappy in its own way.” –– Leo Tolstoy
“Tidy datasets are all alike, but every messy dataset is messy in its own way.” –– Hadley Wickham
1) One variable is spread across multiple columns
2) One observation is scattered across multiple rows
pivot_longer
pivot_wider
head(gp_4) %>% pretty_table()
Want to collapse 'value columns' into a new pair of variables, one to store the variable in the column names, the other to store the variable spread out over cells in the table
Need to specify:
cols
: The set of columns that represent values not variables
names_to
: The name of the variable whose values form column names
values_to
: The name of the variable whose values are spread over the cells (called the value
)
Starting from a table with the year value distributed across columns
head(gp_4) %>% pretty_table()
Use pivot longer to consolidate these year columns into a single variable called year
gp_4 %>% pivot_longer(cols = c('1952', '1957', '1962', '1967'), names_to = 'year', values_to = 'gdp') %>% pretty_table()
gp_4 %>% pivot_longer(cols = !c('country'), names_to = 'year', values_to = 'gdp') %>% pretty_table()
gp_4 %>% pivot_longer(cols = c('1952':'1967'), names_to = 'year', values_to = 'gdp') %>% pretty_table()
pivot_wider()
is the opposite of pivot_longer()
Increases columns, decreases rows
Use it when an observation is scattered across multiple rows
In this case, the population and gdp values for each observation are on separate rows.
To use pivot_wider()
you need to specify:
names_from
: The column containing the variable names. Here it's type
values_from
: The column containing the values from multiple variables (the value
column). Here it's value
gp_2 %>% pretty_table()
gp_2 %>% pivot_wider(names_from = 'type', values_from = 'value') %>% head(5) %>% pretty_table()
When multiple variables are encoded in a single column
gp_3 %>% pretty_table()
separate
splits them into two new columns
gp_3 %>% separate(col = gdpPercap, into = c('gdp', 'pop'), sep = '/', convert = TRUE) %>% head(5) %>% pretty_table()
convert = TRUE
ensures that the resulting gdp and pop variables are treated asnumeric rather than stringsgp_1 %>% separate(col = year, into = c('century', 'year'), sep = 2) %>% head(5) %>% pretty_table()
Concatenate by rows (assume same columns)
Concatenate by columns (assume same rows)
Merge (join
) tables with different kinds of info.
bind_rows
Split data into separate tables by year and then recombine by row concatenation
gpmind_1962 <- gapminder %>% filter(year == 1962) gpmind_1972 <- gapminder %>% filter(year == 1972) gpmind_1977 <- gapminder %>% filter(year == 1977) #combine them by pasting together rows gpmind_row_comb <- bind_rows(gpmind_1962, gpmind_1972, gpmind_1977) gpmind_row_comb %>% group_by(year) %>% summarise(n = n())
Split into two tables by column
gpmind_loc <- gapminder %>% select(country, continent) gpmind_data <- gapminder %>% select(lifeExp, pop, gdpPercap) head(gpmind_loc, 3) %>% pretty_table() head(gpmind_data, 3) %>% pretty_table()
bind_cols
pastes together the columnsbind_cols(gpmind_data, gpmind_loc) %>% head(3) %>% pretty_table()
It's generally not possible to simply paste together tables by row or column because they don't "line up"
Say we have a table of country 'codes', and we want to add this information to our gapminder table
country_codes %>% head(4) %>% pretty_table()
bind_cols
!join
{.smaller}Matches rows of the two tables based on the values of country
.
gapminder %>% left_join(country_codes, by = 'country') %>% head(4) %>% pretty_table()
-Specify this variable (or set of variable) in the 'by' argument
cont_year <- tibble(continent = c('Asia', 'Europe', 'Africa', 'Asia', 'Oceania'), year = c(1962, 1992, 1972, 2002, 1982), some_stat = c(1, 2, 3, 4, 5))
How can we add this info, specific to the continent and year, to our original table?
cont_year %>% pretty_table()
Easy: you can just specify multiple fields to match by
gapminder %>% inner_join(cont_year, by = c('continent', 'year')) %>% head(4) %>% pretty_table()
https://ismayc.github.io/talks/ness-infer/slide_deck.html#85
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.