source("R/utils.R")
r format_date(params$datetime)
r params$instructor
r params$level
Required packages to install:
```r install.packages(c("dplyr", "tidyr"))
```
At the end of this session you will be able:
%>%
operator.mutate()
.filter()
and select()
.arrange()
.group_by()
and summarise()
.gather()
(and optionally spread()
).For learning:
For help:
?tidyr
, ?dplyr
We'll be using a very simple dataset that is built-in to the tidyr package. This dataset contains data on tuberculosis cases in Afghanistan, Brazil, and China from 1999 and 2000.
library(dplyr) library(tidyr) names(table1) str(table1) summary(table1) # a tibble is a modified data.frame, making it slightly easier to use as_tibble(table1)
Create an exercise.R
file (we will have done that in class) and put these code into the file. Then choose one of the below datasets that you will use for only the exercises:
airquality
(recommended for complete beginners)swiss
(recommended for complete beginners)population
storms
starwars
Copy the code below, replacing the ___
with the relevant dataset. This format is used throughout the session.
# load the packages library(dplyr) library(tidyr) # variables names(___) # general contents str(___) # using the dplyr function glimpse(___) # quick summary summary(___) # compare a data.frame with a tibble as.data.frame(___) as_tibble(___) # convert your dataset into a tibble ___ <- as_tibble(___)
The dataset is tidy when:
library(tidyr) library(dplyr)
Look at the structure of the toy tidyr datasets table1
, table3
, and table4a
. Which is tidy?
table1 table3 table4a
More or less, table1
is tidy while the others are not. That's because in table1
, each column has specific and unique information and each row represents a single observation from a single location at a single point in time.
Benefits of tidy data and tidy code:
%>%
pipe operator can help clarify complex data processing workflows.%>%
%>%
pipe operator can help clarify complex data processing workflows.# normal R way of nesting functions head(glimpse(table1)) # the pipe way of linking functions table1 %>% glimpse() %>% head()
mutate()
The mutate()
function adds or replaces a variable/column in a dataset. To add
# Replace an existing variable table1 %>% mutate(population = population / 1000000) # Or create a new variable based on a condition table1 %>% mutate(after_2000 = if_else(year >= 2000, "yes", "no")) # Create or replace multiple variables by using the "," table1 %>% mutate(new_column = "only one value", population = population / 1000000) # Create a new variable using existing variables and save a new dataframe table1_rate <- table1 %>% mutate(rate = (cases / population) * 100000) table1_rate
Time: 10 min
# explore the structure of the data names(___) # pipe the data into mutate function and: new_dataset <- ___ %>% # dataset mutate( # 1. create a new variable ___ = ___, # 2. transform/replace an existing variable ___ = ___, # 3. create a new variable using a conditional ___ = if_else(___, TRUE, FALSE) ) new_dataset
select()
# select columns/variables by name, without quotes table1_rate %>% select(country, year, rate) # to *not* select a variable, us minus (-) table1_rate %>% select(-country) # when you have many variables with similar names, use "matching" functions table1_rate %>% select(starts_with("c"), ends_with("e"), contains("pop"))
filter()
Warning: Filter using logic... and humans are very bad at logic... Make sure to be very certain that you think your logic is what the code reads as logic... Lots of mistakes can be made at this stage! Especially with complex logic situations.
# when country is equal to table1_rate %>% filter(country == "Brazil") # when country is *not* equal to table1_rate %>% filter(country != "Brazil") # when year is equal to table1_rate %>% filter(year == 1999) # Or when year is equal to or more than table1_rate %>% filter(year >= 1999) # when year is 1999 *and* country is Brazil table1_rate %>% filter(year == 1999 & country == "Brazil") # when year is 1999 *or* country is Brazil table1_rate %>% filter(year == 1999 | country == "Brazil")
arrange()
# ascending order by rate table1_rate %>% arrange(rate) # descending order by rate table1_rate %>% arrange(desc(rate)) # ascending order by year and rate table1_rate %>% arrange(year, rate)
Time: 10 min
# filter the rows by two variables # choose two numeric variables and two numbers ___ %>% # format: variable >= number # example: weight >= 50 filter(___ >= ___ & ___ <= ___) # note the "&" above, compare with "|" below # how do they differ? ___ %>% filter(___ >= ___ | ___ <= ___) # pipe the data and continue piping ___ %>% # filter a variable, preferably a character variable, filter(______ == ______) %>% # sort by two variables arrange(___, ___) %>% # now keep only three variables select(___, ___, ___)
If you are familiar with ggplot2 from the ggplot2 session, then you can optional try to visualize the dataset.
group_by()
, summarise()
# summarise on its own # summarise only can output one (1) value, eg. a max or an average table1_rate %>% summarise(cum_cases = sum(cases), max = max(cases)) # grouping by country, on its own, does nothing table1_rate %>% group_by(country) # group_by combined with summarise is powerful table1_rate %>% group_by(country) %>% summarise(cum_cases = sum(cases), max = max(cases)) # grouping by year table1_rate %>% group_by(year) %>% summarise(cum_cases = sum(cases), max = max(cases))
gather()
Compare how table1
looks normally and after converting to the long form with gather()
.
table1 # convert to long form by stacking all cases with all population table1 %>% gather(variable, value, -country, -year) # this does the same: table1 %>% gather(variable, value, cases, population)
Converting to long form makes other types of exploration much much easier. For instance, combined with group_by()
and summarise()
:
table1 %>% gather(variables, values, cases, population) %>% group_by(year, variables) %>% summarise(cum_sum = sum(values), max = sum(values))
Time: Until end of the session.
Copy the code below and run it to get the dataset to use for this exercise. Using everything you learned in this session, try to recreate the table below from the dataset given.
library(ggplot2) economics_data <- economics %>% # get the year information mutate(year = as.numeric(substr(date, 0, 4))) %>% as_tibble() economics_data
economics_data %>% filter(year >= 1990, year <= 2000) %>% mutate(unemploy_rate = unemploy / pop) %>% select(year, unemploy_rate, uempmed) %>% gather(variable, value, -year) %>% group_by(year, variable) %>% summarise( Max = max(value), Median = median(value), Min = min(value) ) %>% arrange(desc(year)) %>% knitr::kable(caption = "From the data above, try to recreate this data below in this table.")
spread()
To convert from long to wide, you use spread()
.
# original table2 # after spread table2 %>% spread(type, count)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.