# load packages ---------------------------------------------------------------- library(introexercises) library(learnr) library(gradethis) library(dplyr) library(flair) library(ggplot2) library(lubridate) library(fontawesome) library(tidyr) library(forcats) library(janitor) library(kableExtra) # library(RMariaDB) # connect to sql database ## set options for exercises and checking --------------------------------------- ## Define how exercises are evaluated gradethis::gradethis_setup( ## note: the below arguments are passed to learnr::tutorial_options ## set the maximum execution time limit in seconds exercise.timelimit = 60, ## set how exercises should be checked (defaults to NULL - individually defined) # exercise.checker = gradethis::grade_learnr ## set whether to pre-evaluate exercises (so users see answers) exercise.eval = FALSE ) # ## event recorder --------------------------------------------------------------- # ## see for details: # ## https://pkgs.rstudio.com/learnr/articles/publishing.html#events # ## https://github.com/dtkaplan/submitr/blob/master/R/make_a_recorder.R # # ## connect to your sql database # sqldtbase <- dbConnect(RMariaDB::MariaDB(), # user = Sys.getenv("userid"), # password = Sys.getenv("pwd"), # dbname = 'excersize_log', # host = "144.126.246.140") # # # ## define a function to collect data # ## note that tutorial_id is defined in YAML # ## you could set the tutorial_version too (by specifying version:) but use package version instead # recorder_function <- function(tutorial_id, tutorial_version, user_id, event, data) { # # ## define a sql query # ## first bracket defines variable names # ## values bracket defines what goes in each variable # event_log <- paste("INSERT INTO responses ( # tutorial_id, # tutorial_version, # date_time, # user_id, # event, # section, # label, # question, # answer, # code, # correct) # VALUES('", tutorial_id, "', # '", tutorial_version, "', # '", format(Sys.time(), "%Y-%M%-%D %H:%M:%S %Z"), "', # '", Sys.getenv("SHINYPROXY_PROXY_ID"), "', # '", event, "', # '", data$section, "', # '", data$label, "', # '", paste0('"', data$question, '"'), "', # '", paste0('"', data$answer, '"'), "', # '", paste0('"', data$code, '"'), "', # '", data$correct, "')", # sep = '') # # # Execute the query on the sqldtbase that we connected to above # rsInsert <- dbSendQuery(sqldtbase, event_log) # # } # # options(tutorial.event_recorder = recorder_function)
# hide non-exercise code chunks ------------------------------------------------ knitr::opts_chunk$set(echo = FALSE) # Data prep -------------------------------------------------------------------- # Import combined <- rio::import(system.file("dat/old_version/linelist_combined_20141201.rds", package = "introexercises"))
# hide non-exercise code chunks ------------------------------------------------ knitr::opts_chunk$set(echo = FALSE)
Welcome to the live course "Introduction to R for applied epidemiologists", offered by Applied Epi - a nonprofit organisation that offers open-source tools, training, and support to frontline public health practitioners.
knitr::include_graphics("images/logo.png", error = F)
This exercise focuses on pivoting columns within data frames from wide-to-long, and introduces the column class "factor".
This exercise will guide you through a set of tasks.
You should perform these tasks in RStudio and on your local computer.
There are several ways to get help:
1) Look for the "helpers" (see below)
2) Ask your live course instructor/facilitator for help
3) Ask a colleague or other participant in the course for tips
4) Post a question in Applied Epi Community in the category for questions about Applied Epi Training
Here is what those "helpers" will look like:
r fontawesome::fa("lightbulb", fill = "gold")
Click to read a hint
Here you will see a helpful hint!
r fontawesome::fa("check", fill = "red")
Click to see a solution (try it yourself first!)
linelist %>% filter( age > 25, district == "Bolo" )
Here is more explanation about why the solution works.
Please complete the quiz questions that you encounter throughout the tutorial. Answering will help you to comprehend the material, and will also help us to improve the exercises for future students.
To practice, please answer the following questions:
quiz( question_radio("When should I view the red 'helper' code?", answer("After trying to write the code myself", correct = TRUE), answer("Before I try coding", correct = FALSE), correct = "Reviewing best-practice code after trying to write yourself can help you improve", incorrect = "Please attempt the exercise yourself, or use the hint, before viewing the answer." ) )
question_numeric( "How anxious are you about beginning this tutorial - on a scale from 1 (least anxious) to 10 (most anxious)?", answer(10, message = "Try not to worry, we will help you succeed!", correct = T), answer(9, message = "Try not to worry, we will help you succeed!", correct = T), answer(8, message = "Try not to worry, we will help you succeed!", correct = T), answer(7, message = "Try not to worry, we will help you succeed!", correct = T), answer(6, message = "Ok, we will get there together", correct = T), answer(5, message = "Ok, we will get there together", correct = T), answer(4, message = "I like your confidence!", correct = T), answer(3, message = "I like your confidence!", correct = T), answer(2, message = "I like your confidence!", correct = T), answer(1, message = "I like your confidence!", correct = T), allow_retry = TRUE, correct = "Thanks for sharing. ", min = 1, max = 10, step = 1 )
You will see these icons throughout the exercises:
Icon |Meaning
------|--------------------
r fontawesome::fa("eye", fill = "darkblue")
|Observe
r fontawesome::fa("exclamation", fill = "red")
|Alert!
r fontawesome::fa("pen", fill = "brown")
|An informative note
r fontawesome::fa("terminal", fill = "black")
|Time for you to code!
r fontawesome::fa("window-restore", fill = "darkgrey")
|Change to another window
r fontawesome::fa("bookmark", fill = "orange")
|Remember this for later
Applied Epi Incorporated, 2022
This work is licensed by Applied Epi Incorporated under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Please email contact@appliedepi.org with questions about the use of these materials for academic courses and epidemiologist training programs.
In this exercise you will:
Code from this exercise can be added into your R script, if you want. The code from this exercise is not vital to future exercises. If you are tired, you can simply read through the exercise and absorb the material.
This exercise uses the combined
data frame that was created in the previous exercise on "Joining data". If you did not complete the exercise, or are seeing errors when trying to use combined
, you can import a "backup" combined
data frame in the "data/clean/backup/" folder using this command:
combined <- import(here("data", "clean", "backup", "linelist_combined_20141201.rds"))
Add a new section heading in your script called "Pivoting - patient timelines".
The new header should look something like this:
# Pivoting - patient timelines ----------------------------------------
Now that we have joined all the datasets joined together in combined
, we have a more complete picture of each patient's movement through the health system. We have information on date_infection
, date_onset
, date_report
, date_hospitalization
, and date_outcome
.
Let's create a small data frame to examine the timelines of 5 patients.
Here is what the end result will look like:
timelines <- combined %>% arrange(date_onset) %>% # sort dataset so that earliest are at the top head(5) %>% # keep only the top 5 rows select(case_id, starts_with("date")) # keep only certain columns timelines_long <- timelines %>% pivot_longer( cols = starts_with("date"), names_to = "date_type", values_to = "date" ) %>% mutate(date_type = fct_relevel(date_type, "date_infection", "date_onset", "date_report", "date_hospitalisation", "date_outcome")) timelines_long %>% ggplot(mapping = aes(x = date, y = case_id, color = date_type, shape = date_type, group = case_id))+ geom_point(size = 4)+ geom_line()+ theme_minimal()
Above, we have the timelines of 5 cases displayed in detail, each on their own row. For each case, their milestone dates are visualized by points of varying color and shape.
Let's build this together, and along the way learn about pivoting longer and about factors.
First, let's reduce the dataset to the following:
case_id
and any column that begins with "date" timelines <- combined %>% arrange(date_onset) %>% # sort dataset so that earliest are at the top head(5) %>% # keep only the top 5 rows select(case_id, starts_with("date")) # keep only certain columns
Thankfully we can use the "tidyselect" helper function starts_with()
to refer to all of the date columns at once.
Let's look at this new dataset:
timelines
As you know, ggplot()
with geom_point()
will ask for column names to use for mapping to the axes (x =
and y =
).
quiz( question("In it's current form, which column would be assigned to the X-axis to create the plot??", answer("date_onset", message = "This will not work because in the plot, the date axis reflects all the different date types"), answer("date_outcome", message = "This will not work because in the plot, the date axis reflects all the different date types"), answer("date_infection", message = "This will not work because in the plot, the date axis reflects all the different date types"), answer("date", message = "This is not a column in the current dataset."), answer("Not possible in current format", correct=TRUE, message = "Yes, the dataset must be transformed."), allow_retry = TRUE ) )
To use this dataset in ggplot()
we need to transform or "pivot" the columns into "long" format. This will result in a dataset with only 3 columns:
case_id
date_type
(a new column with values like "date_infection" and "date_report" - the current column names) date
(the actual date values, all in one column) To do this, we will use pivot_longer()
to collect all of the date columns and pivot their values into just those two new columns (date_type
and date
).
At it's most minimal, the function needs only the following argument:
cols =
(this is a vector of the columns to pivot in this case, the date columns) Thankfully, we can reference all the "date" columns with the helper starts_with("date")
. In other circumstances you might list them within a vector c()
.
# Pivot dates longer timelines_long <- timelines %>% pivot_longer(cols = starts_with("date"))
See what this new dataset looks like, below:
timelines_long
Notice the following things:
name
". These are now character values. values
". case_id
- once for each possible date. quiz( question("How did the dimensions of the data frame change?", answer("The pivoted data frame is the same as the old."), answer("The pivoted data frame has more columns"), answer("The pivoted data frame has more columns, but fewer rows"), answer("The pivoted data frame has fewer columns, but more rows", correct=TRUE, message = "Yes, since there were 5 date columns pivoted, there is now 5x as many rows as before. All the 5 date columns have been collapsed into 2 columns."), allow_retry = TRUE ) )
If you want, you can re-run the pivoting command and add these arguments, which allow you to change these default names for the two new columns:
names_to =
(try "date_type") values_to =
(try "date") # Pivot dates longer timelines_long <- timelines %>% pivot_longer( cols = starts_with("date"), names_to = "date_type", values_to = "date")
timelines_long
What happens if we make the ggplot right now, using the dataset timelines_long
?
quiz( question("Which column in the pivoted data frame will be mapped to the X-axis?", answer("case_id", message = "No, this discrete column will be on the Y-axis."), answer("date", correct = TRUE, message = "Yes, this column is continuous date values."), answer("date_type", message = "No, this column contains discrete character values. It will be used as color for the points and lines."), allow_retry = TRUE ) )
ggplot(data = timelines_long, # use the long dataset mapping = aes( x = date, # dates of all types displayed along the x-axis y = case_id, # case_id are discrete, character values color = date_type, # color of the points shape = date_type, # shape of the points group = case_id))+ # this makes the lines appear by color geom_point(size = 4)+ # show points geom_line()+ # show lines theme_minimal()
The points and lines are there, but are they in a sensible order in the legend?
If a variable has an inherent order, we might call it an "ordinal" variable. Think if the values in a column were "first", "second", or "third". We would want them to appear in a plot in a specific order.
In R, these types of variables should be converted to the class "factor". A factor has "levels", such that the values are ordered (first, second, third, fourth, etc.).
In this case, the expected ordering would be:
1) "date_infection"
2) "date_onset"
3) "date_report"
4) "date_hospitalisation"
5) "date_outcome"
Of course for some patients there may be hospitalised before they are reported, but generally let's say that this is the order that we want to embed in the variable.
What is the current class of date_type
?
class(timelines_long$date_type)
It is not a factor. The character values have no inherent ordering. By default they will appear alphabetically.
We can change this using fct_relevel()
from the {forcats} package. This functions converts the column to class "factor" and gives you the opportunity to set the desired order.
Below, we add a mutate()
step to the pipe chain that re-defines this new column, and then lists the values in the order we want.
# Pivot dates longer timelines_long <- timelines %>% # pivot the dataset longer pivot_longer( cols = starts_with("date"), names_to = "date_type", values_to = "date") %>% # set the new column date_type as class factor, and define order for its values mutate(date_type = fct_relevel(date_type, "date_infection", "date_onset", "date_report", "date_hospitalisation", "date_outcome"))
The class is now "factor"
class(timelines_long$date_type)
And it has "levels"
levels(timelines_long$date_type)
After re-running the chain above (with the mutate), we try the ggplot again - see how the ordering has changed (look at the legend):
timelines_long %>% ggplot(data = timelines_long, mapping = aes( x = date, y = case_id, color = date_type, shape = date_type, group = case_id))+ geom_point(size = 4)+ geom_line()+ theme_minimal()
quiz( question("Which case seems to have an error in date_outcome?", answer("dce5cc"), answer("9d4019"), answer("974bc1"), answer("76b97a"), answer("2ae019", correct=TRUE, message = "Yes, The recorded date of outcome is prior to the recorded date of onset."), allow_retry = TRUE ) )
There are many other {forcats} functions to handle factors, see this chapter of the Epi R Handbook.
One {forcats} function that is worth showing you is fct_lump()
. This function will aggregate together values in a column into an "Other" category based on frequency.
See this epidemic curve - because the column district
is assigned to the aesthetic fill =
, it shows every district in the legend. This is quite overwhelming and difficult to interpret!
ggplot(data = combined, mapping = aes( x = date_onset, fill = district))+ geom_histogram(binwidth = 7)
We can use fct_lump()
and its variations like fct_lump_n()
to reduce the number of district that are shown in the plot:
fct_lump_n()
shows only the top "n" values (by counts), with all remaining put in "Other" fct_lump_prop()
shows only those that exceed n proportion of rows, with all remaining in "Other"Below, we wrap district
within fct_lump_n()
and specify that we want to keep only the 3 most-common districts.
ggplot(data = combined, mapping = aes( x = date_onset, fill = fct_lump_n(district, 3)))+ geom_histogram(binwidth = 7)+ labs(fill = "District")
Note that applying this function here within the ggplot()
does not change the underlying district
data. The data are lumped only for this plot. If you want to lump the underlying data you can do that with mutate()
in a cleaning pipe.
We will not focus on pivoting wider in this exercise, as it is less common. However, know that if you need to pivot data wider you can find good examples in these two chapters of the Epi R Handbook:
Congratulations! You are done with this exercise. You have made case timelines, have practiced pivoting data longer, and using some functions to handle factors!
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.