In appliedepi/introexercises:

# load packages ----------------------------------------------------------------
library(introexercises)
library(learnr)
library(gradethis)
library(dplyr)
library(flair)
library(ggplot2)
library(lubridate)
library(fontawesome)
library(tidyr)
library(forcats)
library(janitor)
library(kableExtra)
# library(RMariaDB)        # connect to sql database 

## set options for exercises and checking ---------------------------------------

## Define how exercises are evaluated 
gradethis::gradethis_setup(
  ## note: the below arguments are passed to learnr::tutorial_options
  ## set the maximum execution time limit in seconds
  exercise.timelimit = 60, 
  ## set how exercises should be checked (defaults to NULL - individually defined)
  # exercise.checker = gradethis::grade_learnr
  ## set whether to pre-evaluate exercises (so users see answers)
  exercise.eval = FALSE 
)

# ## event recorder ---------------------------------------------------------------
# ## see for details: 
# ## https://pkgs.rstudio.com/learnr/articles/publishing.html#events
# ## https://github.com/dtkaplan/submitr/blob/master/R/make_a_recorder.R
# 
# ## connect to your sql database
# sqldtbase <- dbConnect(RMariaDB::MariaDB(),
#                        user     = Sys.getenv("userid"),
#                        password = Sys.getenv("pwd"),
#                        dbname   = 'excersize_log',
#                        host     = "144.126.246.140")
# 
# 
# ## define a function to collect data 
# ## note that tutorial_id is defined in YAML
#     ## you could set the tutorial_version too (by specifying version:) but use package version instead 
# recorder_function <- function(tutorial_id, tutorial_version, user_id, event, data) {
#     
#   ## define a sql query 
#   ## first bracket defines variable names
#   ## values bracket defines what goes in each variable
#   event_log <- paste("INSERT INTO responses (
#                        tutorial_id, 
#                        tutorial_version, 
#                        date_time, 
#                        user_id, 
#                        event, 
#                        section,
#                        label, 
#                        question, 
#                        answer, 
#                        code, 
#                        correct)
#                        VALUES('", tutorial_id,  "', 
#                        '", tutorial_version, "', 
#                        '", format(Sys.time(), "%Y-%M%-%D %H:%M:%S %Z"), "',
#                        '", Sys.getenv("SHINYPROXY_PROXY_ID"), "',
#                        '", event, "',
#                        '", data$section, "',
#                        '", data$label,  "',
#                        '", paste0('"', data$question, '"'),  "',
#                        '", paste0('"', data$answer,   '"'),  "',
#                        '", paste0('"', data$code,     '"'),  "',
#                        '", data$correct, "')",
#                        sep = '')
# 
#     # Execute the query on the sqldtbase that we connected to above
#     rsInsert <- dbSendQuery(sqldtbase, event_log)
#   
# }
# 
# options(tutorial.event_recorder = recorder_function)

# hide non-exercise code chunks ------------------------------------------------
knitr::opts_chunk$set(echo = FALSE)


# Data prep --------------------------------------------------------------------
# Import
combined <- rio::import(system.file("dat/old_version/linelist_combined_20141201.rds", package = "introexercises"))

# hide non-exercise code chunks ------------------------------------------------
knitr::opts_chunk$set(echo = FALSE)

Introduction to R for Applied Epidemiology and Public Health

Welcome

Welcome to the live course "Introduction to R for applied epidemiologists", offered by Applied Epi - a nonprofit organisation that offers open-source tools, training, and support to frontline public health practitioners.

knitr::include_graphics("images/logo.png", error = F)

Pivoting data

This exercise focuses on pivoting columns within data frames from wide-to-long, and introduces the column class "factor".

Format

This exercise will guide you through a set of tasks.
You should perform these tasks in RStudio and on your local computer.

Getting Help

There are several ways to get help:

1) Look for the "helpers" (see below) 2) Ask your live course instructor/facilitator for help
3) Ask a colleague or other participant in the course for tips
4) Post a question in Applied Epi Community in the category for questions about Applied Epi Training

Here is what those "helpers" will look like:

r fontawesome::fa("lightbulb", fill = "gold") Click to read a hint

Here you will see a helpful hint!

r fontawesome::fa("check", fill = "red")Click to see a solution (try it yourself first!)

linelist %>% 
  filter(
    age > 25,
    district == "Bolo"
  )

Here is more explanation about why the solution works.

Quiz questions

Please complete the quiz questions that you encounter throughout the tutorial. Answering will help you to comprehend the material, and will also help us to improve the exercises for future students.

To practice, please answer the following questions:

quiz(
  question_radio("When should I view the red 'helper' code?",
    answer("After trying to write the code myself", correct = TRUE),
    answer("Before I try coding", correct = FALSE),
    correct = "Reviewing best-practice code after trying to write yourself can help you improve",
    incorrect = "Please attempt the exercise yourself, or use the hint, before viewing the answer."
  )
)

question_numeric(
 "How anxious are you about beginning this tutorial - on a scale from 1 (least anxious) to 10 (most anxious)?",
 answer(10, message = "Try not to worry, we will help you succeed!", correct = T),
 answer(9, message = "Try not to worry, we will help you succeed!", correct = T),
 answer(8, message = "Try not to worry, we will help you succeed!", correct = T),
 answer(7, message = "Try not to worry, we will help you succeed!", correct = T),
 answer(6, message = "Ok, we will get there together", correct = T),
 answer(5, message = "Ok, we will get there together", correct = T),
 answer(4, message = "I like your confidence!", correct = T),
 answer(3, message = "I like your confidence!", correct = T),
 answer(2, message = "I like your confidence!", correct = T),
 answer(1, message = "I like your confidence!", correct = T),
 allow_retry = TRUE,
 correct = "Thanks for sharing. ",
 min = 1,
 max = 10,
 step = 1
)

Icons

You will see these icons throughout the exercises:

Icon |Meaning ------|-------------------- r fontawesome::fa("eye", fill = "darkblue")|Observe
r fontawesome::fa("exclamation", fill = "red")|Alert!
r fontawesome::fa("pen", fill = "brown")|An informative note
r fontawesome::fa("terminal", fill = "black")|Time for you to code!
r fontawesome::fa("window-restore", fill = "darkgrey")|Change to another window
r fontawesome::fa("bookmark", fill = "orange")|Remember this for later

License

Applied Epi Incorporated, 2022
This work is licensed by Applied Epi Incorporated under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Please email contact@appliedepi.org with questions about the use of these materials for academic courses and epidemiologist training programs.

Learning objectives

In this exercise you will:

Practice pivoting a dataset from wide to long
See the class "factor" applied to provide order to a columns values

Code from this exercise can be added into your R script, if you want. The code from this exercise is not vital to future exercises. If you are tired, you can simply read through the exercise and absorb the material.

Preparation

This exercise uses the combined data frame that was created in the previous exercise on "Joining data". If you did not complete the exercise, or are seeing errors when trying to use combined, you can import a "backup" combined data frame in the "data/clean/backup/" folder using this command:

combined <- import(here("data", "clean", "backup", "linelist_combined_20141201.rds"))

New section heading

Add a new section heading in your script called "Pivoting - patient timelines".

The new header should look something like this:

# Pivoting - patient timelines ----------------------------------------

Pivoting to plot patient timelines

Now that we have joined all the datasets joined together in combined, we have a more complete picture of each patient's movement through the health system. We have information on date_infection, date_onset, date_report, date_hospitalization, and date_outcome.

Let's create a small data frame to examine the timelines of 5 patients.

Here is what the end result will look like:

timelines <- combined %>% 
  arrange(date_onset) %>%                 # sort dataset so that earliest are at the top
  head(5) %>%                             # keep only the top 5 rows
  select(case_id, starts_with("date"))    # keep only certain columns 

timelines_long <- timelines %>% 
  pivot_longer(
    cols = starts_with("date"),
    names_to = "date_type",
    values_to = "date"
  ) %>% 
  mutate(date_type = fct_relevel(date_type, "date_infection", "date_onset", "date_report", "date_hospitalisation", "date_outcome"))

timelines_long %>% 
  ggplot(mapping = aes(x = date, y = case_id, color = date_type, shape = date_type, group = case_id))+
  geom_point(size = 4)+
  geom_line()+
  theme_minimal()

Above, we have the timelines of 5 cases displayed in detail, each on their own row. For each case, their milestone dates are visualized by points of varying color and shape.

Let's build this together, and along the way learn about pivoting longer and about factors.

Select cases

First, let's reduce the dataset to the following:

Sort the dataset so the cases with earliest onset are the top
Filter to only the top 5 rows
Select only the columns case_id and any column that begins with "date"

timelines <- combined %>% 
  arrange(date_onset) %>%                 # sort dataset so that earliest are at the top
  head(5) %>%                             # keep only the top 5 rows
  select(case_id, starts_with("date"))    # keep only certain columns

Thankfully we can use the "tidyselect" helper function starts_with() to refer to all of the date columns at once.

Let's look at this new dataset:

timelines

As you know, ggplot() with geom_point() will ask for column names to use for mapping to the axes (x = and y =).

quiz(
  question("In it's current form, which column would be assigned to the X-axis to create the plot??",
    answer("date_onset", message = "This will not work because in the plot, the date axis reflects all the different date types"),
    answer("date_outcome", message = "This will not work because in the plot, the date axis reflects all the different date types"),
    answer("date_infection", message = "This will not work because in the plot, the date axis reflects all the different date types"),
    answer("date", message = "This is not a column in the current dataset."),
    answer("Not possible in current format", correct=TRUE, message = "Yes, the dataset must be transformed."),
    allow_retry = TRUE
  )
)

Pivot longer

To use this dataset in ggplot() we need to transform or "pivot" the columns into "long" format. This will result in a dataset with only 3 columns:

case_id
date_type (a new column with values like "date_infection" and "date_report" - the current column names)
date (the actual date values, all in one column)

To do this, we will use pivot_longer() to collect all of the date columns and pivot their values into just those two new columns (date_type and date).

At it's most minimal, the function needs only the following argument:

cols = (this is a vector of the columns to pivot in this case, the date columns)

Thankfully, we can reference all the "date" columns with the helper starts_with("date"). In other circumstances you might list them within a vector c().

# Pivot dates longer
timelines_long <- timelines %>% 
  pivot_longer(cols = starts_with("date"))

See what this new dataset looks like, below:

timelines_long

Notice the following things:

The function has taken all the date column names and placed them in a new column called "name". These are now character values.
It has also taken all the date values and placed them in a new column called "values".
There are now 5 rows for each case_id - once for each possible date.

quiz(
  question("How did the dimensions of the data frame change?",
    answer("The pivoted data frame is the same as the old."),
    answer("The pivoted data frame has more columns"),
    answer("The pivoted data frame has more columns, but fewer rows"),
    answer("The pivoted data frame has fewer columns, but more rows", correct=TRUE, message = "Yes, since there were 5 date columns pivoted, there is now 5x as many rows as before. All the 5 date columns have been collapsed into 2 columns."),
    allow_retry = TRUE
  )
)

If you want, you can re-run the pivoting command and add these arguments, which allow you to change these default names for the two new columns:

names_to = (try "date_type")
values_to = (try "date")

# Pivot dates longer
timelines_long <- timelines %>% 
  pivot_longer(
    cols = starts_with("date"),
    names_to = "date_type",
    values_to = "date")

timelines_long

Plotting

What happens if we make the ggplot right now, using the dataset timelines_long?

quiz(
  question("Which column in the pivoted data frame will be mapped to the X-axis?",
    answer("case_id", message = "No, this discrete column will be on the Y-axis."),
    answer("date", correct = TRUE, message = "Yes, this column is continuous date values."),
    answer("date_type", message = "No, this column contains discrete character values. It will be used as color for the points and lines."),
    allow_retry = TRUE
  )
)

ggplot(data = timelines_long,    # use the long dataset
         mapping = aes(
           x = date,               # dates of all types displayed along the x-axis
           y = case_id,            # case_id are discrete, character values
           color = date_type,      # color of the points
           shape = date_type,      # shape of the points
           group = case_id))+      # this makes the lines appear by color
  geom_point(size = 4)+            # show points
  geom_line()+                     # show lines
  theme_minimal()

The points and lines are there, but are they in a sensible order in the legend?

Factors

If a variable has an inherent order, we might call it an "ordinal" variable. Think if the values in a column were "first", "second", or "third". We would want them to appear in a plot in a specific order.

In R, these types of variables should be converted to the class "factor". A factor has "levels", such that the values are ordered (first, second, third, fourth, etc.).

In this case, the expected ordering would be:

1) "date_infection"
2) "date_onset"
3) "date_report"
4) "date_hospitalisation" 5) "date_outcome"

Of course for some patients there may be hospitalised before they are reported, but generally let's say that this is the order that we want to embed in the variable.

What is the current class of date_type?

class(timelines_long$date_type)

It is not a factor. The character values have no inherent ordering. By default they will appear alphabetically.

We can change this using fct_relevel() from the {forcats} package. This functions converts the column to class "factor" and gives you the opportunity to set the desired order.

Below, we add a mutate() step to the pipe chain that re-defines this new column, and then lists the values in the order we want.

# Pivot dates longer
timelines_long <- timelines %>% 

  # pivot the dataset longer
  pivot_longer(
    cols = starts_with("date"),
    names_to = "date_type",
    values_to = "date") %>% 

  # set the new column date_type as class factor, and define order for its values
  mutate(date_type = fct_relevel(date_type, "date_infection", "date_onset", "date_report", "date_hospitalisation", "date_outcome"))

The class is now "factor"

class(timelines_long$date_type)

And it has "levels"

levels(timelines_long$date_type)

After re-running the chain above (with the mutate), we try the ggplot again - see how the ordering has changed (look at the legend):

timelines_long %>% 
  ggplot(data = timelines_long,
         mapping = aes(
           x = date,
           y = case_id,
           color = date_type,
           shape = date_type,
           group = case_id))+
  geom_point(size = 4)+
  geom_line()+
  theme_minimal()

quiz(
  question("Which case seems to have an error in date_outcome?",
    answer("dce5cc"),
    answer("9d4019"),
    answer("974bc1"),
    answer("76b97a"),
    answer("2ae019", correct=TRUE, message = "Yes, The recorded date of outcome is prior to the recorded date of onset."),
    allow_retry = TRUE
  )
)

There are many other {forcats} functions to handle factors, see this chapter of the Epi R Handbook.

fct_lump()

One {forcats} function that is worth showing you is fct_lump(). This function will aggregate together values in a column into an "Other" category based on frequency.

See this epidemic curve - because the column district is assigned to the aesthetic fill =, it shows every district in the legend. This is quite overwhelming and difficult to interpret!

ggplot(data = combined, 
       mapping = aes(
         x = date_onset,
         fill = district))+
  geom_histogram(binwidth = 7)

We can use fct_lump() and its variations like fct_lump_n() to reduce the number of district that are shown in the plot:

fct_lump_n() shows only the top "n" values (by counts), with all remaining put in "Other"
fct_lump_prop() shows only those that exceed n proportion of rows, with all remaining in "Other"
There are other variations that you can see in the R documentation

Below, we wrap district within fct_lump_n() and specify that we want to keep only the 3 most-common districts.

ggplot(data = combined, 
       mapping = aes(
         x = date_onset,
         fill = fct_lump_n(district, 3)))+
  geom_histogram(binwidth = 7)+
  labs(fill = "District")

Note that applying this function here within the ggplot() does not change the underlying district data. The data are lumped only for this plot. If you want to lump the underlying data you can do that with mutate() in a cleaning pipe.

Pivoting wider

We will not focus on pivoting wider in this exercise, as it is less common. However, know that if you need to pivot data wider you can find good examples in these two chapters of the Epi R Handbook:

End

Congratulations! You are done with this exercise. You have made case timelines, have practiced pivoting data longer, and using some functions to handle factors!

appliedepi/introexercises documentation built on April 22, 2024, 1:01 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com