# load packages ----------------------------------------------------------------
library(introexercises) # get data for exercises  
library(learnr)         # create lessons from rmd
library(gradethis)      # evaluate exercises
library(dplyr)          # wrangle data  
library(stringr)        # strings
library(flair)          # highlight code
library(ggplot2)        # visualise data     
library(lubridate)      # work with dates  
library(epikit)         # for emojis   
library(skimr)          # browse data
library(fontawesome)    # for emojis  
library(janitor)        # clean data
# library(RMariaDB)       # connect to sql database 

## set options for exercises and checking ---------------------------------------

## Define how exercises are evaluated 
gradethis::gradethis_setup(
  ## note: the below arguments are passed to learnr::tutorial_options
  ## set the maximum execution time limit in seconds
  exercise.timelimit = 60, 
  ## set how exercises should be checked (defaults to NULL - individually defined)
  # exercise.checker = gradethis::grade_learnr
  ## set whether to pre-evaluate exercises (so users see answers)
  exercise.eval = FALSE 
)

# ## event recorder ---------------------------------------------------------------
# ## see for details: 
# ## https://pkgs.rstudio.com/learnr/articles/publishing.html#events
# ## https://github.com/dtkaplan/submitr/blob/master/R/make_a_recorder.R
# 
# ## connect to your sql database
# sqldtbase <- dbConnect(RMariaDB::MariaDB(),
#                        user     = Sys.getenv("userid"),
#                        password = Sys.getenv("pwd"),
#                        dbname   = 'excersize_log',
#                        host     = "144.126.246.140")
# 
# 
# ## define a function to collect data 
# ## note that tutorial_id is defined in YAML
#     ## you could set the tutorial_version too (by specifying version:) but use package version instead 
# recorder_function <- function(tutorial_id, tutorial_version, user_id, event, data) {
#     
#   ## define a sql query 
#   ## first bracket defines variable names
#   ## values bracket defines what goes in each variable
#   event_log <- paste("INSERT INTO responses (
#                        tutorial_id, 
#                        tutorial_version, 
#                        date_time, 
#                        user_id, 
#                        event, 
#                        section,
#                        label, 
#                        question, 
#                        answer, 
#                        code, 
#                        correct)
#                        VALUES('", tutorial_id,  "', 
#                        '", tutorial_version, "', 
#                        '", format(Sys.time(), "%Y-%M%-%D %H:%M:%S %Z"), "',
#                        '", Sys.getenv("SHINYPROXY_PROXY_ID"), "',
#                        '", event, "',
#                        '", data$section, "',
#                        '", data$label,  "',
#                        '", paste0('"', data$question, '"'),  "',
#                        '", paste0('"', data$answer,   '"'),  "',
#                        '", paste0('"', data$code,     '"'),  "',
#                        '", data$correct, "')",
#                        sep = '')
# 
#     # Execute the query on the sqldtbase that we connected to above
#     rsInsert <- dbSendQuery(sqldtbase, event_log)
#   
# }
# 
# options(tutorial.event_recorder = recorder_function)
# hide non-exercise code chunks ------------------------------------------------
knitr::opts_chunk$set(echo = FALSE)
# data prep --------------------------------------------------------------------
surv_raw <- rio::import(system.file("dat/old_version/surveillance_linelist_20141201.csv", package = "introexercises"))

Introduction to R for Applied Epidemiology and Public Health

Welcome

Welcome to the live course "Introduction to R for applied epidemiology", offered by Applied Epi - a nonprofit organisation that offers open-source tools, training, and support to frontline public health practitioners.

knitr::include_graphics("images/logo.png", error = F)

Data cleaning

This exercise focuses on cleaning data, including filter rows, selecting columns, deduplicating rows, creating new columns, creating categorical variables, and cleaning dates.

Format

This exercise will guide you through a set of tasks.
You should perform these tasks in RStudio and on your local computer.

Getting Help

There are several ways to get help:

1) Look for the "helpers" (see below) 2) Ask your live course instructor/facilitator for help
3) Ask a colleague or other participant in the course for tips
4) Post a question in Applied Epi Community in the category for questions about Applied Epi Training

Here is what those "helpers" will look like:

r fontawesome::fa("lightbulb", fill = "gold") Click to read a hint

Here you will see a helpful hint!


r fontawesome::fa("check", fill = "red")Click to see a solution (try it yourself first!)

linelist %>% 
  filter(
    age > 25,
    district == "Bolo"
  )

Here is more explanation about why the solution works.


Quiz questions

Please complete the quiz questions that you encounter throughout the tutorial. Answering will help you to comprehend the material, and will also help us to improve the exercises for future students.

To practice, please answer the following questions:

quiz(
  question_radio("When should I view the red 'helper' code?",
    answer("After trying to write the code myself", correct = TRUE),
    answer("Before I try coding", correct = FALSE),
    correct = "Reviewing best-practice code after trying to write yourself can help you improve",
    incorrect = "Please attempt the exercise yourself, or use the hint, before viewing the answer."
  )
)
question_numeric(
 "How anxious are you about beginning this tutorial - on a scale from 1 (least anxious) to 10 (most anxious)?",
 answer(10, message = "Try not to worry, we will help you succeed!", correct = T),
 answer(9, message = "Try not to worry, we will help you succeed!", correct = T),
 answer(8, message = "Try not to worry, we will help you succeed!", correct = T),
 answer(7, message = "Try not to worry, we will help you succeed!", correct = T),
 answer(6, message = "Ok, we will get there together", correct = T),
 answer(5, message = "Ok, we will get there together", correct = T),
 answer(4, message = "I like your confidence!", correct = T),
 answer(3, message = "I like your confidence!", correct = T),
 answer(2, message = "I like your confidence!", correct = T),
 answer(1, message = "I like your confidence!", correct = T),
 allow_retry = TRUE,
 correct = "Thanks for sharing. ",
 min = 1,
 max = 10,
 step = 1
)

Icons

You will see these icons throughout the exercises:

Icon |Meaning ------|-------------------- r fontawesome::fa("eye", fill = "darkblue")|Observe
r fontawesome::fa("exclamation", fill = "red")|Alert!
r fontawesome::fa("pen", fill = "brown")|An informative note
r fontawesome::fa("terminal", fill = "black")|Time for you to code!
r fontawesome::fa("window-restore", fill = "darkgrey")|Change to another window
r fontawesome::fa("bookmark", fill = "orange")|Remember this for later

License

Creative Commons License Applied Epi Incorporated, 2022
This work is licensed by Applied Epi Incorporated under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Please email contact@appliedepi.org with questions about the use of these materials for academic courses and epidemiologist training programs.

Learning objectives

In this exercise you will:

Begin a new R script for Ebola analysis

Open your RStudio project

r fontawesome::fa("window-restore", fill = "darkgrey") Open the "intro_course" R project that you created in the previous exercise.

You can do this by navigating to the "intro_course" folder, and double-clicking on the blue ".Rproj" file in the folder. RStudio should open.

knitr::include_graphics("images/new_r_project_explorer.png", error = F)

Begin a new R script

Open a new R script, either by clicking the new script icon in the upper-left corner of RStudio, or by clicking File -> New File -> R Script.

This will be the primary R script that you work in for the next 2 days. Name it "ebola_analysis.R" and save it into the "scripts" subfolder of the R project.

Add documentation

Add a section heading at the top called "About this script". To do this, place your cursor where the new section should start and press Ctrl, Shift, and R at the same time (or Cmd Shift R on a Mac). In the pop-up, name the section "About this script".

The new header should look something like this:

# About this script ----------------------------------------------

Write # comment lines below this heading to document:

Remember that you can navigate your script using these section headers using the "Outline" button in the upper-left of the R script.

Load packages

The first coding step of any script is to install and/or load the packages for the entire script. It is best to do this at the very top, so that any reviewer can see which packages are used.

Add a new section heading that will contain the code to "Load packages".

# Load packages  ----------------------------------------------

Now write a command using {pacman} and its function p_load() that will install/load the following packages:

Write your command vertically so that you can add descriptive comments for each package.
Try to write this command first, and then check the solution code below.

r fontawesome::fa("check", fill = "red")Click to see a solution (try it yourself first!)

# Load packages ----------------------------------------------
# The pacman package will install each package if necessary,
# and load it for use in the current session

pacman::p_load(
  rio,          # for importing data
  here,         # for locating files
  skimr,        # for reviewing the data
  janitor,      # for data cleaning  
  lubridate,    # for date cleaning  
  epikit,       # creating age categories
  gtsummary,    # for nice tables
  tidyverse     # for data management and visualization
)

Don't forget to write pacman::p_load(), which loads the {pacman} package, to then be able to load all the other packages.

Import data

Add a new section heading that will contain the code to "Import data".

# Import data ----------------------------------------------

In this section, write a command that imports the "surveillance_linelist_20141201.csv" from the "data/raw" subfolder of the R project.
Use import() and here() to import the file, and name the dataset in R as surv_raw (a shorthand for "surveillance linelist raw") using the assignment operator <-

r fontawesome::fa("check", fill = "red")Click to see a solution (try it yourself first!)

surv_raw <- import(here("data", "raw", "surveillance_linelist_20141201.csv"))


Exploratory analysis

First let's run some code to look at the different values in the dataset.

Add a new section heading that will contain the code to "Exploratory analysis".

# Exploratory analysis -----------------------------------------

Much of the code below you will learn in upcoming modules. For now, simply copy-paste the code into your R Console and run it. Do not spend much time trying to understand the code. Then, answer the questions about the outputs that appear in the Console and Plots panes.

Column names

The {base} function names() returns the column names of a data frame. We can place the name of our dataset inside, and view the result in the Console.

names(surv_raw)

Another way to write the same command, is to pipe surv_raw into names().

surv_raw %>% names()
quiz(caption = "Quiz - column names",
  question("Which of the column names contain non-standard syntax and will need to be cleaned?",
    allow_retry = T,
    answer("onset date", correct = T, message = "This one contains a space"),
    answer("wt (kg)", correct = T, message = "Spaces and parentheses need to be removed"),
    answer("hospital", message = "Spaces and parentheses need to be removed"),
    answer("age unit", correct = T, message = "space needs to be removed"),
    answer("ht (cm)", correct = T, message = "Spaces and parentheses need to be removed"),
    answer("date of report", correct = T, message = "Spaces need to be removed or converted")

  )
)

Missing values in R

As we conduct our exploratory analysis, we need to be aware of how R handles missing values.

In R, missing values are represented by a reserved (special) value - NA. Note that NA is treated differently by certain functions because having missingness recognized is important.

Note that NA is typed without quotes. “NA” is different and is just a normal character value (also a Beatles lyric from the song Hey Jude).

Your data may have other ways of representing missingness, such as 99, a period, “Missing”, or “Unknown”. Sometimes, empty cells in a CSV or XLSX can import into R with a blank space - it will look empty but actually there is a character space like this: " ", or even a character no space, like this: "".

Below is a short table to show you how these scenarios look.

To assess whether something in R is truly missing (NA), we use the function is.na(), which asks simply "Is this value NA?" and returns either TRUE or FALSE. The third column of the table is the result of is.na() applied to the value in the second column.

Only the final value in this table will be recognized as a truly "missing value" by R.

# create the demo dataset
#########################


demo_missing <- data.frame(
  `As written in data` = c('"Missing"', '"Unknown"', '" "', '""', '"."', '"NA"', NA),
  `As displayed in column` = c("Missing", "Unknown", " ", "", ".", "NA", NA)) %>% 

  rename(
    `As written in data` = `As.written.in.data`,
    `As displayed in column` = `As.displayed.in.column`
  ) %>% 

  mutate(`As assessed by is.na()` = is.na(`As displayed in column`))

#   
# demo_missing <- tribble(
#   
#   ~`As written in data`, ~`As displayed in column`,
#   '"Missing"',            "Missing",       
#   '"Unknown"',            "Unknown",       
#   '" "',                  " ",             
#   '""',                   "",              
#   NA,                      NA) %>% 
#   
#   mutate(`As assessed by is.na()` = is.na(`As displayed in column`))

demo_missing

As shown above, you can sometimes have an empty character value which will look blank but would actually be referenced in R as: "" (different than NA). For example, below we pipe our raw data frame into the tabyl() function, and tell this function to print the number of rows for each unique value in the column fever.

# tabulate fever
surv_raw %>% 
  tabyl(fever)

Do you see the counts for "no", "yes", and the third value is ... an empty space. This is not properly recognized as NA (missing) by R, so we will need to recode this value in our cleaning.

r fontawesome::fa("eye", fill = "darkblue") Let's keep an eye out for how missing values have been imported in our data set, as we begin to clean.

Gender

The package {janitor} and its function tabyl() can be used to make a rapid tabulation of values in a specified column.

Below, we again run a command to pipe our dataset into the tabyl() function, to get the counts for each unique value in the column gender.

# tabulate gender
surv_raw %>% 
  tabyl(gender)
quiz(caption = "Quiz - Gender",
  question_numeric(
    "How many rows are coded as 'Unknown' gender?",
    answer(26, correct = T),
    allow_retry = TRUE,
    correct = "Correct, nice work.",
    min = 1,
    max = 700,
    step = 1
  )
)

In the previous section we noted that missing values should be listed as NA. Is that the case here?

No. The missing values have been coded by the dataset owner as "Unknown". This may be sufficient for some analyses in R, but usually we want to start our analysis workflow with these as NA so that R correctly handles them as missing data.

We will need to address this in our cleaning steps later.

Age

Let's review the age distribution using the {ggplot2} package, which is the most common package for data visualization.

Below, the function ggplot() uses our data frame surv_raw, the column age is mapped to the X-axis, and the data are visualized as a histogram. Again, do not worry about understanding this function now - we will have two entire modules that cover {ggplot2}.

# a histogram of the case ages 
ggplot(data = surv_raw, mapping = aes(x = age))+
  geom_histogram()

Note that a warning message may appear when you run this command. A "Warning" message in the Console means that R is able to execute your command, but there is something you should probably know. In contrast, an "Error" message means that R was not able to execute your command.

Below, the command is changed only slightly, to display a boxplot instead of a histogram.

# a boxplot of the case age distribution
ggplot(data = surv_raw, mapping = aes(x = age))+
  geom_boxplot()
quiz(caption = "Quiz - age distribution",
  question("The median age is:",
    allow_retry = T,
    answer("Below 20", correct = T, message = ""),
    answer("Above 20", message = "The median age can be found in the boxplot")
  )
)

Ah! But some of the cases have recorded age in months.

surv_raw %>% 
  tabyl(`age unit`)

Note that because this column name has a space in it, we had to write it surrounded by backticks, in order for R to understand that we are referring to a known column name. Cleaning this column name will be one of our first actions once we begin our data cleaning steps.

quiz(caption = "Quiz - age unit",
  question_numeric(
    "How many rows contain age recorded in months?",
    answer(5, message = "", correct = T),
    allow_retry = TRUE,
    correct = "Correct, nice work.",
    min = 1,
    max = 700,
    step = 1
  )
)

Date of onset

The first concern when "cleaning" a date column in R, is which "class" the column has been assigned by default.

The class() command is a {base} function that returns the class of an object.

Note the class of the object surv_raw, our linelist.

class(surv_raw)

Below, we inspect the class of the column age. Here we use the $ operator to specify that within the data frame surv_raw, we want to examine the column age.

Type this command letter-by-letter. When you type the $, you should see a list appear of all the columns in the data frame (this is an auto-complete feature, which can be adjusted in RStudio -> Tools -> Global Options).

class(surv_raw$age)

Now let's see what the class of the column onset date is:

class(surv_raw$`onset date`)

During the import of the data, R has read this column as character values (e.g. "11/03/2014") and did not assume that it was a date. Indeed - how would R know which digits referred to months, or to days? In CSV files in particular, these columns are often first imported as character values. For analysis purposes, it is essential that this column is properly recognized as class "date" and the days and months correctly identified. We will address this in the cleaning steps.

quiz(caption = "Quiz - column class",
  question("What is the class of the column `date of report`?",
    allow_retry = TRUE,
    answer("Numeric"),
    answer("Character", correct = T),
    answer("Proletariat"),
    answer("Logical"),
    answer("Date")
  )
)

Numeric columns

We can also use histograms to examine other numeric columns like weight, height, temperature, etc.

Does anything seem amiss? Are there any values to clean? Values that are impossible?

This code is for the weight column.

ggplot(data = surv_raw, mapping = aes(x = `wt (kg)`))+
     geom_histogram()

This code is for the height column.

ggplot(data = surv_raw, mapping = aes(x = `ht (cm)`))+
     geom_histogram()

Location

There are two columns in the dataset that describe the district of the case:

How well do these two align? We can run a cross-tabulation:

The first-listed column (district of residence) will appear in the rows, and the district of detection will appear as the columns.

surv_raw %>% 
  tabyl(adm3_name_res, adm3_name_det)
quiz(caption = "Quiz - residence",
  question("How many cases were residents of Mountain Rural but were detected in Central II?",
    allow_retry = T,
    answer("12", message = "residence is the rows, and detected is the columns"),
    answer("4", message = "residence is the rows, and detected is the columns"),
    answer(surv_raw %>% filter(adm3_name_res == "Mountain Rural" & adm3_name_det == "Central II") %>% nrow(), correct = T, message = "residence is the rows, and detected is the columns"),
    answer("209", message = "residence is the rows, and detected is the columns")
  )
)

See the first row and the last column of the tabulation - it appears that again, missing values were read into R as a blank character space. This may be a problem in all of the columns that are class "character". Don't worry, we will address this when cleaning the data.

r fontawesome::fa("exclamation", fill = "red") Have you saved your script lately?

Cleaning pipe chain

Now that you have reviewed the data, you can begin to build a "pipe chain" in your R script that will clean the surveillance dataset.

The "pipe" operator %>%

A pipe chain refers to the use of the "pipe" operator %>% to chain together operations (such as cleaning, reordering and manipulating) for altering a dataset.

This operator allows us to carry out several different alterations of the data frame in a single operation. It can be helpful to think of the pipe operator %>% as telling the computer "and then" between your functions.

Keyboard shortcut

R coders use the pipe operator so frequently that there is a keyboard shortcut! You will want to memorize this:

Windows/Linux: Press Ctrl, Shift, and "m" at the same time
Mac: Press Cmd, Shift, and "m" at the same time

Try it out in your script!

Building your cleaning pipe chain

For a given dataset, it is best to store all the cleaning steps in one connected sequence of commands, all linked together by "pipe" operators %>%

To begin, let's do the following:

Create a cleaning section in your script

Add a new section to your script, below the exploratory analysis section:

# Clean surveillance data -----------------------------------------

You can press Ctrl+Shift+R to quickly create a new section in the script.

Begin the cleaning pipe chain

A cleaning pipe chain has three main components:

1) The command begins with an <- assignment operator, with the name you want to give to the "clean" data on the left
2) To the right of the <- you write the name of the "raw" dataset, and a %>% pipe operator
3) On the subsequent line, you write a series of cleaning functions, all connected to each other by the %>% pipe operator (e.g. renaming columns, de-duplicating, recoding values, filtering rows by logical criteria, etc.)

You will build this chain step-by-step, function-by-function, over this and the subsequent exercise. You should only write ONE cleaning pipe chain for the surveillance data. See the image below, for an example, with the %>% operators highlighted:

Do NOT write the code below - simply view it as an example. This is the script that we will write over the coming sessions.

knitr::include_graphics("images/cleaning_pipe_chain.png", error = F)

Because all the steps are linked with %>% pipe operators, the entire pipe chain is effectively one command. Because this command uses the assignment operator <-, the modified dataset will not be printed to the R Console when the command is run. Instead, the changes are saved to the new "clean" data frame surv. To see that clean data frame you have two options:

1) Click on surv in the Environment pane (upper-right of RStudio)
2) Use the command View() (note the capital "V")

You should view the dataset after each function is applied, so that you can track the changes.

Now let's begin building your cleaning pipe chain!

Standardize column names

Often, the first task is to clean the column names. This makes the rest of the coding easier, as you can reference columns by their clean syntax.

Let us insert the function clean_names() into the pipe chain. This is a very easy function to use - you do not need to put anything in the parentheses! clean_names() will ensure that all column names are lowercase, that spaces are converted to underscores, and that any special characters are removed.

Run your pipe chain with clean_names() and save the result to surv.

r fontawesome::fa("check", fill = "red")Click to see a solution (try it yourself first!)

surv <- surv_raw %>% 
  clean_names()


View surv by clicking on it in RStudio. Note that some column names have changed!

quiz(caption = "Quiz - clean_names()",
  question("How is the column name 'ht (cm)' written after using clean_names()?",
    allow_retry = T,
    answer("height"),
    answer("ht_cm", correct = T),
    answer("ht_(cm)"),
    answer("HT-CM")
  ),

  question("What is the new name of the column that contains date of symptom onset?",
    allow_retry = T,
    answer("onset date"),
    answer("date-of-onset"),
    answer("onset_date", correct = T),
    answer("onset")
  )
)

That was easy! You have begun your cleaning pipe chain, but there is more cleaning to do...

Manually clean column names

We still need to make some changes to column names manually:

Simplify the location column names for easier reading (and typing!)

The primary mechanism for these manual changes is the {dplyr} function rename(). Here is the basic syntax:

surv <- surv_raw %>% 
  clean_names() %>% 
  rename(
    NEWname = OLDname,
    NEWname = OLDname
  )

A few things to note:

r fontawesome::fa("check", fill = "red")Click to see a solution (try it yourself first!)

surv <- surv_raw %>% 
  clean_names() %>% 
  rename(
    # NEW NAME     # OLD NAME
    date_onset = onset_date,
    date_report = date_of_report,
    district_res = adm3_name_res,
    district_det = adm3_name_det
  )


Review surv once you have successfully run your expanded pipe chain. Verify that the desired changes did take place.

Create a "Testing" area

Add a new section heading below your cleaning pipe chain section that will contain "Testing" code. (remember you can use the shortcut Ctrl Shift and R to create a section heading as below):

# Testing code -----------------------------------------

You will find that often you want to try out code or commands, that are not (yet) added to your cleaning pipe chain. You can do this here.

This section will hold stand-alone, independent commands that are not part of your cleaning pipe chain. You can think of this as a "playground" to test things before making changes to your pipe chain.

Select columns

The select() function is versatile and often used to:

Select columns to keep

The primary purpose of select() is to select columns to keep in the dataset.

Any column name that is not included in the parentheses will be dropped. Try typing the below command in your "Testing area" and running it. It takes your partially-clean data set surv and selects only a few columns, printing (not saving) the result.

surv %>% 
  select(epilink, age, gender)

What did you notice about the new ordering of the columns? select() also re-orders columns.

It may be taxing to list every column individually. You can use a colon ( : ) to keep the two named columns and all those in between. Modify your test code as below:

# keeps fever, vomit, and the columns in between them 
surv %>% 
  select(fever:vomit)

You can also list some columns, to move them to the beginning, and finish with everything() to include all the other columns (note the empty parentheses at the end of everything()). There are other such "tidyselect" helpers available to you such as contains() and starts_with() that you can read more about in this Epi R Handbook chapter.

Modify your test code as below and try running it:

# use everything()
surv %>% 
  select(fever, gender, date_report, everything())

Drop columns by name

In the Testing area, practice using a minus symbol ( - ) to drop the column epilink. All other columns are retained.

surv %>% 
  select(-epilink)

Now practice dropping multiple columns by wrapping all their names in a vector with c() - this applies the negative sign to all of them. We learned about the {base} function c() in the previous module, which stands for "concatenate". Here it is used to create a vector of column names and the - sign is applied to all of them.

# Removes these three columns
surv %>% 
  select(-c(epilink, district_det, age))

# an equivalent command to above
surv %>% 
  select(-epilink, -district_det, -age)

A more advanced (and efficient) maneuver is to use select() to rename columns! If you try this, it uses the same syntax as rename(), but remember that it only keeps the columns that you write!

Add to pipe chain

r fontawesome::fa("terminal", fill = "black") Now it is time to add a select() statement to your cleaning pipe chain!

Move away from the "Testing" area and add to your pipe chain a step that removes the column row_num from the dataset. This column is not necessary for our analyses.

r fontawesome::fa("check", fill = "red")Click to see a solution (try it yourself first!)

Your cleaning pipe chain should now look like this:

surv <- surv_raw %>% 

  # automatically clean column names
  clean_names() %>% 

  # manually clean column names   
  rename(
    date_onset = onset_date,
    date_report = date_of_report,
    district_res = adm3_name_res,
    district_det = adm3_name_det) %>%

  # remove unnecessary column
  select(-row_num)


Deduplicate

At this point, we should check for duplicates in our data. This can be done with the distinct() function. If run with empty parentheses, it will remove rows that are 100% duplicates (it will keep only the first row, of rows for which the value in every column is the same).

Add the function distinct() to the pipe chain, and run your pipe chain.

r fontawesome::fa("check", fill = "red")Click to see a solution (try it yourself first!)

Your cleaning pipe chain should now look like this:

surv <- surv_raw %>% 

  # automatically clean column names
  clean_names() %>% 

  # manually clean column names   
  rename(
    date_onset = onset_date,
    date_report = date_of_report,
    district_res = adm3_name_res,
    district_det = adm3_name_det) %>%

  # remove unnecessary column
  select(-row_num) %>% 

  # de-duplicate rows  
  distinct()


Were any rows removed?

r fontawesome::fa("lightbulb", fill = "gold") Click to read a hint

Compare the number of rows in surv to the number of rows in surv_raw (in the Environment pane).


distinct() can be used in more advanced ways, such as to only consider certain columns when evaluation duplicates (e.g. remove rows that have the same age, gender, and first name).

For example, adjusting your command to read: distinct(case_id) will consider a row as a "duplicate" if the case_id values are identical (values in ALL columns no longer have to be identical to be considered a "duplicate").

Notice however, that the default behavior if you run distinct(case_id) is that all the other columns were dropped from the dataset! We probably would not want that. You can add the argument .keep_all = TRUE as below, and all those columns will be kept.

distinct(case_id, .keep_all = TRUE)

You can read more about the distinct() function its documentation, and in the Epi R Handbook cleaning chapter.

In fact, there is an entire chapter of the Epi R Handbook dedicated to de-duplication that we encourage you to read if you do a lot of record deduplication.

Before you move on, please revert your de-duplication command to read as only distinct() with nothing in the parentheses.

r fontawesome::fa("exclamation", fill = "red") Have you saved your script lately?

Column classes

The class() function

Each column in the data frame has a specific "class". Classes reflect how R understands the values in the column.

You reviewed the classes of each column with the skim() function from the {skimr} package. You can also view the class of a single column using this syntax (run this in your "Testing" section):

# print the class of a column
class(surv$gender)

How does this class() command compare to a pipe chain?

This is a "stand-alone" command - not a part of your pipe chain. Because it is an independent command, we use the $ operator to link the column name to its data frame: surv$gender. This linkage is required because, as you know, R can handle multiple datasets at one time.

If we did not specify the data frame, R would return an error: "Error: object 'gender' not found". In a pipe chain the $ is not necessary, because the data frame is being passed from one function to the next with the pipe operator.

What are possible column classes?

Common column classes include:

An early part of the data cleaning process is ensuring that R correctly understands the class for each column. R is generally good about guessing this, but sometimes it needs your help.

Date class

We recognized that the column date_onset (previously named "onset date" in the raw dataset) is recognized by R as class "character". Often, dates in Excel worksheets arrive in this class, or in numeric (e.g. 48256), or in an even stranger class called "POSIXct". Either way, we want to designate this column as a normal and dependable class "date".

First, we need to understand how the date is currently written in the data before the function is applied.

Run the following command in your "Testing" section, to look at the first six values of date_onset in the data frame surv:

head(surv$date_onset)
quiz(caption = "Quiz - raw date format",
  question("Which format is date_onset??",
    allow_retry = T,
    answer("YYYY-MM-DD"),
    answer("DD/MM/YYYY"),
    answer("MM/DD/YYYY", correct = T)
  )
)

{lubridate} has several helper functions designed to convert character objects to dates in a more intuitive and lenient way. These three functions correspond to the format of the date prior to use of the function. The three functions are ymd(), mdy(), and dmy(). Their names correspond to the approximate format they expect the date be in before this function is applied.

Thankfully, they are adaptable in that they allow for a variety of separators, and synonyms for dates (e.g. 01 vs Jan vs January):

The ymd() function converts raw character values to date class that are written as: year, month, day.

# ymd() should be used to convert raw values written as: year, month, day
ymd("2020-10-11")
ymd("20201011")

The mdy() does the same but for raw character values written as: month, day, year.

# mdy() should be used to convert raw values written as: month, day, year
mdy("Oct 11 20")
mdy("10/11/20")

and the dmy() should be used for raw character values written as: day, month , year.

# mdy() should be used to convert raw values written as: month, day, year
dmy("11 October 2020")
dmy("11/10/2020")

In ALL of the above commands, the outcome is values in date class that display as YYYY-MM-DD.

quiz(caption = "Quiz - date format",
  question("Which function should be used to convert the column date_onset?",
    allow_retry = T,
    answer("ymd()"),
    answer("dmy()"),
    answer("mdy()", correct = T)
  )
)

The most important thing to remember is to use the function that matches the current (raw) date format

mutate()

Changing a column is done with mutate(). This is a general function for changing a column - that could include:

1) Creating a new column
2) Changing or re-defining an existing column

column name = a function to use to create/modify the column

Thus, the command to add to our pipe chain to modify the column date_onset to be converted to class "date" would look like this:

mutate(date_onset = mdy(date_onset))

Above, the code is in effect saying: "change (mutate) the date_onset column, such that it becomes the date_onset column as modified by the function mdy()."

Now add two mutate() commands to the pipe chain, that convert date_onset and date_report to class "date".

r fontawesome::fa("check", fill = "red")Click to see a solution (try it yourself first!)

surv <- surv_raw %>% 

  # automatically clean column names
  clean_names() %>% 

  # manually clean column names   
  rename(
    date_onset = onset_date,
    date_report = date_of_report,
    district_res = adm3_name_res,
    district_det = adm3_name_det) %>%

  # remove unnecessary column
  select(-row_num) %>% 

  # de-duplicate rows  
  distinct() %>% 

  # convert date_onset to date class
  mutate(date_onset = mdy(date_onset)) %>% 
  mutate(date_report = mdy(date_report))


Now we can inspect the date column, and use it in plotting functions like a continuous variable. Previously, as class character, a plot like the one created below would not have been possible.

We give you this code below not to learn plotting, but simply to test it in RStudio. You will learn how to make plots with {ggplot2} later in the course.

Write it in your "Testing" section and run it.

# plot a histogram of dates using the cleaned dataset and date column
ggplot(data = surv, mapping = aes(x = date_onset))+
  geom_histogram()
quiz(caption = "Quiz - Date",
  question("In the most recent weeks, what is the apparent trend in the number of reported cases?",
    allow_retry = T,
    answer("Increasing"),
    answer("Decreasing", correct = T)
  )
)

Age as numeric

There is another column that we should check the class of: age. What is the class of this column? (write this command in your "Testing" section)

class(surv$age)

The command above should have returned "integer". While this is OK for the moment... we may want to perform calculations with this column that would result in it needing to accept decimal values. As class "integer", this column cannot accept decimals, but as class "numeric" it can.

Thankfully in {base} R there are functions like as.numeric(), as.character(), and as.integer() to easily convert columns from one class to another. Simply put the column name within the parentheses, and it is converted.

In your pipe chain, add a mutate() command that converts this column using the as.numeric() function.

r fontawesome::fa("lightbulb", fill = "gold") Click to read a hint

In the mutate() function, re-define age as equal to itself as wrapped within the as.numeric() function.


r fontawesome::fa("check", fill = "red")Click to see a solution (try it yourself first!)

surv <- surv_raw %>% 

  # automatically clean column names
  clean_names() %>% 

  # manually clean column names   
  rename(
    date_onset = onset_date,
    date_report = date_of_report,
    district_res = adm3_name_res,
    district_det = adm3_name_det) %>%

  # remove unnecessary column
  select(-row_num) %>% 

  # de-duplicate rows  
  distinct() %>% 

  # convert date_onset to date class
  mutate(date_onset = mdy(date_onset)) %>% 
  mutate(date_report = mdy(date_report)) %>% 

  # convert age to numeric class
  mutate(age = as.numeric(age))


r fontawesome::fa("exclamation", fill = "red") Have you saved your script lately?

Simple recoding

There are a variety of function that can be used to perform simple recoding of values. We will show you some in the sections below, as you add them to your pipe chain.

Missing values

As identified earlier, in "raw" data there is a value "Unknown" in the column gender.

Run the code below (in your "Testing" section), which creates a counts table of the unique values in the column gender in the "raw" data.

surv_raw %>% 
  tabyl(gender)

We want to tell R that if the value in gender is "Unknown" then R should change the value to NA. We may change this later, but for now, let's convert it to NA.

Use mutate() on the column gender and set it equal to the function na_if(). In the parentheses of na_if(), again write gender, then a comma, and then "Unknown" in quote marks: mutate(gender = na_if(gender, "Unknown")).

The function na_if() replace a specified value with NA. This code is saying: "Change gender, such that its values are equivalent to the column gender as changed by the function na_if() so that any instances of "Unknown" are replaced by NA".

Add this command to your pipe chain - it will not run as a standalone command by itself.

# This command would be added to a pipe chain
mutate(gender = na_if(gender, "Unknown"))

Note: it is also useful to know the opposite function, replace_na(), which replaces NA with a specified value.

r fontawesome::fa("check", fill = "red")Click to see a solution (try it yourself first!)

surv <- surv_raw %>% 

  # automatically clean column names
  clean_names() %>% 

  # manually clean column names   
  rename(
    date_onset = onset_date,
    date_report = date_of_report,
    district_res = adm3_name_res,
    district_det = adm3_name_det) %>%

  # remove unnecessary column
  select(-row_num) %>% 

  # de-duplicate rows  
  distinct() %>% 

  # convert date_onset to date class
  mutate(date_onset = mdy(date_onset)) %>% 
  mutate(date_report = mdy(date_report)) %>% 

  # convert age to numeric class
  mutate(age = as.numeric(age)) %>% 

  # convert "Unknown" gender to NA
  mutate(gender = na_if(gender, "Unknown"))


Now re-run the tabyl() command on column gender, using the clean dataset surv, and see how the missing values are represented as NA.

We need to apply this change for all the columns in the dataset that are class "character" (gender, hospital, gender, wt_kg, ht_cm, fever, chills, cough...). It is a long list...

Across

As mentioned earlier, when R imports "raw" files from CSV or XLSX sometimes it records missing values in "character" columns as an empty character "", instead of as the special symbol NA.

Run the code below (in your "Testing" section), which creates a counts table of the unique values in the column fever.

surv %>% 
  tabyl(fever)

We need to tell R that if the value in fever is empty "" then R should change the value to NA. Don't spend the time to look, but there are many columns in the dataset where missing is recorded as "". How can we make all these changes efficiently?

We are going to give you a command that uses across() within the mutate() to change multiple columns at once. This is an intermediate R command, so please do not spend time trying to understand it. Later, you can read more about across() in the Epi R Handbook here.

Add this command to your pipe chain.

mutate(across(.cols = where(is.character), .fns = na_if, ""))

Briefly, this command is saying: Mutate all the columns that are character class, and use the function na_if() to convert any instances of "" to NA. It should change the columns fever, vomit, hospital, district, and any other character columns.

r fontawesome::fa("check", fill = "red")Click to see a solution (try it yourself first!)

surv <- surv_raw %>% 

  # automatically clean column names
  clean_names() %>% 

  # manually clean column names   
  rename(
    date_onset = onset_date,
    date_report = date_of_report,
    district_res = adm3_name_res,
    district_det = adm3_name_det) %>%

  # remove unnecessary column
  select(-row_num) %>% 

  # de-duplicate rows  
  distinct() %>% 

  # convert date_onset to date class
  mutate(date_onset = mdy(date_onset)) %>% 
  mutate(date_report = mdy(date_report)) %>% 

  # convert age to numeric class
  mutate(age = as.numeric(age)) %>% 

  # convert "Unknown" gender to NA
  mutate(gender = na_if(gender, "Unknown")) %>% 

  # properly record missing values in many character columns
  mutate(across(.cols = where(is.character), .fns = na_if, ""))


Assessing missingness

While we are discussing missing values, let's introduce you to the core R functions used to assess missingness.

We have two date columns in the data frame: date_onset and date_report. The first is the date of symptom onset, as recorded by the surveillance team. The second is the date the case was reported to the surveillance team.

Which one should we be primarily using?

There are many factors that go into a decision like this, such as the purpose of the analysis, the audience, etc. One important factor is data quality and missingness.

Missingness in R is represented by NA - or at least it should be - as we discussed above. The function to assess missingness is is.na() (whereas !is.na() is used to test for non-missingness - the exclamation point is a negation operator). In both of these functions, the object you want to assess in placed within the parentheses.

r fontawesome::fa("bookmark", fill = "orange") Please remember the below commands related to missingness in R.

To count the number of rows that meet a certain logical criteria, you can further wrap them in sum().

Run these two commands below (as stand-alone commands in the "Testing" section)

sum(is.na(surv$date_onset))   # number of rows where is.na(surv$date_onset) evaluates to TRUE
sum(is.na(surv$date_report))  # number of rows where is.na(surv$date_report) evaluates to TRUE
quiz(caption = "Quiz - Date",
  question("Which column has more missing data?",
    allow_retry = T,
    answer("date_onset", correct = T),
    answer("date_report")
  ),
  question("Which percent of 'date_onset' is missing?",
    allow_retry = T,
    answer("16.4%"),
    answer("10.2%"),
    answer("1%"),
    answer("5.2%", correct = T)
  )
)

Re-coding hospital names

Our pipe chain is getting quite long! Each step is important, and thankfully it is all well-documented in our code. This transparency is useful for ourselves, anyone reviewing our code later, and for scientific integrity!

Now we turn to the column hospital. Try this code in your "Testing" section. It will print a tabulation of all the values in this column:

surv %>% 
  tabyl(hospital)

It seems there is some cleaning to be done! Note the different spellings of "hospital" and "hopital", of "Military" and "Mitilary", and of "Port" and "Port Hospital". These need to be syncronized.

An easy option is recode(). The syntax is relatively simple when we want to simply exchange one value for another:

surv %>% 
  mutate(hospital = recode(hospital,
    "OLD value" = "NEW value",
    "OLD value" = "NEW value",
    "OLD value" = "NEW value"))

The above code can be read as: "Re-define hospital as hospital with these recodes applied". Note that this uses the opposite syntax order from rename(), which was NEW = OLD.

Using the above code as a template, try to write a command at the end of your pipe chain that aligns all the hospital names to the following:

We will leave NA hospitals as they are.

r fontawesome::fa("lightbulb", fill = "gold") Click to read a hint

Begin as before by piping to a mutate(). Within the mutate() parentheses, write the name of the column to re-define hospital and an equals sign. On the right side of the equals sign, you write the function recode(). Within the parentheses of recode(), the first argument is again the name of the column to re-define.

After a comma, continue with one line per change, with the old value on the left, and the new value on the right. As these are character values, put them in quotes. Don't forget commas between each line, and to close with the correct number of parentheses.


r fontawesome::fa("check", fill = "red")Click to see a solution (try it yourself first!)

surv <- surv_raw %>% 

  # automatically clean column names
  clean_names() %>% 

  # manually clean column names   
  rename(
    date_onset = onset_date,
    date_report = date_of_report,
    district_res = adm3_name_res,
    district_det = adm3_name_det) %>%

  # remove unnecessary column
  select(-row_num) %>% 

  # de-duplicate rows  
  distinct() %>% 

  # convert date_onset to date class
  mutate(date_onset = mdy(date_onset)) %>% 
  mutate(date_report = mdy(date_report)) %>% 

  # convert age to numeric class
  mutate(age = as.numeric(age)) %>% 

  # convert "Unknown" gender to NA
  mutate(gender = na_if(gender, "Unknown")) %>% 

  # properly record missing values in many character columns
  mutate(across(.cols = where(is.character), .fns = na_if, "")) %>% 

  # re-code hospital column
  mutate(hospital = recode(hospital,
    # for reference: OLD = NEW
    "Mitilary Hospital"  = "Military Hospital",
    "Port"               = "Port Hospital",
    "Port Hopital"       = "Port Hospital",
    "St. Mark's Maternity Hospital (SMMH)" = "SMMH"))


Recoding gender

Can you add a recode() statement to your pipe chain that converts the values in the column gender from "m" to "male" and from "f" to "female"?

r fontawesome::fa("check", fill = "red")Click to see a solution (try it yourself first!)

surv <- surv_raw %>% 

  # automatically clean column names
  clean_names() %>% 

  # manually clean column names   
  rename(
    date_onset = onset_date,
    date_report = date_of_report,
    district_res = adm3_name_res,
    district_det = adm3_name_det) %>%

  # remove unnecessary column
  select(-row_num) %>% 

  # de-duplicate rows  
  distinct() %>% 

  # convert date_onset to date class
  mutate(date_onset = mdy(date_onset)) %>% 
  mutate(date_report = mdy(date_report)) %>% 

  # convert age to numeric class
  mutate(age = as.numeric(age)) %>% 

  # convert "Unknown" gender to NA
  mutate(gender = na_if(gender, "Unknown")) %>% 

  # properly record missing values in many character columns
  mutate(across(.cols = where(is.character), .fns = na_if, "")) %>% 

  # re-code hospital column
  mutate(hospital = recode(hospital,
    # for reference: OLD = NEW
    "Mitilary Hospital"  = "Military Hospital",
    "Port"               = "Port Hospital",
    "Port Hopital"       = "Port Hospital",
    "St. Mark's Maternity Hospital (SMMH)" = "SMMH")) %>% 

  # recode gender
  mutate(gender = recode(gender,
    "m" = "male",
    "f" = "female"))


End

Congratulations! You have finished the first part of the data cleaning in this course!

Save your script, and check in with your instructor to let them know that you finished.

If you have extra time, ensure that your script is aligned with the "1_ebola_after_cleaning_part1.R" file in the "scripts/backups" folder.

Extras

{tidylog}

If you have extra time, try to install the {tidylog} package by inserting it into your pacman::p_load() command and re-running it.

Once loaded, this function records every data cleaning step that you run (e.g. filter(), mutate(), select(), etc.) and will print a record of the changes in the console. For example, re-run your cleaning pipe chain and see the impact of each step in the process.

Deduplication

Deduplication is a significant topic that is often more complex than the simple example in this exercise. We have written about more complex deduplication scenarios is this the Epi R Handbook chapter on de-duplication.

More date cleaning

Date and times involve many special circumstances. Familiarise yourself with this Epi R Handbook chapter on working with dates.

Of particular note, review the package {parsedate} and its function parse_date(). This is used to clean those very messy date columns for which the {lubridate} functions mdy(), dmy(), and ymd() will not work because the dates are not in the same general format across all the rows. parse_date() will evaluate each row individually, intelligently guessing the format of each cell. This takes more time, but is useful if the dates are very messy.

Additional reading

If you have extra time, we encourage you to review the following for more data cleaning tips:

If you have questions about any of the above subjects, contact your facilitator and if they are not busy helping other students they can help you.



appliedepi/introexercises documentation built on April 22, 2024, 1:01 a.m.