# load packages ---------------------------------------------------------------- library(introexercises) # get data for exercises library(learnr) # create lessons from rmd library(gradethis) # evaluate exercises library(dplyr) # wrangle data library(stringr) # strings library(flair) # highlight code library(ggplot2) # visualise data library(lubridate) # work with dates library(epikit) # for emojis library(skimr) # browse data library(fontawesome) # for emojis library(janitor) # clean data # library(RMariaDB) # connect to sql database ## set options for exercises and checking --------------------------------------- ## Define how exercises are evaluated gradethis::gradethis_setup( ## note: the below arguments are passed to learnr::tutorial_options ## set the maximum execution time limit in seconds exercise.timelimit = 60, ## set how exercises should be checked (defaults to NULL - individually defined) # exercise.checker = gradethis::grade_learnr ## set whether to pre-evaluate exercises (so users see answers) exercise.eval = FALSE ) # ## event recorder --------------------------------------------------------------- # ## see for details: # ## https://pkgs.rstudio.com/learnr/articles/publishing.html#events # ## https://github.com/dtkaplan/submitr/blob/master/R/make_a_recorder.R # # ## connect to your sql database # sqldtbase <- dbConnect(RMariaDB::MariaDB(), # user = Sys.getenv("userid"), # password = Sys.getenv("pwd"), # dbname = 'excersize_log', # host = "144.126.246.140") # # # ## define a function to collect data # ## note that tutorial_id is defined in YAML # ## you could set the tutorial_version too (by specifying version:) but use package version instead # recorder_function <- function(tutorial_id, tutorial_version, user_id, event, data) { # # ## define a sql query # ## first bracket defines variable names # ## values bracket defines what goes in each variable # event_log <- paste("INSERT INTO responses ( # tutorial_id, # tutorial_version, # date_time, # user_id, # event, # section, # label, # question, # answer, # code, # correct) # VALUES('", tutorial_id, "', # '", tutorial_version, "', # '", format(Sys.time(), "%Y-%M%-%D %H:%M:%S %Z"), "', # '", Sys.getenv("SHINYPROXY_PROXY_ID"), "', # '", event, "', # '", data$section, "', # '", data$label, "', # '", paste0('"', data$question, '"'), "', # '", paste0('"', data$answer, '"'), "', # '", paste0('"', data$code, '"'), "', # '", data$correct, "')", # sep = '') # # # Execute the query on the sqldtbase that we connected to above # rsInsert <- dbSendQuery(sqldtbase, event_log) # # } # # options(tutorial.event_recorder = recorder_function)
# hide non-exercise code chunks ------------------------------------------------ knitr::opts_chunk$set(echo = FALSE)
# data prep -------------------------------------------------------------------- surv_raw <- rio::import(system.file("dat/old_version/surveillance_linelist_20141201.csv", package = "introexercises"))
Welcome to the live course "Introduction to R for applied epidemiology", offered by Applied Epi - a nonprofit organisation that offers open-source tools, training, and support to frontline public health practitioners.
knitr::include_graphics("images/logo.png", error = F)
This exercise focuses on cleaning data, including filter rows, selecting columns, deduplicating rows, creating new columns, creating categorical variables, and cleaning dates.
This exercise will guide you through a set of tasks.
You should perform these tasks in RStudio and on your local computer.
There are several ways to get help:
1) Look for the "helpers" (see below)
2) Ask your live course instructor/facilitator for help
3) Ask a colleague or other participant in the course for tips
4) Post a question in Applied Epi Community in the category for questions about Applied Epi Training
Here is what those "helpers" will look like:
r fontawesome::fa("lightbulb", fill = "gold")
Click to read a hint
Here you will see a helpful hint!
r fontawesome::fa("check", fill = "red")
Click to see a solution (try it yourself first!)
linelist %>% filter( age > 25, district == "Bolo" )
Here is more explanation about why the solution works.
Please complete the quiz questions that you encounter throughout the tutorial. Answering will help you to comprehend the material, and will also help us to improve the exercises for future students.
To practice, please answer the following questions:
quiz( question_radio("When should I view the red 'helper' code?", answer("After trying to write the code myself", correct = TRUE), answer("Before I try coding", correct = FALSE), correct = "Reviewing best-practice code after trying to write yourself can help you improve", incorrect = "Please attempt the exercise yourself, or use the hint, before viewing the answer." ) )
question_numeric( "How anxious are you about beginning this tutorial - on a scale from 1 (least anxious) to 10 (most anxious)?", answer(10, message = "Try not to worry, we will help you succeed!", correct = T), answer(9, message = "Try not to worry, we will help you succeed!", correct = T), answer(8, message = "Try not to worry, we will help you succeed!", correct = T), answer(7, message = "Try not to worry, we will help you succeed!", correct = T), answer(6, message = "Ok, we will get there together", correct = T), answer(5, message = "Ok, we will get there together", correct = T), answer(4, message = "I like your confidence!", correct = T), answer(3, message = "I like your confidence!", correct = T), answer(2, message = "I like your confidence!", correct = T), answer(1, message = "I like your confidence!", correct = T), allow_retry = TRUE, correct = "Thanks for sharing. ", min = 1, max = 10, step = 1 )
You will see these icons throughout the exercises:
Icon |Meaning
------|--------------------
r fontawesome::fa("eye", fill = "darkblue")
|Observe
r fontawesome::fa("exclamation", fill = "red")
|Alert!
r fontawesome::fa("pen", fill = "brown")
|An informative note
r fontawesome::fa("terminal", fill = "black")
|Time for you to code!
r fontawesome::fa("window-restore", fill = "darkgrey")
|Change to another window
r fontawesome::fa("bookmark", fill = "orange")
|Remember this for later
Applied Epi Incorporated, 2022
This work is licensed by Applied Epi Incorporated under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Please email contact@appliedepi.org with questions about the use of these materials for academic courses and epidemiologist training programs.
In this exercise you will:
r fontawesome::fa("window-restore", fill = "darkgrey")
Open the "intro_course" R project that you created in the previous exercise.
You can do this by navigating to the "intro_course" folder, and double-clicking on the blue ".Rproj" file in the folder. RStudio should open.
knitr::include_graphics("images/new_r_project_explorer.png", error = F)
Open a new R script, either by clicking the new script icon in the upper-left corner of RStudio, or by clicking File -> New File -> R Script.
This will be the primary R script that you work in for the next 2 days. Name it "ebola_analysis.R" and save it into the "scripts" subfolder of the R project.
Add a section heading at the top called "About this script". To do this, place your cursor where the new section should start and press Ctrl, Shift, and R at the same time (or Cmd Shift R on a Mac). In the pop-up, name the section "About this script".
The new header should look something like this:
# About this script ----------------------------------------------
Write # comment lines below this heading to document:
Remember that you can navigate your script using these section headers using the "Outline" button in the upper-left of the R script.
The first coding step of any script is to install and/or load the packages for the entire script. It is best to do this at the very top, so that any reviewer can see which packages are used.
Add a new section heading that will contain the code to "Load packages".
# Load packages ----------------------------------------------
Now write a command using {pacman} and its function p_load()
that will install/load the following packages:
Write your command vertically so that you can add descriptive comments for each package.
Try to write this command first, and then check the solution code below.
r fontawesome::fa("check", fill = "red")
Click to see a solution (try it yourself first!)
# Load packages ---------------------------------------------- # The pacman package will install each package if necessary, # and load it for use in the current session pacman::p_load( rio, # for importing data here, # for locating files skimr, # for reviewing the data janitor, # for data cleaning lubridate, # for date cleaning epikit, # creating age categories gtsummary, # for nice tables tidyverse # for data management and visualization )
Don't forget to write pacman::p_load()
, which loads the {pacman} package, to then be able to load all the other packages.
Add a new section heading that will contain the code to "Import data".
# Import data ----------------------------------------------
In this section, write a command that imports the "surveillance_linelist_20141201.csv" from the "data/raw" subfolder of the R project.
Use import()
and here()
to import the file, and name the dataset in R as surv_raw
(a shorthand for "surveillance linelist raw") using the assignment operator <-
r fontawesome::fa("check", fill = "red")
Click to see a solution (try it yourself first!)
surv_raw <- import(here("data", "raw", "surveillance_linelist_20141201.csv"))
First let's run some code to look at the different values in the dataset.
Add a new section heading that will contain the code to "Exploratory analysis".
# Exploratory analysis -----------------------------------------
Much of the code below you will learn in upcoming modules. For now, simply copy-paste the code into your R Console and run it. Do not spend much time trying to understand the code. Then, answer the questions about the outputs that appear in the Console and Plots panes.
The {base} function names()
returns the column names of a data frame. We can place the name of our dataset inside, and view the result in the Console.
names(surv_raw)
Another way to write the same command, is to pipe surv_raw
into names()
.
surv_raw %>% names()
quiz(caption = "Quiz - column names", question("Which of the column names contain non-standard syntax and will need to be cleaned?", allow_retry = T, answer("onset date", correct = T, message = "This one contains a space"), answer("wt (kg)", correct = T, message = "Spaces and parentheses need to be removed"), answer("hospital", message = "Spaces and parentheses need to be removed"), answer("age unit", correct = T, message = "space needs to be removed"), answer("ht (cm)", correct = T, message = "Spaces and parentheses need to be removed"), answer("date of report", correct = T, message = "Spaces need to be removed or converted") ) )
As we conduct our exploratory analysis, we need to be aware of how R handles missing values.
In R, missing values are represented by a reserved (special) value - NA
. Note that NA
is treated differently by certain functions because having missingness recognized is important.
Note that NA
is typed without quotes. “NA” is different and is just a normal character value (also a Beatles lyric from the song Hey Jude).
Your data may have other ways of representing missingness, such as 99, a period, “Missing”, or “Unknown”. Sometimes, empty cells in a CSV or XLSX can import into R with a blank space - it will look empty but actually there is a character space like this: " ", or even a character no space, like this: "".
Below is a short table to show you how these scenarios look.
To assess whether something in R is truly missing (NA
), we use the function is.na()
, which asks simply "Is this value NA
?" and returns either TRUE
or FALSE
. The third column of the table is the result of is.na()
applied to the value in the second column.
Only the final value in this table will be recognized as a truly "missing value" by R.
# create the demo dataset ######################### demo_missing <- data.frame( `As written in data` = c('"Missing"', '"Unknown"', '" "', '""', '"."', '"NA"', NA), `As displayed in column` = c("Missing", "Unknown", " ", "", ".", "NA", NA)) %>% rename( `As written in data` = `As.written.in.data`, `As displayed in column` = `As.displayed.in.column` ) %>% mutate(`As assessed by is.na()` = is.na(`As displayed in column`)) # # demo_missing <- tribble( # # ~`As written in data`, ~`As displayed in column`, # '"Missing"', "Missing", # '"Unknown"', "Unknown", # '" "', " ", # '""', "", # NA, NA) %>% # # mutate(`As assessed by is.na()` = is.na(`As displayed in column`)) demo_missing
As shown above, you can sometimes have an empty character value which will look blank but would actually be referenced in R as: "" (different than NA
). For example, below we pipe our raw data frame into the tabyl()
function, and tell this function to print the number of rows for each unique value in the column fever
.
# tabulate fever surv_raw %>% tabyl(fever)
Do you see the counts for "no", "yes", and the third value is ... an empty space. This is not properly recognized as NA
(missing) by R, so we will need to recode this value in our cleaning.
r fontawesome::fa("eye", fill = "darkblue")
Let's keep an eye out for how missing values have been imported in our data set, as we begin to clean.
The package {janitor} and its function tabyl()
can be used to make a rapid tabulation of values in a specified column.
Below, we again run a command to pipe our dataset into the tabyl()
function, to get the counts for each unique value in the column gender
.
# tabulate gender surv_raw %>% tabyl(gender)
quiz(caption = "Quiz - Gender", question_numeric( "How many rows are coded as 'Unknown' gender?", answer(26, correct = T), allow_retry = TRUE, correct = "Correct, nice work.", min = 1, max = 700, step = 1 ) )
In the previous section we noted that missing values should be listed as NA
. Is that the case here?
No. The missing values have been coded by the dataset owner as "Unknown". This may be sufficient for some analyses in R, but usually we want to start our analysis workflow with these as NA
so that R correctly handles them as missing data.
We will need to address this in our cleaning steps later.
Let's review the age distribution using the {ggplot2} package, which is the most common package for data visualization.
Below, the function ggplot()
uses our data frame surv_raw
, the column age
is mapped to the X-axis, and the data are visualized as a histogram. Again, do not worry about understanding this function now - we will have two entire modules that cover {ggplot2}.
# a histogram of the case ages ggplot(data = surv_raw, mapping = aes(x = age))+ geom_histogram()
Note that a warning message may appear when you run this command. A "Warning" message in the Console means that R is able to execute your command, but there is something you should probably know. In contrast, an "Error" message means that R was not able to execute your command.
Below, the command is changed only slightly, to display a boxplot instead of a histogram.
# a boxplot of the case age distribution ggplot(data = surv_raw, mapping = aes(x = age))+ geom_boxplot()
quiz(caption = "Quiz - age distribution", question("The median age is:", allow_retry = T, answer("Below 20", correct = T, message = ""), answer("Above 20", message = "The median age can be found in the boxplot") ) )
Ah! But some of the cases have recorded age in months.
surv_raw %>% tabyl(`age unit`)
Note that because this column name has a space in it, we had to write it surrounded by backticks, in order for R to understand that we are referring to a known column name. Cleaning this column name will be one of our first actions once we begin our data cleaning steps.
quiz(caption = "Quiz - age unit", question_numeric( "How many rows contain age recorded in months?", answer(5, message = "", correct = T), allow_retry = TRUE, correct = "Correct, nice work.", min = 1, max = 700, step = 1 ) )
The first concern when "cleaning" a date column in R, is which "class" the column has been assigned by default.
The class()
command is a {base} function that returns the class of an object.
Note the class of the object surv_raw
, our linelist.
class(surv_raw)
Below, we inspect the class of the column age
. Here we use the $
operator to specify that within the data frame surv_raw
, we want to examine the column age
.
Type this command letter-by-letter. When you type the $, you should see a list appear of all the columns in the data frame (this is an auto-complete feature, which can be adjusted in RStudio -> Tools -> Global Options).
class(surv_raw$age)
Now let's see what the class of the column onset date
is:
class(surv_raw$`onset date`)
During the import of the data, R has read this column as character values (e.g. "11/03/2014") and did not assume that it was a date. Indeed - how would R know which digits referred to months, or to days? In CSV files in particular, these columns are often first imported as character values. For analysis purposes, it is essential that this column is properly recognized as class "date" and the days and months correctly identified. We will address this in the cleaning steps.
quiz(caption = "Quiz - column class", question("What is the class of the column `date of report`?", allow_retry = TRUE, answer("Numeric"), answer("Character", correct = T), answer("Proletariat"), answer("Logical"), answer("Date") ) )
We can also use histograms to examine other numeric columns like weight, height, temperature, etc.
Does anything seem amiss? Are there any values to clean? Values that are impossible?
This code is for the weight column.
ggplot(data = surv_raw, mapping = aes(x = `wt (kg)`))+ geom_histogram()
This code is for the height column.
ggplot(data = surv_raw, mapping = aes(x = `ht (cm)`))+ geom_histogram()
There are two columns in the dataset that describe the district of the case:
adm3_name_res
- this is the case's district of residence adm3_name_det
- this is the case's district of detection How well do these two align? We can run a cross-tabulation:
The first-listed column (district of residence) will appear in the rows, and the district of detection will appear as the columns.
surv_raw %>% tabyl(adm3_name_res, adm3_name_det)
quiz(caption = "Quiz - residence", question("How many cases were residents of Mountain Rural but were detected in Central II?", allow_retry = T, answer("12", message = "residence is the rows, and detected is the columns"), answer("4", message = "residence is the rows, and detected is the columns"), answer(surv_raw %>% filter(adm3_name_res == "Mountain Rural" & adm3_name_det == "Central II") %>% nrow(), correct = T, message = "residence is the rows, and detected is the columns"), answer("209", message = "residence is the rows, and detected is the columns") ) )
See the first row and the last column of the tabulation - it appears that again, missing values were read into R as a blank character space. This may be a problem in all of the columns that are class "character". Don't worry, we will address this when cleaning the data.
r fontawesome::fa("exclamation", fill = "red")
Have you saved your script lately?
Now that you have reviewed the data, you can begin to build a "pipe chain" in your R script that will clean the surveillance dataset.
%>%
A pipe chain refers to the use of the "pipe" operator %>%
to chain together operations (such as cleaning, reordering and manipulating) for altering a dataset.
This operator allows us to carry out several different alterations of the data frame in a single operation. It can be helpful to think of the pipe operator %>%
as telling the computer "and then" between your functions.
R coders use the pipe operator so frequently that there is a keyboard shortcut! You will want to memorize this:
Windows/Linux: Press Ctrl, Shift, and "m" at the same time
Mac: Press Cmd, Shift, and "m" at the same time
Try it out in your script!
For a given dataset, it is best to store all the cleaning steps in one connected sequence of commands, all linked together by "pipe" operators %>%
To begin, let's do the following:
Add a new section to your script, below the exploratory analysis section:
# Clean surveillance data -----------------------------------------
You can press Ctrl+Shift+R to quickly create a new section in the script.
A cleaning pipe chain has three main components:
1) The command begins with an <-
assignment operator, with the name you want to give to the "clean" data on the left
2) To the right of the <-
you write the name of the "raw" dataset, and a %>%
pipe operator
3) On the subsequent line, you write a series of cleaning functions, all connected to each other by the %>%
pipe operator (e.g. renaming columns, de-duplicating, recoding values, filtering rows by logical criteria, etc.)
You will build this chain step-by-step, function-by-function, over this and the subsequent exercise. You should only write ONE cleaning pipe chain for the surveillance data. See the image below, for an example, with the %>%
operators highlighted:
Do NOT write the code below - simply view it as an example. This is the script that we will write over the coming sessions.
knitr::include_graphics("images/cleaning_pipe_chain.png", error = F)
Because all the steps are linked with %>%
pipe operators, the entire pipe chain is effectively one command. Because this command uses the assignment operator <-
, the modified dataset will not be printed to the R Console when the command is run. Instead, the changes are saved to the new "clean" data frame surv
. To see that clean data frame you have two options:
1) Click on surv
in the Environment pane (upper-right of RStudio)
2) Use the command View()
(note the capital "V")
You should view the dataset after each function is applied, so that you can track the changes.
Now let's begin building your cleaning pipe chain!
Often, the first task is to clean the column names. This makes the rest of the coding easier, as you can reference columns by their clean syntax.
Let us insert the function clean_names()
into the pipe chain. This is a very easy function to use - you do not need to put anything in the parentheses! clean_names()
will ensure that all column names are lowercase, that spaces are converted to underscores, and that any special characters are removed.
Run your pipe chain with clean_names()
and save the result to surv
.
r fontawesome::fa("check", fill = "red")
Click to see a solution (try it yourself first!)
surv <- surv_raw %>% clean_names()
View surv
by clicking on it in RStudio. Note that some column names have changed!
quiz(caption = "Quiz - clean_names()", question("How is the column name 'ht (cm)' written after using clean_names()?", allow_retry = T, answer("height"), answer("ht_cm", correct = T), answer("ht_(cm)"), answer("HT-CM") ), question("What is the new name of the column that contains date of symptom onset?", allow_retry = T, answer("onset date"), answer("date-of-onset"), answer("onset_date", correct = T), answer("onset") ) )
That was easy! You have begun your cleaning pipe chain, but there is more cleaning to do...
We still need to make some changes to column names manually:
onset_date
to date_onset
(it is more standard to put "date" at the front, especially if there are multiple date columns) date_of_report
to date_report
Simplify the location column names for easier reading (and typing!)
adm3_name_res
to district_res
adm3_name_det
to district_det
The primary mechanism for these manual changes is the {dplyr} function rename()
. Here is the basic syntax:
surv <- surv_raw %>% clean_names() %>% rename( NEWname = OLDname, NEWname = OLDname )
A few things to note:
r fontawesome::fa("check", fill = "red")
Click to see a solution (try it yourself first!)
surv <- surv_raw %>% clean_names() %>% rename( # NEW NAME # OLD NAME date_onset = onset_date, date_report = date_of_report, district_res = adm3_name_res, district_det = adm3_name_det )
Review surv
once you have successfully run your expanded pipe chain. Verify that the desired changes did take place.
Add a new section heading below your cleaning pipe chain section that will contain "Testing" code. (remember you can use the shortcut Ctrl Shift and R to create a section heading as below):
# Testing code -----------------------------------------
You will find that often you want to try out code or commands, that are not (yet) added to your cleaning pipe chain. You can do this here.
This section will hold stand-alone, independent commands that are not part of your cleaning pipe chain. You can think of this as a "playground" to test things before making changes to your pipe chain.
The select()
function is versatile and often used to:
The primary purpose of select()
is to select columns to keep in the dataset.
Any column name that is not included in the parentheses will be dropped. Try typing the below command in your "Testing area" and running it. It takes your partially-clean data set surv
and selects only a few columns, printing (not saving) the result.
surv %>% select(epilink, age, gender)
What did you notice about the new ordering of the columns? select()
also re-orders columns.
It may be taxing to list every column individually. You can use a colon ( : ) to keep the two named columns and all those in between. Modify your test code as below:
# keeps fever, vomit, and the columns in between them surv %>% select(fever:vomit)
You can also list some columns, to move them to the beginning, and finish with everything()
to include all the other columns (note the empty parentheses at the end of everything()
). There are other such "tidyselect" helpers available to you such as contains()
and starts_with()
that you can read more about in this Epi R Handbook chapter.
Modify your test code as below and try running it:
# use everything() surv %>% select(fever, gender, date_report, everything())
In the Testing area, practice using a minus symbol ( - ) to drop the column epilink
. All other columns are retained.
surv %>% select(-epilink)
Now practice dropping multiple columns by wrapping all their names in a vector with c()
- this applies the negative sign to all of them. We learned about the {base} function c()
in the previous module, which stands for "concatenate". Here it is used to create a vector of column names and the - sign is applied to all of them.
# Removes these three columns surv %>% select(-c(epilink, district_det, age)) # an equivalent command to above surv %>% select(-epilink, -district_det, -age)
A more advanced (and efficient) maneuver is to use select()
to rename columns! If you try this, it uses the same syntax as rename()
, but remember that it only keeps the columns that you write!
r fontawesome::fa("terminal", fill = "black")
Now it is time to add a select()
statement to your cleaning pipe chain!
Move away from the "Testing" area and add to your pipe chain a step that removes the column row_num
from the dataset. This column is not necessary for our analyses.
r fontawesome::fa("check", fill = "red")
Click to see a solution (try it yourself first!)
Your cleaning pipe chain should now look like this:
surv <- surv_raw %>% # automatically clean column names clean_names() %>% # manually clean column names rename( date_onset = onset_date, date_report = date_of_report, district_res = adm3_name_res, district_det = adm3_name_det) %>% # remove unnecessary column select(-row_num)
At this point, we should check for duplicates in our data. This can be done with the distinct()
function. If run with empty parentheses, it will remove rows that are 100% duplicates (it will keep only the first row, of rows for which the value in every column is the same).
Add the function distinct()
to the pipe chain, and run your pipe chain.
r fontawesome::fa("check", fill = "red")
Click to see a solution (try it yourself first!)
Your cleaning pipe chain should now look like this:
surv <- surv_raw %>% # automatically clean column names clean_names() %>% # manually clean column names rename( date_onset = onset_date, date_report = date_of_report, district_res = adm3_name_res, district_det = adm3_name_det) %>% # remove unnecessary column select(-row_num) %>% # de-duplicate rows distinct()
Were any rows removed?
r fontawesome::fa("lightbulb", fill = "gold")
Click to read a hint
Compare the number of rows in surv
to the number of rows in surv_raw
(in the Environment pane).
distinct()
can be used in more advanced ways, such as to only consider certain columns when evaluation duplicates (e.g. remove rows that have the same age, gender, and first name).
For example, adjusting your command to read: distinct(case_id)
will consider a row as a "duplicate" if the case_id
values are identical (values in ALL columns no longer have to be identical to be considered a "duplicate").
Notice however, that the default behavior if you run distinct(case_id)
is that all the other columns were dropped from the dataset! We probably would not want that. You can add the argument .keep_all = TRUE
as below, and all those columns will be kept.
distinct(case_id, .keep_all = TRUE)
You can read more about the distinct()
function its documentation, and in the Epi R Handbook cleaning chapter.
In fact, there is an entire chapter of the Epi R Handbook dedicated to de-duplication that we encourage you to read if you do a lot of record deduplication.
Before you move on, please revert your de-duplication command to read as only distinct()
with nothing in the parentheses.
r fontawesome::fa("exclamation", fill = "red")
Have you saved your script lately?
class()
functionEach column in the data frame has a specific "class". Classes reflect how R understands the values in the column.
You reviewed the classes of each column with the skim()
function from the {skimr} package. You can also view the class of a single column using this syntax (run this in your "Testing" section):
# print the class of a column class(surv$gender)
How does this class()
command compare to a pipe chain?
This is a "stand-alone" command - not a part of your pipe chain. Because it is an independent command, we use the $
operator to link the column name to its data frame: surv$gender
. This linkage is required because, as you know, R can handle multiple datasets at one time.
If we did not specify the data frame, R would return an error: "Error: object 'gender' not found". In a pipe chain the $
is not necessary, because the data frame is being passed from one function to the next with the pipe operator.
What are possible column classes?
Common column classes include:
An early part of the data cleaning process is ensuring that R correctly understands the class for each column. R is generally good about guessing this, but sometimes it needs your help.
We recognized that the column date_onset
(previously named "onset date" in the raw dataset) is recognized by R as class "character". Often, dates in Excel worksheets arrive in this class, or in numeric (e.g. 48256), or in an even stranger class called "POSIXct". Either way, we want to designate this column as a normal and dependable class "date".
First, we need to understand how the date is currently written in the data before the function is applied.
Run the following command in your "Testing" section, to look at the first six values of date_onset
in the data frame surv
:
head(surv$date_onset)
quiz(caption = "Quiz - raw date format", question("Which format is date_onset??", allow_retry = T, answer("YYYY-MM-DD"), answer("DD/MM/YYYY"), answer("MM/DD/YYYY", correct = T) ) )
{lubridate} has several helper functions designed to convert character objects to dates in a more intuitive and lenient way. These three functions correspond to the format of the date prior to use of the function. The three functions are ymd()
, mdy()
, and dmy()
. Their names correspond to the approximate format they expect the date be in before this function is applied.
Thankfully, they are adaptable in that they allow for a variety of separators, and synonyms for dates (e.g. 01 vs Jan vs January):
The ymd()
function converts raw character values to date class that are written as: year, month, day.
# ymd() should be used to convert raw values written as: year, month, day ymd("2020-10-11") ymd("20201011")
The mdy()
does the same but for raw character values written as: month, day, year.
# mdy() should be used to convert raw values written as: month, day, year mdy("Oct 11 20") mdy("10/11/20")
and the dmy()
should be used for raw character values written as: day, month , year.
# mdy() should be used to convert raw values written as: month, day, year dmy("11 October 2020") dmy("11/10/2020")
In ALL of the above commands, the outcome is values in date class that display as YYYY-MM-DD.
quiz(caption = "Quiz - date format", question("Which function should be used to convert the column date_onset?", allow_retry = T, answer("ymd()"), answer("dmy()"), answer("mdy()", correct = T) ) )
The most important thing to remember is to use the function that matches the current (raw) date format
mutate()
Changing a column is done with mutate()
. This is a general function for changing a column - that could include:
1) Creating a new column
2) Changing or re-defining an existing column
column name = a function to use to create/modify the column
Thus, the command to add to our pipe chain to modify the column date_onset
to be converted to class "date" would look like this:
mutate(date_onset = mdy(date_onset))
Above, the code is in effect saying: "change (mutate) the date_onset
column, such that it becomes the date_onset
column as modified by the function mdy()
."
Now add two mutate()
commands to the pipe chain, that convert date_onset
and date_report
to class "date".
r fontawesome::fa("check", fill = "red")
Click to see a solution (try it yourself first!)
surv <- surv_raw %>% # automatically clean column names clean_names() %>% # manually clean column names rename( date_onset = onset_date, date_report = date_of_report, district_res = adm3_name_res, district_det = adm3_name_det) %>% # remove unnecessary column select(-row_num) %>% # de-duplicate rows distinct() %>% # convert date_onset to date class mutate(date_onset = mdy(date_onset)) %>% mutate(date_report = mdy(date_report))
Now we can inspect the date column, and use it in plotting functions like a continuous variable. Previously, as class character, a plot like the one created below would not have been possible.
We give you this code below not to learn plotting, but simply to test it in RStudio. You will learn how to make plots with {ggplot2} later in the course.
Write it in your "Testing" section and run it.
# plot a histogram of dates using the cleaned dataset and date column ggplot(data = surv, mapping = aes(x = date_onset))+ geom_histogram()
quiz(caption = "Quiz - Date", question("In the most recent weeks, what is the apparent trend in the number of reported cases?", allow_retry = T, answer("Increasing"), answer("Decreasing", correct = T) ) )
There is another column that we should check the class of: age
. What is the class of this column? (write this command in your "Testing" section)
class(surv$age)
The command above should have returned "integer". While this is OK for the moment... we may want to perform calculations with this column that would result in it needing to accept decimal values. As class "integer", this column cannot accept decimals, but as class "numeric" it can.
Thankfully in {base} R there are functions like as.numeric()
, as.character()
, and as.integer()
to easily convert columns from one class to another. Simply put the column name within the parentheses, and it is converted.
In your pipe chain, add a mutate()
command that converts this column using the as.numeric()
function.
r fontawesome::fa("lightbulb", fill = "gold")
Click to read a hint
In the mutate()
function, re-define age
as equal to itself as wrapped within the as.numeric()
function.
r fontawesome::fa("check", fill = "red")
Click to see a solution (try it yourself first!)
surv <- surv_raw %>% # automatically clean column names clean_names() %>% # manually clean column names rename( date_onset = onset_date, date_report = date_of_report, district_res = adm3_name_res, district_det = adm3_name_det) %>% # remove unnecessary column select(-row_num) %>% # de-duplicate rows distinct() %>% # convert date_onset to date class mutate(date_onset = mdy(date_onset)) %>% mutate(date_report = mdy(date_report)) %>% # convert age to numeric class mutate(age = as.numeric(age))
r fontawesome::fa("exclamation", fill = "red")
Have you saved your script lately?
There are a variety of function that can be used to perform simple recoding of values. We will show you some in the sections below, as you add them to your pipe chain.
As identified earlier, in "raw" data there is a value "Unknown" in the column gender
.
Run the code below (in your "Testing" section), which creates a counts table of the unique values in the column gender
in the "raw" data.
surv_raw %>% tabyl(gender)
We want to tell R that if the value in gender
is "Unknown" then R should change the value to NA
. We may change this later, but for now, let's convert it to NA
.
Use mutate()
on the column gender
and set it equal to the function na_if()
. In the parentheses of na_if()
, again write gender
, then a comma, and then "Unknown" in quote marks: mutate(gender = na_if(gender, "Unknown"))
.
The function na_if()
replace a specified value with NA
. This code is saying: "Change gender
, such that its values are equivalent to the column gender
as changed by the function na_if()
so that any instances of "Unknown" are replaced by NA
".
Add this command to your pipe chain - it will not run as a standalone command by itself.
# This command would be added to a pipe chain mutate(gender = na_if(gender, "Unknown"))
Note: it is also useful to know the opposite function, replace_na()
, which replaces NA
with a specified value.
r fontawesome::fa("check", fill = "red")
Click to see a solution (try it yourself first!)
surv <- surv_raw %>% # automatically clean column names clean_names() %>% # manually clean column names rename( date_onset = onset_date, date_report = date_of_report, district_res = adm3_name_res, district_det = adm3_name_det) %>% # remove unnecessary column select(-row_num) %>% # de-duplicate rows distinct() %>% # convert date_onset to date class mutate(date_onset = mdy(date_onset)) %>% mutate(date_report = mdy(date_report)) %>% # convert age to numeric class mutate(age = as.numeric(age)) %>% # convert "Unknown" gender to NA mutate(gender = na_if(gender, "Unknown"))
Now re-run the tabyl()
command on column gender
, using the clean dataset surv
, and see how the missing values are represented as NA
.
We need to apply this change for all the columns in the dataset that are class "character" (gender, hospital, gender, wt_kg, ht_cm, fever, chills, cough...). It is a long list...
As mentioned earlier, when R imports "raw" files from CSV or XLSX sometimes it records missing values in "character" columns as an empty character ""
, instead of as the special symbol NA
.
Run the code below (in your "Testing" section), which creates a counts table of the unique values in the column fever
.
surv %>% tabyl(fever)
We need to tell R that if the value in fever
is empty "" then R should change the value to NA
. Don't spend the time to look, but there are many columns in the dataset where missing is recorded as "". How can we make all these changes efficiently?
We are going to give you a command that uses across()
within the mutate()
to change multiple columns at once. This is an intermediate R command, so please do not spend time trying to understand it. Later, you can read more about across()
in the Epi R Handbook here.
Add this command to your pipe chain.
mutate(across(.cols = where(is.character), .fns = na_if, ""))
Briefly, this command is saying: Mutate all the columns that are character class, and use the function na_if()
to convert any instances of "" to NA
. It should change the columns fever
, vomit
, hospital
, district
, and any other character columns.
r fontawesome::fa("check", fill = "red")
Click to see a solution (try it yourself first!)
surv <- surv_raw %>% # automatically clean column names clean_names() %>% # manually clean column names rename( date_onset = onset_date, date_report = date_of_report, district_res = adm3_name_res, district_det = adm3_name_det) %>% # remove unnecessary column select(-row_num) %>% # de-duplicate rows distinct() %>% # convert date_onset to date class mutate(date_onset = mdy(date_onset)) %>% mutate(date_report = mdy(date_report)) %>% # convert age to numeric class mutate(age = as.numeric(age)) %>% # convert "Unknown" gender to NA mutate(gender = na_if(gender, "Unknown")) %>% # properly record missing values in many character columns mutate(across(.cols = where(is.character), .fns = na_if, ""))
While we are discussing missing values, let's introduce you to the core R functions used to assess missingness.
We have two date columns in the data frame: date_onset
and date_report
. The first is the date of symptom onset, as recorded by the surveillance team. The second is the date the case was reported to the surveillance team.
Which one should we be primarily using?
There are many factors that go into a decision like this, such as the purpose of the analysis, the audience, etc. One important factor is data quality and missingness.
Missingness in R is represented by NA
- or at least it should be - as we discussed above. The function to assess missingness is is.na()
(whereas !is.na()
is used to test for non-missingness - the exclamation point is a negation operator). In both of these functions, the object you want to assess in placed within the parentheses.
r fontawesome::fa("bookmark", fill = "orange")
Please remember the below commands related to missingness in R.
is.na()
is used to assess whether a value is NA
!is.na()
is used to assess whether a value is NOT NA
To count the number of rows that meet a certain logical criteria, you can further wrap them in sum()
.
Run these two commands below (as stand-alone commands in the "Testing" section)
sum(is.na(surv$date_onset)) # number of rows where is.na(surv$date_onset) evaluates to TRUE sum(is.na(surv$date_report)) # number of rows where is.na(surv$date_report) evaluates to TRUE
quiz(caption = "Quiz - Date", question("Which column has more missing data?", allow_retry = T, answer("date_onset", correct = T), answer("date_report") ), question("Which percent of 'date_onset' is missing?", allow_retry = T, answer("16.4%"), answer("10.2%"), answer("1%"), answer("5.2%", correct = T) ) )
Our pipe chain is getting quite long! Each step is important, and thankfully it is all well-documented in our code. This transparency is useful for ourselves, anyone reviewing our code later, and for scientific integrity!
Now we turn to the column hospital
. Try this code in your "Testing" section. It will print a tabulation of all the values in this column:
surv %>% tabyl(hospital)
It seems there is some cleaning to be done! Note the different spellings of "hospital" and "hopital", of "Military" and "Mitilary", and of "Port" and "Port Hospital". These need to be syncronized.
An easy option is recode()
. The syntax is relatively simple when we want to simply exchange one value for another:
surv %>% mutate(hospital = recode(hospital, "OLD value" = "NEW value", "OLD value" = "NEW value", "OLD value" = "NEW value"))
The above code can be read as: "Re-define hospital
as hospital
with these recodes applied". Note that this uses the opposite syntax order from rename()
, which was NEW = OLD.
Using the above code as a template, try to write a command at the end of your pipe chain that aligns all the hospital names to the following:
We will leave NA
hospitals as they are.
r fontawesome::fa("lightbulb", fill = "gold")
Click to read a hint
Begin as before by piping to a mutate()
. Within the mutate()
parentheses, write the name of the column to re-define hospital
and an equals sign. On the right side of the equals sign, you write the function recode()
. Within the parentheses of recode()
, the first argument is again the name of the column to re-define.
After a comma, continue with one line per change, with the old value on the left, and the new value on the right. As these are character values, put them in quotes. Don't forget commas between each line, and to close with the correct number of parentheses.
r fontawesome::fa("check", fill = "red")
Click to see a solution (try it yourself first!)
surv <- surv_raw %>% # automatically clean column names clean_names() %>% # manually clean column names rename( date_onset = onset_date, date_report = date_of_report, district_res = adm3_name_res, district_det = adm3_name_det) %>% # remove unnecessary column select(-row_num) %>% # de-duplicate rows distinct() %>% # convert date_onset to date class mutate(date_onset = mdy(date_onset)) %>% mutate(date_report = mdy(date_report)) %>% # convert age to numeric class mutate(age = as.numeric(age)) %>% # convert "Unknown" gender to NA mutate(gender = na_if(gender, "Unknown")) %>% # properly record missing values in many character columns mutate(across(.cols = where(is.character), .fns = na_if, "")) %>% # re-code hospital column mutate(hospital = recode(hospital, # for reference: OLD = NEW "Mitilary Hospital" = "Military Hospital", "Port" = "Port Hospital", "Port Hopital" = "Port Hospital", "St. Mark's Maternity Hospital (SMMH)" = "SMMH"))
Can you add a recode()
statement to your pipe chain that converts the values in the column gender
from "m" to "male" and from "f" to "female"?
r fontawesome::fa("check", fill = "red")
Click to see a solution (try it yourself first!)
surv <- surv_raw %>% # automatically clean column names clean_names() %>% # manually clean column names rename( date_onset = onset_date, date_report = date_of_report, district_res = adm3_name_res, district_det = adm3_name_det) %>% # remove unnecessary column select(-row_num) %>% # de-duplicate rows distinct() %>% # convert date_onset to date class mutate(date_onset = mdy(date_onset)) %>% mutate(date_report = mdy(date_report)) %>% # convert age to numeric class mutate(age = as.numeric(age)) %>% # convert "Unknown" gender to NA mutate(gender = na_if(gender, "Unknown")) %>% # properly record missing values in many character columns mutate(across(.cols = where(is.character), .fns = na_if, "")) %>% # re-code hospital column mutate(hospital = recode(hospital, # for reference: OLD = NEW "Mitilary Hospital" = "Military Hospital", "Port" = "Port Hospital", "Port Hopital" = "Port Hospital", "St. Mark's Maternity Hospital (SMMH)" = "SMMH")) %>% # recode gender mutate(gender = recode(gender, "m" = "male", "f" = "female"))
Congratulations! You have finished the first part of the data cleaning in this course!
Save your script, and check in with your instructor to let them know that you finished.
If you have extra time, ensure that your script is aligned with the "1_ebola_after_cleaning_part1.R" file in the "scripts/backups" folder.
If you have extra time, try to install the {tidylog} package by inserting it into your pacman::p_load()
command and re-running it.
Once loaded, this function records every data cleaning step that you run (e.g. filter()
, mutate()
, select()
, etc.) and will print a record of the changes in the console. For example, re-run your cleaning pipe chain and see the impact of each step in the process.
Deduplication is a significant topic that is often more complex than the simple example in this exercise. We have written about more complex deduplication scenarios is this the Epi R Handbook chapter on de-duplication.
Date and times involve many special circumstances. Familiarise yourself with this Epi R Handbook chapter on working with dates.
Of particular note, review the package {parsedate} and its function parse_date()
. This is used to clean those very messy date columns for which the {lubridate} functions mdy()
, dmy()
, and ymd()
will not work because the dates are not in the same general format across all the rows. parse_date()
will evaluate each row individually, intelligently guessing the format of each cell. This takes more time, but is useful if the dates are very messy.
If you have extra time, we encourage you to review the following for more data cleaning tips:
across()
to clean multiple columns at once If you have questions about any of the above subjects, contact your facilitator and if they are not busy helping other students they can help you.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.