In appliedepi/introexercises:

# load packages ----------------------------------------------------------------
library(introexercises) # get data for exercises  
library(learnr)         # create lessons from rmd
library(gradethis)      # evaluate exercises
library(dplyr)          # wrangle data  
library(tidyr)          # pivot  
library(forcats)        # factors
library(stringr)        # strings
library(flair)          # highlight code
library(ggplot2)        # visualise data     
library(lubridate)      # work with dates  
library(epikit)         # for emojis   
library(skimr)          # browse data
library(fontawesome)    # for emojis  
library(janitor)        # clean data
# library(RMariaDB)       # connect to sql database 

## set options for exercises and checking ---------------------------------------

## Define how exercises are evaluated 
gradethis::gradethis_setup(
  ## note: the below arguments are passed to learnr::tutorial_options
  ## set the maximum execution time limit in seconds
  exercise.timelimit = 60, 
  ## set how exercises should be checked (defaults to NULL - individually defined)
  # exercise.checker = gradethis::grade_learnr
  ## set whether to pre-evaluate exercises (so users see answers)
  exercise.eval = FALSE 
)

# ## event recorder ---------------------------------------------------------------
# ## see for details: 
# ## https://pkgs.rstudio.com/learnr/articles/publishing.html#events
# ## https://github.com/dtkaplan/submitr/blob/master/R/make_a_recorder.R
# 
# ## connect to your sql database
# sqldtbase <- dbConnect(RMariaDB::MariaDB(),
#                        user     = Sys.getenv("userid"),
#                        password = Sys.getenv("pwd"),
#                        dbname   = 'excersize_log',
#                        host     = "144.126.246.140")
# 
# 
# ## define a function to collect data 
# ## note that tutorial_id is defined in YAML
#     ## you could set the tutorial_version too (by specifying version:) but use package version instead 
# recorder_function <- function(tutorial_id, tutorial_version, user_id, event, data) {
#     
#   ## define a sql query 
#   ## first bracket defines variable names
#   ## values bracket defines what goes in each variable
#   event_log <- paste("INSERT INTO responses (
#                        tutorial_id, 
#                        tutorial_version, 
#                        date_time, 
#                        user_id, 
#                        event, 
#                        section,
#                        label, 
#                        question, 
#                        answer, 
#                        code, 
#                        correct)
#                        VALUES('", tutorial_id,  "', 
#                        '", tutorial_version, "', 
#                        '", format(Sys.time(), "%Y-%M%-%D %H:%M:%S %Z"), "',
#                        '", Sys.getenv("SHINYPROXY_PROXY_ID"), "',
#                        '", event, "',
#                        '", data$section, "',
#                        '", data$label,  "',
#                        '", paste0('"', data$question, '"'),  "',
#                        '", paste0('"', data$answer,   '"'),  "',
#                        '", paste0('"', data$code,     '"'),  "',
#                        '", data$correct, "')",
#                        sep = '')
# 
#     # Execute the query on the sqldtbase that we connected to above
#     rsInsert <- dbSendQuery(sqldtbase, event_log)
#   
# }
# 
# options(tutorial.event_recorder = recorder_function)

# hide non-exercise code chunks ------------------------------------------------
knitr::opts_chunk$set(echo = FALSE)

# data prep --------------------------------------------------------------------
surv_raw <- rio::import(system.file("dat/surveillance_linelist_20141201.csv", package = "introexercises"))

# for quizes
surv <- rio::import(system.file("dat/surveillance_linelist_clean_20141201.rds", package = "introexercises"))

malaria_counts <- rio::import(system.file("dat/malaria_facility_count_data.rds", package = "introexercises"))

datadict <- rio::import(system.file("dat/linelist_datadict.csv", package = "introexercises"))

site_coverage_tidy <- rio::import(system.file("dat/messy_data_examples.xlsx", package = "introexercises"), sheet = "tidy_site_coverage")

Introduction to R for Applied Epidemiology and Public Health

Welcome

Welcome to the course "Introduction to R for applied epidemiology", offered by Applied Epi - a nonprofit organisation and the leading provider of R training, support, and tools to frontline public health practitioners.

knitr::include_graphics("images/logo.png", error = F)

Data cleaning

This exercise focuses on logical recoding of values, filtering rows, creating age categories, and other more complex data cleaning tasks.

Format

This exercise guides you through tasks that you should perform in RStudio on your local computer.

Getting Help

There are several ways to get help:

1) Look for the "helpers" (see below) 2) Ask your live course instructor/facilitator for help
3) Schedule a 1-on-1 call with an instructor for "Course Tutoring" 4) Post a question in Applied Epi Community

Here is what those "helpers" will look like:

r fontawesome::fa("lightbulb", fill = "gold") Click to read a hint

Here you will see a helpful hint!

r fontawesome::fa("check", fill = "red")Click to see a solution (try it yourself first!)

linelist %>% 
  filter(
    age > 25,
    district == "Bolo"
  )

Here is more explanation about why the solution works.

Quiz questions

Answering quiz questions will help you to comprehend the material. The answers are not recorded.

To practice, please answer the following questions:

quiz(
  question_radio("When should I view the red 'helper' code?",
    answer("After trying to write the code myself", correct = TRUE),
    answer("Before I try coding", correct = FALSE),
    correct = "Reviewing best-practice code after trying to write yourself can help you improve",
    incorrect = "Please attempt the exercise yourself, or use the hint, before viewing the answer."
  )
)

question_numeric(
 "How anxious are you about beginning this tutorial - on a scale from 1 (least anxious) to 10 (most anxious)?",
 answer(10, message = "Try not to worry, we will help you succeed!", correct = T),
 answer(9, message = "Try not to worry, we will help you succeed!", correct = T),
 answer(8, message = "Try not to worry, we will help you succeed!", correct = T),
 answer(7, message = "Try not to worry, we will help you succeed!", correct = T),
 answer(6, message = "Ok, we will get there together", correct = T),
 answer(5, message = "Ok, we will get there together", correct = T),
 answer(4, message = "I like your confidence!", correct = T),
 answer(3, message = "I like your confidence!", correct = T),
 answer(2, message = "I like your confidence!", correct = T),
 answer(1, message = "I like your confidence!", correct = T),
 allow_retry = TRUE,
 correct = "Thanks for sharing. ",
 min = 1,
 max = 10,
 step = 1
)

License

Please email contact@appliedepi.org with questions about the use of these materials.

Learning objectives

In this exercise you will:

Continue to apply a series of {dplyr} functions to clean a messy dataset
Apply advanced logic to recode columns
Finalize your cleaning command
Complete a series of mini-exercises to test your comprehension

Setup

Open the "ebola" RStudio project by double-clicking on the "ebola.Rproj" file in the "ebola" folder.

Open the "ebola_analysis.R" script you had worked on previously. You will continue to expand the cleaning command.

Run all the commands from the top of this script to the bottom, most importantly:

The pacman::p_load() command to load packages
The combined import() and here() commands to import the raw data
The cleaning command, to re-create the surv object (cleaned data) in your Environment

Take a moment to organize and simplify the "Exploratory analysis", "Clean Surveillance Linelist", and "Testing area" sections of your script so they do not cause errors. Ideally, you should be able to run your entire script with no errors.

Now, compare your cleaning command to the one below and make sure your cleaning command does not have any missing steps.

surv <- surv_raw %>% 

  # automatically clean column names
  clean_names() %>% 

  # manually clean column names   
  rename(
    date_onset = onset_date,
    date_report = date_of_report,
    district_res = adm3_name_res,
    district_det = adm3_name_det) %>%

  # remove unnecessary column
  select(-row_num) %>% 

  # de-duplicate rows  
  distinct() %>% 

  # convert date_onset to date class
  mutate(date_onset = mdy(date_onset)) %>% 
  mutate(date_report = mdy(date_report)) %>% 

  # convert age to numeric class
  mutate(age = as.numeric(age)) %>% 

  # convert "Unknown" sex to NA
  mutate(sex = na_if(sex, "Unknown")) %>% 

  # properly record missing values in many character columns
  mutate(across(.cols = where(is.character), .fns = ~na_if(.x, ""))) %>% 

  # re-code hospital column
  mutate(hospital = recode(hospital,
    # for reference: OLD = NEW
    "Mitilary Hospital"  = "Military Hospital",
    "Port"               = "Port Hospital",
    "Port Hopital"       = "Port Hospital",
    "St. Mark's Maternity Hospital (SMMH)" = "SMMH")) %>% 

  # recode sex
  mutate(sex = recode(sex,
    "m" = "male",
    "f" = "female"))

Remember you should have only one cleaning command. Any other miscellaneous commands like class() and tabyl() should be in the Testing Area.

Logical recoding

Sometimes, the recoding of values is not as simple as replacing one value for another. Instead, you can use logical statements and operators to implement more complex recoding actions.

Negative values

In the weight column wt_kg, there are some nonsensical negative values to fix.
Open the surv data frame in the viewer (click the data frame name in the Environment pane). Then, sort the dataset by clicking on the wt_kg column. Note the negative values.

Alternatively, in your "Testing area" section, run the {base} function summary() on the column surv$wt_kg to return key summary statistics on this numeric column:

summary(surv$wt_kg)

The negative values must be a data entry error! We should remove these from our clean data by converting them to NA (missing).

Remember, it is better practice to make such edits in your R script, rather than in Excel. This way, because the changes are recorded, are repeatable on any refreshed versions of the dataset, and can be reversed easily.

The logic we want to apply is simple: "If the value in our wt_kg column is negative, convert it to NA, otherwise keep the value the same." Because the logic is simple, we can use the ifelse() function within mutate() to change the wt_kg column.

The syntax for the ifelse() function is as follows:

ifelse(CONDITION APPLIED TO ROW, OUTCOME IF TRUE, OUTCOME IF FALSE)

In this example:

ifelse(value in row of wt_kg column is < 0, replace negative value with NA, otherwise keep the value as is)

Here is the example, written within a mutate() command that could be added to the cleaning command:

# mutate wt_kg to replace negative values with NA, else keep value in wt_kg
mutate(wt_kg = ifelse(wt_kg < 0, NA, wt_kg))

Add this to your cleaning command and verify that it worked correctly.

r fontawesome::fa("check", fill = "red")Click to see a solution (try it yourself first!)

surv <- surv_raw %>% 

  # automatically clean column names
  clean_names() %>% 

  # manually clean column names   
  rename(
    date_onset = onset_date,
    date_report = date_of_report,
    district_res = adm3_name_res,
    district_det = adm3_name_det) %>%

  # remove unnecessary column
  select(-row_num) %>% 

  # de-duplicate rows  
  distinct() %>% 

  # convert date_onset to date class
  mutate(date_onset = mdy(date_onset)) %>% 
  mutate(date_report = mdy(date_report)) %>% 

  # convert age to numeric class
  mutate(age = as.numeric(age)) %>% 

  # convert "Unknown" sex to NA
  mutate(sex = na_if(sex, "Unknown")) %>% 

  # properly record missing values in many character columns
  mutate(across(.cols = where(is.character), .fns = ~na_if(.x, ""))) %>% 

  # re-code hospital column
  mutate(hospital = recode(hospital,
    # for reference: OLD = NEW
    "Mitilary Hospital"  = "Military Hospital",
    "Port"               = "Port Hospital",
    "Port Hopital"       = "Port Hospital",
    "St. Mark's Maternity Hospital (SMMH)" = "SMMH")) %>% 

  # recode sex
  mutate(sex = recode(sex,
    "m" = "male",
    "f" = "female")) %>% 

  # convert negative weight values to NA
  mutate(wt_kg = ifelse(wt_kg < 0, NA, wt_kg))

Now that you have re-run your cleaning command and updated your surv data frame, let's re-run the summary() command in the "Testing area" to check the new range of the values:

summary(surv$wt_kg)

Review logical operators

Below, we have provided a summary of useful logical operators in R. There is no need to memorize them (see the Epi R Handbook basics chapter), but they are here for your review.

Relational operators

Relational operators compare values and are often used when defining new variables and subsets of datasets.

+--------------------------+------------+--------------+------------------------------------------+ | Meaning | Operator | Example | Example Result | +==========================+============+==============+==========================================+ | Equal to | == | "A" == "a" | FALSE (because R is case sensitive) | +--------------------------+------------+--------------+------------------------------------------+ | Not equal to | != | 2 != 0 | TRUE | +--------------------------+------------+--------------+------------------------------------------+ | Greater than | > | 4 > 2 | TRUE | +--------------------------+------------+--------------+------------------------------------------+ | Less than | < | 4 < 2 | FALSE | +--------------------------+------------+--------------+------------------------------------------+ | Greater than or equal to | >= | 6 >= 4 | TRUE | +--------------------------+------------+--------------+------------------------------------------+ | Less than or equal to | <= | 6 <= 4 | FALSE | +--------------------------+------------+--------------+------------------------------------------+ | Value is missing | is.na() | is.na(7) | FALSE | +--------------------------+------------+--------------+------------------------------------------+ | Value is not missing | !is.na() | !is.na(7) | TRUE | +--------------------------+------------+--------------+------------------------------------------+

Note that the == (double equals) asks the question to R: "IS the value on the right equal to the value on the left?. The result is either TRUE or FALSE.

age == 12 asks is the value in column age for a given row equal to 12?

In contrast, the = (single equals) is used to assign values, such as for an argument within a function:

max(age, na.rm = TRUE) This sets the na.rm= argument to TRUE.
paste(jurisdictions, collapse = ",") This sets the collapse argument to a comma.

Logical operators

Logical operators, such as AND and OR, are often used to connect relational operators and create more complicated criteria. Complex statements might require parentheses ( ) for grouping and order of application.

+---------------------+-----------------------------------------------------------------------+ | Meaning | Operator | +=====================+=======================================================================+ | AND | & | +---------------------+-----------------------------------------------------------------------+ | OR | | (vertical bar) | +---------------------+-----------------------------------------------------------------------+ | Parentheses | ( ) Used to group criteria together and clarify order of operations | +---------------------+-----------------------------------------------------------------------+

Other simple logic recoding options

Now that you have learned simple logical re-coding, you can use ifelse() for many situations. Here are two other simple logical recoding functions that you should bookmark to read about later:

if_else() is particularly useful if re-coding dates. NOTE this is slightly different from the ifelse() function, which we used above.
replace() can be a shortcut for ifelse() logical recoding

`case_when()` for complex recoding

For more complex recoding, you should use case_when() from {dplyr} (part of the tidyverse). This function allows you to run the equivalent of multiple ifelse() statements - it is R's equivalent to the SQL CASE WHEN statement.

It is just a coincidence that the case_when() function has a name that sounds like it was made for building epidemiological case definitions! It is a general function - not specific to public health. The generic case_when() function format is as follows:

mutate(NEWcolumn = case_when(
    LOGIC STATEMENT 1  ~ VALUE FOR NEWcolumn IF STATEMENT 1 IS TRUE,  
    LOGIC STATEMENT 2  ~ VALUE FOR NEWcolumn IF STATEMENT 2 IS TRUE,  
    LOGIC STATEMENT 3  ~ VALUE FOR NEWcolumn IF STATEMENT 3 IS TRUE)
)

Note the two sides of the logic statement lines are separated by a ~ (tilde) symbol. This symbol can be difficult to find on the keyboard. For some, it is located close to the Enter key. For others, it is located in the bottom left of the keyboard. Take a moment to identify where the ~ is located on your keyboard, and notify an instructor if you can not find it.

Remember you can always use ?case_when() to look up the documentation.

Here is an example of case_when(), used within mutate(), to create a case definition column:

mutate(case_def = case_when(                        # create new column case_def
  lab_confirmed == TRUE             ~ "Confirmed",  # Confirmed case if lab positive
  epilink == "yes" & fever == "yes" ~ "Suspect",    # else, Suspect case if epilink and fever
  TRUE                              ~ "To investigate")) # else, if none of previous are true, assign as "To investigate"

Let us examine what is happening in the above command:

Setup

A mutate() function should be linked by a pipe to previous cleaning commands
To the left of the equals is the name of the column we want to create or adjust: case_def
To the right of the equals is the case_when() function, used to assign the values for this new column

Structure within the case_when()

case_when() commands consist of formulae on separate lines. Each has a Right-Hand Side (RHS) and a Left-Hand Side (LHS), separated by a “tilde” ~
The left side is a logical statement that can evaluate to TRUE or FALSE for a given row in the data
The right side is the value that is assigned if the left side is TRUE for a given row

ATTENTION: All outcome values on the RIGHT-side must all be the same class - either numeric, integer, character, date, logical, etc.

If a patient is lab confirmed they are immediately categorized as a "Confirmed" case in our new case_def column. If lab confirmation is FALSE or missing, then the second formula is checked: if the case has an epidemiological link AND a fever, they are categorized as a "Suspect" case. Finally, any patient / row where the first and second conditions are not met, the value "To investigate" is assigned.

How the statements are evaluated

Statements are evaluated from top to bottom (not all at once), so the order is important.

In this example, every row in the data frame is evaluated for the top logical statement first (lab_confirmed == TRUE). If this statement is TRUE for that row, its value in column case_def is set as "Confirmed". However if lab_confirmed is not TRUE for that row in the data, only then is the second logical statement checked. Thus, for each row in the data frame, the first logical statement which evaluates to TRUE is used.

Write the formulae in an intentional order - from very specific (at the top) to very broad (at the bottom).

If none of the logical statements apply to a row in the data, that row's value is assigned NA by default. You can adjust this "default" value by doing one of the following:

Make a final logical statement simply written as TRUE. In our case definition scenario, any rows where neither logical criteria 1 or 2 were met, the value of "To Investigate" is assigned.
Alternatively, set the argument .default = "To investigate". However this can be confusing to beginners because it uses the = instead of the ~.

Run ?case_when in the Console to learn more about the .default = argument. if you are interested.

Now that we have reviewed the case_when() command, add a mutate() line with the `case_when() to the bottom of your cleaning command and re-run it.

Please type the command and do not just copy/paste.

r fontawesome::fa("check", fill = "red")Click to see a solution (try it yourself first!)

surv <- surv_raw %>% 

  # automatically clean column names
  clean_names() %>% 

  # manually clean column names   
  rename(
    date_onset = onset_date,
    date_report = date_of_report,
    district_res = adm3_name_res,
    district_det = adm3_name_det) %>%

  # remove unnecessary column
  select(-row_num) %>% 

  # de-duplicate rows  
  distinct() %>% 

  # convert date_onset to date class
  mutate(date_onset = mdy(date_onset)) %>% 
  mutate(date_report = mdy(date_report)) %>% 

  # convert age to numeric class
  mutate(age = as.numeric(age)) %>% 

  # convert "Unknown" sex to NA
  mutate(sex = na_if(sex, "Unknown")) %>% 

  # properly record missing values in many character columns
  mutate(across(.cols = where(is.character), .fns = ~na_if(.x, ""))) %>% 

  # re-code hospital column
  mutate(hospital = recode(hospital,
    # for reference: OLD = NEW
    "Mitilary Hospital"  = "Military Hospital",
    "Port"               = "Port Hospital",
    "Port Hopital"       = "Port Hospital",
    "St. Mark's Maternity Hospital (SMMH)" = "SMMH")) %>% 

  # recode sex
  mutate(sex = recode(sex,
    "m" = "male",
    "f" = "female")) %>% 

  # convert negative weight values to NA
  mutate(wt_kg = ifelse(wt_kg < 0, NA, wt_kg)) %>% 

  # create case definition
  mutate(case_def = case_when(
    lab_confirmed == TRUE             ~ "Confirmed",
    epilink == "yes" & fever == "yes" ~ "Suspect",
    TRUE                              ~ "To investigate"))

Now we can tabulate the new column case_def and see the results (write in the Testing section):

tabyl(surv, case_def)

It is also wise to go look at the dataset to confirm that the logic was applied the way that you intended!

Create column `age_years`

Now apply your case_when() knowledge to this next scenario.

Remember that according to the age_unit column, some age values are recorded in "years", others in "months", and some are missing a unit. We need to create a standard age_years column that contains a numeric age in years.

Add another step to your cleaning command to create a new column age_years. Use mutate() and case_when() to assess the age and age_unit columns and perform calculations accordingly.

Below, we give you the outline - it is your responsibility to fill in the logic statements on the left, and the calculations on the right.

  # create age-in-years
  mutate(age_years = case_when(

    # logic for age_unit    # adjustment to make age_years  
    [if months]               ~ [calculation],    # return: age / 12
    [if years]                ~ [calculation],    # return: age 
    [if missing]              ~ [calculation]))   # if unit is missing, assume years

r fontawesome::fa("lightbulb", fill = "gold") Click to read a hint

Within the parentheses of case_when(), on the LEFT sides you should write logical criteria that will be assessed for each row in the dataset, and on the right sides you should write the resulting value to be returned.

The first logical criteria should be: "is age_unit equal to "months"? In code, this is written age_unit == "months" (note the double equals, which asks the question of equivalence). On the right side of the ~, you would write an equation that would produce the correct outcome in the new column age_years. In this case, it would be the value in age divided by 12. In code this would be written as: age_unit == "months" ~ age / 12. Don't forget a comma at the end, and to always continue on the next line with the next logical criteria.

If you happen to encounter this error: 'names' attribute [1] must be the same length as the vector [0], then confirm that you converted the age column to Class numeric earlier in the cleaning command. This error is because by dividing by 12, we are introducing decimals, which is not acceptable if the column is still "integer" class.

r fontawesome::fa("check", fill = "red")Click to see a solution (try it yourself first!)

surv <- surv_raw %>% 

  # automatically clean column names
  clean_names() %>% 

  # manually clean column names   
  rename(
    date_onset = onset_date,
    date_report = date_of_report,
    district_res = adm3_name_res,
    district_det = adm3_name_det) %>%

  # remove unnecessary column
  select(-row_num) %>% 

  # de-duplicate rows  
  distinct() %>% 

  # convert date_onset to date class
  mutate(date_onset = mdy(date_onset)) %>% 
  mutate(date_report = mdy(date_report)) %>% 

  # convert age to numeric class
  mutate(age = as.numeric(age)) %>% 

  # convert "Unknown" sex to NA
  mutate(sex = na_if(sex, "Unknown")) %>% 

  # properly record missing values in many character columns
  mutate(across(.cols = where(is.character), .fns = ~na_if(.x, ""))) %>% 

  # re-code hospital column
  mutate(hospital = recode(hospital,
    # for reference: OLD = NEW
    "Mitilary Hospital"  = "Military Hospital",
    "Port"               = "Port Hospital",
    "Port Hopital"       = "Port Hospital",
    "St. Mark's Maternity Hospital (SMMH)" = "SMMH")) %>% 

  # recode sex
  mutate(sex = recode(sex,
    "m" = "male",
    "f" = "female")) %>% 

  # convert negative weight values to NA
  mutate(wt_kg = ifelse(wt_kg < 0, NA, wt_kg))  %>% 

  # create case definition
  mutate(case_def = case_when(
    lab_confirmed == TRUE             ~ "Confirmed",
    epilink == "yes" & fever == "yes" ~ "Suspect",
    TRUE                              ~ "To investigate")) %>% 

  # create age-in-years
  mutate(age_years = case_when(
    age_unit == "months" ~ age/12,   # if age is given in months
    age_unit == "years"  ~ age,      # if age is given in years
    is.na(age_unit)      ~ age))     # if age_unit missing assume years, else NA

That was difficult! If the above exercise with case_when() was very challenging for you, do not despair! It will come with practice. Talk with an instructor about the errors you encountered.

Age categories

Now that we have a valid age_years column where all ages are recorded in years, let's create a categorical column of age categories.

There are several ways to do this (see this chapter of the Epi R Handbook for details), but here we show the most simple method.

The function age_categories() from {epikit} can be used within mutate() to define a new column age_cat with 10-year age categories.

The arguments for age_categories are:

1) The numeric column with age values to be divided into categories (age_years)
2) lower = A number that is the lower numeric boundary (often we use 0)
3) upper = A number that is the upper numeric boundary for categories (e.g. set this to 70, so that all ages above 70 are put in the category "70+")
4) by = A number for the width of the categories (often we use 10, so for example the youngest category would be "0-9")

Add this line to the bottom of your cleaning command.

r fontawesome::fa("check", fill = "red")Click to see a solution (try it yourself first!)

surv <- surv_raw %>% 

  # automatically clean column names
  clean_names() %>% 

  # manually clean column names   
  rename(
    date_onset = onset_date,
    date_report = date_of_report,
    district_res = adm3_name_res,
    district_det = adm3_name_det) %>%

  # remove unnecessary column
  select(-row_num) %>% 

  # de-duplicate rows  
  distinct() %>% 

  # convert date_onset to date class
  mutate(date_onset = mdy(date_onset)) %>% 
  mutate(date_report = mdy(date_report)) %>% 

  # convert age to numeric class
  mutate(age = as.numeric(age)) %>% 

  # convert "Unknown" sex to NA
  mutate(sex = na_if(sex, "Unknown")) %>% 

  # properly record missing values in many character columns
  mutate(across(.cols = where(is.character), .fns = ~na_if(.x, ""))) %>% 

  # re-code hospital column
  mutate(hospital = recode(hospital,
    # for reference: OLD = NEW
    "Mitilary Hospital"  = "Military Hospital",
    "Port"               = "Port Hospital",
    "Port Hopital"       = "Port Hospital",
    "St. Mark's Maternity Hospital (SMMH)" = "SMMH")) %>% 

  # recode sex
  mutate(sex = recode(sex,
    "m" = "male",
    "f" = "female")) %>% 

  # convert negative weight values to NA
  mutate(wt_kg = ifelse(wt_kg < 0, NA, wt_kg))  %>% 

  # create case definition
  mutate(case_def = case_when(
    lab_confirmed == TRUE             ~ "Confirmed",
    epilink == "yes" & fever == "yes" ~ "Suspect",
    TRUE                              ~ "To investigate")) %>% 

  # create age-in-years
  mutate(age_years = case_when(
    age_unit == "months" ~ age/12,   # if age is given in months
    age_unit == "years"  ~ age,      # if age is given in years
    is.na(age_unit)      ~ age)) %>% # if unit missing assume years, else NA

  # create age category column
  mutate(age_cat = age_categories(         # create new column
    age_years,                             # numeric column to make groups from
    lower = 0,
    upper = 70,
    by = 10))

You can also define age categories other ways. For example, you can provide a vector of age breaks such as c(0, 5, 10, 15, 20, 30, 40, 50) to the breaks = argument. Ask an instructor if you want to know how to do this, or read the function's documentation by entering ?age_categories in the R console.

Other functions that do similar tasks, but less easily, include cut() from {base} R. See the Epi R Handbook's data cleaning chapter.

r fontawesome::fa("exclamation", fill = "red") Have you saved your script lately?

Date differences

Now that both of the date columns are correctly classified as "Date", we can do math on them with simple + and - operators.

Add another step to your cleaning command using mutate() to create a new column named diff. Define this new column to be equal to the number of days between date_onset and date_report for each case.

r fontawesome::fa("lightbulb", fill = "gold") Click to read a hint

Within the mutate, put the name of the new column diff, then an equals sign, then use subtraction ( - ) to find the difference between the two dates, in days.

r fontawesome::fa("check", fill = "red")Click to see a solution (try it yourself first!)

surv <- surv_raw %>% 

  # automatically clean column names
  clean_names() %>% 

  # manually clean column names   
  rename(
    date_onset = onset_date,
    date_report = date_of_report,
    district_res = adm3_name_res,
    district_det = adm3_name_det) %>%

  # remove unnecessary column
  select(-row_num) %>% 

  # de-duplicate rows  
  distinct() %>% 

  # convert date_onset to date class
  mutate(date_onset = mdy(date_onset)) %>% 
  mutate(date_report = mdy(date_report)) %>% 

  # convert age to numeric class
  mutate(age = as.numeric(age)) %>% 

  # convert "Unknown" sex to NA
  mutate(sex = na_if(sex, "Unknown")) %>% 

  # properly record missing values in many character columns
  mutate(across(.cols = where(is.character), .fns = ~na_if(.x, ""))) %>% 

  # re-code hospital column
  mutate(hospital = recode(hospital,
    # for reference: OLD = NEW
    "Mitilary Hospital"  = "Military Hospital",
    "Port"               = "Port Hospital",
    "Port Hopital"       = "Port Hospital",
    "St. Mark's Maternity Hospital (SMMH)" = "SMMH")) %>% 

  # recode sex
  mutate(sex = recode(sex,
    "m" = "male",
    "f" = "female")) %>% 

  # convert negative weight values to NA
  mutate(wt_kg = ifelse(wt_kg < 0, NA, wt_kg))  %>% 

  # create case definition
  mutate(case_def = case_when(
    lab_confirmed == TRUE             ~ "Confirmed",
    epilink == "yes" & fever == "yes" ~ "Suspect",
    TRUE                              ~ "To investigate")) %>% 

  # create age-in-years
  mutate(age_years = case_when(
    age_unit == "months" ~ age/12,   # if age is given in months
    age_unit == "years"  ~ age,      # if age is given in years
    is.na(age_unit)      ~ age)) %>% # if unit missing assume years, else NA

  # create age category column
  mutate(age_cat = age_categories(         # create new column
    age_years,                             # numeric column to make groups from
    lower = 0,
    upper = 70,
    by = 10)) %>% 

  # Make date-difference column  
  mutate(diff = date_report - date_onset)

Open the surv dataset and examine a few rows manually. Note the difference between date_report and date_onset, as compared to the value in the new diff column. Is it how you expect?

The diff column's values appear like "4 days" - this is a special class called "difftime". If you want the result to be purely numeric, you can wrap the result of the calculation in the {base} function as.numeric().

DO NOT DO THIS NOW, but just for reference, adding as.numeric() to the command above would look like this:

mutate(diff = as.numeric(date_report - date_onset))

Currently, a case with date_onset and date_report on 9 November 2014 is recorded with diff of 0 days. Do not change this, but note that in your work context, you may choose to add a + 1 so that same-day admission is recorded as a value of 1.

Location

We have two columns that express district / location of cases - district_res is the district of residence, and district_det is where they were detected.

How similar are these two columns?

One way to assess how similar the columns are, is to create a new column that indicates whether the two columns' values in a given row are different. Then, you can tabulate this new column.

The syntax is actually quite simple using mutate() and the != operator (which means IS NOT EQUAL TO).

Add a new step to the cleaning command that compares the two district columns, and re-run your cleaning command.

# New column will be TRUE if the two values are different
mutate(moved = district_res != district_det)

Now assess the change - open the surv dataset. Click the filter button on the upper-left to limit the view to rows in which the new column moved is TRUE. Was the logic for the new column applied correctly? Are TRUE and FALSE appearing where they should be?

quiz(caption = "Quiz - Moving cases",
  question_numeric(
    "How many cases have different districts of residence and of detection?",
    answer(surv_raw %>%
     distinct() %>%
     mutate(across(.cols = where(is.character), .fns = ~na_if(.x, ""))) %>% 
     mutate(moved = adm3_name_res != adm3_name_det) %>% 
     filter(moved) %>%
     nrow(),
           correct = T),
    allow_retry = TRUE,
    correct = "Correct, nice work.",
    min = 1,
    max = 700,
    step = 1
  )
)

You can use R code to find the answer as well, by using the tabyl() function on the new column. Try this out in your "Testing area section" of your script.

tabyl(surv, moved)

Prioritize certain values with `coalesce()`

Now that you know there is some discrepancy, which of these values should be prioritized? District of residence, or district of detection?

This is a question that each outbreak epidemiology team must answer given their knowledge of the data collection process, and of the outbreak itself. For this exercise, we will prioritize location of detection, because it might better reflect the nearby populations who are at higher risk for infection (which can be best understood by good case investigation).

Create a new column in the data simply named: district.

Within mutate(), use the coalesce() function, which will take its first value if it is available, else it will take its second value. This function allows us to "fill-in" missing values by using the first non-missing value (given a preference order), or to prioritize one value over another if they differ.

Add the following step to your cleaning command, and re-run it. Make sure to rerun your cleaning command so your surv object is updated with an additional district column.

# Add this command to the pipechain
mutate(district = coalesce(district_det, district_res))

r fontawesome::fa("check", fill = "red")Click to see a solution (try it yourself first!)

surv <- surv_raw %>% 

  # automatically clean column names
  clean_names() %>% 

  # manually clean column names   
  rename(
    date_onset = onset_date,
    date_report = date_of_report,
    district_res = adm3_name_res,
    district_det = adm3_name_det) %>%

  # remove unnecessary column
  select(-row_num) %>% 

  # de-duplicate rows  
  distinct() %>% 

  # convert date_onset to date class
  mutate(date_onset = mdy(date_onset)) %>% 
  mutate(date_report = mdy(date_report)) %>% 

  # convert age to numeric class
  mutate(age = as.numeric(age)) %>% 

  # convert "Unknown" sex to NA
  mutate(sex = na_if(sex, "Unknown")) %>% 

  # properly record missing values in many character columns
  mutate(across(.cols = where(is.character), .fns = ~na_if(.x, ""))) %>% 

  # re-code hospital column
  mutate(hospital = recode(hospital,
    # for reference: OLD = NEW
    "Mitilary Hospital"  = "Military Hospital",
    "Port"               = "Port Hospital",
    "Port Hopital"       = "Port Hospital",
    "St. Mark's Maternity Hospital (SMMH)" = "SMMH")) %>% 

  # recode sex
  mutate(sex = recode(sex,
    "m" = "male",
    "f" = "female")) %>% 

  # convert negative weight values to NA
  mutate(wt_kg = ifelse(wt_kg < 0, NA, wt_kg))  %>% 

  # create case definition
  mutate(case_def = case_when(
    lab_confirmed == TRUE             ~ "Confirmed",
    epilink == "yes" & fever == "yes" ~ "Suspect",
    TRUE                              ~ "To investigate")) %>% 

  # create age-in-years
  mutate(age_years = case_when(
    age_unit == "months" ~ age/12,   # if age is given in months
    age_unit == "years"  ~ age,      # if age is given in years
    is.na(age_unit)      ~ age)) %>% # if unit missing assume years, else NA

  # create age category column
  mutate(age_cat = age_categories(         # create new column
    age_years,                             # numeric column to make groups from
    lower = 0,
    upper = 70,
    by = 10)) %>% 

  # Make date-difference column  
  mutate(diff = date_report - date_onset) %>% 

  # create column marking TRUE if district of residence and detection differ
  mutate(moved = district_res != district_det) %>% 

  # create new column that prioritizes district of detection
  mutate(district = coalesce(district_det, district_res))

View the dataset again and verify with your eye that this code worked as expected

Add the following command to your "Testing area section" and answer the quiz question below.

# cross tabulate "moved" column with "district_res" column
tabyl(surv, moved, district_res)

quiz(caption = "Quiz - Moving cases origin",
  question_numeric(
    "Of cases with different residence and detection locations, how many had residence in 'Mountain Rural' district?",
    answer(surv_raw %>%
             distinct() %>%
             mutate(across(.cols = where(is.character), .fns = ~na_if(.x, ""))) %>% 
             mutate(moved = adm3_name_res != adm3_name_det) %>% 
             filter(moved & adm3_name_res == "Mountain Rural") %>%
             nrow(),
           correct = T),
    allow_retry = TRUE,
    correct = "Correct, nice work.",
    min = 1,
    max = 700,
    step = 1
  )
)

r fontawesome::fa("exclamation", fill = "red") Have you saved your script lately?

Filter rows

Sub-setting rows is done with the function filter(). This function is very useful! Inside the parentheses of the filter() function, you write logical criteria that are applied to each row. Rows that meet the criteria are kept.

Briefly review again the relational and logical operators in R. These can always be found in the Epi R Handbook R Basics page.

Relational operators

Logical operators

Note that the == (double equals) asks the question to R: "IS the value on the right equal to the value on the left?. The result is either TRUE or FALSE.

age == 12 Asks is the value in column age for a given row equal to 12?

In contrast, the = (single equals) is used to assign values, such as for an argument within a function:

max(age, na.rm = TRUE) This sets the na.rm= argument to TRUE
paste(jurisdictions, collapse = ",") This sets the collapse argument to a comma

`%in%`

The %in% operator is a very useful operator for matching values, and for quickly assessing if a value is within a given vector or data frame. For example, given a simple vector, we can assess using %in% whether a value (on the left) is in the vector (on the right). See the example below:

my_countries <- c("United States", "Vietnam", "Mongolia", "Peru")

"Peru" %in% my_countries
"Tanzania" %in% my_countries

To ask if a value is not %in% a vector, put an exclamation mark (!) in front of the logic statement:

# to negate, put an exclamation in front
!"Peru" %in% my_countries
!"Tanzania" %in% my_countries

This also works with columns, because columns are "vectors".

Remove cases defined as "To investigate"

For the purposes of this exercise, we will keep only "Confirmed" and "Suspect" cases in the data, and exclude any "To Investigate" rows.

The most simple way of using filter() is to apply a logical condition (e.g. ==, !=, >, <) so that only rows where the logical statement is TRUE are kept.

For example only (do NOT add this to your code), this is a filter() statement that keeps only the rows where sex is equal to "female":

filter(sex == "female")

The %in% operator can be used to test whether a row contains one of the values included within a vector of values.

DO NOT ADD THIS TO YOUR COMMAND

filter(sex %in% c("male", "female"))

Now, add another step to your cleaning command using filter(), %in%, and c() to keep only rows where case_def is either "Confirmed" or "Suspect". Then re-run your cleaning command to see the changes.

r fontawesome::fa("check", fill = "red")Click to see a solution (try it yourself first!)

surv <- surv_raw %>% 

  # automatically clean column names
  clean_names() %>% 

  # manually clean column names   
  rename(
    date_onset = onset_date,
    date_report = date_of_report,
    district_res = adm3_name_res,
    district_det = adm3_name_det) %>%

  # remove unnecessary column
  select(-row_num) %>% 

  # de-duplicate rows  
  distinct() %>% 

  # convert date_onset to date class
  mutate(date_onset = mdy(date_onset)) %>% 
  mutate(date_report = mdy(date_report)) %>% 

  # convert age to numeric class
  mutate(age = as.numeric(age)) %>% 

  # convert "Unknown" sex to NA
  mutate(sex = na_if(sex, "Unknown")) %>% 

  # properly record missing values in many character columns
  mutate(across(.cols = where(is.character), .fns = ~na_if(.x, ""))) %>% 

  # re-code hospital column
  mutate(hospital = recode(hospital,
    # for reference: OLD = NEW
    "Mitilary Hospital"  = "Military Hospital",
    "Port"               = "Port Hospital",
    "Port Hopital"       = "Port Hospital",
    "St. Mark's Maternity Hospital (SMMH)" = "SMMH")) %>% 

  # recode sex
  mutate(sex = recode(sex,
    "m" = "male",
    "f" = "female")) %>% 

  # convert negative weight values to NA
  mutate(wt_kg = ifelse(wt_kg < 0, NA, wt_kg))  %>% 

  # create case definition
  mutate(case_def = case_when(
    lab_confirmed == TRUE             ~ "Confirmed",
    epilink == "yes" & fever == "yes" ~ "Suspect",
    TRUE                              ~ "To investigate")) %>% 

  # create age-in-years
  mutate(age_years = case_when(
    age_unit == "months" ~ age/12,   # if age is given in months
    age_unit == "years"  ~ age,      # if age is given in years
    is.na(age_unit)      ~ age)) %>% # if unit missing assume years, else NA

  # create age category column
  mutate(age_cat = age_categories(         # create new column
    age_years,                             # numeric column to make groups from
    lower = 0,
    upper = 70,
    by = 10)) %>% 

  # Make date-difference column  
  mutate(diff = date_report - date_onset) %>% 

  # create column marking TRUE if district of residence and detection differ
  mutate(moved = district_res != district_det) %>% 

  # create new column that prioritizes district of detection
  mutate(district = coalesce(district_det, district_res)) %>% 

  # remove suspect cases
  filter(case_def %in% c("Confirmed", "Suspect"))

Filters and missing values

r fontawesome::fa("exclamation", fill = "red") It is important to note that filter(case_def %in% c("Confirmed", "Suspect")) will remove rows in which case_def is missing (NA).

Likewise, filtering based on a numeric or date criteria may also remove rows that are missing those values. For example:

filter(age > 12) will keep only rows in which age is greater than 12. Rows with missing age will be removed.
filter(date_onset > ymd("2014-11-25") will keep only rows in which date_onset is after 25 November 2014. Rows with missing date of onset will be removed.

To keep rows with missing values, add an "OR" bar and then the is.na() function to the filter:

filter(age > 12 | is.na(age))
filter(date_onset > ymd("2014-11-25") | is.na(date_onset))

The statements above allow rows to remain that meet the first criteria, OR that have missing values.

Now, review your cleaning command and the solution script above. Ensure that all the cleaning steps in the solution are also present in your R script "ebola_analysis.R".

Re-arrange columns

Click on your data frame surv in the Environment pane to review the order of the columns.

quiz(caption = "Quiz - Re-arrange columns",
  question("Where did the new columns appear, which you created during this exercise?",
    allow_retry = TRUE,
    answer("On the left-most side of the data frame"),
    answer("On the right-most side of the data frame", correct = TRUE),
    answer("In the middle of the data frame")
  )
)

We may prefer to save or export the dataset with some columns placed next to each other - for example age and age_years.

If you remember from the data cleaning demo on select(), you can use the select() function to re-arrange columns as well as to select certain columns.

For example, you can list some columns to move them to the beginning, and finish the select() statement by using the tidyselect helper function everything(), which includes all the other columns. There are other "tidyselect" helpers available to you such as contains() and starts_with(), which you can write within the select() command. Read more about these helpers in this Epi R Handbook chapter.

Here is an example (do NOT add this example to your pipe):

select(case_id, contains("Date"), everything())

Add a select() function to the end of your cleaning command that re-arranges the columns into the following order:

case_id first, followed by any column that starts with "date", followed by diff, sex, age, age_unit, age_years, age_cat, hospital, district, district_res, district_det, moved, all other columns

r fontawesome::fa("check", fill = "red")Click to see a solution (try it yourself first!)

surv <- surv_raw %>% 

  # automatically clean column names
  clean_names() %>% 

  # manually clean column names   
  rename(
    date_onset = onset_date,
    date_report = date_of_report,
    district_res = adm3_name_res,
    district_det = adm3_name_det) %>%

  # remove unnecessary column
  select(-row_num) %>% 

  # de-duplicate rows  
  distinct() %>% 

  # convert date_onset to date class
  mutate(date_onset = mdy(date_onset)) %>% 
  mutate(date_report = mdy(date_report)) %>% 

  # convert age to numeric class
  mutate(age = as.numeric(age)) %>% 

  # convert "Unknown" sex to NA
  mutate(sex = na_if(sex, "Unknown")) %>% 

  # properly record missing values in many character columns
  mutate(across(.cols = where(is.character), .fns = ~na_if(.x, ""))) %>% 

  # re-code hospital column
  mutate(hospital = recode(hospital,
    # for reference: OLD = NEW
    "Mitilary Hospital"  = "Military Hospital",
    "Port"               = "Port Hospital",
    "Port Hopital"       = "Port Hospital",
    "St. Mark's Maternity Hospital (SMMH)" = "SMMH")) %>% 

  # recode sex
  mutate(sex = recode(sex,
    "m" = "male",
    "f" = "female")) %>% 

  # convert negative weight values to NA
  mutate(wt_kg = ifelse(wt_kg < 0, NA, wt_kg))  %>% 

  # create case definition
  mutate(case_def = case_when(
    lab_confirmed == TRUE             ~ "Confirmed",
    epilink == "yes" & fever == "yes" ~ "Suspect",
    TRUE                              ~ "To investigate")) %>% 

  # create age-in-years
  mutate(age_years = case_when(
    age_unit == "months" ~ age/12,   # if age is given in months
    age_unit == "years"  ~ age,      # if age is given in years
    is.na(age_unit)      ~ age)) %>% # if unit missing assume years, else NA

  # create age category column
  mutate(age_cat = age_categories(         # create new column
    age_years,                             # numeric column to make groups from
    lower = 0,
    upper = 70,
    by = 10)) %>% 

  # Make date-difference column  
  mutate(diff = date_report - date_onset) %>% 

  # create column marking TRUE if district of residence and detection differ
  mutate(moved = district_res != district_det) %>% 

  # create new column that prioritizes district of detection
  mutate(district = coalesce(district_det, district_res)) %>% 

  # remove suspect cases
  filter(case_def %in% c("Confirmed", "Suspect")) %>% 

  # re-arrange columns
  select(case_id, starts_with("date"), diff, sex, age, age_unit, age_years, age_cat, hospital, district, district_res, district_det, moved, everything())

Remember to re-run your cleaning command so your surv data frame is updated with this final step.

Export clean dataset

Now that we have cleaned this dataset, we should save it in RStudio project for later use, and perhaps to share with other people.

We can use the export() function from the {rio} package - this is the same package that we used to import() the raw data.

Exporting is a stand-alone command - not part of your cleaning command. The arguments that it expects are:

1) The R object (data frame) that should be exported
2) The desired name for the file, in quotation marks, with extension (e.g. "surveillance_linelist_clean_20141201.csv")

The command below is stand-alone (not part of your cleaning command). This means there is no dataset piped to it, so you have to write surv as the first argument.

As written below, surv will be saved as a CSV file to the root folder of your RStudio project.

Run the command and verify that it was saved in the intended location.

# Export cleaned file to the root folder of the R project  
export(surv, "surveillance_linelist_clean_20141201.csv")

Saving as .rds

If you save the file as .xlsx or .csv, any person using R to further analyze the data will have to re-clean it by again converting the columns to their correct classes.

To avoid this, save the clean dataset as an ".rds" file. RDS is a file format specific to R, and is very useful if you will work in R with the exported data again.

For example, if you work in an epidemiology team and need to send data files to a GIS team for map production, and they use R as well, just send them the .rds file! Then all the column classes are retained and they have less cleaning work to do.

In the example below, surv is saved as a .rds file. We also specify that we want it to be saved in a subfolder ("data/clean") by writing the file path using here() in the same way that we do to import the data.

Try this one too!

# Export cleaned file  
export(surv, here("data", "clean", "surveillance_linelist_clean_20141201.rds"))

If you want to export the data as CSV or Excel, simply change the file extension in the command to .csv or .xlsx.

End

Congratulations! You have completed the cleaning exercise! This is a huge step in your introduction to R. You have now been introduced to most of the important data management functions in R.

Be sure to save your R script, and check-in with your facilitator.

In case you did not finish, do not worry. Look in the "ebola/scripts/backup" folder. You will find R scripts for the Ebola analysis at every stage of this course. So you can always us these as a "backup" for later modules.*

Click on to see exciting extra material...

Extra - Tidy data

In these sections, we will review concepts of "tidy data" and some common pitfalls that prevent easy analysis of Excel spreadsheet data in R.

knitr::include_graphics("images/tidy_shelves.png")

How you structure your dataset will greatly impact how complicated your R code needs to be. Advance attention during your data entry phase will save you a lot of time during data cleaning and analysis!

Reflect: What are some features of your data that make it particularly difficult to clean for analysis?

Definitions

Tidy data has a specific meaning in data analysis - it refers to how information has been stored within the structure of the dataset. So let's begin with some terminology.

Structurally, datasets (called "data frames" in R) consist of cells, columns, rows.

However, "values", "variables" and "observations" are more abstract concepts.

Variables measure one underlying attribute (age, outcome, or date of onset)
Observations relate to a unit of analysis
Values are one piece of information associated with one variable and one observation

Let's test your understanding of these terms.

quiz(
  question("50 kg, 75 kg, 67 kg, 90 kg are...",
    answer("Different observations of the variable `temperature`", message = "Think again! It does not make much sense that `temperature` would be measured in kilograms! It is also unlikely that these values would be referred to as observations..."),
    answer("Different variables of the observation `weight`", message = "Think again! These values all have the same unit, so they would probably be one variable rather than different ones. `weight` as an observation might also not be the most logical way of recording this data..."),
    answer("Different values of the variable `weight`", correct = TRUE, message = "Indeed, these are correct values (in this case we are dealing with numbers, so quantitative data). As the different values have the same unit, they can be grouped under one variable (in this case, weight would make sense)."),
    answer("None of these answers are correct", message = "Are you sure? Try and think of what these different terms refer to!"),
    answer("All of these answers are correct", message = "Are you sure? Try and think of what these different terms refer to!"),
    allow_retry = TRUE
  ),
  question("`Name` would most likely be...",
    answer("A value", message = "If this was a value, what would be the variable and observation it could be grouped under?"),
    answer("A variable", correct = TRUE, message = "Indeed, different values could include 'Josephine', 'Alex' or 'Billy'."),
    answer("An observation", message = "If this was an observation, what is the unit of measure? How would you get other observations?"),
    allow_retry = TRUE
  ),
  question("Thinking of data structure, how would an observation best be represented? ",
    answer("As a row", correct = TRUE, message = "That's right! More on this below..."),
    answer("As a column", message = "Think again! Is this the most efficient way of representing your data?"),
    answer("As a cell", message = "Think again! Is this the most efficient way of representing your data?"),
    allow_retry = TRUE
  )
)

Principles of tidy data

Your data can be stored in many ways - why is it important to have data that is "tidy"?

Consistency in data structure, consistency in tools, and efficient analyses
Placing variables in columns allows may R function to easily handle your data (e.g. ggplot() for plotting) - the packages from the {tidyverse} are designed to work with tidy data

There are three principles that make a dataset “tidy”:

Each variable must have its own column
Each observation must have its own row
Each value must have its own cell

Source: R for Data Science

Ideally, they align: columns = variables and rows = observations

knitr::include_graphics("images/tidy_image.png")

But this is not always the case... especially in public health data. You have surely seen data entered in this "wide" format:

long_example <- tribble(
     ~country, ~January, ~February, ~March, 
     "Mozambique", 3200,    3300,      4100,  
     "Lesotho", 500,     750,       900,   
     "South Africa", 5100, 6200, 8100,)

long_example

quiz(
  question("'January' is a column, but is it a variable?",
    answer("Yes", message = "No, remember a variable is the more abstract, underlying attribute being measured."),
    answer("No", correct = TRUE, message = "January is one value of the variable 'month'"),
    allow_retry = TRUE
  ),
  question("Where is the variable 'month'?",
    answer("In the third column", message = "No, it is not in one column"),
    answer("In a different spreadsheet", message = "You are being silly."),
    answer("It is spread across many columns", correct = TRUE, message = "Correct, it does not have its own column."),
    allow_retry = TRUE
  ),
  question("Does each observation have its own row?",
    answer("Yes", message = "No, each value is an observation and there are three in each row"),
    answer("No", correct = TRUE, message = "Correct, there are three observations in each row."),
    allow_retry = TRUE
  )
)

Here is the same data, but in a "tidy" format:

long_example_pivoted <- long_example %>%
  pivot_longer(cols=2:4, names_to="month", values_to = "cases") 

long_example_pivoted

Reflecting on tidy data practices, what proportion of the datasets you use are tidy?

"Machine-readable" data

When you first begin to collect data, you must ask the question: is the primary audience of this dataset humans or machines?

Recording information such that it is optimized for for "human-readability" can be very different than optimization for "machine-readability" and analysis. Be clear from the beginning what your priority is. Please allow us to give some general advice - it is generally easier to transition from machine-readable to human-readable than the other way!

Excel spreadsheets

At Applied Epi, we promote the use of R for many reasons, but we know that for most applied epidemiologists, Excel is tool that is fundamental to their workflows. There is nothing wrong with using Excel. Workflows that involve R almost always also involve Excel. But it is important that you use Excel in a manner that allows you to also maximize the benefits of R.

knitr::include_graphics("images/interoperability.png")

Excel is a powerful and beginner-friendly software. While it is also possible to do some analysis in Excel, you will likely find that more sophisticated analyses and data management operations may be very complicated or impossible to carry out in this software. These situations are where versatile programs such as R become very useful.

knitr::include_graphics("images/difficulty_plot-1.png")

In this course, you will become more familiar with R code syntax. In this section, we will focus on steps that you can take to ensure that an Excel dataset can be easily interpreted by R for analysis.

The main reason one encounters problems analyzing data from Excel spreadsheets is when the spreadsheet was designed to prioritize easy reading by humans, not easy reading by machines/software.

To help you see the difference, below are some fictional examples of spreadsheets that prioritize human-readability over machine-readability.

Importing data from spreadsheets

Write a command to import the spreadsheet "messy_data_examples.xlsx" and store it as the object partner_tracking. The spreadsheet is saved in the folder "ebola/data/raw".

r fontawesome::fa("check", fill = "red")Click to see a solution (try it yourself first!)

partner_tracking <- import(here("data", "raw", "messy_data_examples.xlsx"))

Now, open the spreadsheet using Microsoft Excel. If you can do this, note how there are actually 4 sheets in this Excel workbook. Using the import() function defaults has only imported the first sheet!

Revise your importing command to use the sheet = argument to specify the sheet name to import. Write and run one command for each of the sheets, as shown below:

# import sheet that tracks partners with color codes
partner_tracking <- import(here("data", "raw", "messy_data_examples.xlsx"), sheet = "messy_colors")

# import sheet with messy format site coverage
site_coverage <- import(here("data", "raw", "messy_data_examples.xlsx"), sheet = "messy_site_coverage")

# import sheet with "tidy" formatted site coverage data
site_coverage_tidy <- import(here("data", "raw", "messy_data_examples.xlsx"), sheet = "tidy_site_coverage")

# import sheet with messy GIS data
messy_gis <- import(here("data", "raw", "messy_data_examples.xlsx"), sheet = "messy_gis")

Now that you have imported these four sheets, let us review them.

Colors

The sheet that was imported and named parter_tracking records the status of a public health intervention by various response partners in several Provinces, Districts, and sub-Districts. On the right side, the status for particular dates is indicated by the cell color, and there is a small color-based dictionary in the upper-right of the sheet.

knitr::include_graphics("images/messydata2.PNG")

We must ask ourselves - is this sheet written to maximize human readability, or machine readability? A prime example of prioritizing human-readability over machine-readability is the use of color-based coding of cells in a spreadsheet.

It is quite difficult in R to interpret the color of each cell in an Excel spreadsheet. If you use color, you should not only use color - also make the values reflect the differences between cells.

Storing information like in this sheet is not easily interpreted by R - nor by humans with color-blindness!

Furthermore, different pieces of information are combined into one cell (multiple partner organizations working in one area, or the status “TBC” in the same cell as “Partner D”).

question("In this case, what do you think would be the best modification before loading the dataset to R?",
  answer("Keep as is, R will recognise the colours.",
         message = "It is very difficult for R to interpret the colors."),
  answer("Keep as is, but add text to the colours.",
         message = "This would help as R cannot read colours. However, this data format would still not be well read in R and the colour dictionary at the right would cause issues."),
  answer("Keep as is, but move the colour dictionary to another sheet.",
         message = "Moving the colour dictionary to another sheet would be helpful and could be loaded separately as a lookup, however R cannot read colours so you would still face issues when loading the main table."),
  answer("Change the dataset so that each column is a variable and each cell contains one unique value, integrating the colours as a new variable.", correct = TRUE,
         message = "Indeed, transforming this data to a tidy format is absolutely necessary if any analysis will be performed on this data. Removing the necessity for colours by adding a variable to specify what these represent would be the way to go."),
  answer("Change the dataset by adding a column to specify what the colour means.",
         message = "Whilst this is generally the way to go to transform a dataset using a colour dictionary before uploading it to R, this dataset contains other issues making it untidy and hard to analyse in the current format."),
  allow_retry = TRUE
)

While color-based dictionaries may be helpful for human-readability of a dataset, colors should never be used as the only way of recording data. Machines will not be able to interpret it for analysis.

Again, it is important to think of how best to represent that variable that is currently being represented by color - most likely you should restructure your data and represent it in its own column.

And as a general rule, dictionaries (whether they are colour dictionaries as shown here or data dictionaries) should be kept separately from the main table. In Excel, it would be best practice to have this dictionary on another sheet. When importing it to R, you can then import it as a separate data frame.

But we're getting ahead of ourselves... more on data dictionaries later!

Merged cells

The sheet that is now imported as site_coverage contains information about "coverage" at a number of sites in May and June 2022:

knitr::include_graphics("images/site_coverage_untidy.jpg")

Note the merged cells. Merged cells are often useful to make data human-readable but can cause problems for analysis by machines.

question("What issues do you think R will encounter when loading this dataset?",
  answer("R will not recognise the merged cells and will drop the value.",
         message = "Indeed, R cannot recognise merged cells. However it will not drop all those values."),
  answer("R will recognise the merged cell and also create a merged cell.",
         message = "Data frames in R cannot contain merged cells."),
  answer("R will not recognise the merged cell so will only keep the value in the first row and column of the merged area.", correct = TRUE),
  answer("R will recognise the merged cells and duplicate these values across the merged area.",
         message = "Unfortunately, most import functions in R will not do this by default."),
  allow_retry = TRUE
)

How did the merged cells import into R using the import() function from {rio}? Click on the data frame to open it in the Viewer pane. Look at row 8 for an example.

knitr::include_graphics("images/site_coverage_imported.jpg")

As you see, importing this dataset into R in the given Excel format leads to data loss in multiple ways:

Most columns do not have the right header
Many observations (i.e. rows) are missing data
The actual values of interest are not all in the right rows

Using the import() function, R will not recognize the merged cell formatting, and all cells except the first merged cell will be read-in as empty! This will lead to loss of data and make analysis difficult.

One solution is to use a different package to import the data. The {openxlsx} R package handles Excel workbooks with more precision than {rio}. Its function read.xlsx() offers an argument fillMergedCells = which can be set to TRUE. For this function, the argument for the sheet name is also sheet =.

sites <- openxlsx::read.xlsx(here("data", "raw", "messy_data_examples.xlsx"),
                             sheet = "messy_site_coverage",
                             fillMergedCells = TRUE)

The argument fillMergedCells = can result in the merged value appearing in all its cells, but the spreadsheet is still very difficult to analyze:

knitr::include_graphics("images/site_coverage_filled.jpg")

Tidy data

Which columns to use is still not clear - it would take many R commands to clean the data in order to produce even a simple tabulation of "Yes" values by site.

How would you enter these data into a spreadsheet in a manner that is "tidy" and machine-readable?

question("If these data were entered in a tidy format, what would be the column headers?",
  answer("Day, Cell, Province, X", message = ""),
  answer("Date, Province, Site, Status", correct=T, message = "Correct, these are the important variables, with Status being either Yes or No."),
  answer("May, June"),
  answer("A, B, Site, Date"),
  allow_retry = TRUE
)

See the sheet that you imported as the object site_coverage_tidy. This contains the same data as in site_coverage, but in a tidy format (also called "long" format). See how:

Every value has its own cell
Every observation has its own row
Every underlying variable has its own column

knitr::include_graphics("images/site_coverage_long.jpg")

The above format is not very easy for humans to read, but it is easily imported and handled by R.

Once in R, it is relatively easy to work with the dataset. We do not expect you to understand the code below, but know that it cleans and expands this tidy data so that every possible date and site are present in the data.

# import the long data 
site_coverage_clean <- site_coverage_tidy %>%    # create complete dataset 
  mutate(Date = ymd(Date)) %>%           # convert dates to proper class in R
  complete(                              # complete all sites and dates
    Date = seq.Date(
      from = min(Date),
      to = max(Date),
      by = "day"),
    Site = seq(1:14),
    fill = list(Status = "No")) %>%      # If not already listed in data, status is "No"
  mutate(Province = as_factor(ifelse(Site %in% 1:7, "A", "B")), # add Province
         Site = as_factor(Site))

Now the dataset has expanded from r nrow(site_coverage_tidy) rows to r nrow(site_coverage_clean) rows - all possible dates and sites - a complete data set! (one row for each cell in the original messy Excel spreadsheet!)

site_coverage_clean

Expanded possibilities

With the data in "tidy" format, the possibilities for transforming and analysing the data are wide open!

For example, we can use the {ggplot2} data visualization R package to create a "heat plot" that resembles the original Excel spreadsheet.

We do not expect you to understand or type this code right now. You will learn more about {ggplot2} in later sessions

# create heat tile plot
ggplot(data = site_coverage_clean,
       mapping = aes(x = Date, y = fct_rev(Site),
                     fill = Status, label = Status))+
  geom_tile(color = "white")+
  geom_text()+
  scale_x_date(
    date_breaks = "day",
    labels = scales::label_date_short(),
    expand = c(0,0))+
  scale_fill_manual(
    values = c(
      "Yes" = "darkgreen",
      "No" = "orange"))+
  theme_minimal(base_size = 16)+
  labs(title = "Site coverage",
       y = "Site")+
  facet_wrap(~Province, ncol = 1, scales = "free_y")

If you don't understand the above code, that is OK - we only want to show you that with a few line of R code you can create an "Excel-like", "human-readable" output that is much more easy to analyze than the original spreadsheet.

In many ways, this setup is more useful than the original Excel spreadsheet:

The data are able to be analyzed, whereas in the Excel they could not be
New rows added to the Excel can be re-imported and the plot updates in seconds
This plot can be exported to a PDF report and emailed to partner organizations

Unlike the Excel, this dataset in R can be analyzed! It takes only a few lines of code to tabulate Status by Province:

site_coverage_clean %>% 
  tabyl(Province, Status)

Or by Date:

site_coverage_clean %>% 
  tabyl(Date, Status) %>% 
  arrange(desc(No))

Or the data can be aggregated into weeks and the number of unfilled spots tabulated:

site_coverage_clean %>% 
  group_by(week_of = floor_date(Date, "week")) %>% 
  summarise(days_coverage_needed = sum(Status == "No"))

Or the data can be used to quickly make other informative plots:

site_coverage_clean %>% 
  filter(Status == "Yes") %>% 
  ggplot(mapping = aes(x = fct_infreq(Site)))+
  geom_bar(fill = "dodgerblue")+
  coord_flip()+
  theme_minimal(base_size = 16)+
  labs(title = "Number of days 'covered', by site",
       x = "Site",
       y = "Number of days with coverage")

Empty space

Examine Excel sheet "messy_gis", which records information about specific clinics, including GPS coordinates the operating partner organization, and bed capacity.

knitr::include_graphics("images/messydata.PNG")

quiz(
  question("What do you foresee may be issues when importing this into R? Tick all that are most likely to apply.",
    answer("I don't think there will be any issues when we upload this to R.", message = "The format will not be the same in R."),
    answer("R will not be able to read this dataset.", message = "R will be able to read this dataset, however it will require a lot of cleaning!"),
    answer("The colours won't appear.", correct = TRUE, message = "The colours won't appear, which may be an issue if the colours represent another variable!"),
    answer("The colours will appear but it is not clear what they represent.", message = "The colours will not appear in R."),
    answer("Some rows are empty.", correct = TRUE, message = "The extra empty rows will require an extra cleaning step."),
    answer("Some row values are empty.", correct = TRUE, message = "R can handle missing values in different cells. The issue with this dataset is that some empty cells imply another value (that from the row above)."),
    answer("Some columns are empty.", correct = TRUE, message = "The empty columns will require an extra cleaning step."),
    answer("Some column values are empty.", message = "R can handle missing values in columns."),
    answer("Spaces in the column names.", correct = TRUE, message = "Column names will be changed from what we see in Excel if they have spaces and as a result and may require an extra cleaning step!"),
    answer("Spaces in the cell values.", message = "R can handle spaces when it comes to cell values. Note that these values will then be recorded as a character (i.e. string) variable."),
    answer("Different recording formats in the same column.", message = "If the same column has different recording formats, R will still be able to read it. It may however save it in the wrong format which will require an extra cleaning step. Recording in different formats in the same column also does not follow the principles of tidy data."),
    allow_retry = TRUE
  )
)

The extra empty rows and columns within the dataset will cause cleaning headaches in R.

question("Which of the empty cells do you think will be most problematic?",
  answer("The first empty row as R will not recognise the column names",
         message = "R will actually recognise the second row as the header row in this case. So whilst this is not ideal, this empty row is not too problematic."),
  answer("The third empty row.",
         message = "The third empty row will be loaded as a row of `NA` values. Whilst not ideal, this can be fairly easily removed."),
  answer("The empty columns.", correct = TRUE,
         message = "These will be the most problematic as you will probably have to manually remove these in R, because one 'real' column (`Patients`) also does not contain values. This means that you will need to take note of every column number that you want to remove and do so manually!"),
  answer("All empty cells.",
         message = "All empty cells are not necessarily issues, as sometimes there is no data available for one particular variable (which may be data itself, or grounds for exclusion from analysis)."),
  allow_retry = TRUE
)

The empty columns in this dataset are likely to be the most problematic out of the options given above, as you will manually have to remove these in R.

The empty rows from row number 4 onwards are also a problem, as they actually imply the value that is given in the cell above, but will be recorded as NA values. This will require significant extra cleaning.

On the other hand, the values in the column Patients are actually missing. In this case, it would be wiser to write "missing" or some other indicator for this in each of these cells, to explicitly indicate that these data are missing for this particular variable.

Take a look at how the data were imported into R as the object messy_gis.

Excel-to-R resources

Here are some links to tutorials to help you transition to R from Excel:

R has robust ways to import Excel workbooks, work with the data, export/save Excel files, and work with the nuances of Excel sheets. It is true that some of the more aesthetic Excel formatting can get lost in translation (e.g. italics, sideways text, etc.). If your work flow requires passing documents back-and-forth between R and Excel while retaining the original Excel formatting, try packages such as {openxlsx}.

Extra practice

In these sections are some data cleaning exercises for practice.

Spotlight on children

Can you create these data frames, starting from the clean surv data frame?

Create a separate data frame for children and save it to the "outputs" folder as "children_spotlight.csv"

Start from the clean data frame surv and assign this new data frame with the name children
Reduce the data frame to only contain cases with age less than 18 years
Create a new column infant that says "Infant" if the case is less than 1 year old, and says "Non-infant" otherwise
Reduce the data frame to only show the columns case_id, age_years, infant, sex, district, hospital, and date_report
In a separate command, save this children data frame to the root folder of your RStudio project as "children_spotlight.csv"

r fontawesome::fa("check", fill = "red")Click to see a solution (try it yourself first!)

# create data frame with only child cases
children <- surv %>% 
  filter(age_years < 18) %>% 
  mutate(infant = ifelse(age_years < 1, "Infant", "Non-infant")) %>% 
  select(case_id, age_years, infant, sex, district, hospital, date_report)

# export
export(children, "children_spotlight.csv")

GIS dataset

Can you create these data frames, starting from the clean surv data frame?

Your GIS team needs a data frame to put into their routine maps for this outbreak. They use R, so they expect a .rds file containing the case_id, any columns containing information on location or whether the patient moved significantly from their residence before being detected.

They have access to your RStudio project and expect the file to be saved into the "outputs" sub-folder with the current date of the data at the end of the file name.

Can you make this file for them?

r fontawesome::fa("check", fill = "red")Click to see a solution (try it yourself first!)

# create data frame with only gis information
gis <- surv %>% 
  select(case_id, district, district_res, district_det, moved, lat, lon)

# export
export(gis, here("outputs", "gis_ebola_cases_20141201.rds"))

String detection

The R package {stringr} is part of the {tidyverse} and contains many different functions to handle words and character "strings". If you ever want to split, unite, or search within a character value, you should look in the Characters and strings chapter of the Epi R Handbook.

For example the function str_detect() returns TRUE or FALSE depending on whether particular characters are found within a specified value. The true arguments are:

1) string = this is the column or text to search in
2) pattern = this the character pattern to search for

For example:

str_detect(string = name, pattern = "Isha") will return TRUE if the characters "Isha" are found in a particular row of the colunm name
str_detect(string = occupation, pattern = "nurse") will return TRUE if the characters "nurse" are found in a particular row of the colunm occupation

Be aware that this function is case sensitive by default. You can adjust this using the tip found here in the Epi R Handbook.

Now, starting with the clean data frame surv, create a new data frame that contains only the cases reported from district whose names contains "West". Then export this as "western_district_cases.csv" to the "outputs" folder.

quiz(caption = "Quiz - Western districts",
  question("How many cases were there in western districts?",
    allow_retry = TRUE,
    answer("170"),
    answer(surv %>% filter(str_detect(string = district, pattern = "West")) %>% nrow(), correct = TRUE),
    answer("418")
  )
)

r fontawesome::fa("check", fill = "red")Click to see a solution (try it yourself first!)

# create data frame with only cases from the "West" districts
western_district_cases <- surv %>% 
  filter(str_detect(string = district, pattern = "West"))

# export
export(western_district_cases, here("outputs", "western_district_cases.csv"))

Searching for multiple patterns

You can search for multiple patterns at the same time, for example: "Doctor" OR "Physician" OR "Surgeon" by using the "OR bar" in your pattern string. For example:

str_detect(string = district, pattern = "West|North|west|north")

Read more about this function, and how to make it not case-sensitive in this section of the Epi R Handbook.

Data dictionaries

A data dictionary, also sometimes referred to as a 'key', is a separate table from your main recording sheet. This dictionary allows to specify what specific variables mean, whether they be column names, colours, or other.

Variable definition

A data dictionary describes the meaning, units, and range of values that each column holds.

Even if you are familiar with a dataset, the meaning of the column names may not be obvious! That is when data dictionaries become useful, as they will provide information on the column name meaning.

quiz(
  question("For example, what do you think a column named `age_cat5` represents?",
    answer("The 5th age category"),
    answer("The age of the five cats"),
    answer("The age category, as categorised by 5 years age ranges", correct = TRUE),
    answer("The age category, as categorised in 5 groups"),
  allow_retry = TRUE
  )
)

Listing possible values and levels

Data dictionaries can also used to specify the acceptable values for a variable.

For example, you could specify the unit for a numeric variable (kilograms, pounds, years, months, etc.), or the increments of age groups. This is similar to what a cell with a drop-down list of values would provide in Excel.

quiz(
  question("Which of these would be helpful to add in a data dictionary to specify the value levels or format?",
    answer("Date: YYYY-MM-DD", correct = TRUE, message = "Adding the date format is helpful as a date can be written in many different ways (e.g. years in 2 or 4 digits, putting the day/month first, etc.) and specifying what input format the date should have will help save time when it comes to cleaning your dataset"),
    answer("sex: 'M' or 'F' or 'other' or 'unknown'", correct= TRUE, message = "Specifying the input format is helpful as you can input sex as words or initials, which may differ depending on the language or your input data. This can also be helpful to decide where you are only recording 'males' and 'females' or if you are recording other sexs too."),
    answer("Age: in years", correct = TRUE, message = "Specifying the unit of one value is important especially if that unit is not evident. In this case, age could be recorded in days, months or years for example."),
    answer("Age category, '0-4' or '5-14' or '15-44', '45+', or 'unknown'", correct=TRUE, message="When asking for data to be categorised, it is very important to specify the categories that you are expecting as these may not be obvious and can differ depending on the research question and data type."),
    answer("Presence of fever on admission: 'yes', 'no', or 'unknown'", correct = TRUE, message = "If you are expecting a binary data input, it is important to specify that you are as well as the format you are expecting. In this example, rather than 'yes' and 'no', the data collector could have asked for '1' and '0', where 1 = 'yes' and 0 = 'no'. If asking for numerical inputs to binary data, it is important to specify what these numbers represent."),
    allow_retry = TRUE
  )
)

Data dictionary example

Here are the first 5 rows of the surv_raw data frame that you imported:

head(surv_raw, 5)

And below is a data dictionary for this data:

knitr::kable(datadict)

This data dictionary allows to understand what each column value means, as well as what units the values were recorded in. These help keep data frames tidy, so that the imported data frame can be understood by computers but the epidemiologist still has a clear understanding of what each column represents.

It is good practice to create data dictionaries when you collect data and create new templates. These will also allow you to have column names that are easier to use for analysis. When creating your template in Excel, try to remember these tips so that your analysis in R will be as easy as possible:

Keep column name concise
Do not use spaces, but rather underscores (_) in the column names
Avoid the use of special characters, such as hyphen (-), commas (,), percentage signs (%), or currency signs (for example$ or £). These will either not be read by R or the column names will then have to be called on using backticks
Avoid recording unit values in column name unless the unit is shown in multiple units (it might be worth considering creating a new column for the value unit in this case)
Bold or italics will not be read by R
Stay consistent in how you name columns (for example, avoid mixing upper- and lower-case and if possible, keep column names all in lower-case)

Data dictionary variable descriptions

Thinking of the best practices you have just learned, try to answer the following questions.

quiz(
  question("What would be the best description for a column named `Date`?",
    answer("Date, YY-MM-JJ", message = "This description may not be explicit (what occurence is this date recording?) and the format is inconsistent: YY refers to 'years' (i.e. in English) whereas JJ refers to 'jour' (i.e. in French)."),
    answer("Date, YY-MM-DD", message = "This description may not be explicit (what occurence is this date recording?)."),
    answer("Date of specimen collection, YYYY-MM-DD", correct = TRUE, message = "This description explicitly specifies what event the date is recording and the format clear and consistent."),
    answer("Date, YYYY.MM.DD", message = "This description may not be explicit (what occurence is this date recording?)."),
    allow_retry = TRUE
  )
)

Data dictionaries are extremely useful and important if the data is recorded by multiple people or analysed by people that were not the collectors. Being as explicit as possible in the dictionary minimises the risk of misunderstanding and inaccurate recording. Data dictionaries should be kept in separate documents or sheets of your Excel document.

The R package {epikit}, jointly developed by Applied Epi and other organizations, has functions specifically for importing Kobo dictionaries into R.

Recording data

When recording data, the most important aspect is to stay consistent. This will help minimize the time it takes to clean the data as well as re-use the same code on new data.

Recording dates

Dates can be recorded in numerous formats. For example:

The order of days, months, or years may change
The numbers required for days, months, or years may change
Months may be written numerically or as characters
If using a numerical date, the separator may change

question("What is the optimal way of recording a date?",
  answer("YYYY-MM-DD", message="That's right, this is a correct way of recording a date! Are you sure this is the only correct way though?"),
  answer("DD/MM/YY", message="That's right, this is a correct way of recording a date! Are you sure this is the only correct way though?"),
  answer("MM/DD/YYYY", message="That's right, this is a correct way of recording a date! Are you sure this is the only correct way though?"),
  answer("MM.DD.YY", message="That's right, this is a correct way of recording a date! Are you sure this is the only correct way though?"),
  answer("All of those may be correct", correct = TRUE,
         message="Indeed, all of the above options may be correct ways of recording dates!"),
  answer("Neither of those are correct",
         message="Are you sure about this?"),
  allow_retry = TRUE
)

Dates can be recorded in multiple ways, none of which is particularly superior to another. The most important aspect to remember is to stay consistent in how the date is recorded, whether that is numerically or in strings, the type of separator used, the order or the amount of numbers expected for days, months, years or hours and minutes!

Recording sex

question("Out of the below options, which do you think are good ways of recording sex?",
  answer("`f`,`h`", correct = TRUE),
  answer("`Female`, `Boy`, `Man`, `Baby`",
         message = "'Boy' and 'Baby' would not fall under a 'sex' variable. If using 'man', consistency would require the counterpart 'woman' rather than 'female' but when recording sex, 'female' and 'male' are the more correct form."),
  answer("`female`, `Male`",
         message = "Consistency in the capitalisation of the words is necessary!"),
  answer("`F`,`M`", correct = TRUE),
  answer("`0`,`1`", correct = TRUE),
  answer("`f`,`man`",
         message = "Consistency in the recording format is necessary! It would also be better to use the term 'male' rather than 'man' when recording sex."),
  allow_retry = TRUE
)

When recording sexs, consistency is key, whichever format is used.

The letters 'F' and 'M' are generally widely understood. Note that this needs to come with a data dictionary so that if other initials are used (for example H and F for 'homme' and 'femme', in French), the person analyzing the data will be aware of what these represent.

If using a binary number format to represent sex, a data dictionary is crucial to specify which number refers to which sex.

Whilst not incorrect, it is generally better to avoid spelling out the sex as this leaves more room for typos or using different styles of letter capitaliszation (which will be read out as different values in R).

Recording location

Look at the GPS column in the dataset below:

knitr::include_graphics("images/messydata.PNG")

question("What is the issue with it?",
  answer("I don't think there's any issue with it."),
  answer("The data is recorded in the wrong format."),
  answer("The data is recorded in multiple formats."),
  answer("The data is recorded in the right format but over multiple rows."),
  answer("The data is recorded in the wrong format and over multiple rows."),
  answer("The data is recorded in multiple formats and over multiple rows", correct = TRUE),
  allow_retry = TRUE
)

There are two issues with how this data is recorded:

The data is recorded in multiple formats
The data is recorded over multiple rows

question("How do you think these are best remedied?",
  answer("You harmonise the formats (pick one) and duplicate the data from the first row to the one below.",
         message = "Are you sure duplicating the data from one row to the next is the most efficient way of presenting your data?"),
  answer("You harmonise the formats (pick one) and separate the coordinates over two columns (merged cell).",
         message = "Remember that R can't read merged cells!"),
  answer("You harmonise the formats (pick one) and separate the coordinates over two columns (`latitude` and `longitude`).",correct = TRUE),
  answer("The format in rows 16 and 17 is the right way to record GPS data. You convert the other values to this format and you combine the values from the two rows into one cell.",
         message = "Whilst combining the two coordinate values into one cell is right, there is not one format that is better than another when recording GPS coordinates. You can pick any that you think is suitable, but the most important aspect is to stay consistent."),
  answer("The format in rows 16 and 17 is the right way to record GPS data. You drop the values that do not conform to this format and you combine the values from the two rows into one cell.",
         message = "Whilst combining the two coordinate values into one cell is right, there is not one format that is better than another when recording GPS coordinates. You can pick any that you think is suitable, but the most important aspect is to stay consistent. You should not drop the values as you would then be deleting valid data."),
  answer("The coordinates are well recorded except for row 14. You drop row 14 and 15 and convert the coordinates over two columns.",
         message = "Dropping rows 14 and 15 would mean getting rid of data across other variables too, so you should not do that. You could replace the value 'pending' by `NA`."),
  allow_retry = TRUE
)

GPS coordinates can be given in different formats:

Degrees, minutes and seconds (DMS), for example: 41°24’12.2"N 2°10’26.5"E
Degrees and decimal minutes (DMM), for example: 41 24.2028, 2 10.4418
Decimal degrees (DD), for example: 41.40338, 2.17403

Any of these units are fine to record in, but the most important rule to remember when recording location is to stay consistent with the format you are using.

Good practices for collecting and storing data

In this tutorial, you have learned about what tidy data is and its importance for data analysis. Let's review some of these concepts which will be important to keep in mind next time you design a dataset.

Collecting data

Before collecting your data, think of:

Your investigation objectives and the data needed to fulfill these
Your outcome measures
Your measures of exposure
Other relevant variables that may act as possible confounders
If this is data that will be collected over a certain period of time or by different people, think of the format your are creating for collection: is it explicit? Is it expandable?

An ideal dataset will be exhaustive enough to permit your analysis without being too complicated to fill when collecting the data.

Storing data

In this tutorial we have shown you examples of datasets stored in Excel and analysed in R. Whether using this format or not, keep in mind that:

Each variable should be in a column for collection (these may be grouped together for analysis)
Each observation should have its own row (these may be grouped together for analysis)
Each value should have its own cell

When creating your data collection template and when collecting data, remember to:

Have clear, explicit and concise column names
Avoid the use of special characters (eg greek letters or accents)
Specify the unit of each variable, if relevant, and use the same unit throughout that column
Specify the different levels of each variable, if relevant, and only use those throughout that column
Be consistent with your use of capitalization throughout your column
Explicitly state is your data is missing
Do not leave cells empty if they imply another meaning
Avoid the use of merged cells
Avoid the use of colors

If collecting data over multiple spreadsheet sheets, think of:

1) Whether is makes sense to collect it over multiple sheets. For example, if each sheet represents recording for a different location or month, could you not instead add a column to specify which location/month the observation belongs to?

2) If it makes sense to have your data across multiple sheets but you are recording common variables across both sheets, stay consistent in how you are recording your data and the units that you are using! Also stay consistent in which cells you are recording your data so you can automate your data extraction from the Excel sheets into R (rather than manually selecting which cells you will extract data from in each sheet!)

appliedepi/introexercises documentation built on April 22, 2024, 1:01 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

appliedepi/introexercises

In appliedepi/introexercises:

Introduction to R for Applied Epidemiology and Public Health

Welcome

Data cleaning

Format

Getting Help

Quiz questions

License

Learning objectives

Setup

Logical recoding

Negative values

Review logical operators

Relational operators

Logical operators

Other simple logic recoding options

case_when() for complex recoding

Create column age_years

Age categories

Date differences

Location

Prioritize certain values with coalesce()

Filter rows

Relational operators

Logical operators

%in%

Remove cases defined as "To investigate"

Filters and missing values

Re-arrange columns

Export clean dataset

Saving as .rds

End

Extra - Tidy data

Definitions

Principles of tidy data

"Machine-readable" data

Excel spreadsheets

Importing data from spreadsheets

Colors

Merged cells

Tidy data

Expanded possibilities

Empty space

Excel-to-R resources

Extra practice

Spotlight on children

GIS dataset

String detection

Searching for multiple patterns

Data dictionaries

Variable definition

Listing possible values and levels

Data dictionary example

Data dictionary variable descriptions

Recording data

Recording dates

Recording sex

Recording location

Good practices for collecting and storing data

Collecting data

Storing data

R Package Documentation

Browse R Packages

We want your feedback!

`case_when()` for complex recoding

Create column `age_years`

Prioritize certain values with `coalesce()`

`%in%`