In appliedepi/introexercises:

library(introexercises)  # get data for exercises 
library(learnr)          # create lessons from rmd 
library(gradethis)       # evaluate exercises
library(dplyr)           # wrangle data
library(flair)           # highlight code 
library(ggplot2)         # visualise data 
library(lubridate)       # work with dates
library(forcats)         # work with factors
library(fontawesome)     # for emojis 
library(scales)          # defining axes and units 
library(stringr)         # work with character strings
library(apyramid)        # creating demographic pyramids 
library(viridis)         # defining colour palettes 
library(janitor)         # clean data
library(tsibble)         # working with epiweeks
library(tidyr)           # clean data
# library(RMariaDB)        # connect to sql database

## set options for exercises and checking ---------------------------------------

## Define how exercises are evaluated 
gradethis::gradethis_setup(
  ## note: the below arguments are passed to learnr::tutorial_options
  ## set the maximum execution time limit in seconds
  exercise.timelimit = 60, 
  ## set how exercises should be checked (defaults to NULL - individually defined)
  # exercise.checker = gradethis::grade_learnr
  ## set whether to pre-evaluate exercises (so users see answers)
  exercise.eval = FALSE 
)

# ## event recorder ---------------------------------------------------------------
# ## see for details: 
# ## https://pkgs.rstudio.com/learnr/articles/publishing.html#events
# ## https://github.com/dtkaplan/submitr/blob/master/R/make_a_recorder.R
# 
# ## connect to your sql database
# sqldtbase <- dbConnect(RMariaDB::MariaDB(),
#                        user     = Sys.getenv("userid"),
#                        password = Sys.getenv("pwd"),
#                        dbname   = 'excersize_log',
#                        host     = "144.126.246.140")
# 
# 
# ## define a function to collect data 
# ## note that tutorial_id is defined in YAML
#     ## you could set the tutorial_version too (by specifying version:) but use package version instead 
# recorder_function <- function(tutorial_id, tutorial_version, user_id, event, data) {
#     
#   ## define a sql query 
#   ## first bracket defines variable names
#   ## values bracket defines what goes in each variable
#   event_log <- paste("INSERT INTO responses (
#                        tutorial_id, 
#                        tutorial_version, 
#                        date_time, 
#                        user_id, 
#                        event, 
#                        section,
#                        label, 
#                        question, 
#                        answer, 
#                        code, 
#                        correct)
#                        VALUES('", tutorial_id,  "', 
#                        '", tutorial_version, "', 
#                        '", format(Sys.time(), "%Y-%M%-%D %H:%M:%S %Z"), "',
#                        '", Sys.getenv("SHINYPROXY_PROXY_ID"), "',
#                        '", event, "',
#                        '", data$section, "',
#                        '", data$label,  "',
#                        '", paste0('"', data$question, '"'),  "',
#                        '", paste0('"', data$answer,   '"'),  "',
#                        '", paste0('"', data$code,     '"'),  "',
#                        '", data$correct, "')",
#                        sep = '')
# 
#     # Execute the query on the sqldtbase that we connected to above
#     rsInsert <- dbSendQuery(sqldtbase, event_log)
#   
# }
# 
# options(tutorial.event_recorder = recorder_function)

# hide non-exercise code chunks ------------------------------------------------
knitr::opts_chunk$set(echo = FALSE)

# data prep --------------------------------------------------------------------
combined <- rio::import(system.file("dat/linelist_combined_20141201.rds", package = "introexercises"))

surv_raw <- rio::import(system.file("dat/surveillance_linelist_20141201.csv", package = "introexercises"))

Introduction to R for Applied Epidemiology and Public Health

Welcome

Welcome to the course "Introduction to R for applied epidemiology", offered by Applied Epi - a nonprofit organisation and the leading provider of R training, support, and tools to frontline public health practitioners.

knitr::include_graphics("images/logo.png", error = F)

Getting help and troubleshooting errors

This exercise teaches how to troubleshoot error messages, read function documentation, and effectively engage with R user communities to get help.

Format

This exercise guides you through tasks that you should perform in RStudio on your local computer.

Getting Help

There are several ways to get help:

1) Look for the "helpers" (see below) 2) Ask your live course instructor/facilitator for help
3) Schedule a 1-on-1 call with an instructor for "Course Tutoring" 4) Post a question in Applied Epi Community

Here is what those "helpers" will look like:

r fontawesome::fa("lightbulb", fill = "gold") Click to read a hint

Here you will see a helpful hint!

r fontawesome::fa("check", fill = "red")Click to see a solution (try it yourself first!)

linelist %>% 
  filter(
    age > 25,
    district == "Bolo"
  )

Here is more explanation about why the solution works.

Quiz questions

Answering quiz questions will help you to comprehend the material. The answers are not recorded.

To practice, please answer the following questions:

quiz(
  question_radio("When should I view the red 'helper' code?",
    answer("After trying to write the code myself", correct = TRUE),
    answer("Before I try coding", correct = FALSE),
    correct = "Reviewing best-practice code after trying to write yourself can help you improve",
    incorrect = "Please attempt the exercise yourself, or use the hint, before viewing the answer."
  )
)

question_numeric(
 "How anxious are you about beginning this tutorial - on a scale from 1 (least anxious) to 10 (most anxious)?",
 answer(10, message = "Try not to worry, we will help you succeed!", correct = T),
 answer(9, message = "Try not to worry, we will help you succeed!", correct = T),
 answer(8, message = "Try not to worry, we will help you succeed!", correct = T),
 answer(7, message = "Try not to worry, we will help you succeed!", correct = T),
 answer(6, message = "Ok, we will get there together", correct = T),
 answer(5, message = "Ok, we will get there together", correct = T),
 answer(4, message = "I like your confidence!", correct = T),
 answer(3, message = "I like your confidence!", correct = T),
 answer(2, message = "I like your confidence!", correct = T),
 answer(1, message = "I like your confidence!", correct = T),
 allow_retry = TRUE,
 correct = "Thanks for sharing. ",
 min = 1,
 max = 10,
 step = 1
)

License

Please email contact@appliedepi.org with questions about the use of these materials.

Learning objectives

In this exercise you will:

Practice catching common errors in R code
Learn to read R documentation
Learn how to create a minimal, reproducible example ("reprex") of a problem

The purpose of this exercise is for you to practice asking the community for help with a coding problem.

In this module you will learn how to ask us (and many other epis) questions for free via our community forum.

Script clean up

Style

Many organizations have R code style guides.

Please take a brief look at these examples. Think about your own coding style and how well it aligns:

On writing good comments, from the Tidyverse style guide
Using pipes and indentations, from the Tidyverse style guide
ggplot2 style, from the Tidyverse style guide

Code review

Restart your R session (Session drop-down menu -> Restart R).

Make sure your environment is empty (use the "broom" button in the top right if necessary).

Take 5-10 minutes and review your R script. Ensure that:

All code runs without errors, and unnecessary commands are either removed, deactivated, or moved to the bottom.
All the code is written with clear and best-practice syntax
The code is well commented, so that someone else can understand what you are doing

Ask an instructor for assistance, tips, or for a final review of your script.

Error troubleshooting

This section of the exercise will warm up your brain by showing you common error messages. It is your task to understand the reason for the error message.

Try your best, and view this as a learning opportunity. You will surely be in a position to troubleshoot errors in your code, or for others, very soon!

R's error messages can be cryptic, but practicing how to understand them is extremely useful.

Remember, the answers are anonymous, so don't worry if you get some questions wrong!

Could not find function

knitr::include_graphics("images/error_import.png", error = F)

quiz(
  question_radio("In the image above, what is the likely cause of the error message?",
    answer("The user forgot to run the command to load their packages",
           correct = TRUE,
           message = "Right! The error 'could not find function' either means the function name was spelled incorrectly, or its package was not loaded."),
    answer("R has a memory lapse and cannot remember how to import data",
           correct = FALSE,
           message = "R does not just forget things, this is probably user error."),
    answer("The user is not in the correct RStudio project",
           correct = FALSE,
           message = "The ebola RStudio project is correct and is not the problem."),
    allow_retry = TRUE
  )
)

No such file

knitr::include_graphics("images/error_import_extension.png", error = F)

quiz(
  question_radio("In the image above, what is the likely cause of the error message? Note, the files pane in the bottom right is show-casing the RSutdio project directory for reference.",
    answer("The user forgot to run the command to load their packages",
           correct = FALSE),
    answer("The linelist is not saved in the indicated folder",
           correct = FALSE,
           message = "It is visible in the Files pane that the linelist is saved in the indicated data/raw folder"),
    answer("The user forgot to write the file extension of the linelist (.csv)",
           correct = TRUE,
           message = "The file extension must be provided. The import() function also uses this to determine the best mechanism to import that type of data."),
    answer("The user is not in the correct RStudio project",
           correct = FALSE,
           message = "The ebola RStudio project is correct and is not the problem."),
    allow_retry = TRUE
  )
)

Column doesn't exist

knitr::include_graphics("images/error_rename_data.png", error = F)

knitr::include_graphics("images/error_rename_script.png", error = F)

quiz(
  question_radio("After loading packages and importing data, the user tried to run the data cleaning command. What is the likely cause of the error message?",
    answer("The column date_onset does not exist in the raw dataset",
           correct = FALSE,
           message = "By the time the rename() is run, are the columns in the raw data relevant anymore? Think about what is changing them before the rename() function is run."),
    answer("The old column name should be written with a space, like in the raw data",
           correct = FALSE,
           message = "By the time the rename() function is used, the column names have already been standardized by the clean_names() function, so all spaces have been replaced by underscores."),
    answer("The columns in rename() should be written in the opposite order",
           correct = TRUE,
           message = "The rename function expects the new name of the column first, then an equals sign, then the old name of the column. The column names used should have underscores instead of spaces, because they have been modified by clean_names()."),
    answer("The pipe is using the incorrect dataset",
           correct = FALSE,
           message = "The cleaning command is correctly set up to create a clean dataset, starting from the raw dataset."),
    allow_retry = TRUE
  )
)

Error in tabyl(): object not found

knitr::include_graphics("images/error_missing_case.png", error = F)

quiz(
  question_radio("The user ran the cleaning command above, and then tried to produce the cross-tabulation of the clean data. What is the cause of the error message?",
    answer("The tabyl() function is not recognizing the surv object, because it is piped into it.",
           correct = FALSE,
           message = "The dataset can be piped into tabyl(), or provided as the first argument inside the parentheses."),
    answer("The user has created the object surv, but in the table is referencing Surv.",
           correct = TRUE,
           message = "R is case-sensitive. The difference between the lowercase 's' and the uppercase 'S' matters."),
    answer("The user forgot to save their script before running the table command.",
           correct = FALSE,
           message = "Saving of the script is not relevant to this. You can run commands without saving your script."),
    allow_retry = TRUE
  )
)

Getting help

When you encounter an error or other problem, what are your options?

You can look in our free Epidemiologist R Handbook for the answer. We built this book to be practical, useful, and cover all the topics an applied epidemiologist might need in their daily routines.
You can check the function's documentation, online or in RStudio itself.
You could ask a colleague, or an AI tool such as ChatGPT (we will address ChatGPT at the end of the module).
You could book a call with Applied Epi's 24/7 multilingual R Support Desk.
You can use a search engine like Google - this will likely direct you to Forum posts made by other R users who have asked similar questions.
You could post a question on a beginner-friendly forum, like Applied Epi's Community Forum. Answers are free!

Function documentation

when an R user creates an R function (or an entire package), they write "documentation" to assist users of the function. When the package is installed, the documentation is also downloaded and available in the Help pane in the bottom-right of RStudio.

The quality of this documentation varies greatly. Still, knowing the basic structure can help you if seek help in using a function.

Vocabulary

Documentation will often mention objects of various classes, such as "data frame", "vector", "list", "matrices", etc.

Data frame - A "data frame" is the most common way of storing data in R. In common language, it is a "dataset" with rows and columns of equal length. All of the datasets we have worked with in R are data frames (surv_raw, surv, hosp, etc.)

Vector - A "vector" is sequence of values, all of which must be the same class (e.g. date, numeric, character, integer, etc.). In a data frame, each column is itself a vector. You can also have standalone vectors that are created with the function c().

List - Lists in R are a more intermediate/advanced object that we have not taught in this class. In brief, they are like a vector of other objects. For example, you can have a list of data frames, a list of vectors, etc. This is a very useful object type for repeated actions (loops or other types of iteration).

quiz(
  question("Which of the following is a vector? (select all that apply)",
    answer("A data frame named linelist",
           correct = FALSE,
           message = "No, a data frame contains multiple columns."),

    answer("A column linelist$age",
           correct = TRUE,
           message = "Yes, a column is a vector."),

    answer("c('Suspect', 'Confirmed') used in a filter() command to keep certain rows",
           correct = TRUE,
           message = "Yes, the Suspect and Confirmed are combined into one vector using the function c()."),
    allow_retry = TRUE
  )
)

Structure

Function documentation is generally organised into distinct sections:

Description - The purpose of the function and generally how it works.
Usage - How to write the function, with the most important arguments.
Arguments - The arguments which can be provided to the function, including requirements for the values provided to them, which are optional, and their default values.
Value - What the output of the function is.
Examples - These examples typically use data frames that are built into R to showcase the function's use.
Notes - Special notes about function usage (often included function specific categories).

You can look up the Help Documentation for any function from the R packages loaded in your current RStudio session by using a '?' followed by the function name with no parentheses.

For example, this command run through your console displays the documentation for tabyl() under the Help tab of the Files, Plots, Packages, Help, and Viewer pane in RStudio.

?tabyl

Try running the above command in your Console and review the documentation shown under the Help tab.

quiz(
  question_radio("Which formats can the first argument of the tabyl function take?",
    answer("Only a data.frame",
           correct = FALSE,
           message = "No, review the documentation of the tabyl function."),

    answer("Only a vector",
           correct = FALSE,
           message = "No, review the documentation of the tabyl function."),

    answer("As long as it is in character format, any object type works. ",
           correct = FALSE,
           message = "No, review the documentation of the tabyl function."),

    answer("data.frame or vector",
           correct = TRUE,
           message = "The tabyl() function requires a first argument to be data.frame or vector object."),
    allow_retry = TRUE
  )
)

Package documentation

Scroll to the bottom of the tabyl() documentation.

Notice at the bottom there is information on the package that tabyl() belongs to:

Package name, in this case {janitor}
Package version, the version you have installed on your computer.

Click on the "Index" link and review the documentation for the {janitor} pacakge. You will see links to documentation for all the functions within the package.

You can also access the package documentation with a command like this:

?janitor

Review the documentation of the adorn_totals() function in the {janitor} package.

quiz(
  question_radio("What is the class of the output created from adorn_totals?",
    answer("Numeric",
           correct = FALSE,
           message = "Review the Value area of the documentation."),

    answer("Character",
           correct = FALSE,
           message = "Review the Value area of the documentation."),

    answer("Numeric or Character",
           correct = FALSE,
           message = "Review the Value area of the documentation."),

    answer("Data frame",
           correct = TRUE,
           message = "adorn_totals() outputs a data frame that also has the special class 'tabyl' which allows it to store additional information about the attached totals and underlying data."),
    allow_retry = TRUE
  )
)

Most people find documentation difficult to read! If you feel this way, you are not alone. Often, authors of packages also create websites or walk-through websites for their packages online which are more readable, and may even include more in-depth examples (e.g. vignettes) of how to use functions within a package.

Online documentation

CRAN site and vignettes

CRAN is a place where packages can be hosted online and made available for download. Each package stored there has a standard, basic website. Sometimes, links to package specific vignettes and tutorials can be found here.

Here is the CRAN site for rio. See the section on "Vignettes" and click on some to open them. Here is their introductory vignette.

Custom websites

Go to this website https://www.danieldsjoberg.com/gtsummary/. This is a website created by Daniel Sjoberg, the creator of the {gtsummary} package.

In addition to the standard documentation stored in RStudio, he has created a website to make the documentation more accessible. Many authors do this as a service to their audiences.

Here is the website for the tidyverse (dplyr, ggplot2, stringr, forcats, and more...) group of packages.

Github

For many packages, their underlying code is available on the site "Github". There, updates are shown publicly and users can submit requests or problems to be fixed. In fact, users often submit fixes as well! This is one of the many ways R-users can show-case how useful it is to code in an open-source language like R!

Here is the Github site for janitor. Click the "Issues" tab at the top to see pending issues and suggestions from the public.

Open-source ethos and thanking authors

Remember that R is built by its users, and is available for free. There is no company overseeing R and building all of its packages. A user writes a package and then it is vetted by millions of users - the best ones rise to the top.

Before you get frustrated at lack of documentation, take a deep breath and remember that the vast majority of packages are made on a volunteer basis by people who wish to help others. These authors often do not have much spare.

Sometimes, they have donation buttons, or suggest that you "pass on" the goodwill to someone else learning R (for example, by answering their questions in our Community Forum)!

ChatGPT

AI tools like ChatGPT can be useful for getting R code help, and we encourage you to try them. However, for this introductory course, our priority is to teach you R programming without the assist of AI.

Sometimes, tools like ChatGPT can provide incorrect or confusing answers, they might not know the latest developments for a package, and you should be careful about feeding them sensitive information. You still need to know how R code works in order to understand where AI is making a mistake.

We believe that forums are a place to connect with other epidemiologists and public health professionals. In forums, you can ask nuanced questions about epidemiological methods, network, and importantly: help others.

Note, as AI becomes more commonplace, many institutions have started requiring citation of AI if used. Review your agency's policies before using AI.

Search engine strategies

Search engines can be a powerful starting point to get help, but certain strategies will make your search much nore effective.

You might find it useful to try a Google search when:

You have an error message you don't understand
You are trying to clean or format dates but can't remember strptime formatting
You want to create nice background for your plot but can't remember ggplot's theme specifics
You are curious whether a package exists that conducts survival analysis (Hint: it does!)
You want to learn how to create a specific type of plot, such as an map of an outbreak

Search strategies:

To write an effective search:

Exclude information that is unique to your data or analysis.

For example, let's say you receive the following error message:

knitr::include_graphics("images/error_missing_recode.png", error = F)

In a google search, you would not include the data frame name ("surv"), column names ("district") or specific values ("West II"), as these are unique to your own data or analysis.

The focus should be on the broader error message. The error message above is specific to the recode() function. In this scenario, the most effective search would only include:

Caused by error in recode.character(): ! argument ".x" is missing, with no default

Let's try this out now. Copy and paste the above into the Google search engine now.

Depending on your search history, you may have different results compared to others. Click through some different results to review some answers. Did you find the solution to your problem?!

Useful places to look for the answer might include Stack Exchange, Stack Overflow, or the Epi R Handbook. For example, you may have the following result appear in your Google search.

knitr::include_graphics("images/search_engine_error_missing_recode_result.png", error = F)

If you click into this page, you will find the Epi R Handbook chapter provides the answer.

Ctrl f

To search within a webpage for a specific term, you can use the Ctrl key and "f" key together to start a search. For example, the above link provides a chapter of many common R errors, so you can use "ctrl" and "f" (or "cmd" and "f" on a mac) to search for recode in the page:

knitr::include_graphics("images/search_engine_error_missing_recode_solution.png", error = F)

It quickly located text related to our specific error of interest!

We are missing the column name within recode(). An easy fix!

Note: Just like you used ctrl f (or cmd f for a mac) to search for recode on this web page, you can also use ctrl f or cmd f in RStudio to search for specific terms in your R script. This can be very useful, especially when debugging errors in the script!

Forums

There are a number of forums for posting R code questions. For example Stack Exchange or Stack Overflow are very popular. But it can be intimidating or scary to post in these forums!

Our team at Applied Epi noticed that most beginner public health R users do not post in these forums because

1) They are intimidated by the strict culture of the forum, and
2) These forums are not public-health focused

So, we decided to make a beginner-friendly and public health-focused forum. That is our Applied Epi Community!

You can post questions about R code, or other public health / epi methods topics!

Many public health agencies are now establishing their own internal R user groups or support email groups. But almost everyone is asking the same questions! If more agencies post their (non-sensitive) R code and epi methods questions in one public forum, then everyone will be able to find answers more easily and walls between nations and agencies are reduced.

Ask a good question in a forum

The most efficient way to get a useful answer on a Forum, is to take the time to ask a good question.

"Help me help you!" - Lots of epidemiologists ready to assist you

Imagine someone trying to help you with an R code error. To quickly understand your problem and fix it, they must be able to re-create your problem on their computer.

They do not have access to your full dataset. But they still want to be able to run your code and see the exact problem you encountered...

You can facilitate this by including a "minimal, reproducible example" in your post, which is called a "reprex".

Data sensitivity and sharing

r fontawesome::fa("exclamation", fill = "red") Remember to always think carefully about what data you are allowed to post publicly! Ensure there is no patient, identifiable, or otherwise sensitive information.

Note - you should not post an entire dataset. Typically, only 5-15 rows of data, and a few columns, are required for someone to reproduce your problem.

Alternatively, consider:

Anonymising, "jittering", or otherwise obscuring any sensitive values, and clearly stating in the post that the data are fake
Creating a fake dataset with a similar structure
Use a public dataset that has a similar structure

Reprex example 1

Now, you will work through an example of building a "reproducible example" of a problem, and post it in the Training area of Applied Epi Community forum. If you do not have access to the Training area, please alert one of the instructors so they can grant you access and ensure you can post on the forum. You will need your applied epi account to log in to the Applied Epi Community forum.

Purpose of a reprex

The purpose of a reprex is to summarize your problem so that readers can re-create it on their own computers.

A good reprex is:

Minimal - include only the parts of your data and code that are required to reproduce your problem
Reproducible - include all context needed to reproduce the problem, e.g. the packages loaded, the commands run, etc.

Now we will create a reprex together. If you forget these steps at a later date, you can always find them written in Applied Epi's Community Forum here.

Scenario

Start with a very basic R script that uses the Ebola dataset.

Go to the ebola/scripts/examples folder, and open "example_analysis1.R".

It should look like this:

############################
# Example analysis 1
############################

# install and load packages
pacman::p_load(rio, here, janitor, tidyverse)

# import data
surv_raw <- import(here("data", "raw", "surveillance_linelist_20141201.csv"))

# clean the surveillance data
surv_clean <- surv_raw %>% 
  clean_names()

# make a horizontal bar plot of cases per district, filled by sex
ggplot(
  data = Surv_clean,
  mapping = aes(y = adm3_name_res, fill = sex))+
geom_bar()

This script does the following:

1) Loads packages
2) Imports the raw linelist data
3) Cleans the data's column names
4) Uses ggplot() to create a bar plot

Run all the commands one-by-one.

r fontawesome::fa("exclamation", fill = "red") Something went wrong!

When you tried to produce the bar plot, no plot appeared and there was an error message:

object 'Surv_clean' not found

It is possible that you can already see the reason for the error. Even if you already know the reason for the error, work through the steps below to post in the forum for assistance.

The skill you learn today is how to create an effective post - it is a skill that will help you for many years.

Let's make a "reprex" forum post to see if anyone can help!

Make a new script for your forum post

Begin your reprex with a clean script. Open a separate R script and save it to the ebola/scripts/examples folder as "reprex1.R".

Have both scripts available in RStudio so that you can switch between them ("example_analysis.R" and "reprex1.R").

Copy commands

Now populate the "reprex1.R" script with only the commands needed to re-create your problem.

You will need to think about this. Typically, this includes:

1) A command to load packages
2) A command to import data
3) The few commands you ran to process the data, if relevant
4) Any commands that revealed, or likely caused, the problem/error

In this short example, we can copy the following commands:

1) The pacman::p_load() command. This will tell the forum responder which packages were loaded at the time.

2) The import() command. We will adjust this in a moment.

3) The data cleaning command that creates surv_clean

Ask yourself - is the ggplot() command necessary to recreate the problem? Yes. That is where the problem was first identified.

4) Include the ggplot() command in the "reprex1.R" script as well.

Your "reprex1.R" script should now look like this:

# reprex1 script  

# install and load packages
pacman::p_load(rio, here, janitor, tidyverse)

# import data
surv_raw <- import(here("data", "raw", "surveillance_linelist_20141201.csv"))

# clean the surveillance data
surv_clean <- surv_raw %>% 
  clean_names()

# make a horizontal bar plot of cases per district, filled by sex
ggplot(
  data = Surv_clean,
  mapping = aes(y = adm3_name_res, fill = sex))+
geom_bar()

Clear your Environment (click the "broom" icon in the top-right corner) and then run all these commands again. Ensure that you still see the error message you expect.

The dataset for the post

A responder in the forum will not have access to the "surveillance_linelist_20141201.csv" dataset. This is stored in the RStudio project in your computer. Most often, it is not practical nor permissible to share the entire dataset in the forum.

Instead, you can provide the forum with a minimal version of your dataset - just a few columns and rows - which they can use to recreate your problem on their computer.

Step 1. Load the {reprex} and {datapasta} packages

Install and/or load the "reprex" and "datapasta" packages. You can do this by adding them to the pacman::p_load() command at the top of your "reprex1.R" script. Your pacman::p_load() command should now look as follows:

# install and load packages
pacman::p_load(rio, here, janitor, tidyverse, reprex, datapasta)

Step 2. Make your minimal data frame

Identify the minimal rows and columns needed to reproduce your problem. Typically, 5-10 rows and a few columns will suffice.

In your "reprex1.R" script, write a new command at the bottom of your script. It will use select() and head() to reduce your surv_raw dataset into a very small dataset with only a few rows and few columns.

Write and run this command:

# create a minimal dataset, by reducing surv_raw to 5 rows and 2 columns
surv_raw %>% 
  head(5) %>%                 # take the top 5 rows only
  select(adm3_name_res, sex)  # keep only the relevant columns

Note: This command is not creating a new object. It should simply print a small data frame to the Console.

Note: Start the command with your raw data. This ensures that the forum user begins their troubleshooting from the same dataset that you began with.

Step 3. A command to generate the raw data

Will people on the forum have access to surv_raw? No.

This means we need to create our raw data within the script as well. The next step is to transition our commands so that the raw data and minimal data frame are generated purely with R code. There will be no reliance on the original surv_raw dataset.

Add a pipe to the command at the bottom of your script and pass the minimal dataset to the dpasta() function, from the {datapasta} package.

# create a minimal dataset, by reducing surv_raw to 5 rows and 3 columns
surv_raw %>% 
  head(5) %>%                      # take the top 5 rows only
  select(adm3_name_res, sex) %>%   # keep only the relevant columns
  dpasta()                         # convert to stand-alone R code

When you run this command, you will notice that R code has been automatically pasted into your script below the command. For example:

data.frame(
  stringsAsFactors = FALSE,
     adm3_name_res = c(NA,"Mountain Rural",
                       "Mountain Rural","East II","West III"),
               sex = c("m", "f", "f", "f", "f")
)

Run this command. See how it produces the exact same minimal dataset, using the base R function data.frame(). The values are "hard-coded" (explicitly written into the command) with the proper column structure.

The result: anyone with this command can begin working with your minimal dataset - despite not having access to the original surv_raw dataset.

Step 4. Make the reprex totally reproducible

Now, in the "reprex1.R" script, CUT and PASTE the above output data.frame() command to replace the import() command. In the script, the surv_raw object should now be created from the data.frame() command instead of using import().

# import data
surv_raw <- data.frame(
  stringsAsFactors = FALSE,
  adm3_name_res = c(NA,"Mountain Rural",
                    "Mountain Rural","East II","West III"),
  sex = c("m", "f", "f", "f", "f")
)

Once finished, you can deactivate/comment out (using the # symbol) the code at the bottom of the script which makes the minimal dataset.

Your script should now look like this:

# reprex 1

# install and load packages
pacman::p_load(rio, here, janitor, tidyverse, reprex, datapasta)

# import data
surv_raw <- data.frame(
  stringsAsFactors = FALSE,
  adm3_name_res = c(NA,"Mountain Rural",
                    "Mountain Rural","East II","West III"),
  sex = c("m", "f", "f", "f", "f")
)

# clean the surveillance data
surv_clean <- surv_raw %>% 
  clean_names()

# make a horizontal bar plot of cases per district, filled by sex
ggplot(
  data = Surv_clean,
  mapping = aes(y = adm3_name_res, fill = sex))+
geom_bar()


# make the minimal dataset
#surv_raw %>% 
#  head(5) %>%                      # take the top 5 rows only
#  select(adm3_name_res, sex) %>%   # keep only the relevant columns
#  dpasta()                         # convert to stand-alone R code

Step 5. Test the reprex

Clear your Environment again by clikcing on the broom icon in the top-right of the Environment pane. This means no objects are saved and ensures your reprex is being tested in isolation.

Run all the commands in the "reprex1.R" script. The data command should not have import() in it - only the data.frame() with the hard-coded values.

You should still encounter the same error message. If not, consider using a different minimal dataset (e.g. more rows, or different columns). For example, to include 10 rows instead of 5 rows, you could use head(10) instead of head(5) when creating your minimal dataframe.

Step 6. Generate a shareable reprex

The next step is to process the reprex using the {reprex} package. This will help you easily share your reprex in the forum.

You have already loaded the {reprex} package as part of the pacman::p_load command in your R script.

# install and load packages
pacman::p_load(rio, janitor, tidyverse, reprex, datapasta)

First, highlight all the required R code in the "reprex1.R" script to recreate your problem. This is the code that will be run in isolation (from an empty Environment), and will be posted to the forum. Consider when highlighing the R code, which part (if any) of the script is not required for your reproducible example.

Once you have highlighted your R code, run the following command in your Console: reprex_addin(). This will open a pop-up window to assist wich creating your reprex.

Select the following options in the pop-up window:

1) Where is reprex source? "current selection" (refers to highlighted R script) 2) Target venue: "Github or Stack overflow" (refers to online forum post) 3) Scroll down to select "Append session info" (check box at the bottom of the pop up)

Then press the Render button.

knitr::include_graphics("images/reprex_addin.png", error = F)

This action takes the highlighted R code and runs it in an isolated R Environment. It records any errors, warnings, outputs, information about your R version, and loaded packages used in the reprex.

All this is combined into a pretty HTML format that can be posted in a forum for a responder to easily receive. This appears in the Viewer pane, and also is automatically added to your clipboard (meaning it is ready to be pasted by you into a forum post).

knitr::include_graphics("images/reprex_output.png", error = F)

If nothing appears in the Viewer pane, run the command again. Occasionally it can be buggy and need to be run twice. Check the Viewer pane to ensure the error message looks as you expected.

This example is entirely reproducible and can be run by anyone else. This means Anyone can copy this into their R script, run it, and begin troubleshooting your problem.

Posting your reprex

Finally, you have your reprex ready to go! Let's go post it in the forum.

Go to Applied Epi Community and login with your Applied Epi account.

There are two areas where you can post R code problems or questions:

1) The Training area - use this for today's practice posts

knitr::include_graphics("images/training_area.png", error = F)

2) The "R Code" area - use this in the future for questions about your analyses

knitr::include_graphics("images/rcode_area.png", error = F)

Go to the Training area of the forum and click "New Topic" on the right side. Note that when writing in forum posts, the written portion of your post will use Markdown style writing. We will learn more about this next session, but for now you can try out using * to create bullet points, or # to create different level of headings in your post (# for big heading, ## for smaller heading, so on so forth...). The post format on the forum also provides point and click options for bold, italics, or creating bullet points.

Follow the prompts in the forum to write a simple post that:

Has a brief, informative title (use # for headings and ## for sub-headings within the written portion of the post)
Is "tagged" with relevant tags
Explains your desired outcome and steps you have already taken to solve the problem
Thanks anyone who tries to help you

At the bottom, paste your reprex. Remember that the reprex is on your clipboard already, it is ready to be pasted. Either use ctrl+v (cmd + v for macs) shortcut, or right-click and paste. If you have lost it, simply re-run the reprex by highlighting the required code in your R script and running reprex_addin() in the Console, and then paste into the forum post.

knitr::include_graphics("images/reprex1_post.png", error = F)

The content on the right side is a preview. When you have finished editing your post, click "Create Topic" and go look at your post in the forum!

An answer

You have posted in the Training area, so you may not receive an answer.

However, if you post a question in the actual "R Code" area, epidemiologists from around the world will be alerted to your question and may decide to help you.

How does someone respond to a question?

The responder will copy the reprex by clicking the copy icon in its top-right corner. They can easily paste this into an R script on their computer. Because the reprex contains a minimal dataset and is reproducible, they can run your code exactly as you did and see the same error message.

An answer!

They quickly identify that the problem is in the object names. Your script saves the clean data as surv_clean, but ggplot()'s data = argument is using Surv_clean. Because R is case-sensitive, it does not know the Surv_clean dataset, so it reports that this object is not found.

The forum responder can edit your code appropriately, and paste it in their reply.

Be sure to reply to thank them, and mark their answer as the "Solution"!!!

More troubleshooting practice

Try to answer these questions about common R errors. Again, simply do your best and ask an instructor for help if needed.

If you prefer to spend your time practicing another "reprex", you are welcome to skip these questions and advance to the next section.

Object not found

knitr::include_graphics("images/error_pipe_missing_data.png", error = F)

knitr::include_graphics("images/error_pipe_missing.png", error = F)

quiz(
  question_radio("In the image above, the user received an error after running the entire data cleaning command. What is the cause of this error?",
    answer("The adm3_name_res column is not in the surv dataset",
           correct = FALSE,
           message = "This column is in the surv dataset."),
    answer("The adm3_name_res column is not in the surv_raw dataset",
           correct = FALSE,
           message = "This column is in the surv_raw dataset."),
    answer("Some of the values in adm3_name_res column are missing",
           correct = FALSE,
           message = "This column is in the surv dataset."),
    answer("The user forgot to put a pipe operator on the end of line 22",
           correct = TRUE,
           message = "Correct. The dataset was not piped to the filter() line. Therefore, R did not understand what the column adm3_name_res was, without the context of the dataset."),
    answer("The user should have written the filter() line above the mutate() line",
           correct = FALSE,
           message = "No, the order of these two lines has no impact on this error message."),
    allow_retry = TRUE
  )
)

quiz(
  question_radio("Why does the error message reference 'string', and 'pattern' if it has to do with a missing pipe?",
    answer("On line 23 the filter() function is applying a logical test to the rows in the dataset via the function str_detect(), which look for patterns in a character column.",
           correct = TRUE,
           message = "Learn more about str_detect() in the Epi R Handbook. Alternatively, run ?str_detect() in your console to access Help Documentation for str_detect()."),
    answer("The software R is constructed via microscopic strings that vibrate at varying speeds.",
           correct = FALSE,
           message = "R is a normal programming software."),
    answer("R error messages always contain these words",
           correct = FALSE,
           message = "While not always clear, R messages do not always have the same text."),
    allow_retry = TRUE
  )
)

Unused argument

knitr::include_graphics("images/error_cleaning_ifelse.png", error = F)

quiz(
  question_radio("In the image above, the user received an error after running the entire data cleaning command. What is the cause of this error?",
    answer("The ifelse() function requires the user to specify which argument is the test",
           correct = FALSE,
           message = "If the arguments of the ifelse() function are written in their default order, they do not need to be specified with their names."),
    answer("It is not necessary to write 'child'. Any row that is not an adult will be marked as NA.",
           correct = FALSE,
           message = "ifelse() does require the user to specify a value for when the logical test is met, and for when the test is not met."),
    answer("The age column must be converted to numeric before being used in ifelse()",
           correct = FALSE,
           message = "While it is true that age must be numeric or integer to be evaluated properly, in this case it already is numeric. This is not the source of the error."),
    answer("The user must nest the ifelse() within a mutate() function",
           correct = TRUE,
           message = "Correct. Each line that receives data piped from a previous line must begin with a function that accepts a dataset as its first argument. Within the mutate(), after an equals sign, the ifelse() can be written. The correct syntax for the line is: mutate(age_group = ifelse(age > 18, 'adult', 'child'))."),
    allow_retry = TRUE
  )
)

Unfinished command

knitr::include_graphics("images/error_cleaning_paren.png", error = F)

quiz(
  question_radio("As shown above, the user ran the entire cleaning command, but no output or changes occurred. The Console shows a + symbol, indicating the command is not finished. What is the cause of this?",
    answer("There is too much space between each line in the cleaning command.",
           correct = FALSE,
           message = "Empty space does not impact code execution, but can make it easier to read."),
    answer("Ending with a filter() command will always cause this situation",
           correct = FALSE,
           message = "There is no function that will always cause this situation."),
    answer("It is not a problem, the user should continue with their next command.",
           correct = FALSE,
           message = "If R thinks that this command is not finished, then running the next command will produce unexpected results because the two commands will be linked in unintended ways."),
    answer("The X's are clues - there is a missing closing parentheses on line 41",
           correct = FALSE,
           message = "Line 41 is complete and accurate. One opening parenthesis, and one closing parenthesis."),
    answer("The X's are clues - there is a missing closing parentheses on line 38",
           correct = TRUE,
           message = "On line 38 there are two functions. The filter() function is restricting the dataset to only three districts. The districts are three character values combined into a vector by the c() function. This allows the %in% operator to evaluate them all at one time. However, there is only one closing parenthesis on the line, when there should be two. One set of parentheses for mutate() and one for c()."),
    allow_retry = TRUE
  )
)

Object not found (ggplot)

knitr::include_graphics("images/error_ggplot_data.png", error = F)

quiz(
  question_radio("In the image above, what is the cause of the error?",
    answer("The district_res column is not in the surv dataset.",
           correct = FALSE,
           message = "This column is in the surv dataset."),
    answer("The ggplot2 package has not been loaded so the ggplot() function is not working properly.",
           correct = FALSE,
           message = "The ggplot2 package has been loaded. If it was not loaded, the error ggplot function not found would appear instead."),
    answer("The dataset object to be used for the plot has not been specified.",
           correct = TRUE,
           message = "Yes, this command is missing a specification of the data. It must either be specified to the data= argument in the ggplot() function, data= argument in the geom_bar() function, or piped into the ggplot command from above."),
    allow_retry = TRUE
  )
)

surv_raw not found

knitr::include_graphics("images/error_pipe_hang.png", error = F)

quiz(
  question_radio("The user above clicked on the cleaning command and pressed the Run button. No clean data appeared in the Environment. What is the cause of the error message?",
    answer("The tabyl() command below should have referenced surv, not surv_raw.",
           correct = FALSE,
           message = "This is not the problem. One of the underlying problems is that the surv data has not even been created! (see the Environment pane)"),
    answer("A package has not been loaded, so R cannot find the surv_raw() function.",
           correct = FALSE,
           message = "Although you are correct that this message 'cannot find function...' usually indicates a mis-spelling of a function or an unloaded package, in this case surv_raw is not a function we want to run - it is a dataset!"),
    answer("R thinks surv_raw on line 18 is a function, because it has a pipe written after it.",
           correct = FALSE,
           message = "Yes, R thinks surv_raw is a function, but NOT because it has a pipe written AFTER it."),
    answer("R thinks surv_raw on line 18 is a function, because it has a pipe written before it.",
           correct = TRUE,
           message = "Yes, R thinks surv_raw is a function, because there is an extra pipe written on line 14, at the end of the cleaning command. This should be removed. It is causing R to think that the output of line 14 (the clean dataset) should be piped into a function on line 18."),
    allow_retry = TRUE
  )
)

surv not found

knitr::include_graphics("images/error_assign.png", error = F)

quiz(
  question_radio("The user ran the cleaning command above, and then tried to produce the cross-tabulation of the clean data. What is the cause of the error message?",
    answer("The surv object was not created, because the user did not run the cleaning command separately from the table command.",
           correct = FALSE,
           message = "The surv object was not created, but not because of how they ran the commands. It is possible to run multiple commands at once by highlighting them as shown."),
    answer("The surv object was not created, because the select() function removes unused objects in the cleaning pipeline.",
           correct = FALSE,
           message = "The select() function does not remove objects. It selects columns to keep or remove (using the - symbol)."),
    answer("The surv object was not created, because the cleaning pipeline starting in line 29 was not assigned (<-) to an object called surv. ",
           correct = TRUE,
           message = "You can see in the Environment that surv has not been created. You can see in the Console that the dataset printed to the Console instead of being saved to the Environment. On line 29, there should be the name surv, and the assignment operator, to save the object in the Environment."),
    allow_retry = TRUE
  )
)

"I wanted an epicurve colored by hospital"

knitr::include_graphics("images/error_ggplot_quotes.png", error = F)

quiz(
  question_radio("The user wanted an epidemic curve histogram with 'stacked bars' and one color for each hospital. Why is the histogram all one color?",
    answer("The user should have used geom_col() instead of geom_histogram",
           correct = FALSE,
           message = "No, geom_histogram() is the correct function to show an epidemic curve from case linelist data and a continuous date variable."),
    answer("The command is missing the stat = 'identity' argument",
           correct = FALSE,
           message = "No, this adjustment is outdated and not even relevant for this scenario."),
    answer("The user has placed the fill aesthetic in quotation/speech marks",
           correct = TRUE,
           message = "The speech marks inside the aes() function are not correct - R does not understand hospital to be a column in the data, but instead just as a text value/word. The fill is therefore not using the column hospital to distinguish groupings in the data."),
    answer("The user should add facet_wrap() to the end of the command.",
           correct = FALSE,
           message = "facet_wrap() creates small plots for each group indicated to it. It does not create stacked bars with different colors in one plot."),
    allow_retry = TRUE
  )
)

quiz(
  question_radio("What is the reason for the warning message in the Console?",
    answer("There are 33 rows missing a hospital value.",
           correct = FALSE,
           message = "Almost... "),
    answer("There are 33 rows missing a date_onset value.",
           correct = TRUE,
           message = "Yes, a histogram can only plot observations that have a value for the variable mapped to the x-axis."),
    answer("There are 33 rows missing either hospital or date_onset value",
           correct = FALSE,
           message = "For a histogram, the x-axis is most important and any missing values will be excluded out of necessity."),
    answer("33 rows had dates of onset that were more than 2 incubation periods after the last reported case, so they are not part of this outbreak.",
           correct = FALSE,
           message = "R is not an epidemiologist."),
    allow_retry = TRUE
  )
)

Column adm3_name_res doesn't exist

knitr::include_graphics("images/error_missing_select.png", error = F)

quiz(
  question_radio("The user ran the cleaning command above, and then tried to produce the cross-tabulation of the clean data. What is the cause of the error message?",
    answer("The table command should be in its own section of the script",
           correct = FALSE,
           message = "Sections in the script are for readability purposes, but do not impact execution of the code."),

    answer("The column adm3_name_res is not present in the surv_raw dataset.",
           correct = FALSE,
           message = "This column is present in the raw data."),

    answer("The column adm3_name_res is not present in the surv dataset, because it is removed by the filter() line.",
           correct = FALSE,
           message = "filter() does not remove columns. In fact, the filter() line in this cleaning command uses the adm3_name_res column."),

    answer("The column adm3_name_res is not present in the surv dataset, because it is removed by the select() line.",
           correct = TRUE,
           message = "The select() line of the cleaning command lists the columns to keep in the new surv dataset. The adm3_name_res column is not listed, so it is removed. The tabyl function subsequently tries to use this column, but it is no longer present."),
    allow_retry = TRUE
  )
)

Invalid type of argument

knitr::include_graphics("images/error_dplyr_equals.png", error = F)

quiz(
  question_radio("Using the clean data, the user tried to make a summary table of cases and number of feverish cases by district. What is the cause of the error message?",
    answer("The user should have grouped by both district and fever.",
           correct = FALSE,
           message = "No, the wants one row in the table per district. They were correct to use group_by() on the district column."),

    answer("The user should have used summarize() instead of summarise().",
           correct = FALSE,
           message = "Both of these spellings of the function work."),

    answer("The user should have added na.rm = TRUE to their sum() function.",
           correct = FALSE,
           message = "True, they should have included this to get accurate results, but this is not the cause of the error."),

    answer("They used the incorrect number of equals signs inside the sum() function when counting the fevers.",
           correct = TRUE,
           message = "Within the sum() function, the user wishes to count the number of times per district that the value in the column fever is 'yes'. Therefore, they are asking a question and should use double equals (==). The way this is written, it seems the user is trying to assign / make an affirmative statement that the column fever IS yes... so R is confused."),
    allow_retry = TRUE
  )
)

Unexpected symbol

knitr::include_graphics("images/error_symbol_comma.png", error = F)

quiz(
  question_radio("The user is trying to use the clean data to produce a bar plot. What is the cause of the error message?",
    answer("The user should have used the raw data.",
           correct = FALSE,
           message = "No, the clean data is the correct dataset."),

    answer("The ggplot() command should close with a parenthesis after the data is assigned.",
           correct = FALSE,
           message = "The aesthetic mappings should also occur in the ggplot() function. This is not the source of the error."),

    answer("There should be a + symbol after surv, on line 54.",
           correct = FALSE,
           message = "In ggplot commands, the + symbol should be written only after the close of a function's parentheses."),

    answer("There should be a comma after surv, on line 54.",
           correct = TRUE,
           message = "There is a comma missing between the data= argument and the mapping= argument."),
    answer("The user needs to define an x-axis column in the mappings.",
           correct = FALSE,
           message = "There is a comma missing between the data= argument and the mapping= argument."),
    allow_retry = TRUE
  )
)

argument is missing, with no default

knitr::include_graphics("images/error_missing_recode.png", error = F)

quiz(
  question_radio("The user is trying to clean values in the district column with recode(). What is the cause of the error message?",
    answer("The user should have used the function case_when()",
           correct = FALSE,
           message = "No, to do manual recoding, the function recode() is a valid option."),

    answer("The user needs to write .x = TRUE inside the recode() function",
           correct = FALSE,
           message = "No, this is not an argument of recode()."),

    answer("There is an extra parenthesis on line 36.",
           correct = FALSE,
           message = "The parenthesis on line 36 is correctly closing the mutate() that starts on line 32."),

    answer("The recode() function does not know which column to use.",
           correct = TRUE,
           message = "The recode() function requires a first argument which is the name of the column. So the correct line will be mutate(district = recode(district, ...)"),
    allow_retry = TRUE
  )
)

Reprex example 2

If you still have time, make another Training post using the "example_analysis2.R" script, also found in the "ebola/scripts/examples/" folder.

The script should look like this:

# Example script 2

# install and load packages
pacman::p_load(rio, here, janitor, tidyverse)

# import data
surv_raw <- import(here("data", "raw", "surveillance_linelist_20141201.csv"))

# try to convert column to class "Date"
surv_clean <- surv_raw %>% 
  clean_names() %>% 
  mutate(onset_date = ymd(onset_date))  

# epicurve
ggplot(data = surv_clean,
       mapping = aes(x = onset_date))+
geom_histogram()

This script does the following:

1) Loads packages
2) Imports the raw linelist data
3) Cleans the data by converting the onset date column to class Date, using mutate() and ymd() 4) Plots an epidemic curve (histogram)

No error message

Run the commands in "example_analysis2.R".

There was a warning message after the cleaning command:

All formats failed to parse. No formats found.

A minimal dataset

And the epidemic curve is empty!

How can you identify where the problem started?

Add commands to the bottom of the script to examine the onset_date column in the surv_clean dataset

r fontawesome::fa("lightbulb", fill = "gold") Click to read a hint

Use class() on the column to check the Class. For example:

class(surv_clean$onset_date)

Use range() on the column to check the maximum and minimum values. For example:

range(surv_clean$onset_date)

See that the cleaning command successfully changed the class of the column to Date... but the range (minimum, and maximum values) is entirely NA!

If we open surv_clean to look at the values, you will see that all of the date values were converted to NA during the cleaning process!

Make a "reprex" forum post to see if anyone can help! Below are some reminders of the steps...

Make a new script

Create a new R script and save it as "reprex2.R" in the "ebola/scripts/examples" folder.
Include only the commands needed to re-create your problem.

Typically, this would include:

1) A command to load packages
2) A command to import data
3) The few commands you ran to process the data, if relevant
4) Any commands that revealed, or likely caused, the problem/error

In this case, all the commands are necessary except the ggplot() command. If you choose to include it, that is OK, but it is not necessary.

# REPREX 2
# install and load packages
pacman::p_load(rio, here, janitor, tidyverse)

# import data
surv_raw <- import(here("data", "raw", "surveillance_linelist_20141201.csv"))

# try to convert column to class "Date"
surv_clean <- surv_raw %>% 
  clean_names() %>% 
  mutate(onset_date = ymd(onset_date))

# check the CLEANED date column class and date range
class(surv_clean$onset_date)
range(surv_clean$onset_date)

Make the minimal dataset

Nobody in the forum has access to "surveillance_linelist_20141201.csv", so you need to provide them with a minimal surv_raw dataset created using head(), select(), and dpasta().

r fontawesome::fa("lightbulb", fill = "gold") Click to read a hint

Load the {reprex} and {datapasta} packages by adding them to your pacman::p+load() function.

Begin a new command at the bottom of the "reprex2.R" script, starting with the RAW data.

Pipe to head() and select() to restrict the dataset to only a few rows and the relevant columns. To select onset date column from the raw data, you may need to surround its name with backticks because in the raw dataset it contains a space.

Once you check this output, add a pipe to pass the data to dpasta(). This will print the standalone version of the minimal data into your script.

r fontawesome::fa("check", fill = "red")Click to see a solution (try it yourself first!)

We kept 3 columns, simply to give some context to the onset date values.

surv_raw %>% 
  head(5) %>%                             # take the top 5 rows only
  select(case_id, sex, `onset date`) %>%  # keep only these columns
  dpasta()                                # produce standalone code

Cut and paste the minimal dataset to the import command and replace the import() function.

Then, verify that the same problem is reproduced using this minimal dataset

1) Click the "broom" icon in your Environment to clear all objects 2) Re run the code in your script to make sure you are re-creating the same problem 3) Use the class() and range() functions to review the output dataframe - have the surv_clean date values been converted to NA? Open the dataset and look.

r fontawesome::fa("check", fill = "red")Click to see a solution (try it yourself first!)

# REPREX 2
# install and load packages
pacman::p_load(rio, here, janitor, tidyverse, datapasta, reprex)

# import data
surv_raw <- data.frame(
  stringsAsFactors = FALSE,
  case_id = c("694928","86340d","92d002","544bd1","6056ba"),
  sex = c("m", "f", "f", "f", "f"),
  onset_date = c("11/9/2014","10/30/2014","8/16/2014","8/29/2014","10/20/2014")


# try to convert column to class "Date"
surv_clean <- surv_raw %>% 
  clean_names() %>% 
  mutate(onset_date = ymd(onset_date))

# check the CLEANED date column class and date range
class(surv_clean$onset_date)
range(surv_clean$onset_date)

This means that the example is now completely independent from the original dataset. We have a minimal, reproducible example of the problem!

Make the reprex

Highlight all of the reprex code, and in the console run reprex_addin().

Select the following in the pop-up window:

1) "current selection"
2) "Github or Stack Overflow" 3) "append session info" (check box at the bottom)
4) Then press the "Render" button

If it does not work, empty your environment, highlight the necessary R code in your script for the reprex, and run reprex_addin() again in your Console.

Post the reprex in the forum

Go to community.appliedepi.org and enter the Training area. Create a "New Topic".

Write some text about your problem and goals
Give an informative title and tags
Thank anyone who decides to help you
Paste your reprex at the bottom

The reprex should already be on your clipboard, ready to paste (ctrl + v, cmd + v for macs, or right-click and paste). The preview will appear on the right side.

An answer

The scenario finishes...

Within an hour, an epidemiologist in another country has read the post, copied the reprex, and run the script on their computer.

They quickly identified that we used the function ymd() to convert the date column, but our raw data were written as Day-Month-Year. Therefore, we should have used the function dmy()! That is why the column was converted to class Date, but all the values became NA (none were understood).

In your "example_analysis2.R" script, try changing the ymd() to mdy() and see if the epidemic curve is created successfully...

Mystery solved! You would reply to thank the responder and mark their answer as the "Solution".

End

Congratulations! Save the links to your posts. You will need to provide one link in order to get your certificate of completion for the course.

We encourage you to monitor the forum, and even offer help to others!

Extras

Answer someone else's post

You can also help people! Simply copy their reprex, paste it into a script on your computer, and run the commands. See if you can reply to a few of the training posts...

Post a reprex from your own data

Ideally, use a project from your work!

Create a new "intro_course" subfolder for your analysis and add an RStudio project. Then create a script and find an error or an unexpected result.

Post the reprex in the Applied Epi Community Forum using the same method as you learned today. This time, you can post in the R Code area (not the Training area), so that other people can help you.

If you want other example datasets to practice with, see below:

Other practice datasets

H7N9 Influenza Outbreak

Data These data are stored here: "intro_course/learning_materials/extra_datasets/H7N9_china_2013_EN.csv". You can cut/paste the file into your RStudio project for this module.

These data comprise of 136 cases of influenza A H7N9 in China, analysed by Kucharski et al. (2014). Data were collated by Adam Kucharski et al. from ProMed, WHO, FluTrackers, news reports and research articles. Transfer to R and its documentation in the {outbreaks} package was by Simon Frost (sdwfrost@gmail.com). The data were modified slightly by Applied Epi for training purposes.

Suggested objectives: (modify as you wish)

Import the data
Review the data
How many columns are there in the dataset?
How many rows are there in the dataset?
What class is the column “date_of_symptoms”?
What class is the column “age”?
What class is the column “province”?
Clean the data
Create a cleaning command using the %>% operator to link several cleaning functions together:
Clean the column names so that there are no spaces or special characters
Filter the rows to remove any cases from the province “Anhui”
Convert the column “date_of_symptoms” to class Date
In the column “sex”, convert “m” to “male” and convert “f” to “female”
Clean the column "age" and create a new column with age category
Look at the column “result”. Do any of the values need cleaning or correction?
Make any changes that are necessary so that the values are only "Death", “Recovered", and NA.
Make a table of counts and percents Use the tabyl() function from the {janitor} package to make a table that shows the number of cases for each province that died and recovered. Extra challenge: Can you add percents (by province) next to the counts?
Make a table with more detailed statistics
Use the group_by() and summarise() functions from the {tidyverse} package to make a table that has the following columns:
- n_case (the number of cases for each province)
- n_death (the number of deaths in each province)
- max_sym (the latest date of symptom onset in each province)
- min_age (the lowest age present in each province)
Extra challenge: add a column for pct_death (the percent of cases who died in each province)

The {outbreaks} R package contains 22 publicly available outbreak datasets. The package is maintained by RECON.

Install the {outbreaks} package from CRAN, then run this code to see a list of the datasets:

data(package="outbreaks")

You can save one of the datasets to your R environment by referencing it from the {outbreaks} package like this:

# save the measles outbreak dataset as "measles" in your R environment
measles <- outbreaks::measles_hagelloch_1861

Below are two datasets we propose as options.

1861 Measles outbreak

These data comprise of 188 cases of measles among children in the German city of Hagelloch, 1861. The data were originally collected by Dr. Albert Pfeilsticker (1863) and augmented and re-analysed by Dr. Heike Oesterle (1992).

You can read the description of each column in the Help pane, using this command:

?outbreaks::measles_hagelloch_1861

Suggested objectives: (modify as you wish)

Import the dataset
Clean the dataset, including ensuring appropriate column classes and the creation of an age category column
Create an epidemic curve using an appropriate date column
Create a demographic pyramid of the outbreak
Create descriptive tables of gender, class, complications, and age category
Consider the use of {dplyr}'s group_by() and summarise() functions
Consider the use of {flextable} or {gtsummary} to make nice HTML tables for presentation
Summarise your analysis in an R Markdown report or PowerPoint slide deck for presentation

2015 MERS linelist and contact tracing

These datasets correspond to the initial information collected by the Epidemic Intelligence group at European Centre for Disease Prevention and Control (ECDC) during the first weeks of the outbreak of Middle East respiratory syndrome (MERS-CoV) outbreak (South Korea) in 2015. The data were used to follow the daily evolution of this outbreak using public information available. This dataset is meant for teaching purposes; it represents neither the final outbreak investigation results nor a consolidated and complete description of the transmission chain.

Read a complete description of the data, and the columns, by running:

# Read about South Korea MERS dataset
?outbreaks::mers_korea_2015

Data: There are two dataframes:
1) A linelist of MERS Co-V cases and their attributes

# save case linelist to Environment
mers_cases <- outbreaks::mers_korea_2015$linelist

2) A data frame describing the relationship betweens MERS Co-V cases.

# save contacts linelist to Environment
mers_contacts <- outbreaks::mers_korea_2015$contacts

Suggested objectives

Import and clean the respective datasets
Create new columns that display the time difference in days between important dates
Join the linelist and contacts datasets, and explore relationships between their respective attributes.
Plot an epidemic curve of the cases
Plot bar plots, a demographic pyramid, or other plots of the case attributes
Summarise your findings in an R Markdown report

Malaria case counts

Data: This is a fictional dataset generated by Applied Epi staff. These data are stored here: "intro_course/learning_materials/extra_datasets/malaria_facility_case_counts.rds". You can cut/paste the file into your RStudio project for this module.

Each row represents malaria case counts for a particular “facility-day”. The case counts (the right-most columns) are stored in a “wide” format such that the information for every age group on a given facility-day is stored in a single row.

Suggested objectives:

Import and clean the data
Plot an epidemic curve using the "total" counts column
Create a "tidy" data frame that is pivoted longer, so that each row represents the case count for a specific age group at a specific facility on a specific day.
Plot an epidemic curve that shows the contribution of each age group (e.g. "stacked bars")
Assess reporting delays by calculating the difference between values in the columns "data_date" (date the data were collected) and "submitted_date" (date the data were submitted to the surveillance system).
Create a demographic pyramid (using age_pyramid() with the argument count = for showing pre-aggregated counts)

r fontawesome::fa("lightbulb", fill = "gold") Click to read a hint

For the epidemic curves, remember the difference between geom_histogram() (bar height reflects the number of rows in the data) and geom_col() (use of aes(y = ) argument can make bar height reflect aggregated count value)

For pivoting, recall that you can use tidyselect() helpers like starts_with() to select all columns that start with the word "malaria". You can get further help on the pivoting from this chapter of the Epi R Handbook

New RStudio project

After you have selected a dataset for this module, create a new subfolder in "intro_course" with an RStudio project, and give it an appropriate name for the data you selected.

Look in the upper-right corner of RStudio to ensure that you are working in the correct project.

Explore your dataset

Take 10-15 minutes to do the following:

Create subfolders in your RStudio project to hold data, scripts, and outputs
Save your dataset(s) in the data subfolder, as appropriate
Open a new R script, and save it in the scripts subfolder
Write a command to load/install appropriate R packages
Write a command to import your data, giving it an appropriate name
Look at your data frame and conduct basic exploratory analysis

r fontawesome::fa("lightbulb", fill = "gold") Click to read a hint

Use pacman::p_load() to load and install R packages

Use import() and here() to locate your data within the RStudio project, and import it into R

Use some of the functions below to understand your data:

skim() from the {skimr} package
tabyl() from the {janitor} package
range() and summary() from {base} R
ggplot() to make bar plots or histograms

Choose a problem

As you work, choose a problem to post in the Community forum. For this exercise, one of the following is best:

An error message that you do not understand the cause
An unexpected result that you do not understand the cause
A task you have tried, but are not sure how to complete

Even if you have no problem, you can make up a realistic problem (just to practice posting).

appliedepi/introexercises documentation built on April 22, 2024, 1:01 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.