library(introexercises) # get data for exercises library(learnr) # create lessons from rmd library(gradethis) # evaluate exercises library(dplyr) # wrangle data library(flair) # highlight code library(ggplot2) # visualise data library(lubridate) # work with dates library(fontawesome) # for emojis library(janitor) # clean data # library(RMariaDB) # connect to sql database ## set options for exercises and checking --------------------------------------- ## Define how exercises are evaluated gradethis::gradethis_setup( ## note: the below arguments are passed to learnr::tutorial_options ## set the maximum execution time limit in seconds exercise.timelimit = 60, ## set how exercises should be checked (defaults to NULL - individually defined) # exercise.checker = gradethis::grade_learnr ## set whether to pre-evaluate exercises (so users see answers) exercise.eval = FALSE ) # ## event recorder --------------------------------------------------------------- # ## see for details: # ## https://pkgs.rstudio.com/learnr/articles/publishing.html#events # ## https://github.com/dtkaplan/submitr/blob/master/R/make_a_recorder.R # # ## connect to your sql database # sqldtbase <- dbConnect(RMariaDB::MariaDB(), # user = 'sander', # password = 'E9hqb2Tr5GumHHu', # # user = Sys.getenv("userid"), # # password = Sys.getenv("pwd"), # dbname = 'excersize_log', # host = "144.126.246.140") # # # ## define a function to collect data # ## note that tutorial_id is defined in YAML # ## you could set the tutorial_version too (by specifying version:) but use package version instead # recorder_function <- function(tutorial_id, tutorial_version, user_id, event, data) { # # ## define a sql query # ## first bracket defines variable names # ## values bracket defines what goes in each variable # event_log <- paste("INSERT INTO responses ( # tutorial_id, # tutorial_version, # date_time, # user_id, # event, # section, # label, # question, # answer, # code, # correct) # VALUES('", tutorial_id, "', # '", tutorial_version, "', # '", format(Sys.time(), "%Y-%M%-%D %H:%M:%S %Z"), "', # '", Sys.getenv("SHINYPROXY_PROXY_ID"), "', # '", event, "', # '", data$section, "', # '", data$label, "', # '", paste0('"', data$question, '"'), "', # '", paste0('"', data$answer, '"'), "', # '", paste0('"', data$code, '"'), "', # '", data$correct, "')", # sep = '') # # # Execute the query on the sqldtbase that we connected to above # rsInsert <- dbSendQuery(sqldtbase, event_log) # # } # # options(tutorial.event_recorder = recorder_function)
# hide non-exercise code chunks ------------------------------------------------ knitr::opts_chunk$set(echo = FALSE)
# data prep -------------------------------------------------------------------- surv_raw <- rio::import(system.file("dat/surveillance_linelist_20141201.csv", package = "introexercises")) tests <- rio::import(system.file("dat/testing_data.csv", package = "introexercises"))
Welcome to the course "Introduction to R for applied epidemiology", offered by Applied Epi - a nonprofit organisation and the leading provider of R training, support, and tools to frontline public health practitioners.
knitr::include_graphics("images/logo.png", error = F)
This exercise focuses on setting up R for the first time and an introduction to basic R coding.
This exercise guides you through tasks that you should perform in RStudio on your local computer.
There are several ways to get help:
1) Look for the "helpers" (see below)
2) Ask your live course instructor/facilitator for help
3) Schedule a call with an instructor for "Course Tutoring"
4) Post a question in Applied Epi Community
Here is what those "helpers" will look like:
r fontawesome::fa("lightbulb", fill = "gold")
Click to read a hint
Here you will see a helpful hint!
r fontawesome::fa("check", fill = "red")
Click to see a solution (try it yourself first!)
linelist %>% filter( age > 25, district == "Bolo" )
Here is more explanation about why the solution works.
Answering quiz questions will help you to comprehend the material. The answers are not recorded.
To practice, please answer the following questions:
quiz( question_radio("When should I view the red 'helper' solution code?", answer("After trying to write the code myself", correct = TRUE), answer("Before I try coding", correct = FALSE), correct = "Reviewing best-practice code after trying to write yourself can help you improve", incorrect = "Please attempt the exercise yourself, or use the hint, before viewing the answer." ) )
question_numeric( "How anxious are you about beginning this tutorial - on a scale from 1 (least anxious) to 10 (most anxious)?", answer(10, message = "Try not to worry, we will help you succeed!", correct = T), answer(9, message = "Try not to worry, we will help you succeed!", correct = T), answer(8, message = "Try not to worry, we will help you succeed!", correct = T), answer(7, message = "Try not to worry, we will help you succeed!", correct = T), answer(6, message = "Ok, we will get there together", correct = T), answer(5, message = "Ok, we will get there together", correct = T), answer(4, message = "I like your confidence!", correct = T), answer(3, message = "I like your confidence!", correct = T), answer(2, message = "I like your confidence!", correct = T), answer(1, message = "I like your confidence!", correct = T), allow_retry = TRUE, correct = "Thanks for sharing. ", min = 1, max = 10, step = 1 )
Please email contact@appliedepi.org with questions about the use of these materials.
In this exercise you will:
If you were not able to complete the installations before the course, alert one of the instructors now.
Please have the following software installed on your computer prior to the start of the course.
1) R (most recent version) 2) RStudio (most recent version) 3) RTools (only needed for Windows machines, and not strictly necessary)
If you have difficulty, consult an instructor or see the installation guide in the course folder for tips (see next step).
If you have not already done so, download the zipped course folder at this link.
Unzip/extract the folder and save it on your computer's desktop - not on a shared drive.
To "unzip" a folder once it is downloaded, right-click on the folder and select "Extract All". If offered a choice of location to save the unzipped folder, save it to your desktop.
The folder structure should look like this:
r emo::ji("folder")
Desktopr emo::ji("folder")
intro_courser emo::ji("folder")
ebolar emo::ji("folder")
covidr emo::ji("folder")
module1r emo::ji("folder")
learning_materialsr emo::ji("document")
packages_to_install.RYou should have already installed the R packages for the course.
If you did not already do this, follow these instructions:
1) Go into the "intro_course" folder, and open the file "packages_to_install.R". If it is your first time opening an R script on your computer, you may need to specify that you want to open the file using RStudio.
r emo::ji("folder")
Desktopr emo::ji("folder")
intro_courser emo::ji("folder")
ebolar emo::ji("folder")
covidr emo::ji("folder")
module1r emo::ji("folder")
learning_materialsr emo::ji("document")
packages_to_install.R2) Follow the instructions at the top of the script. Highlight ALL the text in the script and then press the "Run" button located near the top-center of RStudio. Alternatively, highlight all the text and then press the keys Ctrl and Enter.
This script will spend several minutes to install most of the R packages that you need for this course.
If you have not already done so, open RStudio now.
R is a language used for statistical computing and graphics, developed in 1991 and based on the language S - read more of R's history here. R is different from other programming languages in that its original purpose is data analysis.
R is distinct from other data analysis languages because:
Why does it matter that R is "open-source"?
R is not a company, and there is no "headquarters" of R. Because R is "open-source", its tools are created and vetted by its millions of users. This decentralisation of power is inherently democratising, and allows R to rapidly respond to emergent needs (e.g. the COVID-19 pandemic). R is trusted and used by many major institutions: for example, the US Food and Drug Administration (FDA).
The base of the R software is governed by a "core group" and updated every few months. Each R version is assigned a number (like "R version 4.1.2 (2021-11-01)") and a name to make it easier to remember, like "Bird Hippie". In the future, you can easily update your version of R by re-downloading it. Often, R and RStudio will prompt you to download newer versions when they are available.
RStudio is an interface for using R. While it is possible to open and use R directly, it is much more common to use R through an Integrated Development Environment (IDE) such as RStudio, which allows a more friendly experience and easier file organization. "RStudio" is offered for free by the company Posit.
When you open RStudio, it will automatically find and use the R installed on your computer. There is no need to open both programs.
You can think of R as the engine of a vehicle, doing the crucial work, and RStudio as the body of the vehicle (the frame, seats, accessories, etc.).
Now, please check your understanding of R and RStudio by answering the questions below:
quiz(caption = "Quiz - R and RStudio", question("What is RStudio?", allow_retry = TRUE, answer("An application that makes it easier to use R.", correct = TRUE, message = "RStudio is an 'Integrated Development Environment (IDE) that makes it easier to write, use, debug, and save R code."), answer("A spreadsheet program like Microsoft Excel."), answer("Another name for R", message = "R and RStudio are two separate things. R is a language, like English or French. RStudio can be thought of as a program that helps you use the language, like how a word processing program helps you write."), answer("An application that lets you use R without writing any code", message = "You still have to write code - and that's a good thing! Code provides a reproducible record of your work, which is best practice.") ), question("Is RStudio free to download and use?", answer("Yes", correct = TRUE, message = "The RStudio IDE is a software offered for free by the company Posit. There are other, less common, IDEs to use R, such as Tinn-R and Emacs."), answer("No", message = "RStudio IDE is free and open-source.") ), question("Do you need to install both R and RStudio?", answer("Yes", correct = TRUE, message = "While it is possible to work in R without an interface like RStudio, this is not recommended for beginners."), answer("No", message = "While it is possible to work in R without an interface like RStudio, this is not recommended for beginners.") ), question("Once both programs are installed, which one should you open to begin working?", answer("RStudio", correct = TRUE, message = "Opening RStudio will automatically start R."), answer("R", message = "For beginners, it is best to work in R *through* RStudio. Open RStudio.")) )
Look around RStudio. Observe the major RStudio panes. You should see 3 or 4 panes:
1) The Console (left or lower-left)
2) The Environment (upper-right)
3) The Files, Plots, Packages, Help, and Viewer panes (lower-right)
4) The Source pane (upper-left) If you see only these 3 panes, click File -> New file -> R script to open a new R Script and achieve the classic look below.
Take a few minutes to familiarize yourself with the locations of the various panes, using the diagram below.
# adding xfun::relative_path() creates a dynamic file path between the Rmd location and the here() path. # It dynamically creates the ../../etc filepath. knitr::include_graphics("images/rstudio_overview.png", error = F)
The Source Pane (upper-left)
This pane is a space to edit and run your scripts, which is where you write R commands. This pane also displays datasets. For Stata users, this pane is similar to your Do-file and Data Editor windows.
The R Console Pane (lower-left)
This is the R software itself - the “engine” that actually runs commands. Non-graphic outputs and error/warning messages appear here. You can type commands into the R Console, but they are not saved as when run from a script. If you are familiar with Stata, the R Console is like the Command Window and also the Results Window.
The Environment Pane (upper-right)
This pane shows objects available to you, which can include datasets or specific values you have saved for later use (e.g. a specific "epiweek" number). In Stata, this is most similar to the Variables Manager window.
Files, Plots, Packages, Help and Viewer Pane (lower-right)
The pane includes several tabs. The Files pane is a browser to open, rename, or delete files. The Plot pane displays graphs and plots, whereas interactive outputs display in the Viewer pane. The Help pane displays documentation and help files. The Packages pane allows you to install, update, and load/unload R packages. This pane contains the Stata equivalents of the Plots Manager and Project Manager windows.
For more detail on RStudio capabilities, download this PDF: RStudio IDE cheatsheet.
When you first install RStudio, it is important to adjust one default setting.
Please follow these steps in your RStudio session now:
# adding xfun::relative_path() creates a dynamic file path between the Rmd location and the here() path. # It dynamically creates the ../../etc filepath. knitr::include_graphics("images/RStudio_tools_options_1.png", error = F)
# adding xfun::relative_path() creates a dynamic file path between the Rmd location and the here() path. # It dynamically creates the ../../etc filepath. knitr::include_graphics("images/options_rdata.png", error = F)
Why do we suggest this? In the long term, you will become a better coder if RStudio begins empty or with a "clean slate" each time you open it. This forces you to write complete code, that never relies on what you did in a previous session. This makes your analysis more "reproducible" and avoids headache when you share the code with others.
Next, you will set up an RStudio project.
An RStudio project is a self-contained and portable R working environment - effectively a folder for all the files associated with a distinct project (data files, R scripts, outputs, etc.).
# adding xfun::relative_path() creates a dynamic file path between the Rmd location and the here() path. # It dynamically creates the ../../etc filepath. knitr::include_graphics("images/project_briefcase.png", error = F)
If you do not work within an RStudio project, it will be harder to organize files, import data, and have your scripts used by people on other computers.
During this course, you will create at least 3 RStudio projects inside the "intro_course" folder:
1) A project for Module 1 (practice)
2) A project for analysis of an Ebola outbreak
3) A project for analysis of a COVID-19 outbreak
Please follow these steps to create an RStudio project for Module 1:
1) Open RStudio (ensure that you open RStudio and not just R).
2) In RStudio, in the top left click File -> New Project. In the pop-up window, select "Existing directory".
knitr::include_graphics("images/create_project.png")
3) Create the project in the "intro_course/module1" subfolder
* Click "Browse" and navigate to the "intro_course" folder that you downloaded and unzipped earlier (probably saved on your Desktop) and then into the "module1" subfolder.
* Click "Create project" (RStudio may briefly close and re-open)
Voila! This will be the project for your work in this first module.
If you are working in an RStudio project, you will see the name of the project in the upper-right corner of RStudio. If you are not in an RStudio project, it will read "Project: (None)". What do you see in your RStudio?
# adding xfun::relative_path() creates a dynamic file path between the Rmd location and the here() path. # It dynamically creates the ../../etc filepath. knitr::include_graphics("images/Rproject_dropdown.png", error = F)
r fontawesome::fa("window-restore", fill = "darkgrey")
Minimize RStudio and open your computer's folder navigator (e.g. Windows Explorer). Navigate to the "module1" folder. The contents of the folder should now look like to this:
knitr::include_graphics("images/new_r_project_explorer.png", error = F)
In the folder, you should see a small file with an icon that looks like an "R box" - this is the RStudio project file (.Rproj).
To open the project next time, just double-click this project file. RStudio will open, and all your files for this project will be at-the-ready.
Observe that in RStudio, in the "Files" pane (lower-right), the project's contents are also visible.
knitr::include_graphics("images/new_r_project.png", error = F)
You can read more about RStudio projects in this chapter of the Epi R Handbook.
If you have not already done so, open a new R script by clicking File -> New file -> R script.
knitr::include_graphics("images/new_script.jpg", error = F)
You should see an empty space appear in the upper-left of RStudio. This space is the R script.
knitr::include_graphics("images/RStudio_new_script.png", error = F)
Currently, the script is not saved. Click the save icon above the script, or click File -> Save As.
Name the script as "module1_script.R" and ensure it saves in the "module1" folder.
Note that the file extension for an R script is ".R". In working with R, you will encounter other extensions, but remember that this one is for R scripts.
r fontawesome::fa("eye", fill = "darkblue")
Observe that your script should now also appear in the "Files" pane (lower-right of RStudio) in the "module1" folder.
quiz( question("Which of these file names is an R script?", allow_retry = TRUE, answer("survey_analysis.Rproj", message = "No, this extension signifies an R project. You could click this to open RStudio and work on the project."), answer(".Rhistory", message = "No, .Rhistory is a special file that saves a record of commands and outputs. It is rarely viewed."), answer("cholera_plots.Rmd", message = ".Rmd signifies an 'R markdown' script, which is not a standard R script. You'll learn about R markdown later in the course."), answer("measles.R", correct = TRUE, message = "Yes, the .R signifies that this is an R script."), answer("pirate_ship.arrrrr", message = "No, this is just silly!") ) )
A script is a place to write commands (instructions) for R. A typical R script for public health might include sections like:
In the example below, the green text and hash (#) symbols are "comments" or "notes" for the reader, or used to delineate sections of the script. The black text is R code commands.
DO NOT TYPE THE EXAMPLE CODE BELOW INTO YOUR R SCRIPT.
Simply review the format, focussing on how the #
symbol is used for commenting or creating section headings. You will have the opportunity to write scripts like this later in the course.
knitr::include_graphics("images/example_script.png", error = F)
r fontawesome::fa("terminal", fill = "black")
Let's do some coding!
To get comfortable running commands, let's begin with the most simple use of R, as a calculator.
Below are common mathematical operators in R. These are often used to perform addition, division, to create new columns in datasets, etc. Spaces around the operators will not affect the command, but make the code more readable.
| Purpose | Example in R | |---------------------|--------------| | addition | 2 + 3 | | subtraction | 2 - 3 | | multiplication | 2 * 3 | | division | 30 / 5 | | exponent | 2\^3 | | order of operations | ( ) |
As shown in the demonstration, you can run R commands through the console the following ways:
knitr::include_graphics("images/RStudio_run_console.png", error = F)
The [1]
before the 62
answer informs you that this is the first (and in this case the only) output.
You can type into the script and then run the command in one of these ways:
Place your cursor on the line in the script, and press the "Run" button.
knitr::include_graphics("images/RStudio_run_script.png", error = F)
Note that the results will always appear in the Console, even if the command is run from the script.
Type some simple mathematical commands into the R script. Try to run the commands in ALL of the ways listed above. Which is your preferred method to run a command?
Now, run commands in your R script to find the solutions to the following simple math problem:
quiz(caption = "R as a calculator", question_numeric( "What is the sum of 12, -99, 2, 147, and 29?", answer(91, message = "Excellent!", correct = T), allow_retry = TRUE, correct = "Correct, nice work.", min = 1, max = 3000, step = 1 ) )
If you are struggling to write the code, review the hint below! The solution is also provided if needed.
r fontawesome::fa("lightbulb", fill = "gold")
Click to read a hint
Use the addition operator +
between the numbers, place the cursor on that line of the script, and press the "Run" button in the upper-right of the script. Don't forget that the second value is NEGATIVE 99.
r fontawesome::fa("check", fill = "red")
Click to see a solution (try it yourself first!)
12 + -99 + 2 + 147 + 29
Let's try to develop some code for a question more relevant to public health:
You are managing a COVID-19 testing campaign across 3 sites, and you need to place a supply purchase order for rapid tests for next month. Based on the following information, how many tests do you need to order for next month?
Write and run a command in your script to calculate how many tests you need to order for next month. If you are struggling to start, review the hint below first!
r fontawesome::fa("lightbulb", fill = "gold")
Click to read a hint
Use parentheses ( ), the asterisk multiplication operator, and the addition + and subtraction - operators. First, add together the known needs (200, 550, and two times 925). Wrap that all in parentheses, because you need to multiply that sum by the desired buffer (110%, but in decimal form). That total should also be wrapped in parentheses, because you need to then subtract the extra left-over from the previous month (which does not need to be purchased).
r fontawesome::fa("check", fill = "red")
Click to see a solution (try it yourself first!)
((200 + 550 + (925 * 2)) * 1.1) - 420
quiz(caption = "Supply chain exercise", question_numeric( "How many tests should you order for the coming month?", answer(2440, correct = T), allow_retry = TRUE, correct = "Correct, nice work.", incorrect = "Try again, or view the hint above. Think about order of operations...", min = 1, max = 3000, step = 1 ), question( "Where in RStudio was the answer printed?", allow_retry = TRUE, answer("The R Console, in the lower-left", correct = TRUE), answer("The Plots pane, in the lower-right", message = "No it was not printed to the Plots pane. Check the Console in the lower-left."), answer("In the R Script", message = "No, it was not printed in the R Script. Look below at the Console pane.") ) )
r fontawesome::fa("pen", fill = "brown")
It is important to write notes in your script so that other people (and you in the future) can understand it!
A "hash" symbol (#) deactivates any text written to the right of it, and is used to insert comments throughout the script.
For example:
# This is a comment in the script # This is another comment 2 + 2 # this is a comment after some active R code
Please update your script to include the following:
Future readers of your code will thank you!
See a possible solution below.
r fontawesome::fa("check", fill = "red")
Click to see a solution (try it yourself first!)
Note that author, date, and contact email should reflect your own details (e.g. your name, today's date, and your contact details).
# Purpose: Calculate monthly order for testing supplies # Author: Neale Batra # Last updated: 2 February, 2024 # Contact email: contact@appliedepi.org # Site 1 uses 200 per month # Site 2 uses 550 per month # Site 3 has 2 sub-sites that each need 925 tests per month # Order 10% extra as a buffer against higher demand # At the end, subtract the number of tests remaining from last month ((200 + 550 + (925 * 2)) * 1.1) - 420
R allows you to store objects for later use. They are stored in your R Environment, the pane in the upper-right of RStudio.
Everything you store in R - datasets, variables, a list of village names, a total population number, even outputs such as graphs - are objects which are assigned a name of your choice which can be referenced in later commands. In any script you may create ("define") and re-define hundreds of objects!
Some examples:
epiweek
, with the value 14
city
, with the value "Kigali"
run_report
, with value either TRUE
or FALSE
linelist
or hospital_data
my_colors
We will explore the various types of objects later, but for now let's practice defining simple objects.
Create objects by assigning them a value with the "assignment operator": <-
You can think of the assignment operator <-
as the words “is defined as”.
This operator looks like an arrow. It takes the value of the right side, and assigns it to the name on the left. Assignment commands generally follow a standard order:
object_name <- value (or calculation/process that produces a value)
By running the command with the assignment operator, you can create an object, or by re-running the command you can re-define the object with a new value.
Tip: The keyboard shortcut to create the <-
is Alt and - (Windows) or Option and - (Mac).
r fontawesome::fa("eye", fill = "darkblue")
Look at your "Environment" pane in RStudio. Right now, there should be no entries in the Environment pane, because we have not defined any objects yet.
# adding xfun::relative_path() creates a dynamic file path between the Rmd location and the here() path. # It dynamically creates the ../../etc filepath. knitr::include_graphics("images/empty_environment.png", error = F)
Now, write the code below in your R script, and run the command.
Remember that to run a command in your script, either highlight the whole command or place your cursor in the command, and click "Run".
Please actually type out the command (don't just copy/paste!) to get familiar with typing the R commands!
confirmed_cases <- 34
Notice that object confirmed_cases
is now Environment, with the value of 34
. Now, we can run other commands using the object confirmed_cases
, and R will know to use the value 34
.
quiz( question_checkbox("When you ran the command above to define an object, what happened in RStudio (select ALL that are correct)?", answer("A new stored object appeared in the Environment pane, named 'confirmed_cases', with a value of 34", correct = TRUE, message = ""), answer("The command that was run appeared in the Console (lower-left) pane", correct = TRUE, message = "Note that only the command appeared in the R Console, not any calcuated output"), answer("34 appeared in the Console as calculated output", message = "This command simply assigns a value to a object name, it does not ask R to print the result of any calculation"), allow_retry = TRUE ) )
r fontawesome::fa("pen", fill = "brown")
Notes about naming of objects:
r emo::ji("cross mark")
my object name
r emo::ji("cross mark")
Having both dataset
and Dataset
r emo::ji("cross mark")
2nd_wave_of_cases_from_Santa_Clara_County
r emo::ji("check")
cases_zambia
r emo::ji("check")
linelist_raw
r emo::ji("check")
lab_20140216
Note: in R literature, you may often see people using df
as an object name. This is a very shorthand way to refer to the fact that the object they are saving is a "data frame" (dataset with columns and rows).
Now, run a command of only the object name confirmed_cases
(look at the R console pane for the output)*:
confirmed_cases
This command asked R to print (also called "return") the value assigned to the object confirmed_cases
. The value 34
was printed to the console.
This raises a critical point that you must understand as a beginner R user - there are two fundamental types of R commands:
1) Saving commands which use the <-
to save an object.
Some object in your Environment will be created or modified. The record of the command will be printed to the Console, but not any value output.
2) Printing commands which print an output to a console (and do not use the <-
).
R will print the command and a value to the Console, but not make any lasting changes to any object in the Environment.
Always ask yourself: "What is my command asking R to do? Print? or Save?". We will reinforce this with examples throughout the course.
Create an object named suspect_cases
and assign the value 12
. Run this command through your console to ensure the object suspect_cases is created in your environment with the assigned value of 12.
r fontawesome::fa("check", fill = "red")
Click to see a solution (try it yourself first!)
suspect_cases <- 12
Now, write and run the following command. See how you can now reference the values by calling only the assigned names.
total_cases <- confirmed_cases + suspect_cases
quiz( question("Was this command a PRINTING command, or a SAVING command?", answer("Printing", message = "No, there was an arrow used, and NO output number was printed to the Console."), answer("Saving", message = "Yes, there was an arrow used, and a new object was saved to the environment.", correct = TRUE), allow_retry = TRUE ), question("When the above command is run, did the value of total_cases print in the R console?", answer("Yes", message = "This command used the assignment operator, and so only asked R to save the value, not print it to the console."), answer("No", correct = TRUE, message = "Correct, this command used the assignment operator, and so only asked R to save the value, not print it to the console."), allow_retry = TRUE ), question("How can you know the number of total cases (check all that apply)?", answer("Look at the value of total_cases in the Environment", correct = TRUE), answer("Yell at R until it tells me", correct = FALSE), answer("Run a command of only total_cases, to print its current value", correct = TRUE), allow_retry = TRUE ), question("What is the number of total cases?", answer("34", message = "You have selected the number of CONFIRMED cases only"), answer("12", message = "You have selected the number of SUSPECT cases only"), answer("46", correct = TRUE, message = "You have successfully selected the total number of cases"), allow_retry = TRUE ) )
What happens if you receive news that there are 10 additional confirmed cases?
If you edit the first command to read confirmed_cases <- 44
, does the value of confirmed_cases
in the RStudio Environment pane immediately change to 44? (No.) Does the value of total_cases
immediately change to 56
? (Also No.)
quiz( question_checkbox("Select all the steps that must happen for the value of total_cases to be updated to 56?", answer("Run the command: confirmed_cases <- 44", message = "Yes, first the confirmed_cases must be re-defined", correct = T), answer("Re-run the command: total_cases <- confirmed_cases + suspect_cases", message = "Yes, second, the total must be updated with the new number of confirmed cases", correct = T), answer("Give your course instructor a gift", message = "No, bribing your course instructor will not make R do magic."), allow_retry = TRUE ) )
If you change a written value in your script, it does not automatically update the rest of your script, nor does it change any values stored in R!
You must re-run the commands in order for the changes to be registered by R.
In this case, you must re-run two commands to update the value of total_cases
(and they must be run in the correct order!)
You might wonder - "why do I need to run each line of my script one-by-one?". Well, you don't have to! Use your mouse to highlight multiple commands in the script and then run them.
Before you run multiple lines, check for unfinished code that might cause an error. Remember, R will stop if it encounters an error.
The # symbol can also be used to temporarily deactivate code. If placed at the beginning of the line, R will ignore this code.
# Example of a line being deactivated with a # symbol confirmed_cases <- 44 suspect_cases <- 12 #cases_with_status_pending <- 3 total_cases <- confirmed_cases + suspect_cases
Let's apply what you have learned to the supply chain example from the previous section. As a reminder: you are managing COVID-19 test supplies across 3 sites. You have the following information:
Earlier, you wrote this command to calculate the needed number of tests:
((200 + 550 + (925 * 2)) * 1.1) - 420
Re-write this calculation using objects, so that it can be easily updated each month.
r fontawesome::fa("lightbulb", fill = "gold")
Click to read a hint
Begin by writing 3 commands, which define the needs for each of the 3 sites.
Then, write one command that defines excess tests remaining from the previous month.
Finally, write a command that prints the total number of tests needed by replacing the numbers with the object names defined in the previous commands.
r fontawesome::fa("check", fill = "red")
Click to see a solution (try it yourself first!)
# define values (objects) for use in the calculation site1 <- 200 site2 <- 550 site3 <- 925 * 2 extra <- 420 # run this command to print the amount needed to order ((site1 + site2 + site3) * 1.1) - extra
Now that you've set up your calculation to be easy calculated, use the code you've created to answer the following question:
quiz( question("How many tests should you order if site 1 needs 250, site 2 needs 730, site 3's two sites need 1050 each, and you have 37 extra from the previous month?", answer("3253", message = "Check your parentheses - is 1.1 multiplied on ALL of the sites, as it should be?"), answer("3351", correct = TRUE, message = "Nice work!"), answer("2980", message = "You have made an incorrect calculation, check your code against the solution"), allow_retry = TRUE ) )
Think, how would you address the following questions?
If you are unsure, ask an instructor.
r fontawesome::fa("exclamation", fill = "red")
Remember to save your R script often! Just click the small "save icon" in the row of icons above the script, or click File -> Save
You have now written a useful script! But will a colleague be able to understand your logic and commands?
As demonstrated earlier, any text written to the right of the hash symbol is ignored by R. You can place the hash symbol:
# Purpose: Calculate monthly order for testing supplies # Author: (your name) # Last updated: (date) # Contact email: (your email) # define values (objects) for use in the calculation site1 <- 200 site2 <- 550 site3 <- 925 * 2 extra <- 420 # run this command to print the amount needed to order ((site1 + site2 + site3) * 1.1) - extra
# Purpose: Calculate monthly order for testing supplies # Author: (your name) # Last updated: (date) # Contact email: (your email) # define values (objects) for use in the calculation site1 <- 200 # needs for site 1 site2 <- 550 # needs for site 2 site3 <- 925 * 2 # needs for site 3 subsites extra <- 420 # number of tests remaining from last month # run this command to print the amount needed to order ((site1 + site2 + site3) * 1.1) - extra
Below is an example of a well-documented script that imports data and does basic epidemiological analyses. You do NOT need to write this code in your script - simply look at it. Note how clear it is to read - each section is clearly notated with plenty of spaces and new lines present between portions of code.
knitr::include_graphics("images/example_script.png", error = F)
Section headings are a useful way of organising your code. You can utilize a keyboard shortcut to insert a section header into your script.
Place your cursor where the new section should start and press Ctrl, Shift, and R at the same time (or Cmd Shift R on a Mac). In the pop-up, name the section, for example "Monthly supply needs".
The new section header should look something like this:
# Monthly supply needs ----------------------------------------------
RStudio will recognize this section, and it will appear in the script "Outline", located in the top right of the R script pane. This clickable Outline tool can be very useful to navigate scripts with hundreds or even thousands of lines.
knitr::include_graphics("images/section_headings.png", error = F)
Adapt your script to utilize section headings and comments
r fontawesome::fa("check", fill = "red")
Click to see a solution (try it yourself first!)
# About this script ---------------------------------------------- # Purpose: Calculate monthly order for testing supplies # Author: (your name) # Last updated: (date) # Contact email: (your email) # Monthly supply needs ---------------------------------------------- # You can change these numbers to reflect the expected needs for each site site1 <- 200 # needs for site 1 site2 <- 550 # needs for site 2 site3 <- 925 * 2 # needs for site 3 subsites extra <- 420 # number of tests remaining from last month # Calculation ---------------------------------------------- # Note: This equation includes a 10% buffer, via the 1.1 factor ((site1 + site2 + site3) * 1.1) - extra
Objects can also hold "character" values (text, words). These are placed within quotation marks, like "New York City" or "dm76wk34" (a randomly-generated case unique identifier).
Note that the character objects can be created with "double quotes" or 'single quotes' to the same effect (it can be useful to place single quotes within double quotes sometimes).
When your R script recognizes that something you have written is a character value (once the first and last quote marks are written) it will turn a different color.
name <- "Oliver" district <- "Bolo" occupation <- "nurse"
Try to define some character objects in your script, like the ones above. Experiment with printing them to the Console, and with changing their value.
Note that thirty-four could be written in R as:
34
(a numeric value, capable of being used in mathematic calculations), or What happens if you try to add 30
+ "12"?
We will use more character objects in the next section...
The previous examples of running commands, creating objects, and mathematical calculations do not showcase R's best abilities.
The real power of R comes from *functions*! Functions are at the core of using R, and are how you perform more complex tasks.
A function receives inputs, does some action with those inputs, and produces an output. What the output is depends on the function.
Functions operate upon an object placed within the function’s parentheses. For example, the function sqrt()
returns the square root of a number:
sqrt(64)
Likewise, the function sum()
accepts an unlimited number of numeric values and returns the sum.
sum(2, 5, 10, -8, 100)
In your R script, use the functions min()
and max()
to find the minimum and maximum of the numbers 3, 55, 9, -4, and 33.
We won't do a quiz, because the answers should be quite easy... but did your code work?
# calculate the minimum value min(3, 55, 9, -4, 33) # calculate the maximum value max(3, 55, 9, -4, 33)
c()
Let's try the function c()
. The "c" represents the term "concatenate" (you can remember it as "combine", too), because this function combines the values within its parentheses into one unit.
We call the unit produced by c()
a vector. A vector is a unit of several values, which must be of the same class (either all numeric, all character, all logical, etc.) and must be separated by commas.
See this example, where we create a named vector of numeric values (the ages of 5 patients).
# create a vector of patient ages patient_ages <- c(5, 10, 60, 25, 12)
Try the above command in your R script. Now, what happens when you run the command patient_ages
? All the numbers print to the R Console.
It is useful that these numbers can be referenced by one name, because now we can apply changes to all of them with just one step:
patient_ages * 2
I am not sure why we would need to multiply all the patient ages by 2, but it sure was easy, wasn't it?!
Try creating your own vector - make a vector of some names of districts/cities/counties in your home region. Name it jurisdictions
. Does your command look similar to this?
# A character vector of jurisdiction names in Mozambique jurisdictions <- c("Maputo", "Inhambane", "Gaza", "Zambezia", "Manica", "Sofala")
What can we do with a character vector? We cannot multiply it by 2... For fun, let us put this vector in another function: toupper()
, which changes all of the characters to upper case:
toupper(jurisdictions)
Note: Even though the vector contains character values, when typing the object name, or name of the vector, you do not use quotes. It is an R object just like confirmed_cases
, and so should be written plainly in code.
What does the output look like? What if you try the function tolower()
? This could be useful to standardize names or other character words when joining two datasets!
quiz( question("Was this toupper() command a PRINTING command, or a SAVING command?", answer("Printing", message = "Yes, There is no arrow operator. The values in the jurisdictions object were only changed temporarily in order to print them in UPPER CASE to the Console. The values were not changed in the Environment.", correct = TRUE), answer("Saving", message = "No, there was no arrow used, and the values in the jurisdictions object were not actually changed."), allow_retry = TRUE ) )
Most functions you will encounter in R have named arguments. These allow you to specify the settings under which the function will operate.
# adding xfun::relative_path() creates a dynamic file path between the Rmd location and the here() path. # It dynamically creates the ../../etc filepath. knitr::include_graphics("images/arguments-buttons.png", error = F)
Let's return to the character vector jurisdictions
that you defined earlier. Your vector may include other jurisdictions local to your home.
jurisdictions
Imagine you want to print these jurisdiction names in a report (yes, you can automate reports with R!), but this output looks ugly and is not very readable in a sentence. Perhaps you want commas between the names.
The function paste()
accepts values and combines them.
The first argument of paste()
is the name of the object with the values you want to combine. For this example, we will use the jurisdictions
object we created.
paste()
has another argument that is named collapse =
which inserts an item or symbol (e.g. a space, or a comma, or both!) in between each of the values.
Arguments are written within the function's parentheses with a single equals sign, so we refer to the collapse
argument as collapse =
. Note that arguments are separated by commas within the function.
Write the following code in your script and run the command through your console to see the result:
paste(jurisdictions, collapse = ",")
Let's understand what is going on:
1) The first argument expects the name of a vector, in this case: jurisdictions. Often, the first argument of a function is data to be operated upon, and does not require a name nor an equals sign.
2) The second argument is collapse =, to which we provide a character value (in quotation marks) to appear between each of the words. In the example above, we have chosen a comma: ",".
paste(jurisdictions, collapse = ",")
The output looks like this:
paste(jurisdictions, collapse = ",")
Replicate the code above with your jurisdictions
vector in RStudio. Note how this function knows to not place a comma after the last value in jurisdictions
.
How would you adjust the code to add a space after each comma?
r fontawesome::fa("check", fill = "red")
Click to see a solution (try it yourself first!)
# Note the space after the comma, within the quotation marks paste(jurisdictions, collapse = ", ")
Functions often have many arguments, and not all are required for them to run. In the paste()
exercise above, collapse =
is optional. You can use the function without specifying a value for it.
In another example, functions that make plots have arguments like title =
, subtitle =
, color =
, etc. You do not need to supply values for the function to successfully run.
R coders do not have encyclopaedic brains - we look up this information all the time in the Help documentation or resources like our Epidemiologist R Handbook.
A function's arguments and any default values can be read in the function's documentation. To read the documentation, search the name of the function in the Help pane in the lower-right of RStudio. Alternatively, type ?
before the function name in the RStudio Console pane (for example, ?paste
).
The documentation details will look something like this (we can help you interpret, as they can be difficult to understand at first):
knitr::include_graphics("images/r_documentation.png", error = F)
quiz( question("What are the arguments in the following command:\n\nage_pyramid(data = linelist, split_by = 'gender', age_group = 'agecat5', proportional = TRUE)", answer("linelist, split_by, age_group, proportional", message = "Careful! linelist comes after an equals sign. It is the value assigned to the argument data = "), answer("data, split_by, age_group, proportional", correct = T), answer("age_pyramid, linelist, 'gender', proportional", message = "age_pyramid is the function, and linelist is the value assigned to the argument data = "), allow_retry = T ) )
quiz( question("All R functions have multiple arguments that require input", allow_retry = T, answer("True"), answer("False", correct = TRUE, message = "Not all functions have multiple arguments, and typically for functions with multiple arguments, many have default values that do not need to by supplied.") ) )
Briefly review these other operators and base R functions.
| Purpose | Function | |--------------------|---------------------------------------| | rounding | round(x, digits = n) | | rounding | janitor::round_half_up(x, digits = n) | | absolute value | abs(x) | | square root | sqrt(x) | | exponent | exponent(x) | | natural logarithm | log(x) | | log base 10 | log10(x) |
Note: See this page in the Epi R Handbook before using rounding functions. There is mathematical nuance that is important in some circumstances.
Briefly review these common statistical functions in R. We will use these frequently.
| Objective | Function | |-------------------------|--------------------| | mean (average) | mean(x, na.rm=T) | | median | median(x, na.rm=T) | | standard deviation | sd(x, na.rm=T) | | quantiles* | quantile(x) | | sum | sum(x, na.rm=T) | | minimum value | min(x, na.rm=T) | | maximum value | max(x, na.rm=T) | | range of numeric values | range(x, na.rm=T) | | summary | summary(x) |
CAUTION: By default, any missing values in the calculation (written in R as
NA
) will result in an output ofNA
, unless the argumentna.rm =
is set toTRUE
, which removed (rm) NAs from the calculation. This can be written shorthand asna.rm = T
. This will make more sense once we begin to use datasets.
[DANGER: If providing a standalone vector of numbers to one of the above statistical functions, be sure to wrap the numbers within c()
.]{style="color: red;"}
# If supplying raw numbers to a statistical function, wrap them in c() mean(1, 6, 12, 10, 5, 0) # !!! INCORRECT !!! mean(c(1, 6, 12, 10, 5, 0)) # CORRECT
Until now, you have used {base} R functions that come installed with R, such as sum()
, c()
, and min()
. These are a very small portion of all R functions.
An R package is a shareable bundle of related functions that you can download and use.
Packages typically have a theme, for example:
Names of packages are often written in text with curly brackets { }. This is not done when writing package names in your R code.
The names are also often clever puns - the fun spirit of the R community is clear!
To install most R packages, use R commands to download the package from "CRAN" to your computer's "R library".
Many R users create specialized functions, which they share in packages with the R community. For packages to be widely distributed, they are usually shared on the Comprehensive R Archive Network (CRAN), which is R's central software repository - an archive of R packages that have passed basic scrutiny.
As of December 2023, there are 20,190 packages available on CRAN. Some of these are immensely popular, with hundreds of thousands of downloads each month.
Are you worried about viruses and security when downloading a package from CRAN? Read this article on the topic.
If you are using a Virtual Private Network (VPN) you may need to turn it off in order to install R packages.
Once a package is installed, it is stored in your R “library”. You can then access the functions it contains by “loading” the package for use during your current R session .
In summary: to use the functions available in an R package, 2 steps must be implemented:
Think of R as your personal library: when you install a package, your library gains a new "book" of functions. But each time you want to use a function from that book, you must borrow (“load”) it from your library.
knitr::include_graphics("images/bookshelf1.png", error = F)
Our first tip is this: write your code that installs and loads packages at the top of the script (as the first command).
This makes it clear to yourself and other readers which packages are required to run the script.
There are {base} R commands to install and load packages. They are not the most efficient approach, but you can always revert back to this method.
install.packages()
is the {base} R function to install packages (note that "packages" is written in plural). Write the name of the package in the parentheses, in quotes. Package names are case sensitive.
# install the janitor package with base R install.packages("janitor")
r fontawesome::fa("exclamation", fill = "red")
Note that this command will install the package every time it is run - even if the package is already installed. Also, it does not load the package for use in the current session (step 2 from above).
If you want to install multiple packages in one command, they must be listed within a character vector c()
.
# install multiple packages with base R install.packages(c("janitor", "rio", "here"))
Now you try it: write a command near the top of your script to install the package "pacman". If a pop-up window asks if you want to re-start R, say "No".
r fontawesome::fa("check", fill = "red")
Click to see a solution (try it yourself first!)
install.packages("pacman")
Do you remember how often you need to load a package? Every time you start R.
You can do this using the {base} R function library()
.
Write and run the following command below the install.packages("pacman")
command you just added to your script.
# Loads the package pacman for use in the current R session library(pacman)
The functions of the {pacman} package are now available for use.
The package "pacman" helps you efficiently install and load other packages, and greatly eases sharing R scripts among colleagues. Its name refers to "package manager" (not the video game character!)
Its p_load()
function does Steps 1 & 2 efficiently. It installs each package only if it is not already installed. Additionally, it will load each package for use.
The syntax is quite simple. Place the package names within the parentheses, separated by commas. No quotes are needed. Package names are case-sensitive.
Write this command near the top of your script, as the first command, and run it.
Place a # in front of any install.packages()
and library
commands so they are de-activated.
pacman::p_load(rio, here, janitor, tidyverse)
When your colleague runs this command on their computer, only the packages they do not already have on their computer will be installed, and all will be loaded. This uses fewer lines of code, and does not result in unnecessary installations. It does depend on your colleague having the {pacman} package already installed,
The special syntax pacman::p_load()
uses two colons :: to explicitly connect the package name pacman
and the function name p_load()
. This syntax is useful because it loads the {pacman} package (assuming it is already installed), avoiding the need for libary(pacman)
. This is one of the few times that this :: syntax is useful.
IMPORTANT: Have your pacman::p_load()
command written only once near the top of your script. If you later realise that you need additional packages, simply add them to this command and re-run it.
In any R command, the newlines and indenting will not impact the execution of the code, but can dramatically improve readability!
Therefore, the pacman command can be written vertically, with comments to explain why we are loading specific packages.
Edit your command so that it looks like this.
# Load all the packages needed, installing if necessary pacman::p_load( rio, # for importing data here, # for file paths janitor, # for data cleaning tidyverse # for data management )
It is generally advised to list the {tidyverse} package last, due to a phenomenon called "function masking". Ask an instructor if you want to know more.
quiz(caption = "Quiz - packages", question("How often do you need to install a package on your computer?", allow_retry = T, answer("Every time you restart R", message = "Packages only need to be installed once. There is no relation to restarting R."), answer("Only once", correct = TRUE, message = "Packages must be installed once. After a long time, you may want to update them by re-installing."), answer("Never (assuming you are connected to the internet)", message = "Packages must be installed once. You can not access them dynamically through the internet."), answer("Each time you restart your computer", message = "Packages only need to be installed once. There is no relation to restarting your computer.") ), question("How often do you need to load a package?", allow_retry = T, answer("Every time you start or restart R", correct = TRUE, message = "Packages must be loaded each time you start an R session."), answer("Only once", message = "Packages must be installed only once, but must be loaded at the beginning of each R session."), answer("Never (assuming you are connected to the internet)", message = "Packages must be installed once and loaded at the beginning of each R session. You can not access them dynamically through the internet."), answer("Each time you restart your computer", message = "Packages only need to be installed once and loaded at the beginning of each R session. There is no relation to restarting your computer.") ), question("Newlines and indents can be used to improve readability without impacting code execution.", allow_retry = T, answer("True", correct = TRUE), answer("False") ), question("The pacman function p_load() does which of the following (select ALL that apply).", allow_retry = T, answer("Installs the packages if it is not yet installed", correct = TRUE), answer("Loads the packages for use", correct = TRUE), answer("Produces a small yellow pacman emoji that eats your code line-by-line") ) )
Let's begin working with data! To work with a dataset that is saved on your computer as an Excel, CSV, or similar file, you must import it into the R environment as an object.
The dataset will be saved as a data frame object, which consists of columns and rows.
Unlike some other software, R can store multiple datasets at once.
In order to import a dataset into R, you must tell R where to access the file on your computer (e.g. a specific folder). This can be surprisingly difficult sometimes.
However, by using an RStudio project and saving the data within the project, the whole process becomes much easier.
In the "module1" project directory, see that the project's "top-level" or "root" folder (where the "module1.Rproj" file is located) contains the file "testing_data.csv". We will import this dataset into our current R session.
knitr::include_graphics("images/data_in_root.png", error = F)
There are {base} R functions for importing data, but they can be confusing and difficult to remember. For example, there are separate functions for importing different types of files into R (e.g. .xlsx, .csv, .tsv, .txt, etc).
Thankfully, there is one function that works for almost all file formats, which is the import()
function from the package {rio}.
The import()
function expects to receive a character value - the file path to the data that you wish to import. In this case, our data file is saved in the RStudio project root folder, so you only need to provide the file name and extension, in quotation marks, as below:
import("testing_data.csv")
Create a new script section, using the Ctrl + Shift + R shortcut, and name it "Practice importing data"
Type and run the import()
command above in the "Practice importing data" section of your script.
r fontawesome::fa("exclamation", fill = "red")
Did you see this error?
Error in import("testing_data.csv") : could not find function "import"
If so, it means that you did not install and load the {rio} package. If you need help, review the previous section and look for the pacman::p_load()
command that loads several packages, including {rio}.
Once you get the command to run successfully, you probably saw a lot of text appear in the Console. That was the dataset!
Think: what did your command ask R to do? Was is a PRINTING command, or a SAVING command?
...You asked R to import the dataset... and because there was no <-
(assignment operator) in the command, the default action was to print the output in the console.
Now, adjust your command to save the dataset as an object in your Environment, with the name tests
?
r fontawesome::fa("lightbulb", fill = "gold")
Click to read a hint
Use the assignment operator <-
before the function. Don't forget quotation marks around the name of the file.
r fontawesome::fa("check", fill = "red")
Click to see a solution (try it yourself first!)
tests <- import("testing_data.csv")
When you run this code, you should see the new object tests
appear in the Environment pane, with a short description of the number of observations and variables. Congratulations, you have now imported a dataset into R! Well done!
Now that you have imported the data, let's take a look!
In the Environment pane, click once on the blue circle next to the dataset name.
The expanded area beneath the name is an overview of the columns in the data frame. There is also an abbreviation that indicates the "class" of the column (character, integer, etc.) and the first few values in the column. How exciting!
To view the dataset, click on the name of the object test
in the Environment pane. This opens a new tab displaying the data frame, in the pane next to the R script.
Practice scrolling through the data frame, and applying filters (see the filter icon in the upper-left of the data display).
quiz(caption = "Quiz - data review", question("How many columns are there in the data frame?", allow_retry = T, answer("665", message = "No, this is the number of rows/observations"), answer(ncol(tests), correct = TRUE, message = "Yes this is correct!"), answer("4", message = "This is not correct, try again."), answer("25", message = "No, this is the number of rows shown in the initial display") ), question("What is the value in the 4th column and the 10th row?", allow_retry = T, answer(tests[10, 4], correct = TRUE), answer("899"), answer("38.3"), answer("Site 3") ), question("How many of the rows include between 100 and 300 tests requested (inclusive)?", allow_retry = T, answer("14"), answer(tests %>% filter(tests_requested >= 100 & tests_requested <= 300) %>% nrow(), correct = TRUE), answer("20"), answer("4") ) )
If you struggle to answer the final question, click the white box below "tests_requested" to filter the column numerically, then read the summary of filtered rows at the bottom of the pane. Note that this action is not filtering the actual data frame, but only your temporary view of the data frame.
Let's take this opportunity to explore the dataset.
Go to your pacman::p_load()
command near the top of your script and add these packages to the list. Add them before the "tidyverse" package, and don't forget to put commas!
Your command should now look like this:
# Load all the packages needed, installing if necessary pacman::p_load( rio, # for importing data here, # for file paths janitor, # for data cleaning skimr, # for exploratory analysis flextable, # for formatting tables scales, # for formatting values gtsummary, # for pretty tables tidyverse # for data management )
Click to place your cursor anywhere in the pacman::p_load()
in your R script and re-run the command by clicking Run or hitting Enter on your keyboard.
You can return a useful summary of the dataset using the R package {skimr} and its function skim()
.
You should have already installed and loaded {skimr} by adding it to your pacman::p_load()
command and re-running this command. If you encounter any errors, notify your facilitator.
Return to the "Practicing importing data" section of your script. Place the name of your imported data frame object (tests
) in the skim()
function and run the command. What content appears in the R console?
If the output is not easily readable, expand your Console pane to be wider by moving the boundaries between the RStudio panes (double click and drag), and then run the command again.
r fontawesome::fa("check", fill = "red")
Click to see a solution (try it yourself first!)
skim(tests)
What does the output show?
quiz(caption = "Quiz - reviewing the data", question("How many columns has R classified as numeric?", allow_retry = T, answer("12", message = "No, this is more than the columns in the dataset!"), answer(tests %>% select(where(is.numeric)) %>% ncol(), correct = TRUE, message = "Yes this is correct!"), answer("1", message = "No, this is the number of columns that are POSIXct - a type of date format") ), question("Review the information on the character columns. How many unique values are in the sites column?", allow_retry = T, answer(length(unique(tests$site)), correct = TRUE), answer("6"), answer("12"), answer("0") ), question("Review the information on numeric columns. What is the median (50th percentile) of monthly tests requested?", allow_retry = T, answer("32.2"), answer(median(tests$tests_requested, na.rm=TRUE), correct = TRUE), answer("1250"), answer("99.0") ) )
There are other {base} R functions available to summarise objects, such as:
summary()
glimpse()
Try these other functions on your data frame. Compare the outputs from each function. Which output do you prefer?
In an earlier quiz question, we asked you to find the value in the 4th column, and the 10th row. Instead of searching manually, you can also write code to isolate and view pieces of the data frame.
$
index operator"Indexing" means referencing only one part of an object. The $
operator is a {base} R method to extract just one column from a data frame. Write the name of the data frame, then $
, then the name of the column, as shown below.
Add the following code to your R script to print just one column of tests
:
tests$tests_requested
Here is something you should know: columns are "vectors" - a long line of values of the same class, just like jurisdictions
and patient_ages
from earlier in this module. In fact, a data frame is simply a collection of vectors (columns!).
Just as you put the vectors jurisdictions
in the function toupper()
, you can also put a column within a function. For example:
We can use the mean()
function to return the mean of one column. Try this out in your R script:
mean(tests$tests_requested)
We can use the summary()
function to summarise information about one column. Try this out in your R script:
summary(tests$tests_requested)
As you typed, slowly, did you see a small menu appear that showed all the columns in the data frame? You can click to select a column from the drop down menu when it appears if you do not want to type the rest and potentially make a spelling mistake.
Another indexing operator to be aware of is the "square brackets" that look like [ ]
. These extract a sub-part of a larger object using the following template:
object[ROW, COLUMN]
For example, tests[10, 4]
would return the value in tests
at the 10th row and the 4th column.
You can return an entire column using brackets by leaving the ROWS part empty (but don't forget the comma!): tests[ , 4]
or you can return an entire row by doing the opposite: tests[10, ]
Other objects can be subset as well. For example, the summary()
function when used on a column returns an object that contains the minimum, median, mean, IQR, etc.
Try the 3 commands below to see how the output changes as further indexing is applied:
# Return the summary of the column 'age' summary(tests$tests_requested) # Return the 3rd element of the summary summary(tests$tests_requested)[3] # Return the number only, of the 3rd element of the summary # Double brackets look deeper into nested objects summary(tests$tests_requested)[[3]] # Do something with the number summary(tests$tests_requested)[[3]] + 4
You will use the $
very frequently. It is less common to need or use the brackets, but still good to know.
Below, we provide you with some R code to make a plot of the COVID-19 testing data.
You do NOT need to type this code - simply copy and paste! Paste this code into your script, below the import commands, and run it
Later in this course, you will learn how to understand and write this code, as well as learn how to create plots of your own!
# create a plot of tests used by site and month ggplot(data = tests, # use the tests dataset mapping = aes( # map axes to columns in the dataset x = month, # set x-axis to month column y = tests_used, # set y-axis to tests_used column color = site))+ # set line color by site geom_line()+ # display the data as lines theme_light()+ # simple background labs(title = "COVID-19 tests used, by month") # add a title
Do you see the output appear in the RStudio Plots pane in the bottom right? You can adjust the size of the pane, as needed.
How fun! With just a few lines of code we made a beautiful graphic! R is an extremely versatile tool. You could save it as a PNG, embed it into a Powerpoint (PPT) slide, or embed it into a routine report that updates automatically when new data are reported.
There are many ways to make summary tables in R, which we will explore in a later module. Below are some examples.
Copy and paste the following code into your script and run it - do not worry if you do not understand it. You will eventually!
This code uses the {gtsummary} package and its function tbl_summary
to produce an HTML table that can be embedded with R into a Word, PDF, or HTML report, or website/dashboard.
# make an HTML summary table, by site tests %>% # start with tests data select(site, tests_requested, tests_used, staffing) %>% # select our columns tbl_summary(by = site) %>% # summary table by site modify_caption("Assessment of testing site needs") # Add title
In this example we use the {tidyverse} package to calculate more complex descriptive calculations in a summary table.
Copy and paste the R code below into the bottom of your script, then click your cursor into the top line of the command and hit "Run" to run the entire command.
Later in the course, you will be able to understand what this code does. It begins by aggregating the rows by testing site, creating new summary columns with averages, maximums, and percents by site. The later lines adapt the table into a pretty format with a header, caption, and even highlighting that automatically responds to the values!
# make a summary table of testing site performance tests %>% # group the data by site group_by(site) %>% # calculate summary statistics summarise( avg_tests = round(mean(tests_used, na.rm = T), digits = 1), max_tests = max(tests_used, na.rm=T), peak_month = month(which.max(tests_used), label=T), pct_understaffed = percent(sum(staffing == "understaffed", na.rm=T)/12))%>% # make the table look pretty (formatting) qflextable() %>% set_header_labels( site = "Site", avg_tests = "Average\ntests", max_tests = "Max\ntests", peak_month = "Peak\nmonth", pct_understaffed = "% of months\n understaffed" ) %>% bg(i = ~pct_understaffed >= 50, j = 5, part = "body", bg = "#FF7F7F") %>% add_footer_lines("Sites understaffed >= 50% of months are highlighted.") %>% italic(part = "footer") %>% add_header_lines(values = "COVID-19 Testing Site Performance (annual summary)") %>% align(part = "header", i = 1, align = "center") %>% bold(i = 1, bold = TRUE, part = "header")
You could print this table as a PNG, send it to a Word or PDF document, or embed it into a slide deck or online report.
It is important to note, to update the table, you only need to re-run your script using an updated dataset. This means if the testing_data.csv was updated with new data, and you re-ran your script as written, the table would be updated based on the new data, without having to make any actual changes to the script. Amazing!
Do not be afraid to close RStudio at the end of your R session. Let's try this now! Do the following steps:
1) Save your script. You can do this by going to the File menu and clicking "Save" or by clicking the Save icon that looks like a "floppy disk".
2) Close out of RStudio (e.g. click the x in the top right corner to exit).
Now, practice re-opening RStudio *to this "module1" project*. Do you remember how?
quiz(caption = "Quiz - Reopen RStudio", question("Which ways below will work to open RStudio and resume your work in a specific RStudio project?", allow_retry = T, answer("Just open RStudio from its desktop icon - it will automatically open to my project", correct = FALSE, message = "Not necessarily! Opening RStudio generically will automatically open the most recent project"), answer("Navigate to the folder of the desired project, and then open its RStudio project file", correct = TRUE, message = "Yes, you can click on the RStudio project file for the project you wish to open. "), answer("Open RStudio generally, then navigate using the top-right menu to the project I want to work on", correct = TRUE, message = "Yes, you can navigate to a project from within RStudio.") ) )
Now that RStudio is open again to your project, what do you notice? If you followed the instructions at the beginning of this module on workspace settings and the handling of ".RData", you should see that:
Open your R script, and run the commands one-at-a-time, starting from the top of the script. Watch as the objects are re-created, outputs are printed, data are re-imported. If you encounter errors, consider re-arranging the commands in the script. If you are confused, check with an instructor.
Note, you should always run your pacman::p_load()
command first, to ensure you have all the necessary packages and associated functions loaded and reason to use for your current session.
Note, we have stored a "solution" script in the "intro_course/module1/backup" folder. It is an example of what your script might look like at the end of the module.
Congratulations on finishing the first module!
Click on to the next section if you want extra material...
If you finish early, here are some tasks for extra learning:
See this Tidyverse style guide for tips on how to align with best-practice R coding.
While most R packages conduct specific analyses or make workflows more efficient, R programmers are also fun people who made packages for amusement.
Try installing the R package {praise}, and then see what the function praise()
does from the {praise} package.
pacman::p_load(praise) praise()
This is a fun package to use if you are building a tutorial!
Now try installing the R package {cowsay} - a package for printing silly images of animals made from punctuation symbols.
pacman::p_load(cowsay)
The function say()
has one required argument what =
to which you can provide a character value that will be spoken by a cat:
say(what = "Hi, I am a cat who is learning R!")
The second argument is by =
and it can accept the name of another animal such as "chicken", "yoda", "spider", "ant", or "frog".
say(what = "Even frogs like to learn R!", by = "frog")
You can change the colors with the by_color =
and what_color =
arguments. See more in the package documentation by entering ?say
in your R Console.
Have fun with this!
The data visualization functionality of R is so high quality that users have written a package to generate art.
Install the package {aRtsy} and load it for use.
pacman::p_load(aRtsy)
Some of the artwork can take a long time to generate (see the [package documentation(https://koenderks.github.io/aRtsy/)]), but try this one:
canvas_collatz(colors = colorPalette("tuscany3"))
As described in the package documentation:
The Collatz conjecture is also known as the 3x+1 equation. The algorithm draws lines according to a simple rule set: 1. Take a random positive number. 2. If the number is even, divide it by 2. 3. If the number is odd, multiply the number by 3 and add 1. 4. Repeat to get a sequence of numbers. By visualizing the sequence for each number, overlaying sequences that are the same, and bending the edges differently for even and odd numbers in the sequence, organic looking structures can occur.
Read the information at this link and adjust the settings of your RStudio via the "Tools" menu and clicking "Global Options". Which Appearance settings do you prefer?
Look through this list of R User communities and find one near you. See also this list of R-Ladies chapters. R-Ladies is one of the biggest groups of R users in the world! Now that you have an Applied Epi account, don't forget to go check out the applied epi community forum as well. The Applied Epi community forum is a space to ask R code questions, learn about R packages or functions of interest, and discuss all things R with alumni, current students, and members of the Applied Epi team!
A very popular R social phenomenon is "Tidy Tuesday" - the weekday when many R users do a fun data visualization task in R and share it with the community. It is a fun way to practice and learn new R tips.
Check out the hash "TidyTuesday" on Twitter (or Follow @TidyTuesday_Bot) and review the cool plots that people make with R!
If you are interested, watch this 7-minute video interview with of one of R's co-founders, Ross Ihaka, produced by Maori TV in his home of New Zealand. He discusses his philosophy behind creating R, the Maori influences on his life, and more.
Read through the "R Basics" chapter of the Epi R Handbook - it is full of things that you have learned today, but also topics that will be new.
If you are confused by any section, ask your instructor for clarification.
The magic of R really happens when you create your own functions. This is an advanced skill that we do not cover in this course, and you do not typically need to create functions until you are a more experienced R user. You do not need to try this code, unless you want to. This is purely background information.
However, this aspect of R is where its versatility really begins to shine. Imagine if you could convert your entire workflow into one command?
For demonstration purposes, below, the testing supply chain script from the previous section is converted into a function:
# create a function that accepts 4 inputs (1 is optional) and returns the # needs, based on the equation # create the function calc_test() calc_tests <- function(site1, site2, site3, extra){ # list the arguments, and open the function needs <- ((site1 + site2 + site3)*1.1) - extra # We embed the equation inside the function return(needs) # the function returns the result } # close the function
Once the above code is run, the function is defined (it will appear in the R environment pane just like the other objects).
Now that the function calc_tests
has been defined and created, we can use it to run the equation given values for the arguments, like this:
calc_tests(site1 = 200, site2 = 550, site3 = (925*2), extra = 420)
Or with different values for the function's arguments:
calc_tests(site1 = 400, site2 = 150, site3 = (700*2), extra = 85)
See how we've wrapped up all the code into a function! Very cool. Think of the possibilities... you can create functions that meet your local needs!
If you write a function that is useful to others, you can publish it in an R package - a unit of multiple related functions. Everyone else can test and try your functions, and your work can help people around the world! This is the beauty of open-source software.
This chapter in the Epi R Handbook covers more detail about importing and exporting data, including:
We will discuss the use of here()
in the next module.
Read about the {vroom} R package, which is made for quickly importing very large datasets. Try it out!
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.