In appliedepi/introexercises:

library(introexercises)  # get data for exercises 
library(learnr)          # create lessons from rmd
library(gradethis)       # evaluate exercises
library(dplyr)           # wrangle data
library(flair)           # highlight code
library(ggplot2)         # visualise data
library(lubridate)       # work with dates
library(fontawesome)     # for emojis
library(janitor)         # clean data
library(tidyr)
library(forcats)
library(scales)
# library(RMariaDB)        # connect to sql database

## set options for exercises and checking ---------------------------------------

## Define how exercises are evaluated 
gradethis::gradethis_setup(
  ## note: the below arguments are passed to learnr::tutorial_options
  ## set the maximum execution time limit in seconds
  exercise.timelimit = 60, 
  ## set how exercises should be checked (defaults to NULL - individually defined)
  # exercise.checker = gradethis::grade_learnr
  ## set whether to pre-evaluate exercises (so users see answers)
  exercise.eval = FALSE 
)

# ## event recorder ---------------------------------------------------------------
# ## see for details:
# ## https://pkgs.rstudio.com/learnr/articles/publishing.html#events
# ## https://github.com/dtkaplan/submitr/blob/master/R/make_a_recorder.R
# 
# ## connect to your sql database
# sqldtbase <- dbConnect(RMariaDB::MariaDB(),
#                        user = 'sander', 
#                        password = 'E9hqb2Tr5GumHHu',
#                        # user     = Sys.getenv("userid"),
#                        # password = Sys.getenv("pwd"),
#                        dbname   = 'excersize_log',
#                        host     = "144.126.246.140")
# 
# 
# ## define a function to collect data
# ## note that tutorial_id is defined in YAML
#     ## you could set the tutorial_version too (by specifying version:) but use package version instead
# recorder_function <- function(tutorial_id, tutorial_version, user_id, event, data) {
# 
#   ## define a sql query
#   ## first bracket defines variable names
#   ## values bracket defines what goes in each variable
#   event_log <- paste("INSERT INTO responses (
#                        tutorial_id,
#                        tutorial_version,
#                        date_time,
#                        user_id,
#                        event,
#                        section,
#                        label,
#                        question,
#                        answer,
#                        code,
#                        correct)
#                        VALUES('", tutorial_id,  "',
#                        '", tutorial_version, "',
#                        '", format(Sys.time(), "%Y-%M%-%D %H:%M:%S %Z"), "',
#                        '", Sys.getenv("SHINYPROXY_PROXY_ID"), "',
#                        '", event, "',
#                        '", data$section, "',
#                        '", data$label,  "',
#                        '", paste0('"', data$question, '"'),  "',
#                        '", paste0('"', data$answer,   '"'),  "',
#                        '", paste0('"', data$code,     '"'),  "',
#                        '", data$correct, "')",
#                        sep = '')
# 
#     # Execute the query on the sqldtbase that we connected to above
#     rsInsert <- dbSendQuery(sqldtbase, event_log)
# 
# }
# 
# options(tutorial.event_recorder = recorder_function)

# hide non-exercise code chunks ------------------------------------------------
knitr::opts_chunk$set(echo = FALSE)

# data prep --------------------------------------------------------------------
surv_raw <- rio::import(system.file("dat/old_version/surveillance_linelist_20141201.csv", package = "introexercises"))

malaria_counts <- rio::import(system.file("dat/malaria_facility_count_data.rds", package = "introexercises"))

datadict <- rio::import(system.file("dat/linelist_datadict.csv", package = "introexercises"))

site_coverage_tidy <- rio::import(system.file("dat/messy_data_examples.xlsx", package = "introexercises"), sheet = "tidy_site_coverage")

Introduction to R for Applied Epidemiology

Welcome

Welcome to the live course "Introduction to R for applied epidemiology", offered by Applied Epi - a nonprofit organisation that offers open-source tools, training, and support to frontline public health practitioners.

knitr::include_graphics("images/logo.png", error = F)

File management, data import, tidy data, and exploratory analysis

This exercise focuses on organizing your files in an RStudio project and how to import datasets into R.

Format

This exercise will guide you through a set of tasks.
You should perform these tasks in RStudio on your local computer.

Getting Help

There are several ways to get help:

1) Look for the "helpers" (see below) 2) Ask your live course instructor/facilitator for help
3) Ask a colleague or other participant in the course for tips
4) Post a question in Applied Epi Community in the category for questions about Applied Epi Training

Here is what those "helpers" will look like:

r fontawesome::fa("lightbulb", fill = "gold") Click to read a hint

Here you will see a helpful hint!

r fontawesome::fa("check", fill = "red")Click to see a solution (try it yourself first!)

linelist %>% 
  filter(
    age > 25,
    district == "Bolo"
  )

Here is more explanation about why the solution works.

Quiz questions

Please complete the quiz questions that you encounter throughout the tutorial. Answering will help you to comprehend the material, and will also help us to improve the exercises for future students.

To practice, please answer the following questions:

quiz(
  question_radio("When should I view the red 'helper' code?",
    answer("After trying to write the code myself", correct = TRUE),
    answer("Before I try coding", correct = FALSE),
    correct = "Reviewing best-practice code after trying to write yourself can help you improve",
    incorrect = "Please attempt the exercise yourself, or use the hint, before viewing the answer."
  )
)

question_numeric(
 "How anxious are you about beginning this tutorial - on a scale from 1 (least anxious) to 10 (most anxious)?",
 answer(10, message = "Try not to worry, we will help you succeed!", correct = T),
 answer(9, message = "Try not to worry, we will help you succeed!", correct = T),
 answer(8, message = "Try not to worry, we will help you succeed!", correct = T),
 answer(7, message = "Try not to worry, we will help you succeed!", correct = T),
 answer(6, message = "Ok, we will get there together", correct = T),
 answer(5, message = "Ok, we will get there together", correct = T),
 answer(4, message = "I like your confidence!", correct = T),
 answer(3, message = "I like your confidence!", correct = T),
 answer(2, message = "I like your confidence!", correct = T),
 answer(1, message = "I like your confidence!", correct = T),
 allow_retry = TRUE,
 correct = "Thanks for sharing. ",
 min = 1,
 max = 10,
 step = 1
)

Icons

You will see these icons throughout the exercises:

Icon |Meaning ------|-------------------- r fontawesome::fa("eye", fill = "darkblue")|Observe
r fontawesome::fa("exclamation", fill = "red")|Alert!
r fontawesome::fa("pen", fill = "brown")|An informative note
r fontawesome::fa("terminal", fill = "black")|Time for you to code!
r fontawesome::fa("window-restore", fill = "darkgrey")|Change to another window
r fontawesome::fa("bookmark", fill = "orange")|Remember this for later

License

Applied Epi Incorporated, 2022
This work is licensed by Applied Epi Incorporated under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Please email contact@appliedepi.org with questions about the use of these materials for academic courses and epidemiologist training programs.

Learning objectives

In this exercise you will:

Arrange an R project with well-named files
Import a CSV dataset into R
Conduct exploratory analyses
Explore "tidy data" formats and common pitfalls with spreadsheets
Review best practices for naming files, data collection, and data dictionaries

Setup

In this module, you will continue to write code in your script "demo.R" from the previous module.

Open your RStudio project

Begin by opening your RStudio project, by double-clicking the "intro_course.Rproj" file in your "intro_course" folder on your desktop.

This will open the RStudio project with all of the connections between files intact. In the far upper-right corner of the RStudio project you will see "intro_course" displayed. If you do not see this, ask your instructor for assistance.

# adding xfun::relative_path() creates a dynamic file path between the Rmd location and the here() path.
# It dynamically creates the ../../etc filepath.

knitr::include_graphics("images/Rproject_dropdown.png", error = F)

Recall that an RStudio project is a self-contained and portable R working environment - a folder for all the files associated with a distinct project (data files, R scripts, outputs, etc.).

You can zip an RStudio project and email it to someone else. They should be able to open it and replicate your work exactly!
An RStudio project can be linked to a repository on version-control software like Git/Github

If you do not work in an R project, it will be harder to organize files, to share your work with others, and you may encounter lots of frustration struggling to import files with something called the "working directory". When you use an R project, your scripts become much more easy to share and less likely to "break" when someone else uses them.

Open your R script

Look in the Files pane and open your script "demo.R" from the previous module.

Load packages

Recall that you have to load your R packages each time you open R. This is different from installing (downloading) the packages, which only happens one time.

You have written code already to install and load the packages. Ensure that this code is located at the top of your "demo.R" script, and re-run this command.

pacman::p_load(
     rio,          # for importing data
     here,         # for file paths
     janitor,      # for data cleaning
     lubridate,    # for working with dates
     flextable,    # for making pretty tables
     tidyverse     # for data management
)

Since you have already installed these packages, this command should run very quickly as it loads them for use.

Project organization

File naming can impact your quality of life

File naming can seem like a small thing to focus on, but the ramifications can be significant. Proper file naming can really save you time and headache, and switching your practice can be easy.

Imagine the power of R - you can write code to automatically import the most recent data files by the date in their file name! But this can be hindered by poor file naming practices.

Even if you never try coding such as that, you surely know that bad file naming practices can cause significant problems for version control and archiving.

Good and bad file names

Have you ever seen files like these?

r emo::ji("cross mark") notes.docx
r emo::ji("cross mark") a.R
r emo::ji("cross mark") 2b.xlsx
r emo::ji("cross mark") notes.txt

How about something like this?

r emo::ji("cross mark") case control San Jose factory.pptx
r emo::ji("cross mark") abstract for Mark?.xlsx
r emo::ji("cross mark") December 20 update.R
r emo::ji("cross mark") draft "malaria study" report(1).docs

These are very difficult to work with. Try to follow these principles when naming files (your future self will thank you!):

Be nice to machines
Be nice to humans
Make sorting and searching easy

Let's discuss tips to achieve each of these principles.

Names that are machine-readable

Don't use white (empty) space in file names (machines sometimes get confused by spaces)

r emo::ji("cross mark") Draft Vax outreach team.docx
r emo::ji("check") vaccination-outreach_draft01.docx

Use letters, numbers, hyphens (-), and underscores (_) only (special characters can have special meanings and confuse machines e.g. ^.*?+|$ )

r emo::ji("cross mark") Ministry report ?.docx
r emo::ji("check") ministry_report_draft01.docx

Never have two file names that differ only by case (some operating systems treat b and B the same, while others treat them as different), and be consistent with your case (usually lowercase is better)

r emo::ji("cross mark") study.docx
r emo::ji("cross mark") Study.docx
r emo::ji("cross mark") Belgium-and-france.docx

r emo::ji("check") study.docx
r emo::ji("check") belgium-and-france.docx

Names that are human-readable

Use hyphens and underscores differently

Hyphen should mean: "different words, but part of the same idea"
Underscore should mean: "different idea"

r emo::ji("check") 20200316_goma_linelist.xlsx
r emo::ji("check") transmission-analysis_california_mmwr.docx

Names that sort easily

Dates should be written as YYYY-MM-DD (the ISO 8601 standard). Usually, it is best to place the dates before other name components to preserve chronology.

File names with this style of dates do not sort in chronological order!

r emo::ji("cross mark") 1-April-2012_linelist.xlsx
r emo::ji("cross mark") 1-Jan-2009_linelist.xlsx
r emo::ji("cross mark") 1-Jan-2012_linelist.xlsx
r emo::ji("cross mark") 12-Jan-2012_linelist.xlsx
r emo::ji("cross mark") 2-Jan-2012_linelist.xlsx
r emo::ji("cross mark") 31-Dec-2009_linelist.xlsx

But this style does sort correctly!

r emo::ji("check") 2009-01-01_linelist.xlsx
r emo::ji("check") 2009-12-01_linelist.xlsx
r emo::ji("check") 2009-12-31_linelist.xlsx
r emo::ji("check") 2012-01-01_linelist.xlsx
r emo::ji("check") 2012-01-02_linelist.xlsx
r emo::ji("check") 2012-04-01_linelist.xlsx

To order files without dates, use numbers as prefixes (left pad with 0 so all numbers have the same length!)

r emo::ji("check") 01_introduction.docx
r emo::ji("check") 02_methods.docx
r emo::ji("check") 03_analysis.docx

... (more chapters)...

r emo::ji("check") 19_appendix-04.docx
r emo::ji("check") 20_appendix-05.docx
r emo::ji("check") 21_appendix-06.docx

In the end, file naming does not come with hard rules - it is a personal choice. However, certain choices can lead to more frustration, lost files, and coding headaches later.

quiz(caption = "Quiz - file names",
  question("What is problematic about this file name (select ALL that apply):\n\nSARS.Exposure.listing 1stdraft 01032016.csv",
    allow_retry = T,
    answer("The date is incorrectly ordered", correct = T, message = "The date should be in YYYY-MM-DD format"),
    answer("It is actually for Measles, not SARS ", message = "You don't know that! This is an example :-)"),
    answer("It uses special characters", correct = T, message = "File names should avoid periods"),
    answer("It uses spaces", correct = T, message = "Avoid spaces in file names"),
    answer("The date should be at the beginning", correct = T, message = "For proper ordering, the date should be at the beginning")
  )
)

Clean file names

r fontawesome::fa("window-restore", fill = "darkgrey") In your File Explorer, view the "intro_course" folder that we gave to you at the beginning of the course, in which you have created the R project.

Here is how your folder structure should look:

r emo::ji("folder") project ("intro_course")
- r emo::ji("folder") data
  - r emo::ji("folder") clean
    - r emo::ji("folder")backup
  - r emo::ji("folder") raw
    - r emo::ji("folder")backup
  - r emo::ji("folder")pop
  - r emo::ji("folder") shp
- r emo::ji("folder") scripts
  - r emo::ji("folder")backup
- r emo::ji("folder") outputs
- r emo::ji("folder") learning materials
  - r emo::ji("folder") slides
  - r emo::ji("folder") covid_case_study
- r emo::ji("document") surveillance_linelist_20141201.csv
- intro_course.Rproj

r emo::ji("warning") You may also have some other files like ".Rhistory" - do NOT delete these. Just leave them.

Clean file names

r fontawesome::fa("window-restore", fill = "darkgrey") Now enter the folder "data/raw/hospital".

Regard the files - these are linelists of Ebola patients admitted for care, as sent to you on the 1st of December 2014, from 4 different hospitals (Port Hospital, Military Hospital, St. Mark's Maternity Hospital (SMMH), and Central Hospital). There are also files containing records from "Other" hospitals, and of patients for whom the hospital is "Missing".

How can these file names be improved? Both for human readability, but also to facilitate standardized code that we may write to import them into R?

Edit the file names to that they are standardized with best practices above

When finished, view the "solution" below. We suggest that you align your file names with these.

r fontawesome::fa("check", fill = "red")Click to see a solution (try it yourself first!)

20141201_hosp_central.csv
20141201_hosp_military.csv
20141201_hosp_port.csv
20141201_hosp_smmh.csv
20141201_hosp_other.csv
20141201_hosp_missing.csv

Project structure

A few remarks here on folder structure:

Folder structure

Typically, public health analysis folders will contain subfolders such as:
- data (often subdivided into "raw" and "clean", or by time period)
- scripts (or sometimes, just called "R")
- outputs
- images (if you are embedding images in your R documents)

README files

Most folders should have a README file that explains what lives in the folder, how it is updated, how it is used, etc.

In a finished project, the README should help users navigate and understand the contents.
In a project template, the README contains instructions for you to setup the project!

Now, add a .txt file that is named README.txt. Place the brief information that you learned above about where these data come from (the hospitals) in this file, and save it in the "raw" folder as README.txt.

You can do this by opening NotePad or a similar software, entering the text, and saving it in the folder. If you cannot find a plain text editor, you can use Microsoft Word.

Concepts in this section on project structure were borrowed from this slide deck on project structure by DJ Navarro.

Importing a dataset

To begin working with data, you must import it into the R environment as an object.

Once imported, the dataset will be saved as a data frame object, which consists of columns and rows.

Script organization

Recall that often a script is run from top-to-bottom. Therefore, you will need to organize your script in a logical manner.

Typically, you begin with loading packages, and then continue to importing data.

Create a new script section, using the Ctrl + Shift + R shortcut, and name it "Practice importing data"

Location of the data

In order to import data into R, you must tell R where to access the data file on your computer (e.g. a specific folder).

This can be surprisingly difficult (e.g. when data are stored on shared drives). However, by saving the data within our R project the whole process becomes much easier.

{rio} - the easy package for importing data

There are {base} R functions for importing data, but they can be confusing and difficult to remember - there are separate functions for each type of file (e.g. .xlsx, .csv, .tsv, .txt, etc).

Thankfully, there is one function that works for almost all file formats, which is the import() function from the package {rio}.

Let us find our data file

Do you see the file "surveillance_linelist_20141201.csv" that is saved in the "root" project folder? This is the baseline folder - the same folder as the "intro_course.Rproj" file. To access the root folder from the R project file you do not need to click into any sub-directories (sub-folders).

Run the import() command

In its most simple form, the import() function accepts a character value - the file path to the data that you wish to import. In this case, the data file which we want to import is saved in the root folder, so you only need to provide the file name and extension, in quotation marks, as below:

import("surveillance_linelist_20141201.csv")

r fontawesome::fa("exclamation", fill = "red") Did you see this error?

Error in import("surveillance_linelist_20141201.csv") : could not find function "import"

If so, it means that you did not install and load the {rio} package. If you need help, review the previous section and look for the pacman::p_load() command that loads several packages including {rio}.

Once you get the command to run successfully, you probably saw a lot of text appear in the Console. That was the dataset!

Think: what did your command ask R to do? It asked R to import the dataset... and the default action is to print/display it.

You did not ask R to save the dataset as a named object. How would you do that?

In RStudio, try again to import the data and save it as the object surv_raw (a short name for "surveillance linelist raw").

r fontawesome::fa("lightbulb", fill = "gold") Click to read a hint

Use the assignment operator <- before the function. Don't forget quotation marks around the name of the file.

r fontawesome::fa("check", fill = "red")Click to see a solution (try it yourself first!)

surv_raw <- import("surveillance_linelist_20141201.csv")

You should now see the new object surv_raw appear in the R Environment pane, under the subcategory "Data". It should have a short description of the number of observations and variables. Congratulations, you have now imported a dataset into R!

Importing data from subfolders

In most cases, it is better organizationally to store data in a subfolder - even perhaps by week, or by status as either "raw" or "clean". In this course, most of the data are initially stored in a subfolder named "data".

r fontawesome::fa("window-restore", fill = "darkgrey") Now, go to your File Explorer to the root folder of the R project.

!!! Cut the file "surveillance_linelist_20141201.csv" from the root folder, and paste it into the "data/raw" subfolder. !!!

And so we are introduced to the topic of file paths.

Any importing function needs to know where to look for the file that you want to import - this is the file path.

A typical file path ("absolute" file path, or a "full" file path) may look like this:

surv_raw <- import("C:/Users/Laura/Documents/intro_course/data/surveillance_linelist_20141201.csv")

The problem with this is all of the path before "intro_course" ... the C: ... the "Laura" ... etc. None of these may be present in the computer of Laura's colleague, and this command would not run. The file path would be "broken". Or, the next user may be using a Mac, so the slashes would all need to be reversed! Thankfully there is a solution:

The {here} package and relative file paths

The package {here} and its function here() make it easy to tell R where to find and to save your files - in essence, it builds relative file paths.

Used in conjunction with an RStudio project, {here} allows you to describe the location of files in your RStudio project in relation to the project’s root directory (the top-level folder). This is useful when the project may be shared or accessed by multiple people/computers or when a script may be moved around within the project.

Used with an RStudio project, it prevents complications due to the unique file paths on different computers (e.g. "C:/Users/Laura/Documents...") by “starting” the file path in a place common to all users (the project root).

r fontawesome::fa("terminal", fill = "black") Make sure you have installed and loaded the package {here}. Then try running the command here() - leave the parentheses empty. What do you see?

In "Laura"'s situation, the output might look like this:

"C:/Users/Laura/Documents/intro_course")

here() has automatically created the full file path that comes before the RStudio project root folder ("intro_course").

It is easy to add sub-folders at the end of the path, like this:

here("data", "raw")

This would produce the file path:

"C:/Users/Laura/Documents/intro_course/data/raw")

Ending the here() command with a file name and extension could look like this:

here("data", "raw", "surveillance_linelist_20141201.csv")

This would produce the file path:

"C:/Users/Laura/Documents/intro_course/data/raw/surveillance_linelist_20141201.csv")

Now comes the secret combination: we insert this here() command into the import() command! We nest one function inside another one! Take a look at this:

# import the file from the "data" and "raw" subfolders
surv_raw <- import(here("data", "raw", "surveillance_linelist_20141201.csv"))

Note the two parentheses at the end - one for each function. The file path returned by here() is passed to the first argument of import() as a character value.

A note on slashes

You may wonder - where are the forward slashes and back slashes in the here() command? Well, here() allows the slash-agnostic entry format above because it can adapt to whether you are working on a Mac or a PC!

On a PC:

project/data/linelist_20141201.csv
project/data/linelist_20141208.csv

On a Mac:

project\data\linelist_20141201.csv
project\data\linelist_20141208.csv

You do not know who will be using your script later, or what kind of computer they will have. Therefore, it is easier to simply go with: here("data", "linelist_20141201.csv").

Import from a subfolder

r fontawesome::fa("terminal", fill = "black") Now in RStudio, write a command to import "surveillance_linelist_20141201.csv" from the subfolder "data/raw" (where you moved it, earlier) and save it in R as surv_raw using the assignment operator.

r fontawesome::fa("lightbulb", fill = "gold") Click to read a hint

Begin the command with the object name that you want to assign, then the assignment operator, and then the import() command. Within the import() parentheses, place the here() function, which should list the names of the sub-folders (in quotes) and include the file name with extension (.csv).

r fontawesome::fa("check", fill = "red")Click to see a solution (try it yourself first!)

surv_raw <- import(here("data", "raw", "surveillance_linelist_20141201.csv"))

quiz(caption = "Quiz - importing",
  question("How would you write the import() and here() command for the following filepath (assuming the R project is located in the folder 'malaria')?:\n\nE:/Mario/My Documents/R stuff/malaria/received/wk22/cases.csv",
    allow_retry = T,
    answer("import() here('received/wk22/cases.csv')", message = "No, the here() function must be nested within the import() function."),
    answer("import(here('received', 'wk22', 'cases', '.csv'))", message = "No, include the file extension with the file name."),
    answer("here(import('received', 'wk22', 'cases.csv'))", message = "No, the here() function must be nested within the import() function, because it is providing the file path to it."),
    answer("import(here('received', 'wk22', 'cases.csv'))", correct = T)
  )
)

Manual file selection

A useful alternative to providing a file path is using the {base} R function file.choose(). This triggers a pop-up window to appear, from which you can manually select the file to import.

Your import command would look like this (note the empty parentheses at the end of file.choose():

surv_raw <- import(file.choose())

This approach has the disadvantage of not being as well documented (for reproducibility) but can be extremely useful if you are routinely receiving updated data by email and simply need to select the most up-to-date version saved to a local folder.

In the Epi R Handbook chapter on importing, there is sample code on how to write code to automatically import the most recent dataset from a folder. However, file.choose() can often be a more simple alternative.

Try to write a command with file.choose() in your R script and test this process. If you do not see the pop-up window, check behind your RStudio window.

Inspect a dataset

Now that you have imported the data, let's take a look!

Take a peek

r fontawesome::fa("eye", fill = "darkblue") The dataset is now saved as a data frame object in the RStudio Environment pane. Find its name there, and click once on the blue circle next to the name. What do you see?

The expanded area beneath the data frame is an overview of all the columns in the data frame. After the name of the column, there is a colon, then an abbreviation that indicates the "class" of the column (we'll discuss this more later), and then the first few values in the column. How exciting!

For even more detail, collapse this overview and click on the name of the data frame itself (surv_raw).

This should open a new tab in the upper-left pane, next to the R script. This tab will display the data frame!

Practice scrolling through the data frame, and applying filters (see the icon in the upper-left of the display).

Try these exercises:

quiz(caption = "Quiz - data review",
  question("How many columns are there in the data frame?",
    allow_retry = T,
    answer("665", message = "No, this is the number of rows/observations"),
    answer(ncol(surv_raw), correct = TRUE, message = "Yes this is correct!"),
    answer("19", message = "This is not correct, try again."),
    answer("25", message = "No, this is the number of rows shown in the initial display")
  ),

  question("What is the value in the 4th column and the 50th row?",
    allow_retry = T,
    answer(surv_raw[50, 4], correct = TRUE),
    answer("Port Hospital"),
    answer("38.3"),
    answer("West II")
  ),

  question("How many of the rows include a weight 40 to 45 kg (inclusive)?",
    allow_retry = T,
        answer("112"),
        answer(surv_raw %>% filter(`wt (kg)` >= 40 & `wt (kg)` <= 45) %>% nrow(), correct = TRUE),
        answer("226"),
        answer("95")
  )
)

Indexing

Above we asked you to find the value in the 4th column, and the 50th row. Instead of searching manually, you can also write code to isolate and view pieces of the dataframe.

The $ operator can be used on a data frame to reference a specific column. It is one of the indexing operators in R, which returns a sub-part of a larger object. For example, surv_raw$gender will return the entire column gender.

Try typing and running the following commands

surv_raw$hospital
surv_raw$case_id

As you typed, slowly, did you see anything appear as you arrived at typing the $? You should have seen a small menu appear that showed all the columns in the data frame surv_raw. Using your arrow keys you can select a column from that menu - if you do not want to type the rest and potentially make a spelling mistake.

Try writing commands to show other columns, and experiment with this pop-up menu.

Here is something you should know: columns can also be referred to as "vectors" - vectors are a long line of values of the same class. So in fact, a data frame is simply a collection of vectors (columns!).

You can also create other, independent vectors, like you have already done with c(). But most commonly, you will work with vectors in the context of a data frame. "Vector" is a good term to know.

Another indexing operator to be aware of is the "square brackets" that look like [ ]. These brackets can also be used to isolate/return a sub-part of a larger object: dataframe[ROW, COLUMN]
This command would return the value at the 12th row and the 4th column: surv_raw[12, 4]

You can return an entire column by leaving the ROWS part empty (but don't forget the comma!): surv_raw[ ,4]
Or an entire row by doing the opposite: surv_raw[12, ]

Other objects can be subset as well. For example, the summary() function when used on a column returns an object that contains the minimum, median, mean, IQR, etc.

Try these three commands in R to see how the output changes as further indexing is applied:

# Return the summary of the column 'age'
summary(surv_raw$age)

# Return the 3rd element of the summary
summary(surv_raw$age)[3]

# Return the number only, of the 3rd element of the summary
summary(surv_raw$age)[[3]]

# Do something with the number
summary(surv_raw$age)[[3]] + 4

You will use the $ very frequently. It is less common to need the brackets. But still - good to know about in case you see them or need to use them.

Skim

You can return a good summary of the dataset using the R package {skimr} and its function skim().

Use p_load(skimr) or install.packages("skimr") to install this package and load it for use.

If you encounter any errors when trying to install the package, notify your facilitator.

Place the name of the linelist in skim() and run. What content appears in the R console?

If the output is not easily readable, expand your Console pane to be wider, and run the command again.

r fontawesome::fa("check", fill = "red")Click to see a solution (try it yourself first!)

skim(surv_raw)

What does the output show?

quiz(caption = "Quiz - reviewing the data",
  question("How many columns has R classified as numeric?",
    allow_retry = T,
    answer("32", message = "No, this is more than the columns in the dataset!"),
    answer(surv_raw %>% select(where(is.numeric)) %>% ncol(), correct = TRUE, message = "Yes this is correct!"),
    answer("13", message = "No, this is the number of columns of class character"),
    answer("1", message = "No, this is the number of columns that are POSIXct - a type of date format")
  ),

  question("Review the information on character columns. What is the number of unique values in the hospital column?",
    allow_retry = T,
    answer(length(unique(surv_raw$hospital)), correct = TRUE),
    answer("Port Hospital"),
    answer("3"),
    answer("8")
  ),

  question("Review the information on numeric columns. What is the median age?",
    allow_retry = T,
        answer("32.2"),
        answer(median(surv_raw$age, na.rm=TRUE), correct = TRUE),
        answer("1250"),
        answer("99.0")
  )
)

Tidy data

In the subsequent sections, we will briefly review concepts of "tidy data" and some common pitfalls that prevent easy analysis of spreadsheet data in R.

knitr::include_graphics("images/tidy_shelves.png")

How you structure your dataset will greatly impact how complicated your R code needs to be. Advance attention during your data entry phase will save you a lot of time during data cleaning and analysis!

Reflect: What are some features of your data that make it particularly difficult to clean for analysis?

Definitions

Tidy data has a specific meaning in data analysis - it refers to how information has been stored within the structure of the dataset. So let's begin with some terminology.

Structurally, datasets (called "data frames" in R) consist of cells, columns, rows.

However, "values", "variables" and "observations" are more abstract concepts.

Variables measure one underlying attribute (age, outcome, or date of onset)
Observations relate to a unit of analysis
Values are one piece of information associated with one variable and one observation

Let's test your understanding of these terms.

quiz(
  question("50 kg, 75 kg, 67 kg, 90 kg are...",
    answer("Different observations of the variable `temperature`", message = "Think again! It does not make much sense that `temperature` would be measured in kilograms! It is also unlikely that these values would be referred to as observations..."),
    answer("Different variables of the observation `weight`", message = "Think again! These values all have the same unit, so they would probably be one variable rather than different ones. `weight` as an observation might also not be the most logical way of recording this data..."),
    answer("Different values of the variable `weight`", correct = TRUE, message = "Indeed, these are correct values (in this case we are dealing with numbers, so quantitative data). As the different values have the same unit, they can be grouped under one variable (in this case, weight would make sense)."),
    answer("None of these answers are correct", message = "Are you sure? Try and think of what these different terms refer to!"),
    answer("All of these answers are correct", message = "Are you sure? Try and think of what these different terms refer to!"),
    allow_retry = TRUE
  ),
  question("`Name` would most likely be...",
    answer("A value", message = "If this was a value, what would be the variable and observation it could be grouped under?"),
    answer("A variable", correct = TRUE, message = "Indeed, different values could include 'Josephine', 'Alex' or 'Billy'."),
    answer("An observation", message = "If this was an observation, what is the unit of measure? How would you get other observations?"),
    allow_retry = TRUE
  ),
  question("Thinking of data structure, how would an observation best be represented? ",
    answer("As a row", correct = TRUE, message = "That's right! More on this below..."),
    answer("As a column", message = "Think again! Is this the most efficient way of representing your data?"),
    answer("As a cell", message = "Think again! Is this the most efficient way of representing your data?"),
    allow_retry = TRUE
  )
)

Principles of tidy data

Your data can be stored in many ways - why is it important to have data that is "tidy"?

Consistency in data structure, consistency in tools, and efficient analyses
Placing variables in columns allows may R function to easily handle your data (e.g. ggplot() for plotting) - the packages from the {tidyverse} are designed to work with tidy data

There are three principles that make a dataset “tidy”:

Each variable must have its own column
Each observation must have its own row
Each value must have its own cell

Source: R for Data Science

Ideally, they align: columns = variables and rows = observations

knitr::include_graphics("images/tidy_image.png")

But this is not always the case... especially in public health data. You have surely seen data entered in this "wide" format:

long_example <- tribble(
     ~country, ~January, ~February, ~March, 
     "Mozambique", 3200,    3300,      4100,  
     "Lesotho", 500,     750,       900,   
     "South Africa", 5100, 6200, 8100,)

long_example

quiz(
  question("'January' is a column, but is it a variable?",
    answer("Yes", message = "No, remember a variable is the more abstract, underlying attribute being measured."),
    answer("No", correct = TRUE, message = "January is one value of the variable 'month'"),
    allow_retry = TRUE
  ),
  question("Where is the variable 'month'?",
    answer("In the third column", message = "No, it is not in one column"),
    answer("In a different spreadsheet", message = "You are being silly."),
    answer("It is spread across many columns", correct = TRUE, message = "Correct, it does not have its own column."),
    allow_retry = TRUE
  ),
  question("Does each observation have its own row?",
    answer("Yes", message = "No, each value is an observation and there are three in each row"),
    answer("No", correct = TRUE, message = "Correct, there are three observations in each row."),
    allow_retry = TRUE
  )
)

Here is the same data, but in a "tidy" format:

long_example_pivoted <- long_example %>%
  pivot_longer(cols=2:4, names_to="month", values_to = "cases") 

long_example_pivoted

Reflecting on tidy data practices, what proportion of the datasets you use are tidy?

"Machine-readable" data

When you first begin to collect data, you must ask the question: is the primary audience of this dataset humans or machines?

Recording information such that it is optimized for for "human-readability" can be very different than optimization for "machine-readability" and analysis. Be clear from the beginning what your priority is. Please allow us to give some general advice - it is generally easier to transition from machine-readable to human-readable than the other way!

Excel spreadsheets

At Applied Epi, we promote the use of R for many reasons, but we know that for most applied epidemiologists, Excel is tool that is fundamental to their workflows. There is nothing wrong with using Excel. Workflows that involve R almost always also involve Excel. But it is important that you use Excel in a manner that allows you to also maximize the benefits of R.

knitr::include_graphics("images/interoperability.png")

Excel is a powerful and beginner-friendly software. While it is also possible to do some analysis in Excel, you will likely find that more sophisticated analyses and data management operations may be very complicated or impossible to carry out in this software. These situations are where versatile programs such as R become very useful.

knitr::include_graphics("images/difficulty_plot-1.png")

In this course, you will become more familiar with R code syntax. In this section, we will focus on steps that you can take to ensure that an Excel dataset can be easily interpreted by R for analysis.

The main reason one encounters problems analyzing data from Excel spreadsheets is when the spreadsheet was designed to prioritize easy reading by humans, not easy reading by machines/software.

To help you see the difference, below are some fictional examples of spreadsheets that prioritize human-readability over machine-readability.

Importing data from spreadsheets

Write a command to import the spreadsheet "messy_data_examples.xlsx" and store it as the object partner_tracking. The spreadsheet is saved in the folder "intro_course/data/raw".

r fontawesome::fa("check", fill = "red")Click to see a solution (try it yourself first!)

partner_tracking <- import(here("data", "raw", "messy_data_examples.xlsx"))

Now, open the spreadsheet using Microsoft Excel. If you can do this, note how there are actually 4 sheets in this Excel workbook. Using the import() function defaults has only imported the first sheet!

Revise your importing command to use the sheet = argument to specify the sheet name to import. Write and run one command for each of the sheets, as shown below:

# import sheet that tracks partners with color codes
partner_tracking <- import(here("data", "raw", "messy_data_examples.xlsx"), sheet = "messy_colors")

# import sheet with messy format site coverage
site_coverage <- import(here("data", "raw", "messy_data_examples.xlsx"), sheet = "messy_site_coverage")

# import sheet with "tidy" formatted site coverage data
site_coverage_tidy <- import(here("data", "raw", "messy_data_examples.xlsx"), sheet = "tidy_site_coverage")

# import sheet with messy GIS data
messy_gis <- import(here("data", "raw", "messy_data_examples.xlsx"), sheet = "messy_gis")

Now that you have imported these four sheets, let us review them.

Colors

The sheet that was imported and named parter_tracking records the status of a public health intervention by various response partners in several Provinces, Districts, and sub-Districts. On the right side, the status for particular dates is indicated by the cell color, and there is a small color-based dictionary in the upper-right of the sheet.

knitr::include_graphics("images/messydata2.PNG")

We must ask ourselves - is this sheet written to maximize human readability, or machine readability? A prime example of prioritizing human-readability over machine-readability is the use of color-based coding of cells in a spreadsheet.

It is quite difficult in R to interpret the color of each cell in an Excel spreadsheet. If you use color, you should not only use color - also make the values reflect the differences between cells.

Storing information like in this sheet is not easily interpreted by R - nor by humans with color-blindness!

Furthermore, different pieces of information are combined into one cell (multiple partner organizations working in one area, or the status “TBC” in the same cell as “Partner D”).

question("In this case, what do you think would be the best modification before loading the dataset to R?",
  answer("Keep as is, R will recognise the colours.",
         message = "It is very difficult for R to interpret the colors."),
  answer("Keep as is, but add text to the colours.",
         message = "This would help as R cannot read colours. However, this data format would still not be well read in R and the colour dictionary at the right would cause issues."),
  answer("Keep as is, but move the colour dictionary to another sheet.",
         message = "Moving the colour dictionary to another sheet would be helpful and could be loaded separately as a lookup, however R cannot read colours so you would still face issues when loading the main table."),
  answer("Change the dataset so that each column is a variable and each cell contains one unique value, integrating the colours as a new variable.", correct = TRUE,
         message = "Indeed, transforming this data to a tidy format is absolutely necessary if any analysis will be performed on this data. Removing the necessity for colours by adding a variable to specify what these represent would be the way to go."),
  answer("Change the dataset by adding a column to specify what the colour means.",
         message = "Whilst this is generally the way to go to transform a dataset using a colour dictionary before uploading it to R, this dataset contains other issues making it untidy and hard to analyse in the current format."),
  allow_retry = TRUE
)

While color-based dictionaries may be helpful for human-readability of a dataset, colors should never be used as the only way of recording data. Machines will not be able to interpret it for analysis.

Again, it is important to think of how best to represent that variable that is currently being represented by color - most likely you should restructure your data and represent it in its own column.

And as a general rule, dictionaries (whether they are colour dictionaries as shown here or data dictionaries) should be kept separately from the main table. In Excel, it would be best practice to have this dictionary on another sheet. When importing it to R, you can then import it as a separate data frame.

But we're getting ahead of ourselves... more on data dictionaries later!

Merged cells

The sheet that is now imported as site_coverage contains information about "coverage" at a number of sites in May and June 2022:

knitr::include_graphics("images/site_coverage_untidy.jpg")

Note the merged cells. Merged cells are often useful to make data human-readable but can cause problems for analysis by machines.

question("What issues do you think R will encounter when loading this dataset?",
  answer("R will not recognise the merged cells and will drop the value.",
         message = "Indeed, R cannot recognise merged cells. However it will not drop all those values."),
  answer("R will recognise the merged cell and also create a merged cell.",
         message = "Data frames in R cannot contain merged cells."),
  answer("R will not recognise the merged cell so will only keep the value in the first row and column of the merged area.", correct = TRUE),
  answer("R will recognise the merged cells and duplicate these values across the merged area.",
         message = "Unfortunately, most import functions in R will not do this by default."),
  allow_retry = TRUE
)

How did the merged cells import into R using the import() function from {rio}? Click on the data frame to open it in the Viewer pane. Look at row 8 for an example.

knitr::include_graphics("images/site_coverage_imported.jpg")

As you see, importing this dataset into R in the given Excel format leads to data loss in multiple ways:

Most columns do not have the right header
Many observations (i.e. rows) are missing data
The actual values of interest are not all in the right rows

Using the import() function, R will not recognize the merged cell formatting, and all cells except the first merged cell will be read-in as empty! This will lead to loss of data and make analysis difficult.

One solution is to use a different package to import the data. The {openxlsx} R package handles Excel workbooks with more precision than {rio}. Its function read.xlsx() offers an argument fillMergedCells = which can be set to TRUE. For this function, the argument for the sheet name is also sheet =.

sites <- openxlsx::read.xlsx(here("data", "raw", "messy_data_examples.xlsx"),
                             sheet = "messy_site_coverage",
                             fillMergedCells = TRUE)

The argument fillMergedCells = can result in the merged value appearing in all its cells, but the spreadsheet is still very difficult to analyze:

knitr::include_graphics("images/site_coverage_filled.jpg")

Tidy data

Which columns to use is still not clear - it would take many R commands to clean the data in order to produce even a simple tabulation of "Yes" values by site.

How would you enter these data into a spreadsheet in a manner that is "tidy" and machine-readable?

question("If these data were entered in a tidy format, what would be the column headers?",
  answer("Day, Cell, Province, X", message = ""),
  answer("Date, Province, Site, Status", correct=T, message = "Correct, these are the important variables, with Status being either Yes or No."),
  answer("May, June"),
  answer("A, B, Site, Date"),
  allow_retry = TRUE
)

See the sheet that you imported as the object site_coverage_tidy. This contains the same data as in site_coverage, but in a tidy format (also called "long" format). See how:

Every value has its own cell
Every observation has its own row
Every underlying variable has its own column

knitr::include_graphics("images/site_coverage_long.jpg")

The above format is not very easy for humans to read, but it is easily imported and handled by R.

Once in R, it is relatively easy to work with the dataset. We do not expect you to understand the code below, but know that it cleans and expands this tidy data so that every possible date and site are present in the data.

# import the long data 
site_coverage_clean <- site_coverage_tidy %>%    # create complete dataset 
  mutate(Date = ymd(Date)) %>%           # convert dates to proper class in R
  complete(                              # complete all sites and dates
    Date = seq.Date(
      from = min(Date),
      to = max(Date),
      by = "day"),
    Site = seq(1:14),
    fill = list(Status = "No")) %>%      # If not already listed in data, status is "No"
  mutate(Province = as_factor(ifelse(Site %in% 1:7, "A", "B")), # add Province
         Site = as_factor(Site))

Now the dataset has expanded from r nrow(site_coverage_tidy) rows to r nrow(site_coverage_clean) rows - all possible dates and sites - a complete data set! (one row for each cell in the original messy Excel spreadsheet!)

site_coverage_clean

Expanded possibilities

With the data in "tidy" format, the possibilities for transforming and analysing the data are wide open!

For example, we can use the {ggplot2} data visualization R package to create a "heat plot" that resembles the original Excel spreadsheet.

We do not expect you to understand or type this code right now. You will learn more about {ggplot2} in later sessions

# create heat tile plot
ggplot(data = site_coverage_clean,
       mapping = aes(x = Date, y = fct_rev(Site),
                     fill = Status, label = Status))+
  geom_tile(color = "white")+
  geom_text()+
  scale_x_date(
    date_breaks = "day",
    labels = scales::label_date_short(),
    expand = c(0,0))+
  scale_fill_manual(
    values = c(
      "Yes" = "darkgreen",
      "No" = "orange"))+
  theme_minimal(base_size = 16)+
  labs(title = "Site coverage",
       y = "Site")+
  facet_wrap(~Province, ncol = 1, scales = "free_y")

If you don't understand the above code, that is OK - we only want to show you that with a few line of R code you can create an "Excel-like", "human-readable" output that is much more easy to analyze than the original spreadsheet.

In many ways, this setup is more useful than the original Excel spreadsheet:

The data are able to be analyzed, whereas in the Excel they could not be
New rows added to the Excel can be re-imported and the plot updates in seconds
This plot can be exported to a PDF report and emailed to partner organizations

Unlike the Excel, this dataset in R can be analyzed! It takes only a few lines of code to tabulate Status by Province:

site_coverage_clean %>% 
  tabyl(Province, Status) %>% 
  qflextable()

Or by Date:

site_coverage_clean %>% 
  tabyl(Date, Status) %>% 
  arrange(desc(No)) %>% 
  qflextable()

Or the data can be aggregated into weeks and the number of unfilled spots tabulated:

site_coverage_clean %>% 
  group_by(week_of = floor_date(Date, "week")) %>% 
  summarise(days_coverage_needed = sum(Status == "No")) %>% 
  qflextable()

Or the data can be used to quickly make other informative plots:

site_coverage_clean %>% 
  filter(Status == "Yes") %>% 
  ggplot(mapping = aes(x = fct_infreq(Site)))+
  geom_bar(fill = "dodgerblue")+
  coord_flip()+
  theme_minimal(base_size = 16)+
  labs(title = "Number of days 'covered', by site",
       x = "Site",
       y = "Number of days with coverage")

Empty space

Examine Excel sheet "messy_gis", which records information about specific clinics, including GPS coordinates the operating partner organization, and bed capacity.

knitr::include_graphics("images/messydata.PNG")

quiz(
  question("What do you foresee may be issues when importing this into R? Tick all that are most likely to apply.",
    answer("I don't think there will be any issues when we upload this to R.", message = "The format will not be the same in R."),
    answer("R will not be able to read this dataset.", message = "R will be able to read this dataset, however it will require a lot of cleaning!"),
    answer("The colours won't appear.", correct = TRUE, message = "The colours won't appear, which may be an issue if the colours represent another variable!"),
    answer("The colours will appear but it is not clear what they represent.", message = "The colours will not appear in R."),
    answer("Some rows are empty.", correct = TRUE, message = "The extra empty rows will require an extra cleaning step."),
    answer("Some row values are empty.", correct = TRUE, message = "R can handle missing values in different cells. The issue with this dataset is that some empty cells imply another value (that from the row above)."),
    answer("Some columns are empty.", correct = TRUE, message = "The empty columns will require an extra cleaning step."),
    answer("Some column values are empty.", message = "R can handle missing values in columns."),
    answer("Spaces in the column names.", correct = TRUE, message = "Column names will be changed from what we see in Excel if they have spaces and as a result and may require an extra cleaning step!"),
    answer("Spaces in the cell values.", message = "R can handle spaces when it comes to cell values. Note that these values will then be recorded as a character (i.e. string) variable."),
    answer("Different recording formats in the same column.", message = "If the same column has different recording formats, R will still be able to read it. It may however save it in the wrong format which will require an extra cleaning step. Recording in different formats in the same column also does not follow the principles of tidy data."),
    allow_retry = TRUE
  )
)

The extra empty rows and columns within the dataset will cause cleaning headaches in R.

question("Which of the empty cells do you think will be most problematic?",
  answer("The first empty row as R will not recognise the column names",
         message = "R will actually recognise the second row as the header row in this case. So whilst this is not ideal, this empty row is not too problematic."),
  answer("The third empty row.",
         message = "The third empty row will be loaded as a row of `NA` values. Whilst not ideal, this can be fairly easily removed."),
  answer("The empty columns.", correct = TRUE,
         message = "These will be the most problematic as you will probably have to manually remove these in R, because one 'real' column (`Patients`) also does not contain values. This means that you will need to take note of every column number that you want to remove and do so manually!"),
  answer("All empty cells.",
         message = "All empty cells are not necessarily issues, as sometimes there is no data available for one particular variable (which may be data itself, or grounds for exclusion from analysis)."),
  allow_retry = TRUE
)

The empty columns in this dataset are likely to be the most problematic out of the options given above, as you will manually have to remove these in R.

The empty rows from row number 4 onwards are also a problem, as they actually imply the value that is given in the cell above, but will be recorded as NA values. This will require significant extra cleaning.

On the other hand, the values in the column Patients are actually missing. In this case, it would be wiser to write "missing" or some other indicator for this in each of these cells, to explicitly indicate that these data are missing for this particular variable.

Take a look at how the data were imported into R as the object messy_gis.

Excel-to-R resources

Here are some links to tutorials to help you transition to R from Excel:

R has robust ways to import Excel workbooks, work with the data, export/save Excel files, and work with the nuances of Excel sheets. It is true that some of the more aesthetic Excel formatting can get lost in translation (e.g. italics, sideways text, etc.). If your work flow requires passing documents back-and-forth between R and Excel while retaining the original Excel formatting, try packages such as {openxlsx}.

Data dictionaries

A data dictionary, also sometimes referred to as a 'key', is a separate table from your main recording sheet. This dictionary allows to specify what specific variables mean, whether they be column names, colours, or other.

Variable definition

A data dictionary describes the meaning, units, and range of values that each column holds.

Even if you are familiar with a dataset, the meaning of the column names may not be obvious! That is when data dictionaries become useful, as they will provide information on the column name meaning.

quiz(
  question("For example, what do you think a column named `age_cat5` represents?",
    answer("The 5th age category"),
    answer("The age of the five cats"),
    answer("The age category, as categorised by 5 years age ranges", correct = TRUE),
    answer("The age category, as categorised in 5 groups"),
  allow_retry = TRUE
  )
)

Listing possible values and levels

Data dictionaries can also used to specify the acceptable values for a variable.

For example, you could specify the unit for a numeric variable (kilograms, pounds, years, months, etc.), or the increments of age groups. This is similar to what a cell with a drop-down list of values would provide in Excel.

quiz(
  question("Which of these would be helpful to add in a data dictionary to specify the value levels or format?",
    answer("Date: YYYY-MM-DD", correct = TRUE, message = "Adding the date format is helpful as a date can be written in many different ways (e.g. years in 2 or 4 digits, putting the day/month first, etc.) and specifying what input format the date should have will help save time when it comes to cleaning your dataset"),
    answer("Gender: 'M' or 'F' or 'other' or 'unknown'", correct= TRUE, message = "Specifying the input format is helpful as you can input gender as words or initials, which may differ depending on the language or your input data. This can also be helpful to decide where you are only recording 'males' and 'females' or if you are recording other genders too."),
    answer("Age: in years", correct = TRUE, message = "Specifying the unit of one value is important especially if that unit is not evident. In this case, age could be recorded in days, months or years for example."),
    answer("Age category, '0-4' or '5-14' or '15-44', '45+', or 'unknown'", correct=TRUE, message="When asking for data to be categorised, it is very important to specify the categories that you are expecting as these may not be obvious and can differ depending on the research question and data type."),
    answer("Presence of fever on admission: 'yes', 'no', or 'unknown'", correct = TRUE, message = "If you are expecting a binary data input, it is important to specify that you are as well as the format you are expecting. In this example, rather than 'yes' and 'no', the data collector could have asked for '1' and '0', where 1 = 'yes' and 0 = 'no'. If asking for numerical inputs to binary data, it is important to specify what these numbers represent."),
    allow_retry = TRUE
  )
)

Data dictionary example

Here are the first 5 rows of the surv_raw data frame that you imported:

head(surv_raw, 5)

And below is a data dictionary for this data:

knitr::kable(datadict)

This data dictionary allows to understand what each column value means, as well as what units the values were recorded in. These help keep data frames tidy, so that the imported data frame can be understood by computers but the epidemiologist still has a clear understanding of what each column represents.

It is good practice to create data dictionaries when you collect data and create new templates. These will also allow you to have column names that are easier to use for analysis. When creating your template in Excel, try to remember these tips so that your analysis in R will be as easy as possible:

Keep column name concise
Do not use spaces, but rather underscores (_) in the column names
Avoid the use of special characters, such as hyphen (-), commas (,), percentage signs (%), or currency signs (for example$ or £). These will either not be read by R or the column names will then have to be called on using backticks
Avoid recording unit values in column name unless the unit is shown in multiple units (it might be worth considering creating a new column for the value unit in this case)
Bold or italics will not be read by R
Stay consistent in how you name columns (for example, avoid mixing upper- and lower-case and if possible, keep column names all in lower-case)

Data dictionary variable descriptions

Thinking of the best practices you have just learned, try to answer the following questions.

quiz(
  question("What would be the best description for a column named `Date`?",
    answer("Date, YY-MM-JJ", message = "This description may not be explicit (what occurence is this date recording?) and the format is inconsistent: YY refers to 'years' (i.e. in English) whereas JJ refers to 'jour' (i.e. in French)."),
    answer("Date, YY-MM-DD", message = "This description may not be explicit (what occurence is this date recording?)."),
    answer("Date of specimen collection, YYYY-MM-DD", correct = TRUE, message = "This description explicitly specifies what event the date is recording and the format clear and consistent."),
    answer("Date, YYYY.MM.DD", message = "This description may not be explicit (what occurence is this date recording?)."),
    allow_retry = TRUE
  )
)

Data dictionaries are extremely useful and important if the data is recorded by multiple people or analysed by people that were not the collectors. Being as explicit as possible in the dictionary minimises the risk of misunderstanding and inaccurate recording. Data dictionaries should be kept in separate documents or sheets of your Excel document.

The R package {epikit}, jointly developed by Applied Epi and other organizations, has functions specifically for importing Kobo dictionaries into R.

Recording data

When recording data, the most important aspect is to stay consistent. This will help minimize the time it takes to clean the data as well as re-use the same code on new data.

Recording dates

Dates can be recorded in numerous formats. For example:

The order of days, months, or years may change
The numbers required for days, months, or years may change
Months may be written numerically or as characters
If using a numerical date, the separator may change

question("What is the optimal way of recording a date?",
  answer("YYYY-MM-DD", message="That's right, this is a correct way of recording a date! Are you sure this is the only correct way though?"),
  answer("DD/MM/YY", message="That's right, this is a correct way of recording a date! Are you sure this is the only correct way though?"),
  answer("MM/DD/YYYY", message="That's right, this is a correct way of recording a date! Are you sure this is the only correct way though?"),
  answer("MM.DD.YY", message="That's right, this is a correct way of recording a date! Are you sure this is the only correct way though?"),
  answer("All of those may be correct", correct = TRUE,
         message="Indeed, all of the above options may be correct ways of recording dates!"),
  answer("Neither of those are correct",
         message="Are you sure about this?"),
  allow_retry = TRUE
)

Dates can be recorded in multiple ways, none of which is particularly superior to another. The most important aspect to remember is to stay consistent in how the date is recorded, whether that is numerically or in strings, the type of separator used, the order or the amount of numbers expected for days, months, years or hours and minutes!

Recording gender

question("Out of the below options, which do you think are good ways of recording gender?",
  answer("`f`,`h`", correct = TRUE),
  answer("`Female`, `Boy`, `Man`, `Baby`",
         message = "'Boy' and 'Baby' would not fall under a 'gender' variable. If using 'man', consistency would require the counterpart 'woman' rather than 'female' but when recording gender, 'female' and 'male' are the more correct form."),
  answer("`female`, `Male`",
         message = "Consistency in the capitalisation of the words is necessary!"),
  answer("`F`,`M`", correct = TRUE),
  answer("`0`,`1`", correct = TRUE),
  answer("`f`,`man`",
         message = "Consistency in the recording format is necessary! It would also be better to use the term 'male' rather than 'man' when recording gender."),
  allow_retry = TRUE
)

When recording genders, consistency is key, whichever format is used.

The letters 'F' and 'M' are generally widely understood. Note that this needs to come with a data dictionary so that if other initials are used (for example H and F for 'homme' and 'femme', in French), the person analyzing the data will be aware of what these represent.

If using a binary number format to represent gender, a data dictionary is crucial to specify which number refers to which gender.

Whilst not incorrect, it is generally better to avoid spelling out the gender as this leaves more room for typos or using different styles of letter capitaliszation (which will be read out as different values in R).

Recording location

Look at the GPS column in the dataset below:

knitr::include_graphics("images/messydata.PNG")

question("What is the issue with it?",
  answer("I don't think there's any issue with it."),
  answer("The data is recorded in the wrong format."),
  answer("The data is recorded in multiple formats."),
  answer("The data is recorded in the right format but over multiple rows."),
  answer("The data is recorded in the wrong format and over multiple rows."),
  answer("The data is recorded in multiple formats and over multiple rows", correct = TRUE),
  allow_retry = TRUE
)

There are two issues with how this data is recorded:

The data is recorded in multiple formats
The data is recorded over multiple rows

question("How do you think these are best remedied?",
  answer("You harmonise the formats (pick one) and duplicate the data from the first row to the one below.",
         message = "Are you sure duplicating the data from one row to the next is the most efficient way of presenting your data?"),
  answer("You harmonise the formats (pick one) and separate the coordinates over two columns (merged cell).",
         message = "Remember that R can't read merged cells!"),
  answer("You harmonise the formats (pick one) and separate the coordinates over two columns (`latitude` and `longitude`).",correct = TRUE),
  answer("The format in rows 16 and 17 is the right way to record GPS data. You convert the other values to this format and you combine the values from the two rows into one cell.",
         message = "Whilst combining the two coordinate values into one cell is right, there is not one format that is better than another when recording GPS coordinates. You can pick any that you think is suitable, but the most important aspect is to stay consistent."),
  answer("The format in rows 16 and 17 is the right way to record GPS data. You drop the values that do not conform to this format and you combine the values from the two rows into one cell.",
         message = "Whilst combining the two coordinate values into one cell is right, there is not one format that is better than another when recording GPS coordinates. You can pick any that you think is suitable, but the most important aspect is to stay consistent. You should not drop the values as you would then be deleting valid data."),
  answer("The coordinates are well recorded except for row 14. You drop row 14 and 15 and convert the coordinates over two columns.",
         message = "Dropping rows 14 and 15 would mean getting rid of data across other variables too, so you should not do that. You could replace the value 'pending' by `NA`."),
  allow_retry = TRUE
)

GPS coordinates can be given in different formats:

Degrees, minutes and seconds (DMS), for example: 41°24’12.2"N 2°10’26.5"E
Degrees and decimal minutes (DMM), for example: 41 24.2028, 2 10.4418
Decimal degrees (DD), for example: 41.40338, 2.17403

Any of these units are fine to record in, but the most important rule to remember when recording location is to stay consistent with the format you are using.

Good practices for collecting and storing data

In this tutorial, you have learned about what tidy data is and its importance for data analysis. Let's review some of these concepts which will be important to keep in mind next time you design a dataset.

Collecting data

Before collecting your data, think of:

Your investigation objectives and the data needed to fulfill these
Your outcome measures
Your measures of exposure
Other relevant variables that may act as possible confounders
If this is data that will be collected over a certain period of time or by different people, think of the format your are creating for collection: is it explicit? Is it expandable?

An ideal dataset will be exhaustive enough to permit your analysis without being too complicated to fill when collecting the data.

Storing data

In this tutorial we have shown you examples of datasets stored in Excel and analysed in R. Whether using this format or not, keep in mind that:

Each variable should be in a column for collection (these may be grouped together for analysis)
Each observation should have its own row (these may be grouped together for analysis)
Each value should have its own cell

When creating your data collection template and when collecting data, remember to:

Have clear, explicit and concise column names
Avoid the use of special characters (eg greek letters or accents)
Specify the unit of each variable, if relevant, and use the same unit throughout that column
Specify the different levels of each variable, if relevant, and only use those throughout that column
Be consistent with your use of capitalization throughout your column
Explicitly state is your data is missing
Do not leave cells empty if they imply another meaning
Avoid the use of merged cells
Avoid the use of colors

If collecting data over multiple spreadsheet sheets, think of:

1) Whether is makes sense to collect it over multiple sheets. For example, if each sheet represents recording for a different location or month, could you not instead add a column to specify which location/month the observation belongs to?

2) If it makes sense to have your data across multiple sheets but you are recording common variables across both sheets, stay consistent in how you are recording your data and the units that you are using! Also stay consistent in which cells you are recording your data so you can automate your data extraction from the Excel sheets into R (rather than manually selecting which cells you will extract data from in each sheet!)

End

Congratulations! You have completed the first exercise in this course!

Be sure to save your R script, and check-in with your facilitator.

Extras

If you still have time remaining, try some of the activities below.

Installing fun packages

While most R packages conduct specific analyses or make workflows more efficient, R programmers are also fun people who made packages for amusement.

{praise}

Try installing the R package {praise}, and then see what its function praise() does.

praise()

This is a fun package to use if you are building a tutorial!

{cowsay}

Now try installing the R package {cowsay} - a package for printing silly images of animals made from punctuation symbols.

pacman::p_load(cowsay)

Its function say() has one required argument what = to which you can provide a character value that will be spoken by a cat:

say(what = "Hi, I am a cat who is learning R!")

The second argument is by = and it can accept the name of another animal such as "chicken", "yoda", "spider", "ant", or "frog".

say(what = "Even frogs like to learn R!", by = "frog")

You can change the colors with the by_color = and what_color = arguments. See more in the package documentation by entering ?say in your R Console.

Have fun with this!

{aRtsy}

The data visualization functionality of R is so high quality that users have written a package to generate art. Here is the package documentation.

Install the package "aRtsy" and load it for use.

pacman::p_load(aRtsy)

Some of the artwork can take a long time to generate, but if you are feeling adventurous try these commands:

canvas_collatz(colors = colorPalette("tuscany3"))

or this one:

canvas_flow(colors = colorPalette("dark2"))

or this one:

canvas_petri(colors = colorPalette("sooph"))

Customizing RStudio

Read the information at this link and adjust the settings of your RStudio via the "Tools" menu and clicking "Global Options". Which Appearance settings do you prefer?

R user communities

Look through this list of R User communities and find one near you. See also this list of R-Ladies chapters. R-Ladies is one of the biggest groups of R users in the world!

Tidy Tuesday

A very popular R social phenomenon is "Tidy Tuesday" - the weekday when many R users do a fun data visualization task in R and share it with the community. It is a fun way to practice and learn new R tips.

Check out the hash "TidyTuesday" on Twitter (or Follow @TidyTuesday_Bot) and review the cool plots that people make with R!

Origins of R

If you are interested, watch this 7-minute video interview with of one of R's co-founders, Ross Ihaka, produced by Maori TV in his home of New Zealand. He discusses his philosophy behind creating R, the Maori influences on his life, and more.

Review the basics

Read through the "R Basics" chapter of the Epi R Handbook - it is full of things that you have learned today, but also topics that will be new.

If you are confused by any section, ask your instructor for clarification.

Review the Import and Export chapter

This chapter in the Epi R Handbook covers more detail about importing and exporting data, including:

How to import Google Spreadsheets
How to import data from an API
How to enter data manually
How to import the most recent dataset in a folder

Importing big data

Read about the {vroom} R package, which is made for quickly importing very large datasets. Try it out!

appliedepi/introexercises documentation built on April 22, 2024, 1:01 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

appliedepi/introexercises

In appliedepi/introexercises:

Introduction to R for Applied Epidemiology

Welcome

File management, data import, tidy data, and exploratory analysis

Format

Getting Help

Quiz questions

Icons

License

Learning objectives

Setup

Open your RStudio project

Open your R script

Load packages

Project organization

File naming can impact your quality of life

Good and bad file names

Names that are machine-readable

Names that are human-readable

Names that sort easily

Clean file names

Clean file names

Project structure

Folder structure

README files

Importing a dataset

Script organization

Location of the data

{rio} - the easy package for importing data

Importing data from subfolders

The {here} package and relative file paths

A note on slashes

Import from a subfolder

Manual file selection

Inspect a dataset

Take a peek

Indexing

Skim

Tidy data

Definitions

Principles of tidy data

"Machine-readable" data

Excel spreadsheets

Importing data from spreadsheets

Colors

Merged cells

Tidy data

Expanded possibilities

Empty space

Excel-to-R resources

Data dictionaries

Variable definition

Listing possible values and levels

Data dictionary example

Data dictionary variable descriptions

Recording data

Recording dates

Recording gender

Recording location

Good practices for collecting and storing data

Collecting data

Storing data

End

Extras

Installing fun packages

{praise}

{cowsay}

{aRtsy}

Customizing RStudio

R user communities

Tidy Tuesday

Origins of R

Review the basics

Review the Import and Export chapter

Importing big data

R Package Documentation

Browse R Packages

We want your feedback!