In WdeNooy/UsingRTutorials: Provides learnr Tutorials for a Using R Course

library(learnr)
library(gradethis)
library(knitr)

tutorial_options(exercise.timelimit = 60, exercise.checker = gradethis::grade_learnr)
knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE)

# Ensure that library is loaded.
library(tidyverse)

# Ensure that the data is loaded for the remainder of this tutorial.
Glasgow <- UsingRTutorials::Glasgow

Overview

First 1.5 hours: Course content
- Q&A
- Data Wrangling with dplyr::
- Workflow and Data Import
- Integrating text and data: R Markdown
Second 1.5 hours: Data project
- Finish Sprint #1, plan and start Sprint #2
- Plenary updates by the SCRUM masters

Q&A

Any questions about the organization of the course?
Any questions about last tutorial's topics?
Any new topics that must receive attention today?

Data Wrangling with `dplyr::`

Data wrangling: Transforming (raw) data into (useful) information.

__Programming Tip__ - You are responsible that the transformations are correct. - So __check__ every transformation step.

Today, we use a data set containing information about friendships, tobacco, alcohol, and substance use among 160 students, who were followed over their second, third and fourth year at a secondary school in Glasgow (Teenage Friends and Lifestyle Study research project).

The data set, named Glasgow, is available within this tutorial, so you do not have to load it.

student: respondent ID, as a character string.
age: respondent age, in years with one decimal digit.
sex: respondent sex, boy or girl.
smoking_at_home: any smokers at home, yes or no.
smoking_parents: smoking by at least one parent, yes or no.
smoking_siblings: smoking by at least one sibling, yes or no.
wave: time of observation, starting in February 1995, when the pupils were aged 13, and ending in January 1997.
alcohol: respondent alcohol consumption: 1 (none), 2 (once or twice a year), 3 (once a month), 4 (once a week) and 5 (more than once a week).
cannabis: respondent cannabis consumption: 1 (none), 2 (tried once), 3 (occasional) and 4 (regular).
tobacco: respondent tobacco consumption: 1 (none), 2 (occasional) and 3 (regular, i.e. more than once per week).
money: respondent's pocket money per month, in British pounds.
romantic: whether the student had a romantic relation, yes or no.
friendships: number of friendship nominations received by other respondents.

Piping

The tidyverse approach to data wrangling can be summarized as follows:

Transform data with functions: data frame → new data frame.
Breakdown transformations into logical steps.
Chain transformations into a pipe %>%): Use resulting data of previous step as input data of next step.

data.frame(
  Function = c("filter(): select cases", 
               "arrange(): sort cases", 
               "select(): select variables", 
               "mutate(): compute new variables", 
               "summarise(): aggregate (collapse) data", 
               "group_by(): split by group"),
  Goal = c("I want to focus on part of my cases.", 
           "I want to rearrange my cases.", 
           "I want to focus on some of my variables.", 
           "I want to change variables.", 
           "I want summary statistics.", 
           "I want summaries or changed variables for each group.")
  ) %>%
  knitr::kable(
    caption = "Main data transformation functions",
    booktab = TRUE
    ) %>% 
  kableExtra::kable_styling(
    bootstrap_options = "striped"
    )

Or, visually:

knitr::include_graphics("images/mainwrangling.png")

Apply the `tidyverse` approach to the code below: Start with the data set and join all transformations in one pipe such that the result is shown on the screen.

helpData1 <- filter(Glasgow, money >= 0)
helpData2 <- group_by(helpData1, sex, student)
helpData3 <- summarise(helpData2, n_rom = sum(romantic == "yes", na.rm = TRUE))
count(helpData3, sex, n_rom)

# In a pipe, the data frame originating from a previous step is automatically
# the data frame used for the next step. You don't have to save an intermediary
# data frame or specify its name in a pipe. And don't forget to add the pipe
# symbol!

# A pipe for the first function:
Glasgow %>% 
  filter(money >= 0)

# Have a look at the result of this pipe with View().
Glasgow %>% 
  filter(money >= 0) %>% View()
# Note that the View window may be hidden behind your RStudio screen.

# A pipe for the first two functions:
Glasgow %>% 
  filter(money >= 0) %>%
  group_by(sex, student)
# Note that group_by() does not change the result.
# It's effect is that subsequent functions are applied to each group 
# instead of to the whole data set.

# See the impact of group_by on the summarise() function:
Glasgow %>% 
  filter(money >= 0) %>%
  group_by(sex, student) %>% 
  summarise(n_rom = sum(romantic == "yes", na.rm = TRUE)) %>% View()
# Run the code also without group_by() to see the difference.

# You can now finish this pipe on your own, right?

Glasgow %>% filter(money >= 0) %>% group_by(sex, student) %>% summarise(n_rom = sum(romantic == "yes", na.rm = TRUE)) %>% count(sex, n_rom)

gradethis::grade_code(
  incorrect = "Don't mind an `Error occured while checking the submission` message."
)

__Programming Tip__ For readability, formulate a comment explaining what each step in the pipe does or is meant to do to the data. For example: wzxhzdk:13

Add a comment to every step of the solution to the previous exercise, explaining the purpose of the code in this step.

A Frequency Table

Frequency tables count how often values appear in the data: counts per group.

A count is a summary: We replace all cases within a group by one case containing the number of cases in that group.

Glasgow %>%
  # we need summarizing numbers for each number of friendships, so group first
  # grouping automatically sorts on the variable, so the cumulation works fine
  group_by(friendships) %>%
  # summary statistics: note that a new variable can be used immediately
  summarise(
    Freq = n(),
    Perc = 100 * Freq / nrow(Glasgow)
  )

Create the above table with the absolute frequencies (raw counts) and relative frequencies (percentages) of the number of friendships per student in the `Glasgow` data set.

As always, use tidyverse functions and join all functions in one pipe.

# You must group the data before you summarise.
Glasgow %>%
  group_by( ??? )
# Which variable provides the groups?
# In other words: The values of which variable are counted?

# Use summarise() with the n() function to count the number of cases per group. 
Glasgow %>%
  group_by( ??? ) %>%
  summarise(
    Freq = n()
  )
# Instead of summarise(Freq = n()), we can use the function count().

# For relative frequencies, divide the raw counts by the number of cases. 
Glasgow %>%
  group_by( ??? ) %>%
  summarise(
    Freq = n(),
    Perc = ??? / nrow(Glasgow)
  )
# Note that we can use the data set again within a pipe step!

gradethis::grade_result(
  pass_if(
      ~ {nrow(.result) == 13 && ncol(.result) == 3 && identical(names(.result), c("friendships", "Freq", "Perc")) && round(.result[[1, 3]]) == 13  && round(.result[[2, 3]]) == 12 },
    # function(x) {nrow(x) == 13 && ncol(x) == 4 && names(x) == c("friendships", "Freq", "Perc", "CumPerc") && round(x[[1, 3]]) == 13  && round(x[[2, 4]]) == 25 },
    "You correctly gouped and summarised the number of friendships, using the exact same variable names as in the presented table. And you did not forget to create percentages instead of proportions."),
  fail_if(~ nrow(.result) != 13, "How can you get one row for each number of friendships?"),
  fail_if(~ ncol(.result) != 3, "Did you summarize the frequencies and the percentages? Use `summarize()` to calculate the frequencies and percentages; this will give you one row for each number of friendships."),
  fail_if(~ !(identical(names(.result), c("friendships", "Freq", "Perc"))), "Use the names of new variables exactly as they are used in the presented table."),
  fail_if(~ round(.result[[1, 3]]) != 13, "Did you notice that we need percentages, not proportions? Use `Perc = 100 * Freq / nrow(Glasgow)`.")
)

__Programming Tip__ - The current version of __summarise()__ by default undoes the last grouping. Hence the message in the console "`summarise()` ungrouping output (override with `.groups` argument)". - This is the safe option. It is easy to forget that data are grouped, but results on grouped data can be very different from what you expect or want.

Recoding and Grouping

Some useful functions for recoding or grouping variables have been added to tidyverse since the publication of the book R for Data Science:

recode(x, old = new, old = new, ...): Replace single old values by new values in variable x.

# School year instead of wave indicator.
Glasgow %>% mutate(schoolyear = 
    recode(wave,  "t1" = 2,
                  "t2" = 3,
                  "t3" = 4))
# Note that the old value is named first.

case_when(criterion ~ new value, criterion ~ new value, ...): Replace sets of old values (according to a criterion) by new values.

# Group number of friends.
Glasgow %>% 
  mutate(friends_class = case_when(
    friendships == 0 ~ "No friends", 
    friendships < 5 ~ "1 - 4 friends", 
    TRUE ~ "6+ friends"))
# The condition is left to the tilde, the new value to the right.
# Note the importance of the steps: persons without friends 
# are excluded from the group with less than 5 friends.

ntile(x, n): Group variable x into n bins, each containing approximately the same number of cases.

# Group number of friends in three (more or less) equally large groups.
Glasgow %>% 
  mutate(friends_bin = ntile(friendships, 3))

na_if(x, y): replace a specific value y on variable x with NA.

# Set -1 friends to missing.
Glasgow %>% 
  mutate(friends_nonmissing = na_if(friendships, -1))

Now, do it yourself.

Create a new variable `money_class` dividing the Glasgow students into three groups containing more or less the same number of cases. Send the results to the screen.

# Use __mutate()__ to create the new variable.
Glasgow %>% mutate(money_class = ??? )
# Which of the recoding and grouping functions should you use?

Glasgow %>% mutate(money_class = ntile(x = money, n = 3))

gradethis::grade_code(
  correct = "", 
  incorrect = "Please, specify argument names."
  )

Create a new variable `money_class2` with the following groups for the Glasgow students' pocket money: - group -1: values below 0 (negative pocket money?), - group 0: 0 (no pocket money), - group 1: 1 - 10, - group 2: more than 10 pounds per month. Send the results to the screen.

# Use mutate().
Glasgow %>% mutate(
  money_class2 = ???()
)
# Which of the recoding and grouping functions should you use?

# Indeed, use case_when().
Glasgow %>% mutate(
  money_class2 = case_when(

  )
)
# How do you specify the conditions and new values?

# What is the condition for groups -1 (negative pocket money) and 0 (no pocket money)? 
# Use __==__, __>__, or __<__ for "equals", "larger than", and "smaller than".
Glasgow %>% mutate(
  money_class2 = case_when(
    money ?? ~ -1,
    money ?? ~ 0
  )
)

# What is the condition for group 1: 1 - 10? 
# Use __>=__ and __<=__ for "larger or equal" and "smaller or equal".
Glasgow %>% mutate(
  money_class2 = case_when(
    money < 0 ~ -1,
    money == 0 ~ 0,
    ??? ~ 1
  )
)

# What is the condition for group 2: more than 10 pounds per month? 
# You can assign all remaining cases to this group.
Glasgow %>% mutate(
  money_class2 = case_when(
    money < 0 ~ -1,
    money == 0 ~ 0,
    money <= 10 ~ 1,
    ??? ~ 2
  )
)

Glasgow %>% mutate(money_class2 = case_when(money < 0 ~ -1, money == 0 ~ 0, money <= 10 ~ 1, TRUE ~ 2))

__Programming Tip__ - Use `count()` (which is equal to `group_by()` %>% `summarise(n = n())`) to better understand a variable or a combination of two or more variables. - Check a recoded variable: Pipe the original and recoded variable into the count() function. - Example: %>% count(money_class2, money) - Browse the frequency table: are the original values linked to the correct groups on the new variable? - Pay special attention to (combinations that involve) missing values. Missing values may create more missing values in data transformation steps because every transformation involving a missing value results in a missing value.

Instead of grouping the values, set `-1` on variable `money` to missing (change the `money` variable). A negative number of pounds as pocket money cannot be correct.

Send the results to the screen.

# Use mutate().
Glasgow %>% mutate(
  money = ???()
)
# Which of the recoding and grouping functions should you use?

# Indeed, the na_if() function.
# Check out the help for this function to see which arguments you have to use.

Glasgow %>% mutate(money = na_if(x = money, y = -1))

gradethis::grade_code(
  correct = "", 
  incorrect = ""
  )

Missing Values

In R, missing values are indicate by NA.

How are missing values treated?

The alcohol variable in the Glasgow data set has missing values.

What happens with the missing values in the following commands?

quiz(
  caption = "",
  question("`Glasgow %>% filter(alcohol == \"1 none\")`",
    answer("Missing values are included."),
    answer("Missing values are ignored.", correct = TRUE),
    answer("The result is a missing value.")
  ),
  question("`Glasgow %>% select(alcohol)`",
    answer("Missing values are included.", correct = TRUE),
    answer("Missing values are ignored."),
    answer("The result is a missing value.")
  ),
  question("`Glasgow %>% summarise(no_alcohol = sum(alcohol == \"1 none\"))`",
    answer("Missing values are included."),
    answer("Missing values are ignored."),
    answer("The result is a missing value.", correct = TRUE)
  ),
  question("`Glasgow %>% summarise(no_alcohol = sum(alcohol == \"1 none\", na.rm = TRUE))`",
    answer("Missing values are included."),
    answer("Missing values are ignored.", correct = TRUE),
    answer("The result is a missing value.")
  )
)

__Programming Tip__ - If you are not sure about what some code exactly does, run it (on a dataset) and check the results. - You can use the code box below to check the commands of the above questions.

#Copy code from the questions here...

Dealing with missing values

Missing values are special: we cannot use them like other values.

Correct the code below to count the number of missing values on the `alcohol` variable in the `Glasgow` data set.

summarise(Glasgow, alcohol_NA = (alcohol == NA))

__Hint:__ We cannot use `== NA`. Check page 2 of the Data Transformation with `dplyr` cheat sheet for a function to work with missing values (`NA`). Oh, and with how many rows do you want to end up?

summarise(Glasgow, alcohol_NA = sum(is.na(alcohol)))

gradethis::grade_code()

The previous exercise does not use a pipe because we apply just one transformation. Here, a pipe is perhaps a bit too much.

__Programming Tip__ It is very easy to mix up __=__ and __==__. - __=__ means the same as __<-__ in R, namely "becomes". __y = 0__ means that data object __y__ becomes zero. - __==__ means "is equal to". __y == 0__ checks if __y__ equals zero, which is either true or false.

Mock test data

Finally, how does R treat a logical variable if we sum() it?

Time for another little trick:

Create a small input data set.
Use it to test what a function does.

Predict the output of the function. Change the code below a few times until you are certain about what `sum()` does with logical values (`TRUE`or `FALSE` or `NA`).

sum(c(TRUE, TRUE, FALSE, FALSE, NA))

Perhaps, it helps understanding if you also use mean() instead of sum().

__Hint:__ Actually, the help on `sum()` tells you how logicals are treated.

gradethis::grade_result(
  fail_if(~ is.na(.result), "Don't forget to add `na.rm=TRUE` to ignore missing values."),
  pass_if(~ { .result > 0 && .result < 1 }, "R replaces `TRUE` by `1` and `FALSE` by `0` when it calculates with a logical variable."),
  pass_if(~ TRUE, "R replaces `TRUE` by `1` and `FALSE` by `0` when it calculates with a logical variable.")
)

__Programming Tip__ - In R, `c()` creates a __vector__, which is a series of values (of the same type). - A variable in a data frame is a vector. - `c()` is also used when we have to pass more than one value to a function argument.

Multi-Case Functions

mutate() - Ordinary use: Calculate a new variable value for each case from the case's 'own' value on one or more variables. Example: grouping a variable.

Special use: Calculate a new variable value for each case from the values on a variable for other cases.

dplyr cheat sheet:

OFFSETS: use values from a preceding (lag()) or successive (lead()) case in the data frame;
CUMULATIVE AGGREGATES: compute the sum (etc.) of all preceding cases;
RANKINGS: assign rank to value in comparison to all other values.

What if we use these function with:

data sorting (arrange()),
and grouping (group_by())?

The Glasgow data set contains the number of friendships of each student in three successive waves (t1, t2, and t3).

Calculate two new variables: - `prev_friendships`: the number of friendships in the preceding wave (if any); - `change`: the increase or decrease in a student's number of friendships from one wave to the next. Retain only the variables `student`, `wave`, `friendships`, `prev_friendships`, and `change`, so it easy to inspect the results.

Glasgow %>%
  #sort on student and wave within student
  arrange( ____ ) %>%
  #group by student, so data for the same student is used only
  group_by( ____ ) %>%
  #use a special function to calculate the difference
  mutate( 
    prev_friendships = ____(____), #number of friendships in the preceding wave (if any)
    change = ____ #difference: later minus earlier
    ) %>%
  select( _____ )

# correct code
Glasgow %>%
  #sort on student and wave within student
  arrange(student, wave) %>%
  #group by student, so data for the same student is used only
  group_by(student) %>%
  #use lag() to calculate the difference
  mutate( 
    prev_friendships = lag(friendships), #number of friendships in the preceding wave (if any); this command can be included in the next
    change = friendships - prev_friendships #difference: later minus earlier
    ) %>%
  select(student, wave, friendships, prev_friendships, change)

gradethis::grade_result(
  pass_if(~ {
    #required variables created (named) and retained
    "student" %in% names(.result) && "wave" %in% names(.result) &&
       "friendships" %in% names(.result) && "prev_friendships" %in% names(.result) &&
       "change" %in% names(.result) &&
    #only required variables selected
    ncol(.result) == 5 &&
    #correctly sorted
    identical(.result$student[1], "s001") && identical(.result$wave[1], "t1") &&
    #prev_friendships correctly calculated
    identical(.result$prev_friendships, transmute(group_by(arrange(Glasgow, student, wave), student), prev_friendships = lag(friendships))$prev_friendships) &&
    #change correctly calculated
    identical(.result$change, .result$friendships - .result$prev_friendships)
    }, 
    "You correctly sorted and grouped the data before taking the preceding value of friendships as the value for prev_friendships, which you used to calculate change."),
  fail_if(~ !("student" %in% names(.result) && "wave" %in% names(.result) &&
       "friendships" %in% names(.result) && "prev_friendships" %in% names(.result) &&
       "change" %in% names(.result)), 
       "Did you create the two new variables with the right names and retain the required variables in the data set?"),
  fail_if(~ ncol(.result) != 5, 
       "Did you select only the required variables in the data set at the end?"),
  fail_if(~ !(identical(.result$student[1], "s001") && identical(.result$wave[1], "t1")), 
        "Sort the data on student and wave before you create the new variables."),
  fail_if(~ !(identical(.result$prev_friendships, transmute(group_by(arrange(Glasgow, student, wave), student), prev_friendships = lag(friendships))$prev_friendships)), 
       "Did you group the data by student? Did you use the `lag()` function to create a new variable containing the number of friendships in the preceding wave?"),
  fail_if(~ !(identical(.result$change, .result$friendships - .result$prev_friendships)), 
        "You did not calculate the change in number of friendships correctly from `friendships` and `prev_friendships`. Did you subtract the wrong variable?")
)

__Hint:__ - Sort the data such that the cases for a student are together and in temporal order. Consult the _Data Transformation with dplyr_ cheat sheet to find the right function for using information from the preceding case. - Use help on a function if the description on the cheat sheet is not clear to you.

__Programming Tip__ - If you use a multi-case function with grouping, check that the function correctly restarts for a new group. - Pay special attention to the first and last value within a group: are these values as they should be?

What happens if you use the multi-case function without grouping? Comment out the grouping step in the previous answer box and inspect the results.

Missing Observations

What if we calculate the change in friendships for a student missing an observation for a wave?

Manually calculate the change in number of friends of student s998 in the below example data fragment. What does the result mean?

#Example of observation missing for one student.
data.frame(
  student = c("s997", "s998", "s998", "s999"),
  wave = c("t3", "t1", "t3", "t1"),
  friendships = c(4, 2, 3, 6)
  ) %>%
  knitr::kable(booktab = TRUE) %>% 
  kableExtra::kable_styling(bootstrap_options = "striped", full_width = FALSE)

Don't assume that your data are perfect! Use code to check your data.

Step 1 - Formulate the precise conditions that you want to check:

For every student, we should have exactly three observations (rows).
The three observations per student should have different wave values.
Only wave values 't1', 't2', and 't3' should occur.

Step 2 - Translate the conditions into R code.

Use aggregation: count() or group_by()and summarise().
Use filter() to select cases that violate the conditions.

- Add comments to the below code that explain how this code checks the first two conditions specified above. - Are there any cases that violate the first two conditions?

# correct code
Glasgow %>%
  #for each student...
  group_by(student) %>%
  #count the number of different waves in the data
  summarise(
    n_obs = n(),
    n_dist = n_distinct(wave)
    ) %>%
  #filter cases with n_obs != 3 or n_dist != 3
  filter( n_obs != 3 | n_dist != 3)

Glasgow %>%
  group_by(student) %>%
  summarise(
    n_obs = n(),
    n_dist = n_distinct(wave)
    ) %>%
  filter( n_obs != 3 | n_dist != 3)

__Hint:__ Group by student before you summarize. Use function __n()__ to count the number of observations (rows). Logical OR is represented by __|__ and __!=__ means "is not equal to". If no cases remain, your code may be correct.

# correct code
Glasgow %>%
  #count the different waves in the data
  count(wave)

Check the third condition (above) using `count()` (or `group_by()` and `summarize()`).

gradethis::grade_result(
  pass_if(~ identical(.result, count(Glasgow, wave)), "Indeed, we only have values `t1`, `t2`, and `t3`."),
  fail_if(~ !identical(.result, count(Glasgow, wave)), "Use `count()` to create a frequency table of wave values.")
)

__Programming Tip__ - Never assume that data are complete. Formulate which regularities you expect in the data and use code to check them.

Fancy Stuff

Nice tables

R has several packages for creating tabular output. In a later session, we will discuss some packages for tabulating statistical output.

Here, we present a function (kable()) in a basic package for tabular output (knitr), which works very well with piping and R Markdown (discussed later in this tutorial). In addition, we use the kableExtra function to fine-tune tables created with kable().

Let's start by noting that we must have the data in the required shape before we create a table. In other words, kable() does not do any counting or summarizing for us. It only displays our data.

Make sense of the code below; add comments explaining every step. What happens if you do not drop the `sex` variable?

Glasgow %>%
  group_by(sex, wave) %>%
  summarise(tobacco_prop = mean(tobacco != "1 none", na.rm = TRUE)) %>%
  ungroup() %>%
  select(-sex) %>% 
  kable(
    digits = 2,
    col.names = c("Wave", "Proportion using tobacco"),
    align = "lcc",
    caption = "Proportion of Glasgow students using tobacco."
  )

# Use help on the `kable()` function (`?kable`) for more information on the options provided by `kable()`.

The above table is not particularly pretty, so let us improve it with the help of package kableExtra.

library(kableExtra)
Glasgow %>%
  group_by(sex, wave) %>%
  summarise(tobacco_prop = mean(tobacco != "1 none", na.rm = TRUE)) %>%
  ungroup() %>%
  select(-sex) %>% 
  kable(
    digits = 2,
    col.names = c("Wave", "Proportion using tobacco"),
    align = "lcc",
    caption = "Proportion of Glasgow students using tobacco."
  ) %>%
  kableExtra::kable_classic(full_width = FALSE) %>%
  kableExtra::pack_rows("Boys", 1, 3) %>%
  kableExtra::pack_rows("Girls", 4, 6)

Play around with this table using the code box below. Check out the options offered by `kableExtra` on the [package website](https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html).

Adding summaries to plots

If you store summary information about a data set in a data object, you can display the summary information in ggplot() along with the original data. Every geom can have its own data argument, so specify the data object with summaries in a geom that visualizes the summary information.

The below graph, for example, shows the number of friendships for boys and girls over the three waves (with jitter) as well as their average number of friends.

#Calculate and save average number of friends.
averages <- Glasgow %>%
  #group waves by sex
  group_by(sex, wave) %>%
  #calculate mean, ignoring missing values
  summarise(mean_friends = mean(friendships, na.rm = TRUE))
#Create plot
ggplot() +
  #add individual friendship score with jitter
  geom_jitter(data = Glasgow, aes(x = wave, y = friendships, color = sex)) +
  #add mean scores as large boxes
  geom_point(data = averages, aes(x = wave, y = mean_friends, color = sex), shape = "square", size = 4) +
  #add lines to link the means: note the `group` argument
  geom_line(data = averages, aes(x = wave, y = mean_friends, group = sex, color = sex))

Play around with this plot. For example, use `geom_text()` to add the values of the group means to the plot.

Workflow and Data Import

From this point on in this tutorial, you are supposed to work in RStudio.

R project

An R Project keeps everything in the same place.

- Create a new project for your Data Project: *File>New Project…* + **New Directory**. Start a project in a brand new working directory; or + **Existing Directory**. Create a new project in an existing directory. + Select a suitable directory name (which is also the project name.)

The project directory is the working directory.

This is where R looks for your (data) files, unless you specify otherwise.
This directory is displayed in the RStudio Files tab.

Store the Data Project data files in this directory.

R Workspace

While you are working on a project, R collects information, creates data objects, and so on.

Everything in R’s memory is called the workspace.
Part of the workspace is shown in RStudio’s Environment panel.

The workspace can be saved:

RStudio menu: Session > Save Workspace As...;
R command: save.image("filename.RData"))

And loaded:

RStudio menu: Session > Load Workspace...;
R command: load("filename.RData")).
An R work space has file name extension .RData.

RStudio's standard settings save and load the workspace when you close and open a project. This is risky. It is better to have a clean, reproducible workspace without data from previous runs.

So, adjust the global settings of RStudio: - Go to _Tools>Global Options_, - Uncheck _Restore .RData into workspace at startup_, - Select _Never_ for _Save workspace to .RData on exit_.

Global settings need to be set only once.

R Script

R version of SPSS syntax.
A file with extension .R, containing R code.
Similar to answer boxes in this tutorial.

We are not going to use script files because we embed all our code within the R Markdown document (next topic).

R Markdown: Reproducible Research

For both the weekly problem sets and your Data Project, you will be working with R Markdown.

An R Markdown file contains all steps from data to results:

Commands to clean and analyze data.
Comments to explain steps in data cleaning and analysis.
Text, graphs, and tables presenting the research to the reader.

Eve`R`ything in one place!

R Markdown YAML

Open a new R Markdown document in RStudio. Adjust the first part of the document, called YAML (title, authors, date) and add a table of contents.

__Tips:__ - You can set some YAML options in RStudio with _Output Options_ under the settings button (to the right of the Knit button). - Alternatively, use the cheat sheet (RStudio: _Help > Cheat Sheets > R Markdown Cheat Sheet_).

First code chunk

Good practice:

Load all required libraries in the first code chunk.
- Example: library(tidyverse) #load the tidyverse packages
Set all global settings in the first code chunk.
- Recommended global settings for (knitting/rendering) code chunks: knitr::opts_chunk$set(eval = TRUE, echo = FALSE, warning = FALSE, message = FALSE)

Adjust the settings in your R Markdown document and load the tidyverse package.

More info on the meaning of code chunk settings:

R Markdown cheat sheet (RStudio: Help > Cheat Sheets > R Markdown Cheat Sheet) or at the knitr web page.

Code chunk names

A code chunk may have a name (label), e.g., setup in the pic below):

May not contain a space.
May only occur once in an R Markdown document.
Informative chunk names are handy: quickly navigate via RStudio's code outline (button at the bottom left of the R Markdown screen).

Another option for quick navigation:

Open a table of contents of the sections and sub-sections: document outline button (at the top right).

Load your data

To work with your data, they must be loaded.

- Download the Data Project data file marked for initial practice from Canvas to your project directory. Or: Each team member downloads a different Data Project data file. - Get rid of the text and code chunks from the standard R Markdown document. - Add a header to start the section on data description. - Add your first code chunk (use the _Insert_ button at the top of the R Markdown screen).

All Data Project data sets are in csv format.

Add a code chunk, name it, and import your data file into a data object with `read_csv()`.

Note that read_csv() is part of the readr package, which is automatically loaded by the tidyverse package.

This function has some important features:

read_csv() uses the first row as variable names.
It guesses variable type: character, integer, double, ...
It reports variable types as a col_types argument.
Files ending in .gz, .bz2, .xz, or .zip are automatically uncompressed.
Files starting with http://, https://, ftp://, or ftps:// are automatically downloaded.

We don't like guesses; we want to be sure that the variable type is correct.

- Check the variable types reported when the data was read against the original data set. - Add the `col_types` argument to your `read_csv()` command and copy the reported (guessed) variable types (`cols( ... )`) behind the equals sign.

Now, you are sure that the data will be read in the right way.

__Programming Tips__ - Always assume that things go wrong, so you have to convince yourself that your code produces the right results. - Use comments abundantly. Explain why you do things in a particular way. It helps your group members and your future self to understand the code. - In a code chunk, press Ctr/Cmd-Enter to run the command in which your cursor is positioned.

R Markdown plot

Code chunks that produce a plot will show a plot in a knitted document.

You may want to change the appearance of the plot with code chunk options.

Main options for a code chunk creating a plot (book p. 465-467, {28.7.1, 28.7.2}):

fig.cap = "": Add a caption to the plot.
fig.asp = 0.6: Set the ratio of plot height to plot width.
out.width = "75%": Set plot width as a percentage of text width.

- Add one of the plots created in the Data Project part of the previous session in a separate code chunk. - Add a caption to the plot and change the plot to a square layout. Knit the document to check the results.

__Programming Tips__ - Carefully inspect a knitted R Markdown document for the presence and layout of plots and tables and for unwanted code or R messages. - Don't worry about code or output created by code that you may not need in the end. It is easy to skip a code chunk and not displaying its results by setting the code chunk option `eval=` to `FALSE`. Preserving (but hiding) unnecessary code prevents you from creating the code again later on.

Adding (formatted) text

Now, add a description of the plot or data file for the reader of the document.

Use R Markdown text formatting options if needed, see Help>Markdown Quick Reference in RStudio for the main options.

Some often used text formatting in R Markdown.

-   Headers: #, ##, ###
-   Font type: *italics*, **bold**
-   (un)ordered lists: First level *, second level indented +
-   links: [linked phrase](http://example.com)
-   image: ![](example.png)
-   blockquotes: >
-   LaTeX equations: $equation$
-   superscript^2^ and subscript~2~

Knitting to PDF

Knit (render) an R Markdown with the knit button.

knitr::include_graphics("images/Knit.png")

Knitting to HTML:

Is fastest.
Always knit to HTML first, to check the results.

Knitting to a paper document:

Knit to PDF or Word.
PDF output is supported better than Word,
but requires the installation of a TeX package (see the online book on R Markdown).

Knit your R Markdown document to PDF to check that it works and LaTeX package is correctly installed.

If you want to fine-tune your PDF document:

Add keep_tex: TRUE to the document YAML, like this (mind the indentation):

output:
  pdf_document: 
    keep_tex: TRUE

YAML option keep_tex: TRUE saves the TeX file (extension .tex), which you can open in a TeX editor, for example overleaf or Texmaker.

Data Project

Start working on the R Markdown document for your Data Project in RStudio.

Decide how you are going to collaborate on one R Markdown document.
Sprint 1 Review & Sprint 1 Retrospective.
Sprint 2 Planning.
Remaining time: Work on the Sprint 2 Backlog.

Plenary updates Sprint 1 SCRUM masters

Last 15 minutes of the session.

Reminders

Problem Set 1

Available from 1 PM today.
Download R Markdown document with assigned exercises and accompanying data set from Canvas.
Add answers and R code in the R Markdown document.
Knit R Markdown document to PDF or Word.
Submit R Markdown document and PDF or Word document to the Canvas assignment.
Submission deadline: Sunday, 12 PM (noon).
Lecturer's feedback returned in the PDF or Word document.

And as always...

Study course content
Meet your buddy
Work on Sprint 2
Sprint 2 SCRUM masters: Keep your team on course!

WdeNooy/UsingRTutorials documentation built on Jan. 25, 2023, 2:39 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

WdeNooy/UsingRTutorials
Provides learnr Tutorials for a Using R Course

In WdeNooy/UsingRTutorials: Provides learnr Tutorials for a Using R Course

Overview

Q&A

Data Wrangling with `dplyr::`

Piping

A Frequency Table

Recoding and Grouping

Missing Values

How are missing values treated?

Dealing with missing values

Mock test data

Multi-Case Functions

Missing Observations

Fancy Stuff

Nice tables

Adding summaries to plots

Workflow and Data Import

R project

R Workspace

R Script

R Markdown: Reproducible Research

R Markdown YAML

First code chunk

Code chunk names

Load your data

R Markdown plot

Adding (formatted) text

Knitting to PDF

Data Project

Plenary updates Sprint 1 SCRUM masters

Reminders

Problem Set 1

And as always...

R Package Documentation

Browse R Packages

We want your feedback!

WdeNooy/UsingRTutorials Provides learnr Tutorials for a Using R Course

In WdeNooy/UsingRTutorials: Provides learnr Tutorials for a Using R Course

Overview

Q&A

Data Wrangling with dplyr::

Piping

A Frequency Table

Recoding and Grouping

Missing Values

How are missing values treated?

Dealing with missing values

Mock test data

Multi-Case Functions

Missing Observations

Fancy Stuff

Nice tables

Adding summaries to plots

Workflow and Data Import

R project

R Workspace

R Script

R Markdown: Reproducible Research

R Markdown YAML

First code chunk

Code chunk names

Load your data

R Markdown plot

Adding (formatted) text

Knitting to PDF

Data Project

Plenary updates Sprint 1 SCRUM masters

Reminders

Problem Set 1

And as always...

R Package Documentation

Browse R Packages

We want your feedback!

WdeNooy/UsingRTutorials
Provides learnr Tutorials for a Using R Course

Data Wrangling with `dplyr::`