library(learnr) library(gradethis) library(knitr) tutorial_options(exercise.timelimit = 60, exercise.checker = gradethis::grade_learnr) knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE) # Ensure that library is loaded. library(tidyverse)
# Ensure that the data is loaded for the remainder of this tutorial. Glasgow <- UsingRTutorials::Glasgow
First 1.5 hours: Course content
dplyr::
Second 1.5 hours: Data project
dplyr::
Data wrangling: Transforming (raw) data into (useful) information.
Today, we use a data set containing information about friendships, tobacco, alcohol, and substance use among 160 students, who were followed over their second, third and fourth year at a secondary school in Glasgow (Teenage Friends and Lifestyle Study research project).
The data set, named Glasgow
, is available within this tutorial, so you do not have to load it.
The tidyverse
approach to data wrangling can be summarized as follows:
%>%
): Use resulting data of previous step as input data of next step.data.frame( Function = c("filter(): select cases", "arrange(): sort cases", "select(): select variables", "mutate(): compute new variables", "summarise(): aggregate (collapse) data", "group_by(): split by group"), Goal = c("I want to focus on part of my cases.", "I want to rearrange my cases.", "I want to focus on some of my variables.", "I want to change variables.", "I want summary statistics.", "I want summaries or changed variables for each group.") ) %>% knitr::kable( caption = "Main data transformation functions", booktab = TRUE ) %>% kableExtra::kable_styling( bootstrap_options = "striped" )
Or, visually:
knitr::include_graphics("images/mainwrangling.png")
helpData1 <- filter(Glasgow, money >= 0) helpData2 <- group_by(helpData1, sex, student) helpData3 <- summarise(helpData2, n_rom = sum(romantic == "yes", na.rm = TRUE)) count(helpData3, sex, n_rom)
# In a pipe, the data frame originating from a previous step is automatically # the data frame used for the next step. You don't have to save an intermediary # data frame or specify its name in a pipe. And don't forget to add the pipe # symbol!
# A pipe for the first function: Glasgow %>% filter(money >= 0)
# Have a look at the result of this pipe with View(). Glasgow %>% filter(money >= 0) %>% View() # Note that the View window may be hidden behind your RStudio screen.
# A pipe for the first two functions: Glasgow %>% filter(money >= 0) %>% group_by(sex, student) # Note that group_by() does not change the result. # It's effect is that subsequent functions are applied to each group # instead of to the whole data set.
# See the impact of group_by on the summarise() function: Glasgow %>% filter(money >= 0) %>% group_by(sex, student) %>% summarise(n_rom = sum(romantic == "yes", na.rm = TRUE)) %>% View() # Run the code also without group_by() to see the difference.
# You can now finish this pipe on your own, right?
Glasgow %>% filter(money >= 0) %>% group_by(sex, student) %>% summarise(n_rom = sum(romantic == "yes", na.rm = TRUE)) %>% count(sex, n_rom)
gradethis::grade_code( incorrect = "Don't mind an `Error occured while checking the submission` message." )
Frequency tables count how often values appear in the data: counts per group.
A count is a summary: We replace all cases within a group by one case containing the number of cases in that group.
Glasgow %>% # we need summarizing numbers for each number of friendships, so group first # grouping automatically sorts on the variable, so the cumulation works fine group_by(friendships) %>% # summary statistics: note that a new variable can be used immediately summarise( Freq = n(), Perc = 100 * Freq / nrow(Glasgow) )
As always, use tidyverse functions and join all functions in one pipe.
# You must group the data before you summarise. Glasgow %>% group_by( ??? ) # Which variable provides the groups? # In other words: The values of which variable are counted?
# Use summarise() with the n() function to count the number of cases per group. Glasgow %>% group_by( ??? ) %>% summarise( Freq = n() ) # Instead of summarise(Freq = n()), we can use the function count().
# For relative frequencies, divide the raw counts by the number of cases. Glasgow %>% group_by( ??? ) %>% summarise( Freq = n(), Perc = ??? / nrow(Glasgow) ) # Note that we can use the data set again within a pipe step!
gradethis::grade_result( pass_if( ~ {nrow(.result) == 13 && ncol(.result) == 3 && identical(names(.result), c("friendships", "Freq", "Perc")) && round(.result[[1, 3]]) == 13 && round(.result[[2, 3]]) == 12 }, # function(x) {nrow(x) == 13 && ncol(x) == 4 && names(x) == c("friendships", "Freq", "Perc", "CumPerc") && round(x[[1, 3]]) == 13 && round(x[[2, 4]]) == 25 }, "You correctly gouped and summarised the number of friendships, using the exact same variable names as in the presented table. And you did not forget to create percentages instead of proportions."), fail_if(~ nrow(.result) != 13, "How can you get one row for each number of friendships?"), fail_if(~ ncol(.result) != 3, "Did you summarize the frequencies and the percentages? Use `summarize()` to calculate the frequencies and percentages; this will give you one row for each number of friendships."), fail_if(~ !(identical(names(.result), c("friendships", "Freq", "Perc"))), "Use the names of new variables exactly as they are used in the presented table."), fail_if(~ round(.result[[1, 3]]) != 13, "Did you notice that we need percentages, not proportions? Use `Perc = 100 * Freq / nrow(Glasgow)`.") )
Some useful functions for recoding or grouping variables have been added to tidyverse
since the publication of the book R for Data Science:
recode(x, old = new, old = new, ...)
: Replace single old values by new values in variable x
. # School year instead of wave indicator. Glasgow %>% mutate(schoolyear = recode(wave, "t1" = 2, "t2" = 3, "t3" = 4)) # Note that the old value is named first.
case_when(criterion ~ new value, criterion ~ new value, ...)
: Replace sets of old values (according to a criterion) by new values.# Group number of friends. Glasgow %>% mutate(friends_class = case_when( friendships == 0 ~ "No friends", friendships < 5 ~ "1 - 4 friends", TRUE ~ "6+ friends")) # The condition is left to the tilde, the new value to the right. # Note the importance of the steps: persons without friends # are excluded from the group with less than 5 friends.
ntile(x, n)
: Group variable x
into n
bins, each containing approximately the same number of cases.# Group number of friends in three (more or less) equally large groups. Glasgow %>% mutate(friends_bin = ntile(friendships, 3))
na_if(x, y)
: replace a specific value y
on variable x
with NA
.# Set -1 friends to missing. Glasgow %>% mutate(friends_nonmissing = na_if(friendships, -1))
Now, do it yourself.
# Use __mutate()__ to create the new variable. Glasgow %>% mutate(money_class = ??? ) # Which of the recoding and grouping functions should you use?
Glasgow %>% mutate(money_class = ntile(x = money, n = 3))
gradethis::grade_code( correct = "", incorrect = "Please, specify argument names." )
# Use mutate(). Glasgow %>% mutate( money_class2 = ???() ) # Which of the recoding and grouping functions should you use?
# Indeed, use case_when(). Glasgow %>% mutate( money_class2 = case_when( ) ) # How do you specify the conditions and new values?
# What is the condition for groups -1 (negative pocket money) and 0 (no pocket money)? # Use __==__, __>__, or __<__ for "equals", "larger than", and "smaller than". Glasgow %>% mutate( money_class2 = case_when( money ?? ~ -1, money ?? ~ 0 ) )
# What is the condition for group 1: 1 - 10? # Use __>=__ and __<=__ for "larger or equal" and "smaller or equal". Glasgow %>% mutate( money_class2 = case_when( money < 0 ~ -1, money == 0 ~ 0, ??? ~ 1 ) )
# What is the condition for group 2: more than 10 pounds per month? # You can assign all remaining cases to this group. Glasgow %>% mutate( money_class2 = case_when( money < 0 ~ -1, money == 0 ~ 0, money <= 10 ~ 1, ??? ~ 2 ) )
Glasgow %>% mutate(money_class2 = case_when(money < 0 ~ -1, money == 0 ~ 0, money <= 10 ~ 1, TRUE ~ 2))
Send the results to the screen.
# Use mutate(). Glasgow %>% mutate( money = ???() ) # Which of the recoding and grouping functions should you use?
# Indeed, the na_if() function. # Check out the help for this function to see which arguments you have to use.
Glasgow %>% mutate(money = na_if(x = money, y = -1))
gradethis::grade_code( correct = "", incorrect = "" )
In R, missing values are indicate by NA
.
The alcohol
variable in the Glasgow
data set has missing values.
quiz( caption = "", question("`Glasgow %>% filter(alcohol == \"1 none\")`", answer("Missing values are included."), answer("Missing values are ignored.", correct = TRUE), answer("The result is a missing value.") ), question("`Glasgow %>% select(alcohol)`", answer("Missing values are included.", correct = TRUE), answer("Missing values are ignored."), answer("The result is a missing value.") ), question("`Glasgow %>% summarise(no_alcohol = sum(alcohol == \"1 none\"))`", answer("Missing values are included."), answer("Missing values are ignored."), answer("The result is a missing value.", correct = TRUE) ), question("`Glasgow %>% summarise(no_alcohol = sum(alcohol == \"1 none\", na.rm = TRUE))`", answer("Missing values are included."), answer("Missing values are ignored.", correct = TRUE), answer("The result is a missing value.") ) )
#Copy code from the questions here...
Missing values are special: we cannot use them like other values.
summarise(Glasgow, alcohol_NA = (alcohol == NA))
summarise(Glasgow, alcohol_NA = sum(is.na(alcohol)))
gradethis::grade_code()
The previous exercise does not use a pipe because we apply just one transformation. Here, a pipe is perhaps a bit too much.
Finally, how does R treat a logical variable if we sum()
it?
Time for another little trick:
sum(c(TRUE, TRUE, FALSE, FALSE, NA))
Perhaps, it helps understanding if you also use mean()
instead of sum()
.
gradethis::grade_result( fail_if(~ is.na(.result), "Don't forget to add `na.rm=TRUE` to ignore missing values."), pass_if(~ { .result > 0 && .result < 1 }, "R replaces `TRUE` by `1` and `FALSE` by `0` when it calculates with a logical variable."), pass_if(~ TRUE, "R replaces `TRUE` by `1` and `FALSE` by `0` when it calculates with a logical variable.") )
mutate()
- Ordinary use: Calculate a new variable value for each case from the case's 'own' value on one or more variables. Example: grouping a variable.
dplyr
cheat sheet:
lag()
) or successive (lead()
) case in the data frame;What if we use these function with:
arrange()
),group_by()
)?The Glasgow
data set contains the number of friendships
of each student in three successive wave
s (t1, t2, and t3).
Glasgow %>% #sort on student and wave within student arrange( ____ ) %>% #group by student, so data for the same student is used only group_by( ____ ) %>% #use a special function to calculate the difference mutate( prev_friendships = ____(____), #number of friendships in the preceding wave (if any) change = ____ #difference: later minus earlier ) %>% select( _____ )
# correct code Glasgow %>% #sort on student and wave within student arrange(student, wave) %>% #group by student, so data for the same student is used only group_by(student) %>% #use lag() to calculate the difference mutate( prev_friendships = lag(friendships), #number of friendships in the preceding wave (if any); this command can be included in the next change = friendships - prev_friendships #difference: later minus earlier ) %>% select(student, wave, friendships, prev_friendships, change)
gradethis::grade_result( pass_if(~ { #required variables created (named) and retained "student" %in% names(.result) && "wave" %in% names(.result) && "friendships" %in% names(.result) && "prev_friendships" %in% names(.result) && "change" %in% names(.result) && #only required variables selected ncol(.result) == 5 && #correctly sorted identical(.result$student[1], "s001") && identical(.result$wave[1], "t1") && #prev_friendships correctly calculated identical(.result$prev_friendships, transmute(group_by(arrange(Glasgow, student, wave), student), prev_friendships = lag(friendships))$prev_friendships) && #change correctly calculated identical(.result$change, .result$friendships - .result$prev_friendships) }, "You correctly sorted and grouped the data before taking the preceding value of friendships as the value for prev_friendships, which you used to calculate change."), fail_if(~ !("student" %in% names(.result) && "wave" %in% names(.result) && "friendships" %in% names(.result) && "prev_friendships" %in% names(.result) && "change" %in% names(.result)), "Did you create the two new variables with the right names and retain the required variables in the data set?"), fail_if(~ ncol(.result) != 5, "Did you select only the required variables in the data set at the end?"), fail_if(~ !(identical(.result$student[1], "s001") && identical(.result$wave[1], "t1")), "Sort the data on student and wave before you create the new variables."), fail_if(~ !(identical(.result$prev_friendships, transmute(group_by(arrange(Glasgow, student, wave), student), prev_friendships = lag(friendships))$prev_friendships)), "Did you group the data by student? Did you use the `lag()` function to create a new variable containing the number of friendships in the preceding wave?"), fail_if(~ !(identical(.result$change, .result$friendships - .result$prev_friendships)), "You did not calculate the change in number of friendships correctly from `friendships` and `prev_friendships`. Did you subtract the wrong variable?") )
What if we calculate the change in friendships for a student missing an observation for a wave?
#Example of observation missing for one student. data.frame( student = c("s997", "s998", "s998", "s999"), wave = c("t3", "t1", "t3", "t1"), friendships = c(4, 2, 3, 6) ) %>% knitr::kable(booktab = TRUE) %>% kableExtra::kable_styling(bootstrap_options = "striped", full_width = FALSE)
Don't assume that your data are perfect! Use code to check your data.
Step 1 - Formulate the precise conditions that you want to check:
Step 2 - Translate the conditions into R code.
count()
or group_by()
and summarise()
.filter()
to select cases that violate the conditions.# correct code Glasgow %>% #for each student... group_by(student) %>% #count the number of different waves in the data summarise( n_obs = n(), n_dist = n_distinct(wave) ) %>% #filter cases with n_obs != 3 or n_dist != 3 filter( n_obs != 3 | n_dist != 3)
Glasgow %>% group_by(student) %>% summarise( n_obs = n(), n_dist = n_distinct(wave) ) %>% filter( n_obs != 3 | n_dist != 3)
# correct code Glasgow %>% #count the different waves in the data count(wave)
gradethis::grade_result( pass_if(~ identical(.result, count(Glasgow, wave)), "Indeed, we only have values `t1`, `t2`, and `t3`."), fail_if(~ !identical(.result, count(Glasgow, wave)), "Use `count()` to create a frequency table of wave values.") )
R has several packages for creating tabular output. In a later session, we will discuss some packages for tabulating statistical output.
Here, we present a function (kable()
) in a basic package for tabular output (knitr
), which works very well with piping and R Markdown (discussed later in this tutorial). In addition, we use the kableExtra
function to fine-tune tables created with kable()
.
Let's start by noting that we must have the data in the required shape before we create a table. In other words, kable()
does not do any counting or summarizing for us. It only displays our data.
Glasgow %>% group_by(sex, wave) %>% summarise(tobacco_prop = mean(tobacco != "1 none", na.rm = TRUE)) %>% ungroup() %>% select(-sex) %>% kable( digits = 2, col.names = c("Wave", "Proportion using tobacco"), align = "lcc", caption = "Proportion of Glasgow students using tobacco." )
# Use help on the `kable()` function (`?kable`) for more information on the options provided by `kable()`.
The above table is not particularly pretty, so let us improve it with the help of package kableExtra
.
library(kableExtra) Glasgow %>% group_by(sex, wave) %>% summarise(tobacco_prop = mean(tobacco != "1 none", na.rm = TRUE)) %>% ungroup() %>% select(-sex) %>% kable( digits = 2, col.names = c("Wave", "Proportion using tobacco"), align = "lcc", caption = "Proportion of Glasgow students using tobacco." ) %>% kableExtra::kable_classic(full_width = FALSE) %>% kableExtra::pack_rows("Boys", 1, 3) %>% kableExtra::pack_rows("Girls", 4, 6)
If you store summary information about a data set in a data object, you can display the summary information in ggplot()
along with the original data. Every geom can have its own data argument, so specify the data object with summaries in a geom that visualizes the summary information.
The below graph, for example, shows the number of friendships for boys and girls over the three waves (with jitter) as well as their average number of friends.
#Calculate and save average number of friends. averages <- Glasgow %>% #group waves by sex group_by(sex, wave) %>% #calculate mean, ignoring missing values summarise(mean_friends = mean(friendships, na.rm = TRUE)) #Create plot ggplot() + #add individual friendship score with jitter geom_jitter(data = Glasgow, aes(x = wave, y = friendships, color = sex)) + #add mean scores as large boxes geom_point(data = averages, aes(x = wave, y = mean_friends, color = sex), shape = "square", size = 4) + #add lines to link the means: note the `group` argument geom_line(data = averages, aes(x = wave, y = mean_friends, group = sex, color = sex))
From this point on in this tutorial, you are supposed to work in RStudio.
An R Project keeps everything in the same place.
The project directory is the working directory.
While you are working on a project, R collects information, creates data objects, and so on.
The workspace can be saved:
save.image("filename.RData")
) And loaded:
load("filename.RData")
)..RData
.RStudio's standard settings save and load the workspace when you close and open a project. This is risky. It is better to have a clean, reproducible workspace without data from previous runs.
Global settings need to be set only once.
.R
, containing R code.We are not going to use script files because we embed all our code within the R Markdown document (next topic).
For both the weekly problem sets and your Data Project, you will be working with R Markdown.
An R Markdown file contains all steps from data to results:
Good practice:
library(tidyverse) #load the tidyverse packages
knitr::opts_chunk$set(eval = TRUE, echo = FALSE, warning = FALSE, message = FALSE)
More info on the meaning of code chunk settings:
RStudio: Help > Cheat Sheets > R Markdown Cheat Sheet
) or at the knitr
web page.A code chunk may have a name (label), e.g., setup
in the pic below):
Another option for quick navigation:
To work with your data, they must be loaded.
All Data Project data sets are in csv
format.
Note that read_csv()
is part of the readr
package, which is automatically loaded by the tidyverse
package.
This function has some important features:
read_csv()
uses the first row as variable names.col_types
argument.We don't like guesses; we want to be sure that the variable type is correct.
Now, you are sure that the data will be read in the right way.
Code chunks that produce a plot will show a plot in a knitted document.
You may want to change the appearance of the plot with code chunk options.
Main options for a code chunk creating a plot (book p. 465-467, {28.7.1, 28.7.2}):
fig.cap = ""
: Add a caption to the plot.fig.asp = 0.6
: Set the ratio of plot height to plot width.out.width = "75%"
: Set plot width as a percentage of text width.Use R Markdown text formatting options if needed, see Help>Markdown Quick Reference in RStudio for the main options.
Some often used text formatting in R Markdown.
- Headers: #, ##, ### - Font type: *italics*, **bold** - (un)ordered lists: First level *, second level indented + - links: [linked phrase](http://example.com) - image:  - blockquotes: > - LaTeX equations: $equation$ - superscript^2^ and subscript~2~
Knit (render) an R Markdown with the knit button.
knitr::include_graphics("images/Knit.png")
Knitting to HTML:
Knitting to a paper document:
If you want to fine-tune your PDF document:
keep_tex: TRUE
to the document YAML, like this (mind the indentation):output: pdf_document: keep_tex: TRUE
YAML option keep_tex: TRUE
saves the TeX file (extension .tex
), which you can open in a TeX editor, for example overleaf or Texmaker.
Start working on the R Markdown document for your Data Project in RStudio.
Last 15 minutes of the session.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.