knitr::opts_chunk$set( echo = TRUE, message = FALSE, warning = FALSE ) #necessary to render tutorial correctly library(learnr) library(htmltools) #tidyverse library(dplyr) library(ggplot2) #non tidyverse library(correlation) library(GGally) library(knitr) source("./www/discovr_helpers.R") #Read dat files needed for the tutorial exam_tib <- discovr::exam_anxiety liar_tib <- discovr::biggest_liar
# Create bib file for R packages here::here("inst/tutorials/discovr_07/packages.bib") |> knitr::write_bib(c('here', 'tidyverse', 'dplyr', 'readr', 'forcats', 'tibble', 'ggplot2', 'knitr', "GGally", "correlation"), file = _)
r cat_space(fill = blu)
Welcome to the discovr
space pirate academyHi, welcome to discovr space pirate academy. Well done on embarking on this brave mission to planet r rproj()
s, which is a bit like Mars, but a less red and more hostile environment. That's right, more hostile than a planet without water. Fear not though, the fact you are here means that you can master r rproj()
, and before you know it you'll be as brilliant as our pirate leader Mae Jemstone (she's the badass with the gun). I am the space cat-det, and I will pop up to offer you tips along your journey.
On your way you will face many challenges, but follow Mae's system to keep yourself on track:
r bmu(height = 1.5)
This icon flags materials for teleporters. That's what we like to call the new cat-dets, you know, the ones who have just teleported into the academy. This material is the core knowledge that everyone arriving at space academy must learn and practice. For accessibility, these sections will also be labelled with [(1)]{.alt}.r user_visor(height = 1.5)
Once you have been at space pirate academy for a while, you get your own funky visor. It has various modes. My favourite is the one that allows you to see everything as a large plate of tuna. More important, sections marked for cat-dets with visors goes beyond the core material but is still important and should be studied by all cat-dets. However, try not to be disheartened if you find it difficult. For accessibility, these sections will also be labelled with [(2)]{.alt}.r user_astronaut(height = 1.5)
Those almost as brilliant as Mae (because no-one is quite as brilliant as her) get their own space suits so that they can go on space pirate adventures. They get to shout RRRRRR really loudly too. Actually, everyone here gets to should RRRRRR really loudly. Try it now. Go on. It feels good. Anyway, this material is the most advanced and you can consider it optional unless you are a postgraduate cat-det. For accessibility, these sections will also be labelled with [(3)]{.alt}.It's not just me that's here to help though, you will meet other characters along the way:
r alien(height = 1.5)
aliens love dropping down onto the planet and probing humanoids. Unfortunately you'll find them probing you quite a lot with little coding challenges. Helps is at hand though. r robot(height = 1.5)
bend-R is our coding robot. She will help you to try out bits of r rproj()
by writing the code for you before you encounter each coding challenge.r bug(height = 1.5)
we also have our friendly alien bugs that will, erm, help you to avoid bugs in your code by highlighting common mistakes that even Mae Jemstone sometimes makes (but don't tell her I said that or my tuna supply will end). Also, use hints and solutions to guide you through the exercises (Figure 1).
By for now and good luck - you'll be amazing!
Before attempting this tutorial it's a good idea to work through this tutorial on how to install, set up and work within r rproj()
and r rstudio()
.
The tutorials are self-contained (you practice code in code boxes). However, so you get practice at working in r rstudio()
I strongly recommend that you create an Quarto document within an r rstudio()
project and practice everything you do in the tutorial in the Quarto document, make notes on things that confused you or that you want to remember, and save it. Within this Quarto document you will need to load the relevant packages and data.
This tutorial uses the following packages:
correlation
[@correlation2020; @R-correlation]GGally
[@R-GGally]here
[@R-here]knitr
[@R-knitr]It also uses these tidyverse
packages [@R-tidyverse; @tidyverse2019]: dplyr
[@R-dplyr], forcats
[@R-forcats], ggplot2
[@wickhamGgplot2ElegantGraphics2016], and readr
[@R-readr].
There are (broadly) two styles of coding:
Explicit: Using this style you declare the package when using a function: package::function()
. For example, if I want to use the mutate()
function from the package dplyr
, I will type dplyr::mutate()
. If you adopt an explicit style, you don't need to load packages at the start of your Quarto document (although see below for some exceptions).
Concise: Using this style you load all of the packages at the start of your Quarto document using library(package_name)
, and then refer to functions without their package. For example, if I want to use the mutate()
function from the package dplyr
, I will use library(dplyr)
in my first code chunk and type the function as mutate()
when I use it subsequently.
Coding style is a personal choice. The Google r rproj()
style guide and tidyverse style guide recommend an explicit style, and I use it in teaching materials for two reasons (1) it helps you to remember which functions come from which packages, and (2) it prevents clashes resulting from using functions from different packages that have the same name. However, even with this style it makes sense to load tidyverse
because the dplyr
and ggplot2
packages contain functions that are often used within other functions and in these cases explicit code is difficult to read. Also, no-one wants to write ggplot2::
before every function from ggplot2
.
You can use either style in this tutorial because all packages are pre-loaded. If working outside of the tutorial, load the tidyverse
package (and any others if you're using a concise style) at the beginning of your Quarto document:
library(tidyverse)
To work outside of this tutorial you need to download the following data files:
Set up an r rstudio()
project in the way that I recommend in this tutorial, and save the data files to the folder within your project called [data]{.alt}. Place this code in the first code chunk in your Quarto document:
liar_tib <- here::here("data/biggest_liar.csv") |> readr::read_csv() exam_tib <- here::here("data/exam_anxiety.csv") |> readr::read_csv()
To work outside of this tutorial you need to turn categorical variables into factors and set an appropriate baseline category using forcats::as_factor
and forcats::fct_relevel
.
For [liar_tib]{.alt} execute the following code:
liar_tib <- liar_tib |> dplyr::mutate( novice = forcats::as_factor(novice) )
For [exam_tib]{.alt} execute the following code:
exam_tib <- exam_tib |> dplyr::mutate( id = forcats::as_factor(id), sex = forcats::as_factor(sex) )
r bmu()
Correlation process [(1)]{.alt}Figure 2 shows a general procedure to follow when computing a bivariate correlation coefficient. First, check for sources of bias as outlined. The two most important ones in this context are linearity and normality. Remember that we're fitting linear model to the data, so if the relationship between variables is not linear then this model is invalid. To meet this requirement, the outcome variable needs to be measured at the interval or ratio level as does the predictor variable (one exception is that a predictor variable can be a categorical variable with only two categories). As far as normality is concerned, we care about this only if we want confidence intervals or significance tests and if the sample size is small.
If the data have outliers, are not normal implying a non-normal sampling distribution (and the sample is small) or your variables are measured at the ordinal level then you can use Spearman's rho or Kendall's tau, which are versions of the correlation coefficient applied to ranked data. Ranking the data reduces the impact of outliers but we lose information so, we can instead fit a robust variant such as the percentile bend correlation or Winsorized correlation. Furthermore, given that normality of the sampling distribution matters only for inferring significance and computing confidence intervals in small samples, we could use a bootstrap to compute the confidence interval in small samples, then we don't need to worry about this assumption.
r bmu()
Visualizing the data [(1)]{.alt}In a previous tutorial we looked at an example relating to exam anxiety: a psychologist was interested in the effects of exam stress and revision on exam performance. She had devised and validated a questionnaire to assess state anxiety relating to exams (called the Exam Anxiety Questionnaire, or EAQ). This scale produced a measure of anxiety scored out of 100. Anxiety was measured before an exam, and the percentage mark of each student on the exam was used to assess the exam performance. She also measured the number of hours spent revising.
r alien()
Alien coding challengeThese data are preloaded in this tutorial in a tibble called [exam_tib]{.alt}. Use the code box to see these data.
exam_tib
Note there are five variables: the participant id, the hours spent revising (revise), their exam_grade, their exam anxiety, and their biological sex. We can visualise the data easily using the GGally
package. When you want to plot continuous variables, the ggscatmat()
function from this package produces a matrix of scatterplots (below the diagonal), distributions (along the diagonal) and the correlation coefficient (above the diagonal).
r robot()
Code exampleThe ggscatmat()
function takes the general form:
GGally::ggscatmat(my_tibble, columns = c("variable 1", " variable 2", " variable 3" …))
Basically, you feed in the name of the tibble containing the variables, and use the columns argument to name the variables that you want to plot.
r alien()
Alien coding challengeFor example, to plot the variables called exam_grade, revise, and anxiety we replace ["variable 1"]{.alt} with ["exam_grade"]{.alt}, ["variable 2"]{.alt} with ["revise"]{.alt} and so on. It's as simple as that! Plot the scatterplot of the three variables.
GGally::ggscatmat(exam_tib, columns = c("exam_grade", "revise", "anxiety"))
r alien()
Alien coding challengeLike other plots we have done, we can also apply a ggplot2
theme in the usual way. Copy the code from the previous task and try applying theme_minimal()
GGally::ggscatmat(exam_tib, columns = c("exam_grade", "revise", "anxiety")) + theme_minimal()
The resulting plot shows that all of the variables are skewed. This skew could be a problem if we want to do significance tests or look at confidence intervals. The sample contains 103 observations, which is reasonably large, and possibly large enough for the central limit theorem to relieve of us of concerns about normality. We should consider using a robust method to compute the correlation coefficient itself.
r bmu()
Pearson's correlation using r rproj()
[(1)]{.alt}The correlation
package brings together lots of correlation-related stuff into a nice, easy-to-use bundle of loveliness. It has one workhorse function called … wait for it … correlation()
.
r robot()
Code exampleThe correlation()
function takes the general form:
correlation::correlation(tibble, method = "pearson", p_adjust = "holm", ci = 0.95 )
I've listed the main arguments, but there are others. For now, we'll just look at the main arguments that we will use:
To use the function, take your tibble, pipe it into the select()
function from dplyr
to select the variables you want to correlate, then pipe that into the correlation function. You can use the same code structure whether you want to correlate two variables or produce all correlations between pairs of multiple variables.
r robot()
Code exampleTo calculate the Pearson correlation between the variables exam_grade and revise in [exam_tib]{.alt}, we'd execute:
exam_tib |> dplyr::select(exam_grade, revise) |> correlation::correlation()
This code takes the [exam_tib]{.alt} and uses the select()
function to select the variables exam_grade and revise. The result is that a tibble with these two variables in is fed into the correlation()
function to compute the correlation between them.
r alien()
Alien coding challengeCalculate the Pearson correlation between the variables exam_grade and anxiety. Use kable()
to round the output to 3 decimal places.
exam_tib |> dplyr::select(exam_grade, anxiety) |> correlation::correlation() # add kable() to round to 3 decimal places
exam_tib |> dplyr::select(exam_grade, anxiety) |> correlation::correlation() |> knitr::kable(digits = 3)
r robot()
Code exampleThis approach scales up nicely. For example, if we now also want to see the correlations between all three variables we need only extend the list of variables within the select()
function. Adapt the code to compute the correlations between exam grade, revision and anxiety.
exam_tib |> dplyr::select() |> correlation::correlation() |> knitr::kable(digits = 3)
exam_tib |> dplyr::select(exam_grade, revise, anxiety) |> correlation::correlation() |> knitr::kable(digits = 3)
quiz( question(sprintf("The confidence interval for the association between exam grade and revision is 0.22 to 0.55. What does this tell us?"), answer("If this confidence interval is one of the 95% that contains the population value then the population value of *r* lies between 0.22 and 0.55.", correct = TRUE), answer("There is a 95% chance that the population value of *r* lies between 0.22 and 0.55", message = "You cannot make probability statements from a confidence interval. We don't know whether this particular CI is one of the 95% that contains the population value of *r*."), answer("The probability of this confidence interval containing the population value is 0.95.", message = "The probability of this confidence interval containing the population value is either 0 (it doesn't) or 1 (it does) but it's impossible to know which."), answer("I can be 95% confident that the population value of *r* lies between 0.22 and 0.55", message = "Confidence intervals do not quantify your subjective confidence."), correct = "Correct - well done!", random_answer_order = TRUE, allow_retry = T ), question("The *p*-value for the association between exam grade and revision is < 0.001, what does this value mean?", answer("The probability of getting a value of *t* at least as big as the value we have observed, if the value of *r* were, in fact, zero is less than 0.001. I'm going to assume, therefore, that the association between exam grade and revision is not zero.", correct = T), answer("The probability that *r* = 0.40 is a chance result is less than 0.001", message = "*p*-values do not tell us whether results occur by chance."), answer("The probability that exam grade and revision are not assciated is 0.001", message = "*p*-values do not tell us about the probability of the null hypothesis"), answer("I got a different *p*-value to 0.0o1", message = "You may have done the analysis incorrectly. Try again!"), correct = "Correct - well done!", random_answer_order = TRUE, allow_retry = T ) )
pearson <- exam_tib |> dplyr::select(exam_grade, revise, anxiety) |> correlation::correlation() rrob <- exam_tib |> dplyr::select(exam_grade, revise, anxiety) |> correlation::correlation(method = "percentage")
The resulting output tells us that exam grade has a correlation with revision of r = r round(pearson$r[1], 2)
, and a similar strength relationship with exam anxiety, r = r round(pearson$r[2], 2)
(but in the opposite direction). As revision increases so does exam performance, but as anxiety increase exam performance decreases. We also see that revision has a very strong negative relationship with exam anxiety, r = r round(pearson$r[3], 2)
. The more you revise the less anxiety you have about the exam. All p-values are less than 0.001 (and, so also less than 0.05) and would be interpreted as the correlation coefficients being significantly different from zero. The significance values tell us the probability of getting correlation coefficients at least this big as these in a sample of 103 people if the null hypothesis were true (there was no relationship between these variables) is very low (close to zero in fact). If we're prepared to assume that the sample is one of the 95% of samples that will produce a confidence interval containing the population value, then the confidence intervals tell us about the uncertainty around the value of r. For example, under this assumption, the population value of the association between exam grade and revision falls between r round(pearson$CI_low[1], 2)
and r round(pearson$CI_high[1], 2)
.
r user_astronaut()
Robust correlation coefficients [(3)]{.alt}Given the skew in some of the variables, we are probably better off estimating a robust correlation coefficient, like the percentage bend correlation coefficient [@Wilcox_1994].
r alien()
Alien coding challengeWe can obtain the percentage bend correlation coefficient by setting [method = "percentage"]{.alt} within the correlation()
. The code that we just used is reproduced below. Adapt it to obtain the percentage bend correlation coefficient.
exam_tib |> dplyr::select(exam_grade, revise, anxiety) |> correlation::correlation()
# Include method = "percentage" in correlation()
exam_tib |> dplyr::select(exam_grade, revise, anxiety) |> correlation::correlation(method = "percentage") # Use kable() to round the first 8 columns to 2 dp and the 9th column to 8 dp
exam_tib |> dplyr::select(exam_grade, revise, anxiety) |> correlation::correlation(method = "percentage") |> knitr::kable(digits = c(rep(2, 8), 8))
Compare the values of these robust correlations to the raw Pearson correlations you obtained earlier. Exam grade had a Pearson correlation with revision of r = r round(pearson$r[1], 2)
, but the percentage bend correlation is smaller (r = r round(rrob$r[1], 2)
). Similarly the robust versions of the other correlations are smaller than the raw Pearson correlations (although they all remain significant with ps < 0.05).
r bmu()
Spearman's Correlation Coefficient [(1)]{.alt}Spearman correlations are very easy once you've mastered correlation()
. Let's switch example. Imagine I wanted to test a theory that more creative people will be better liars. I gathered together 68 past contestants from the World's Biggest Liar competition (held annually at the Santon Bridge Inn in Wasdale in case you're interested) and noted where they were placed in the competition (first, second, third, etc.); I also gave them a creativity questionnaire (maximum score 60). The position in the competition is an ordinal variable because the places are categories but have a meaningful order (first place is better than second place and so on). Therefore, Spearman's correlation coefficient should be used (Pearson's r requires interval or ratio data). The data for this study are preloaded in this tutorial in a tibble called [liar_tib]{.alt}. If you're working outside of this tutorial see the instructions at the beginning for loading the data.
spear <- liar_tib |> dplyr::select(position, creativity) |> correlation::correlation(method = "spearman") tau <- liar_tib |> dplyr::select(position, creativity) |> correlation::correlation(method = "kendall")
r alien()
Alien coding challengeLook at the tibble by executing its name:
liar_tib
The main variables are in two columns: one labelled creativity and one labelled position, but there's also a variable id containing the participant's id and a categorical variable called novice that specifies whether the participant was entering the contest for the first time or was a previous entrant. Note that the variable position is numeric (the position in the competition is represented with numbers).
r alien()
Alien coding challengePlot creativity against position using ggscatmat()
# Use the GGally::ggscatmat() function (replace the xs) GGally::ggscatmat(xxxxxx, columns = c("xxxxxx", "xxxxxx"))
# This will get you the plot: GGally::ggscatmat(liar_tib, columns = c("creativity", "position")) # Now add a theme
# The plot with theme_minimal() applied GGally::ggscatmat(liar_tib, columns = c("creativity", "position")) + theme_minimal()
To get the Spearman correlation between creativity and position we can basically use correlation()
exactly as we did for the Pearson correlation, except we add [method = "spearman"]{.alt} to the function.
r robot()
Code exampleIn general, we could execute this code to get the Spearman correlation between variable_1 and variable_2 from the tibble called [my_tib]{.alt} (compare it with the code we used for Pearson's r)
my_tib |> dplyr::select(variable_1, variable_2) |> correlation::correlation(method = "spearman")
r alien()
Alien coding challengeUse the code example to obtain the Spearman correlation between the variables position and creativity from [liar_tib]{.alt}. Use kable()
to round the p-value to 3 decimal places and everything else to 2.
#Start with the tibble containing the data
liar_tib
# Now pipe it into the select() function liar_tib |> dplyr::select() # Now, put the variables names in select that you want to select
# Put the variables names in select that you want to select liar_tib |> dplyr::select(position, creativity) # Now pipe this into correlation()
liar_tib |> dplyr::select(position, creativity) |> correlation::correlation() # set method to Spearman
liar_tib |> dplyr::select(position, creativity) |> correlation::correlation(method = "spearman") # Use kable() to round the p-value to 3 decimal places and everything else to 2
liar_tib |> dplyr::select(position, creativity) |> correlation::correlation(method = "spearman") |> knitr::kable(digits = c(rep(2, 7), 3))
From the output we can see that the Spearman correlation coefficient between the two variables is $r_s$ = r round(spear$r[1], 2)
, with an associated p-value of r round(spear$p[1], 3)
and a sample size of r spear$n_Obs
. There was a significant negative relationship between creativity scores and how well someone did in the World's Biggest Liar competition: as creativity increased, position decreased. This might seem contrary to what we predicted until you remember that a low number means that you did well in the competition (a low number such as 1 means you came first, and a high number like 4 means you came fourth). Therefore, our hypothesis is supported: as creativity increased, so did success in the competition.
r bmu()
Kendall's tau [(1)]{.alt}Kendall's tau, denoted by $\tau$, is another non-parametric correlation and it should be used rather than Spearman's coefficient when you have a small data set with a large number of tied ranks. The correlation()
function will compute Kendall's $\tau$ by including [method = "kendall"]{.alt}.
r alien()
Alien coding challengeRepeat the previous exercise but replace [method = "spearman"]{.alt} with [method = "kendall"]{.alt}. Use kable()
to round the p-value to 4 decimal places and everything else to 3.
liar_tib |> dplyr::select(position, creativity) |> correlation::correlation(method = "kendall") |> knitr::kable(digits = c(rep(3, 7), 4))
The output shows $\tau$ = r round(tau$tau[1], 2)
, which is closer to zero than the Spearman correlation (it has decreased from r round(spear$r[1], 2)
to r round(tau$tau[1], 2)
). Kendall's value is likely a more accurate gauge of what the correlation in the population would be.
r rproj()
r rproj()
and r rstudio()
.r rstudio()
cheat sheets.r rstudio()
list of online resources.I'm extremely grateful to Allison Horst for her very informative blog post on styling learnr tutorials with CSS and also for sending me a CSS template file and allowing me to adapt it. Without Allison, these tutorials would look a lot worse (but she can't be blamed for my colour scheme).
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.