knitr::opts_chunk$set(
    echo = TRUE,
    message = FALSE,
    warning = FALSE
)

#necessary to render tutorial correctly
library(learnr) 
library(htmltools)
#tidyverse
library(dplyr)
library(ggplot2)
#non tidyverse
library(correlation)
library(GGally)
library(knitr)

source("./www/discovr_helpers.R")

#Read dat files needed for the tutorial

exam_tib <- discovr::exam_anxiety
liar_tib <- discovr::biggest_liar
# Create bib file for R packages
here::here("inst/tutorials/discovr_07/packages.bib") |>
  knitr::write_bib(c('here', 'tidyverse', 'dplyr', 'readr', 'forcats', 'tibble', 'ggplot2', 'knitr', "GGally", "correlation"), file = _)

discovr: Associations

Overview

discovr package hex sticker, female space pirate with gun. Gunsmoke forms the letter R. **Usage:** This tutorial accompanies [Discovering Statistics Using R and RStudio](https://www.discovr.rocks/) [@field_discovering_2023] by [Andy Field](https://en.wikipedia.org/wiki/Andy_Field_(academic)). It contains material from the book so there are some copyright considerations but I offer them under a [Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License](http://creativecommons.org/licenses/by-nc-nd/4.0/). Tl;dr: you can use this tutorial for teaching and non-profit activities but please don't meddle with it or claim it as your own work.

r cat_space(fill = blu) Welcome to the discovr space pirate academy

Hi, welcome to discovr space pirate academy. Well done on embarking on this brave mission to planet r rproj()s, which is a bit like Mars, but a less red and more hostile environment. That's right, more hostile than a planet without water. Fear not though, the fact you are here means that you can master r rproj(), and before you know it you'll be as brilliant as our pirate leader Mae Jemstone (she's the badass with the gun). I am the space cat-det, and I will pop up to offer you tips along your journey.

On your way you will face many challenges, but follow Mae's system to keep yourself on track:

It's not just me that's here to help though, you will meet other characters along the way:

Also, use hints and solutions to guide you through the exercises (Figure 1).

Each codebox has a hints or solution button that activates a popup window containing code and text to guide you through each exercise.
Figure 1: In a code exercise click the hints button to guide you through the exercise.

By for now and good luck - you'll be amazing!

Workflow

Packages

This tutorial uses the following packages:

It also uses these tidyverse packages [@R-tidyverse; @tidyverse2019]: dplyr [@R-dplyr], forcats [@R-forcats], ggplot2 [@wickhamGgplot2ElegantGraphics2016], and readr [@R-readr].

Coding style

There are (broadly) two styles of coding:

  1. Explicit: Using this style you declare the package when using a function: package::function(). For example, if I want to use the mutate() function from the package dplyr, I will type dplyr::mutate(). If you adopt an explicit style, you don't need to load packages at the start of your Quarto document (although see below for some exceptions).

  2. Concise: Using this style you load all of the packages at the start of your Quarto document using library(package_name), and then refer to functions without their package. For example, if I want to use the mutate() function from the package dplyr, I will use library(dplyr) in my first code chunk and type the function as mutate() when I use it subsequently.

Coding style is a personal choice. The Google r rproj() style guide and tidyverse style guide recommend an explicit style, and I use it in teaching materials for two reasons (1) it helps you to remember which functions come from which packages, and (2) it prevents clashes resulting from using functions from different packages that have the same name. However, even with this style it makes sense to load tidyverse because the dplyr and ggplot2 packages contain functions that are often used within other functions and in these cases explicit code is difficult to read. Also, no-one wants to write ggplot2:: before every function from ggplot2.

You can use either style in this tutorial because all packages are pre-loaded. If working outside of the tutorial, load the tidyverse package (and any others if you're using a concise style) at the beginning of your Quarto document:

library(tidyverse)

Data

To work outside of this tutorial you need to download the following data files:

Set up an r rstudio() project in the way that I recommend in this tutorial, and save the data files to the folder within your project called [data]{.alt}. Place this code in the first code chunk in your Quarto document:

liar_tib <- here::here("data/biggest_liar.csv") |> readr::read_csv()
exam_tib <- here::here("data/exam_anxiety.csv") |> readr::read_csv()

Preparing data

To work outside of this tutorial you need to turn categorical variables into factors and set an appropriate baseline category using forcats::as_factor and forcats::fct_relevel.

For [liar_tib]{.alt} execute the following code:

liar_tib <- liar_tib |> 
  dplyr::mutate(
    novice = forcats::as_factor(novice)
  )

For [exam_tib]{.alt} execute the following code:

exam_tib <- exam_tib |>
  dplyr::mutate(
    id = forcats::as_factor(id),
    sex = forcats::as_factor(sex)
  )

r bmu() Correlation process [(1)]{.alt}

Figure 2 shows a general procedure to follow when computing a bivariate correlation coefficient. First, check for sources of bias as outlined. The two most important ones in this context are linearity and normality. Remember that we're fitting linear model to the data, so if the relationship between variables is not linear then this model is invalid. To meet this requirement, the outcome variable needs to be measured at the interval or ratio level as does the predictor variable (one exception is that a predictor variable can be a categorical variable with only two categories). As far as normality is concerned, we care about this only if we want confidence intervals or significance tests and if the sample size is small.

If the data have outliers, are not normal implying a non-normal sampling distribution (and the sample is small) or your variables are measured at the ordinal level then you can use Spearman's rho or Kendall's tau, which are versions of the correlation coefficient applied to ranked data. Ranking the data reduces the impact of outliers but we lose information so, we can instead fit a robust variant such as the percentile bend correlation or Winsorized correlation. Furthermore, given that normality of the sampling distribution matters only for inferring significance and computing confidence intervals in small samples, we could use a bootstrap to compute the confidence interval in small samples, then we don't need to worry about this assumption.

See main text for description.
Figure 2: The general process for conducting correlation analysis.

r bmu() Visualizing the data [(1)]{.alt}

In a previous tutorial we looked at an example relating to exam anxiety: a psychologist was interested in the effects of exam stress and revision on exam performance. She had devised and validated a questionnaire to assess state anxiety relating to exams (called the Exam Anxiety Questionnaire, or EAQ). This scale produced a measure of anxiety scored out of 100. Anxiety was measured before an exam, and the percentage mark of each student on the exam was used to assess the exam performance. She also measured the number of hours spent revising.

r alien() Alien coding challenge

These data are preloaded in this tutorial in a tibble called [exam_tib]{.alt}. Use the code box to see these data.


exam_tib

Note there are five variables: the participant id, the hours spent revising (revise), their exam_grade, their exam anxiety, and their biological sex. We can visualise the data easily using the GGally package. When you want to plot continuous variables, the ggscatmat() function from this package produces a matrix of scatterplots (below the diagonal), distributions (along the diagonal) and the correlation coefficient (above the diagonal).

r robot() Code example

The ggscatmat() function takes the general form:

GGally::ggscatmat(my_tibble, columns = c("variable 1", " variable 2", " variable 3" …))

Basically, you feed in the name of the tibble containing the variables, and use the columns argument to name the variables that you want to plot.

r alien() Alien coding challenge

For example, to plot the variables called exam_grade, revise, and anxiety we replace ["variable 1"]{.alt} with ["exam_grade"]{.alt}, ["variable 2"]{.alt} with ["revise"]{.alt} and so on. It's as simple as that! Plot the scatterplot of the three variables.


GGally::ggscatmat(exam_tib, columns = c("exam_grade", "revise", "anxiety"))

r alien() Alien coding challenge

Like other plots we have done, we can also apply a ggplot2 theme in the usual way. Copy the code from the previous task and try applying theme_minimal()


GGally::ggscatmat(exam_tib, columns = c("exam_grade", "revise", "anxiety")) +
  theme_minimal()

The resulting plot shows that all of the variables are skewed. This skew could be a problem if we want to do significance tests or look at confidence intervals. The sample contains 103 observations, which is reasonably large, and possibly large enough for the central limit theorem to relieve of us of concerns about normality. We should consider using a robust method to compute the correlation coefficient itself.

r bmu() Pearson's correlation using r rproj() [(1)]{.alt}

The correlation package brings together lots of correlation-related stuff into a nice, easy-to-use bundle of loveliness. It has one workhorse function called … wait for it … correlation().

r robot() Code example

The correlation() function takes the general form:

correlation::correlation(tibble,
                         method = "pearson",
                         p_adjust = "holm",
                         ci = 0.95
                         )

I've listed the main arguments, but there are others. For now, we'll just look at the main arguments that we will use:

To use the function, take your tibble, pipe it into the select() function from dplyr to select the variables you want to correlate, then pipe that into the correlation function. You can use the same code structure whether you want to correlate two variables or produce all correlations between pairs of multiple variables.

r robot() Code example

To calculate the Pearson correlation between the variables exam_grade and revise in [exam_tib]{.alt}, we'd execute:

exam_tib |> 
  dplyr::select(exam_grade, revise) |> 
  correlation::correlation()

This code takes the [exam_tib]{.alt} and uses the select() function to select the variables exam_grade and revise. The result is that a tibble with these two variables in is fed into the correlation() function to compute the correlation between them.

r alien() Alien coding challenge

Calculate the Pearson correlation between the variables exam_grade and anxiety. Use kable() to round the output to 3 decimal places.


exam_tib |> 
  dplyr::select(exam_grade, anxiety) |> 
  correlation::correlation() 
# add kable() to round to 3 decimal places
exam_tib |> 
  dplyr::select(exam_grade, anxiety) |> 
  correlation::correlation() |> 
  knitr::kable(digits = 3)

r robot() Code example

This approach scales up nicely. For example, if we now also want to see the correlations between all three variables we need only extend the list of variables within the select() function. Adapt the code to compute the correlations between exam grade, revision and anxiety.

exam_tib |> 
  dplyr::select() |> 
  correlation::correlation() |> 
  knitr::kable(digits = 3)
exam_tib |> 
  dplyr::select(exam_grade, revise, anxiety) |> 
  correlation::correlation() |> 
  knitr::kable(digits = 3)
quiz(
  question(sprintf("The confidence interval for the association between exam grade and revision is 0.22 to 0.55. What does this tell us?"),
    answer("If this confidence interval is one of the 95% that contains the population value then the population value of *r* lies between 0.22 and 0.55.", correct = TRUE),
    answer("There is a 95% chance that the population value of *r* lies between 0.22 and 0.55", message = "You cannot make probability statements from a confidence interval. We don't know whether this particular CI is one of the 95% that contains the population value of *r*."),
    answer("The probability of this confidence interval containing the population value is 0.95.", message = "The probability of this confidence interval containing the population value is either 0 (it doesn't) or 1 (it does) but it's impossible to know which."),
    answer("I can be 95% confident that the population value of *r* lies between 0.22 and 0.55", message = "Confidence intervals do not quantify your subjective confidence."),
    correct = "Correct - well done!",
    random_answer_order = TRUE,
    allow_retry = T
  ),
  question("The *p*-value for the association between exam grade and revision is < 0.001, what does this value mean?",
    answer("The probability of getting a value of *t* at least as big as the value we have observed, if the value of *r* were, in fact, zero is less than 0.001. I'm going to assume, therefore, that the association between exam grade and revision is not zero.", correct = T),
    answer("The probability that *r* = 0.40 is a chance result is less than 0.001", message = "*p*-values do not tell us whether results occur by chance."),
    answer("The probability that exam grade and revision are not assciated is 0.001", message = "*p*-values do not tell us about the probability of the null hypothesis"),
    answer("I got a different *p*-value to 0.0o1", message = "You may have done the analysis incorrectly. Try again!"),
    correct = "Correct - well done!",
    random_answer_order = TRUE,
    allow_retry = T
  )
)
pearson <- exam_tib |> 
  dplyr::select(exam_grade, revise, anxiety) |> 
  correlation::correlation()

rrob <- exam_tib |> 
  dplyr::select(exam_grade, revise, anxiety) |> 
  correlation::correlation(method = "percentage")

The resulting output tells us that exam grade has a correlation with revision of r = r round(pearson$r[1], 2), and a similar strength relationship with exam anxiety, r = r round(pearson$r[2], 2) (but in the opposite direction). As revision increases so does exam performance, but as anxiety increase exam performance decreases. We also see that revision has a very strong negative relationship with exam anxiety, r = r round(pearson$r[3], 2). The more you revise the less anxiety you have about the exam. All p-values are less than 0.001 (and, so also less than 0.05) and would be interpreted as the correlation coefficients being significantly different from zero. The significance values tell us the probability of getting correlation coefficients at least this big as these in a sample of 103 people if the null hypothesis were true (there was no relationship between these variables) is very low (close to zero in fact). If we're prepared to assume that the sample is one of the 95% of samples that will produce a confidence interval containing the population value, then the confidence intervals tell us about the uncertainty around the value of r. For example, under this assumption, the population value of the association between exam grade and revision falls between r round(pearson$CI_low[1], 2) and r round(pearson$CI_high[1], 2).

`r cat_space()` **Tip: Rounding** Remember that we can use `knitr::kable()` to control the number of decimal places that are printed in our rendered document. For example, up to now we have printed to three decimal places using wzxhzdk:22 If we want different columns to contain different numbers of decimal places we can provide a vector of values to [digits]{.alt} with each value corresponding to a column of the table. For example, the output from `correlation()` has 11 columns. The *p*-value is in column 9. Let's say we want everything rounded to a maximum of 2 decimal places except for the *p*-value, which we want rounded to a maximum of 8 decimal places. We can achieve this using wzxhzdk:23 Note that within `c()` we list the number of decimal places for each column, so we have a series of eight 2s, which sets the first eight columns to 2 decimal places, and then an 8, which sets the decimal places for column 9. We don't need to worry about columns 10 and 11 because we don't need to round them (column 10 contains text and column 11 a whole number). In fact, it's tiresome to type so many values of 2, so we could use the `rep()` function, which takes this form wzxhzdk:24 So, `rep(2, 8)` means *repeat the value 2 eight times* and returns `2, 2, 2, 2, 2, 2, 2`. We can use this function within `c()` as follows wzxhzdk:25 As before, `c()` will contain eight 2s followed by an 8. Try this code in the following section.

r user_astronaut() Robust correlation coefficients [(3)]{.alt}

Given the skew in some of the variables, we are probably better off estimating a robust correlation coefficient, like the percentage bend correlation coefficient [@Wilcox_1994].

r alien() Alien coding challenge

We can obtain the percentage bend correlation coefficient by setting [method = "percentage"]{.alt} within the correlation(). The code that we just used is reproduced below. Adapt it to obtain the percentage bend correlation coefficient.

exam_tib |> 
  dplyr::select(exam_grade, revise, anxiety) |> 
  correlation::correlation()
# Include method = "percentage" in correlation()
exam_tib |> 
  dplyr::select(exam_grade, revise, anxiety) |> 
  correlation::correlation(method = "percentage")
# Use kable() to round the first 8 columns to 2 dp and the 9th column to 8 dp
exam_tib |> 
  dplyr::select(exam_grade, revise, anxiety) |> 
  correlation::correlation(method = "percentage") |> 
  knitr::kable(digits = c(rep(2, 8), 8))

Compare the values of these robust correlations to the raw Pearson correlations you obtained earlier. Exam grade had a Pearson correlation with revision of r = r round(pearson$r[1], 2), but the percentage bend correlation is smaller (r = r round(rrob$r[1], 2)). Similarly the robust versions of the other correlations are smaller than the raw Pearson correlations (although they all remain significant with ps < 0.05).

r bmu() Spearman's Correlation Coefficient [(1)]{.alt}

Spearman correlations are very easy once you've mastered correlation(). Let's switch example. Imagine I wanted to test a theory that more creative people will be better liars. I gathered together 68 past contestants from the World's Biggest Liar competition (held annually at the Santon Bridge Inn in Wasdale in case you're interested) and noted where they were placed in the competition (first, second, third, etc.); I also gave them a creativity questionnaire (maximum score 60). The position in the competition is an ordinal variable because the places are categories but have a meaningful order (first place is better than second place and so on). Therefore, Spearman's correlation coefficient should be used (Pearson's r requires interval or ratio data). The data for this study are preloaded in this tutorial in a tibble called [liar_tib]{.alt}. If you're working outside of this tutorial see the instructions at the beginning for loading the data.

spear <- liar_tib |>
  dplyr::select(position, creativity) |> 
  correlation::correlation(method = "spearman")

tau <- liar_tib |>
  dplyr::select(position, creativity) |> 
  correlation::correlation(method = "kendall")

r alien() Alien coding challenge

Look at the tibble by executing its name:


liar_tib

The main variables are in two columns: one labelled creativity and one labelled position, but there's also a variable id containing the participant's id and a categorical variable called novice that specifies whether the participant was entering the contest for the first time or was a previous entrant. Note that the variable position is numeric (the position in the competition is represented with numbers).

r alien() Alien coding challenge

Plot creativity against position using ggscatmat()


# Use the GGally::ggscatmat() function (replace the xs)

GGally::ggscatmat(xxxxxx, columns = c("xxxxxx", "xxxxxx"))
# This will get you the plot:
GGally::ggscatmat(liar_tib, columns = c("creativity", "position"))

# Now add a theme
# The plot with theme_minimal() applied

GGally::ggscatmat(liar_tib, columns = c("creativity", "position")) +
  theme_minimal()

To get the Spearman correlation between creativity and position we can basically use correlation() exactly as we did for the Pearson correlation, except we add [method = "spearman"]{.alt} to the function.

r robot() Code example

In general, we could execute this code to get the Spearman correlation between variable_1 and variable_2 from the tibble called [my_tib]{.alt} (compare it with the code we used for Pearson's r)

my_tib |>
  dplyr::select(variable_1, variable_2) |> 
  correlation::correlation(method = "spearman")

r alien() Alien coding challenge

Use the code example to obtain the Spearman correlation between the variables position and creativity from [liar_tib]{.alt}. Use kable() to round the p-value to 3 decimal places and everything else to 2.

`r cat_space()` **Tip: More rounding** For Spearman and Kendall's tau the output is slightly different to Pearson. The *p*-value is in column 8, rather than column 9. Therefore, to round the *p*-value to a different number of decimal places to other values we'd use wzxhzdk:38 replacing [dp_general]{.alt} with the number of decimal places for the first 7 columns, and replacing [dp_for_p]{.alt} with the number of decimal places for the column of *p*-values. (Note within `rep()` we ask for 7 repetitions rather than 8.) For example, to round the *p*-value to 8 decimal places and everything else to 2 we'd use wzxhzdk:39

#Start with the tibble containing the data
liar_tib
# Now pipe it into the select() function
liar_tib |>
  dplyr::select()
# Now, put the variables names in select that you want to select
# Put the variables names in select that you want to select
liar_tib |>
  dplyr::select(position, creativity)

# Now pipe this into correlation()
liar_tib |>
  dplyr::select(position, creativity) |> 
  correlation::correlation()
# set method to Spearman
liar_tib |>
  dplyr::select(position, creativity) |> 
  correlation::correlation(method = "spearman")
# Use kable() to round the p-value to 3 decimal places and everything else to 2
liar_tib |>
  dplyr::select(position, creativity) |> 
  correlation::correlation(method = "spearman") |> 
  knitr::kable(digits = c(rep(2, 7), 3))

From the output we can see that the Spearman correlation coefficient between the two variables is $r_s$ = r round(spear$r[1], 2), with an associated p-value of r round(spear$p[1], 3) and a sample size of r spear$n_Obs. There was a significant negative relationship between creativity scores and how well someone did in the World's Biggest Liar competition: as creativity increased, position decreased. This might seem contrary to what we predicted until you remember that a low number means that you did well in the competition (a low number such as 1 means you came first, and a high number like 4 means you came fourth). Therefore, our hypothesis is supported: as creativity increased, so did success in the competition.

r bmu() Kendall's tau [(1)]{.alt}

Kendall's tau, denoted by $\tau$, is another non-parametric correlation and it should be used rather than Spearman's coefficient when you have a small data set with a large number of tied ranks. The correlation() function will compute Kendall's $\tau$ by including [method = "kendall"]{.alt}.

r alien() Alien coding challenge

Repeat the previous exercise but replace [method = "spearman"]{.alt} with [method = "kendall"]{.alt}. Use kable() to round the p-value to 4 decimal places and everything else to 3.


liar_tib |> 
  dplyr::select(position, creativity) |> 
  correlation::correlation(method = "kendall") |> 
  knitr::kable(digits = c(rep(3, 7), 4))

The output shows $\tau$ = r round(tau$tau[1], 2), which is closer to zero than the Spearman correlation (it has decreased from r round(spear$r[1], 2) to r round(tau$tau[1], 2)). Kendall's value is likely a more accurate gauge of what the correlation in the population would be.

discovr package hex sticker, female space pirate with gun. Gunsmoke forms the letter R. **A message from Mae Jemstone:** Once when I was travelling through the constellation of *dragonis murralis* on our mission to improve statistical literacy, we crunched some data and discovered that high anxiety is often associated with poorer performance in Mathematical subjects. It's not that anxious people are less able, but their anxiety is a barrier that makes them feel that they can't do things, even though they can. Associations like this are helpful to get us thinking about relationships between variables, which is the first step towards predicting variables! You've taken that first step towards understanding how we use measured variables to predict what might happen. Well done! And if you're one of those anxious people, you are not alone, and you *can* do this. Think about ways that you can feel the anxious feelings but not let them disrupt your thinking: when you feel the anxiety take some controlled, deep breaths for a minute. Be mindful. Believe in yourself, because I do.

Resources {data-progressive=FALSE}

Statistics

r rproj()

Acknowledgement

I'm extremely grateful to Allison Horst for her very informative blog post on styling learnr tutorials with CSS and also for sending me a CSS template file and allowing me to adapt it. Without Allison, these tutorials would look a lot worse (but she can't be blamed for my colour scheme).

References



profandyfield/discovr documentation built on May 4, 2024, 4:32 p.m.