knitr::opts_chunk$set(
    echo = TRUE,
    message = FALSE,
    warning = FALSE
)

#necessary to render tutorial correctly
library(learnr) 
library(htmltools)
#tidyverse
library(dplyr)
library(ggplot2)
#non tidyverse
library(Hmisc)
library(knitr)

source("./www/discovr_helpers.R")

#Read dat files needed for the tutorial

wish_tib <- discovr::jiminy_cricket
notebook_tib <- discovr::notebook
exam_tib <- discovr::exam_anxiety
# Create bib file for R packages
here::here("inst/tutorials/discovr_05/packages.bib") |>
  knitr::write_bib(c('here', 'tidyverse', 'dplyr', 'readr', 'forcats'), file = _)

discovr: Visualizing data

Overview

discovr package hex sticker, female space pirate with gun. Gunsmoke forms the letter R. **Usage:** This tutorial accompanies [Discovering Statistics Using R and RStudio](https://www.discovr.rocks/) [@field_discovering_2023] by [Andy Field](https://en.wikipedia.org/wiki/Andy_Field_(academic)). It contains material from the book so there are some copyright considerations but I offer them under a [Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License](http://creativecommons.org/licenses/by-nc-nd/4.0/). Tl;dr: you can use this tutorial for teaching and non-profit activities but please don't meddle with it or claim it as your own work.

r cat_space(fill = blu) Welcome to the discovr space pirate academy

Hi, welcome to discovr space pirate academy. Well done on embarking on this brave mission to planet r rproj()s, which is a bit like Mars, but a less red and more hostile environment. That's right, more hostile than a planet without water. Fear not though, the fact you are here means that you can master r rproj(), and before you know it you'll be as brilliant as our pirate leader Mae Jemstone (she's the badass with the gun). I am the space cat-det, and I will pop up to offer you tips along your journey.

On your way you will face many challenges, but follow Mae's system to keep yourself on track:

It's not just me that's here to help though, you will meet other characters along the way:

Also, use hints and solutions to guide you through the exercises (Figure 1).

Each codebox has a hints or solution button that activates a popup window containing code and text to guide you through each exercise.
Figure 1: In a code exercise click the hints button to guide you through the exercise.

By for now and good luck - you'll be amazing!

Workflow

Packages

This tutorial uses the following packages:

It also uses these tidyverse packages [@R-tidyverse; @tidyverse2019]: readr [@R-readr], dplyr [@R-dplyr], forcats [@R-forcats] and ggplot2 [@wickhamGgplot2ElegantGraphics2016].

Coding style

There are (broadly) two styles of coding:

  1. Explicit: Using this style you declare the package when using a function: package::function(). For example, if I want to use the mutate() function from the package dplyr, I will type dplyr::mutate(). If you adopt an explicit style, you don't need to load packages at the start of your Quarto document (although see below for some exceptions).

  2. Concise: Using this style you load all of the packages at the start of your Quarto document using library(package_name), and then refer to functions without their package. For example, if I want to use the mutate() function from the package dplyr, I will use library(dplyr) in my first code chunk and type the function as mutate() when I use it subsequently.

Coding style is a personal choice. The Google r rproj() style guide and tidyverse style guide recommend an explicit style, and I use it in teaching materials for two reasons (1) it helps you to remember which functions come from which packages, and (2) it prevents clashes resulting from using functions from different packages that have the same name. However, even with this style it makes sense to load tidyverse because the dplyr and ggplot2 packages contain functions that are often used within other functions and in these cases explicit code is difficult to read. Also, no-one wants to write ggplot2:: before every function from ggplot2.

You can use either style in this tutorial because all packages are pre-loaded. If working outside of the tutorial, load the tidyverse package (and any others if you're using a concise style) at the beginning of your Quarto document:

library(tidyverse)

Data

To work outside of this tutorial you need to download the following data files:

Set up an r rstudio() project in the way that I recommend in this tutorial, and save the data files to the folder within your project called [data]{.alt}. Place this code in the first code chunk in your Quarto document:

wish_tib <- here::here("data/jiminy_cricket.csv") |> readr::read_csv()
notebook_tib <- here::here("data/notebook.csv") |> readr::read_csv()
exam_tib <- here::here("data/exam_anxiety.csv") |> readr::read_csv()

Preparing data

To work outside of this tutorial you need to turn categorical variables into factors and set an appropriate baseline category using forcats::as_factor and forcats::fct_relevel.

For the [wish_tib]{.alt} execute the following code:

wish_tib <- wish_tib |>
  dplyr::mutate(
    strategy = forcats::as_factor(strategy),
    time = forcats::as_factor(time) |> forcats::fct_relevel("Baseline")
  )

For [notebook_tib]{.alt} execute the following code:

notebook_tib <- notebook_tib |>
  dplyr::mutate(
    sex = forcats::as_factor(sex),
    film = forcats::as_factor(film)
  )

For [exam_tib]{.alt} execute the following code:

exam_tib <- exam_tib |>
  dplyr::mutate(
    id = forcats::as_factor(id),
    sex = forcats::as_factor(sex)
  )

r bmu() ggplot2 [(1)]{.alt}

The most versatile package for producing plots in r rproj() is ggplot2 which automatically installs as part of the tidyverse package. Figure 2 shows how ggplot2 works. You begin with some data and you initialize a plot with the ggplot() function within which you name the tibble or data frame that you want to use, then you set a bunch of aesthetics using the aes() function. Primarily, you name the variable you want plotted on the x-axis, the variable for the y-axis and any aesthetics that you want to set for the plot using a variable (for example, you might want to vary the colour of bars by levels of a variable.). You then add layers to the plot that control what the plot shows and perhaps adjust the visual properties of the objects on the layer. For example, you might add a layer of dots to show group means, change their appearance to be filled with different colours, then add a layer of error bars on top of them. There are various key concepts that relate to controlling aspects of the layers of the plot:

Each of the things above is a layer/transparencies that can be added to a plot. There are also aesthetics, which control what the things on a layer look like (in other words, their the visual aesthetics). Examples of aesthetics are the fill colour of points and bars, line colours (of linear models, error bars, lines around bars etc.), the shape of data points, the size of data points, the type of line (full, dashed, dotted etc.). These aesthetics can be set directly for an object (e.g., making all data points red) or can be set using a variable (e.g., colouring data points based on whether it came from an experimental or control group).

This is a lot to take in, so consider this a reference point (rather than expecting to remember all of the above). We'll get a feel for ggplot2 by doing examples. You may also find the official reference guide and, of course, my book chapter helpful.

See main text for description.
Figure 2: A ggplot is made up of layers.

r bmu() Boxplots (aka Box-Whisker plots) [(1)]{.alt}

Dreams are good, but a completely blinkered view that they'll come true without any work on your part is not. Imagine I collected some data from 250 people on their level of success using a composite measure involving their salary, quality of life and how closely their life matches their aspirations. This gave me a score from 0 (complete failure) to 100 (complete success). I then implemented an intervention: I told people that for the next 5 years they should either wish upon a star for their dreams to come true or work as hard as they could to make their dreams come true. I measured their success again 5 years later. People were randomly allocated to these two instructions. The data are in [wish_tib]{.alt}. The variables are id (the person's id), strategy (hard work or wishing upon a star), time (baseline or 5 years), and success (the rating on my dodgy scale).

First, we're going to create a boxplot of the success scores at baseline and after 5 years. To create a boxplot in ggplot we use the geom_boxplot() function. We've seen that the general setup of a plot uses this command:

ggplot2::ggplot(my_tib, aes(variable_for_x_axis, variable_for_y_axis))

Within the ggplot() function replace [my_tib]{.alt} with the name of the tibble containing the data you want to plot, and within the aes() function replace [variable_for_x_axis]{.alt} with the name of the variable to be plotted on the x-axis (horizontal), and replace [variable_for_y_axis]{.alt} with the name of the variable to be plotted on the y-axis (vertical).

r robot() Code example

We could set up the plot with this command:

wish_plot <- ggplot2::ggplot(wish_tib, aes(time, success))
wish_plot +
  geom_boxplot()

Let's break down this command:

Job done.

r alien() Alien coding challenge

Remember from discovr_02 that we can make the plot nicer by using labs() to add labels to the x and y axis, and apply a theme such as theme_minimal(). We literally add these layers using the + symbol. Use the code box to label the x-axis as Time and y as Success (%), and apply a minimal theme.

wish_plot <- ggplot2::ggplot(wish_tib, aes(time, success))
wish_plot +
  geom_boxplot()
# To add axis labels include
+ labs(x = "label", y = "label")
# To add theme_xxxxx() include
+ theme_xxxxx()
#Solution:
wish_plot <- ggplot2::ggplot(wish_tib, aes(time, success))
wish_plot +
  geom_boxplot() +  
  labs(x = "Time", y = "Success (%)") +
  theme_minimal()

Note that the axis have new labels, and a different theme has been applied (for example, the grey background is gone).

The boxplot shows that success increased (very slightly) after 5 years (the median, shown by the horizontal line within the box, is higher) but the spread of scores has also increased (the whiskers are longer at 5 years than at baseline).

r bmu() Grouping by colour [(1)]{.alt}

The boxplot we have created shows how success changed over time, but it doesn't show us what effect wishing on a star had compared to hard work. We can see this by splitting the data by the variable strategy. We can do this in several ways. First, we can ask ggplot to vary the [fill]{.alt} of the boxes or the [colour]{.alt} of the lines around the boxes by the variable strategy by adding it to the aes() function in the original command to set up the plot For example, to vary the fill of the boxplots by strategy, we'd change the first line of our command to be:

r robot() Code example

wish_plot <- ggplot2::ggplot(wish_tib, aes(time, success, fill = strategy))

Note that all I have done is to add [fill = strategy]{.alt} to the initial aesthetic. The rest of the command stays the same.

r alien() Alien coding challenge

Your original code is reproduced below, adapt it to include [fill = strategy]{.alt} and run it. Compare the plot to the previous version. Note that the plot still splits the data by time along the x-axis, but within each category the data from the wishing on a star group is shown in a different colour to the data from the hard work group.

wish_plot <- ggplot2::ggplot(wish_tib, aes(time, success))
wish_plot +
  geom_boxplot() +  
  labs(x = "Time", y = "Success (%)") +
  theme_minimal()
wish_plot <- ggplot2::ggplot(wish_tib, aes(time, success, fill = strategy))
wish_plot +
  geom_boxplot() +  
  labs(x = "Time", y = "Success (%)") +
  theme_minimal()

We can see that success only increases after 5 years in the hard work group (but the spread of success scores is huge too at 5 years in that group).

Instead of using [fill]{.alt} to differentiate the two strategy groups, we can use [colour]{.alt}. This leaves the boxes white for all groups, but uses different colours for the lines around the boxes.

r robot() Code example

Like with [fill]{.alt}, we adapt the first line of code, but this time to include [colour = strategy]{.alt}:

wish_plot <- ggplot2::ggplot(wish_tib, aes(time, success, colour = strategy))

r alien() Alien coding challenge

Add [colour = strategy]{.alt} to the code below and see what happens when you run it.

wish_plot <- ggplot2::ggplot(wish_tib, aes(time, success))
wish_plot +
  geom_boxplot() +  
  labs(x = "Time", y = "Success (%)") +
  theme_minimal()
wish_plot <- ggplot2::ggplot(wish_tib, aes(time, success, colour = strategy))
wish_plot +
  geom_boxplot() +  
  labs(x = "Time", y = "Success (%)") +
  theme_minimal()

This is great but the legend for the variable strategy has a lower case 's' and isn't very informative. It'd be nice if it said 'Success strategy'. Currently we have specified labels for the x- and y-axis by including:

labs(x = "Time", y = "Success (%)")

To specify the label for the variable that is used to determine the fill or colour of the plot, we add it to the labs() function. For example, if we used strategy to determine the fill of the plot then we'd add [fill = "label"]{.alt}, where label is the text we want to use:

r robot() Code example

labs(x = "Time", y = "Success (%)", fill = "Success strategy")

Similarly, if we had used strategy to determine the colour of the plot then we'd add [colour = "label"]{.alt} to the function

labs(x = "Time", y = "Success (%)", colour = "Success strategy")

r alien() Alien coding challenge

The code to create a boxplot that uses [fill]{.alt} to differentiate the two success strategies is copied below. Edit the code, using what you've just learnt, to change the label for the [fill]{.alt} property to be Success strategy. Run the code and see how the legend changes.

wish_plot <- ggplot2::ggplot(wish_tib, aes(time, success))
wish_plot +
  geom_boxplot() +  
  labs(x = "Time", y = "Success (%)") +
  theme_minimal()
wish_plot <- ggplot2::ggplot(wish_tib, aes(time, success, fill = strategy))
wish_plot +
  geom_boxplot() +  
  labs(x = "Time", y = "Success (%)", fill = "Success strategy") +
  theme_minimal()
`r bug()` **De-bug: don't forget `+`** A common cause of errors messages when using `ggplot()` is forgetting to put a `+` at the end of each line (except the last). If you get an error message check that each line that builds up a plot has a `+` at the end of it (i.e. each function is separated by `+`). I make this mistake *all* the time!

r bmu() Grouping using facet_wrap() [(1)]{.alt}

A second way to split the data is to add a facet layer, for example, by adding facet_wrap() to the plot. This function takes the general form:

facet_wrap(facet, nrow = NULL, ncol = NULL, scales = "fixed")

There are other arguments, but these are the main ones:

r alien() Alien coding challenge

The box below displays the code that you used above to generate a boxplot of success scores over time. Add the line facet_wrap(~strategy) to the command (above the bottom line that applies the theme), execute the code to see what happens.

wish_plot <- ggplot2::ggplot(wish_tib, aes(time, success))
wish_plot +
  geom_boxplot() +  
  labs(x = "Time", y = "Success (%)") +
  theme_minimal()
wish_plot <- ggplot2::ggplot(wish_tib, aes(time, success))
wish_plot +
  geom_boxplot() +  
  labs(x = "Time", y = "Success (%)") +
  facet_wrap(~strategy) +
  theme_minimal()

Note that the data from the wish upon a star and hard work groups are now displayed in separate panels.

r alien() Alien coding challenge

Now edit facet_wrap() to be facet_wrap(~strategy, ncol = 1), rerun the code and see what happens. The plots should now be stacked vertically instead of being side by side.

wish_plot <- ggplot2::ggplot(wish_tib, aes(time, success))
wish_plot +
  geom_boxplot() +  
  labs(x = "Time", y = "Success (%)") +
  theme_minimal()
wish_plot <- ggplot2::ggplot(wish_tib, aes(time, success))
wish_plot +
  geom_boxplot() +  
  labs(x = "Time", y = "Success (%)") +
  facet_wrap(~strategy, ncol = 1) +
  theme_minimal()

r bmu() Plotting means [(1)]{.alt}

Plotting means is slightly more tricky. If you want to plot from the raw data (rather than a tibble containing the summary information) then your best bet is to use the stat_summary() function and then specify the geom to use within it. Let's begin by plotting the mean success split by time. We can do this by setting up the plot exactly as we did for the boxplot, but instead of using geom_boxplot() we use:

stat_summary(fun = "mean", geom = "point", size = 4)

In the stat_summary() function, we're asking r rproj() to calculate the means ([fun = "mean"]{.alt}). The argument [geom = "point"]{.alt} asks ggplot2 to display the means as dots using geom_point(). The final argument, [size = 4]{.alt}, determines the size of the dots and overrides the default (you can omit this argument if you like).

r robot() Code example

The full code is below. Note that the only thing that has changed from the code we used for a boxplot, is that we have replaced geom_boxplot() with stat_summary(fun = "mean", geom = "point", size = 4).

wish_plot <- ggplot2::ggplot(wish_tib, aes(time, success))
wish_plot +
  stat_summary(fun = "mean", geom = "point", size = 4) +  
  labs(x = "Time", y = "Success (%)") +
  theme_minimal()
wish_plot <- ggplot2::ggplot(wish_tib, aes(time, success))
wish_plot +
  stat_summary(fun = "mean", geom = "point", size = 4) +  
  labs(x = "Time", y = "Success (%)") +
  theme_minimal()

r bmu() Adjusting the scales [(1)]{.alt}

The plot we've just produced is all well and good, but ggplot has scaled the y-axis from 50 to 58 and has displayed breaks at the values 50, 52, 54, and 56. This maximizes the differences between means - the small difference looks huge. We shouldn't do this. There's two functions that we can use to add layers that control the scale of the axis.

coord_cartesian()

coord_cartesian(ylim = c(lower_limit, upper_limit), xlim = c(lower_limit, upper_limit))

This code adjusts the y-axis and x-axis to display values from [lower_limit]{.alt} to [upper_limit]{.alt}. You would replace each [lower_limit]{.alt} and [upper_limit]{.alt} with relevant numbers. We want to change only the y-axis so we'll ignore [xlim]{.alt} for now. If we our y-axis to display values from 0 to 100 (the full range of the scale) we would add to the plot:

coord_cartesian(ylim = c(0, 100))

scale_y_continuous()

scale_y_continuous(breaks = seq(lower_limit, upper_limit, increment))

I've used the function seq() which takes the form

seq(lower_limit, upper_limit, increment)

where [lower_limit]{.alt} is the value you want to start at, [upper_limit]{.alt} is the value you want to stop at, and [increment]{.alt} is the size of the increment you want. For example, if we wanted breaks to be displayed at 0, 10, 20, 30 and so on up to 100, we'd specify seq(0, 100, 10) which will create a sequence from 0 to 100 in intervals of 10. There is a similar function scale_x_continuous() for changing the x-axis.

r robot() Code example

For now, we're adjusting only the y-axis. If we want it to show values from 0 to 100 and display labels on every value of 10, we would add these lines to the plot:

coord_cartesian(ylim = c(0, 100)) +
scale_y_continuous(breaks = seq(0, 100, 10)) +

r alien() Alien coding challenge

Try adding these two lines of code to the previous code (above the bottom line that applies the theme) that we used to plot the means. Compare the resulting plot with the previous one.

`r cat_space()` **Tip: Apply themes last** It's good practice to apply themes last (i.e. have the theme function as the final line of the command) because `ggplot2` adds each layer in order. If the theme is the last line it will be applied to the entire plot.
wish_plot <- ggplot2::ggplot(wish_tib, aes(time, success))
wish_plot +
  stat_summary(fun = "mean", geom = "point", size = 4) +  
  labs(x = "Time", y = "Success (%)") +
  theme_minimal()
# Add coord_cartesian() first. Put it above theme_minimal() so the theme is applied last
# don't forget the + sign between coord_cartesian() and theme_minimal()

wish_plot <- ggplot2::ggplot(wish_tib, aes(time, success))
wish_plot +
  stat_summary(fun = "mean", geom = "point", size = 4) +  
  labs(x = "Time", y = "Success (%)") +
  coord_cartesian(ylim = c(0, 100)) +
  theme_minimal()
# Now add scale_y_continuous(). Again, put it above theme_minimal() so the theme is applied last
# don't forget the + sign between scale_y_continuous() and theme_minimal()

wish_plot <- ggplot2::ggplot(wish_tib, aes(time, success))
wish_plot +
  stat_summary(fun = "mean", geom = "point", size = 4) +  
  labs(x = "Time", y = "Success (%)") +
  coord_cartesian(ylim = c(0, 100)) +
  scale_y_continuous(breaks = seq(0, 100, 10)) +
  theme_minimal()

r bmu() Grouping means [(1)]{.alt}

Just like with boxplots we can also group means by the success strategy used using the same methods. For example, we can add facet_wrap(~strategy) to display the two strategies as different panels.

r alien() Alien coding challenge {#facet_wish}

Below is the code we have built up so far. Add facet_wrap(~strategy) + to the line before last.

wish_plot <- ggplot2::ggplot(wish_tib, aes(time, success))
wish_plot +
  stat_summary(fun = "mean", geom = "point", size = 4) +  
  labs(x = "Time", y = "Success (%)") +
  coord_cartesian(ylim = c(0, 100)) +
  scale_y_continuous(breaks = seq(0, 100, 10)) +
  theme_minimal()
wish_plot <- ggplot2::ggplot(wish_tib, aes(time, success))
wish_plot +
  stat_summary(fun = "mean", geom = "point", size = 4) +  
  labs(x = "Time", y = "Success (%)") +
  coord_cartesian(ylim = c(0, 100)) +
  scale_y_continuous(breaks = seq(0, 100, 10)) +
  facet_wrap(~strategy) +
  theme_minimal()

Instead of using facets, we can display the two strategies in different colours, like we did for boxplots. To do this we need to make the same two adjustments to our code to earlier on:

r alien() Alien coding challenge

Execute the code below, then make the two adjustments above and execute it again to see the difference.

wish_plot <- ggplot2::ggplot(wish_tib, aes(time, success))
wish_plot +
  stat_summary(fun = "mean", geom = "point", size = 4) +  
  labs(x = "Time", y = "Success (%)") +
  coord_cartesian(ylim = c(0, 100)) +
  scale_y_continuous(breaks = seq(0, 100, 10)) +
  theme_minimal()
# Add `colour = strategy` to the first line, within `aes()` This line should read:

wish_plot <- ggplot2::ggplot(wish_tib, aes(time, success, colour = strategy))
# Add colour = "Success strategy"` to the `labs()` function to apply 
# a meaningful label to the variable **strategy**. This line will read:

labs(x = "Time", y = "Success (%)", colour = "Success strategy") +
# Solution:

wish_plot <- ggplot2::ggplot(wish_tib, aes(time, success, colour = strategy))
wish_plot +
  stat_summary(fun = "mean", geom = "point", size = 4) +  
  labs(x = "Time", y = "Success (%)", colour = "Success strategy") +
  coord_cartesian(ylim = c(0, 100)) +
  scale_y_continuous(breaks = seq(0, 100, 10)) +
  theme_minimal()

There is a problem though, the dots at baseline overlap.

r bmu() Adjusting the position of geoms [(1)]{.alt}

We can avoid the problem of dots overlapping by adjusting their horizontal position. The stat_summary() function (and most geoms) have a [position]{.alt} argument that can be set using the function [position_dodge(width = value)]{.alt}. This function plots geoms so that they 'dodge' each other on the horizontal plane. You have to replace [value]{.alt} with a number that sets the size of the 'dodge'. Play around with values until it looks good, 0.9 works well for this plot.

r robot() Code example

To set the position of the dots, we need to adjust stat_summary() from:

stat_summary(fun = "mean", geom = "point", size = 4)

to:

stat_summary(fun = "mean", geom = "point", size = 4, position = position_dodge(width = 0.9))

r alien() Alien coding challenge

Execute the code below, then add [position = position_dodge(width = 0.9)]{.alt} to stat_summary() and run the code again. Note that the dots no longer overlap.

wish_plot <- ggplot2::ggplot(wish_tib, aes(time, success, colour = strategy))
wish_plot +
  stat_summary(fun = "mean", geom = "point", size = 4) +  
  labs(x = "Time", y = "Success (%)", colour = "Success strategy") +
  coord_cartesian(ylim = c(0, 100)) +
  scale_y_continuous(breaks = seq(0, 100, 10)) +
  theme_minimal()
wish_plot <- ggplot2::ggplot(wish_tib, aes(time, success, colour = strategy))
wish_plot +
  stat_summary(fun = "mean", geom = "point", size = 4, position = position_dodge(width = 0.9)) +  
  labs(x = "Time", y = "Success (%)", colour = "Success strategy") +
  coord_cartesian(ylim = c(0, 100)) +
  scale_y_continuous(breaks = seq(0, 100, 10)) +
  theme_minimal()

r bmu() Violin plots [(1)]{.alt}

As well plotting the mean success score across the various times and groups, it's also useful to plot the distribution of scores around that mean. We can do that using a violin plot. We can add a 'violin' using the geom_violin() function. Let's add a 'violin' to our previous plot. The box below shows the code we have built up so far. Run this code if you want to remind yourself of what the plot looks like.

r robot() Code example

To add the distribution of scores to the plot, simply add the line:

geom_violin() +

r user_visor() Exploring layers [(2)]{.alt}

This is a good opportunity to remind you that each line of the command adds a layer to the plot in the order you specify them. This optional section might help you to understand how layering works in ggplot2.

r alien() Alien coding challenge

Add the line geom_violin() + directly below the line that specifies stat_summary().

wish_plot <- ggplot2::ggplot(wish_tib, aes(time, success, colour = strategy))
wish_plot +
  stat_summary(fun = "mean", geom = "point", size = 4, position = position_dodge(width = 0.9)) +  
  labs(x = "Time", y = "Success (%)", colour = "Success strategy") +
  coord_cartesian(ylim = c(0, 100)) +
  scale_y_continuous(breaks = seq(0, 100, 10)) +
  theme_minimal()
wish_plot <- ggplot2::ggplot(wish_tib, aes(time, success, colour = strategy))
wish_plot +
  stat_summary(fun = "mean", geom = "point", size = 4, position = position_dodge(width = 0.9)) +
  geom_violin() +
  labs(x = "Time", y = "Success (%)", colour = "Success strategy") +
  coord_cartesian(ylim = c(0, 100)) +
  scale_y_continuous(breaks = seq(0, 100, 10)) +
  theme_minimal()

r alien() Alien coding challenge

Now add the line geom_violin() + directly above the line that specifies stat_summary()

wish_plot <- ggplot2::ggplot(wish_tib, aes(time, success, colour = strategy))
wish_plot +
  stat_summary(fun = "mean", geom = "point", size = 4, position = position_dodge(width = 0.9)) +  
  labs(x = "Time", y = "Success (%)", colour = "Success strategy") +
  coord_cartesian(ylim = c(0, 100)) +
  scale_y_continuous(breaks = seq(0, 100, 10)) +
  theme_minimal()
wish_plot <- ggplot2::ggplot(wish_tib, aes(time, success, colour = strategy))
wish_plot +
  geom_violin() +
  stat_summary(fun = "mean", geom = "point", size = 4, position = position_dodge(width = 0.9)) +  
  labs(x = "Time", y = "Success (%)", colour = "Success strategy") +
  coord_cartesian(ylim = c(0, 100)) +
  scale_y_continuous(breaks = seq(0, 100, 10)) +
  theme_minimal()

You should find that in the first plot the dots showing the means disappear. This is because the violin geom is filled white (the space between the lines isn't transparent). Because we specify geom_violin() after stat_summary() the violin geoms (which are filled white) are layered on top of the dots showing the means and so you can't see the dots anymore (because the violin geoms are not transparent). In the second plot, because we specify geom_violin() before stat_summary() the dots are layered on top of the violins, so we can see them.

r alien() Alien coding challenge

To really drum this point home, look at the code below (which mirrors task 1 above). Note that within geom_violin() I have included [alpha = 1]{.alt}. This arguments sets the transparency of the geom, and the default is 1. Run this code and note that it does exactly the same thing as the code for the first task above. The dots are concealed because we have specified geom_violin() after stat_summary(). Now change [alpha = 1]{.alt} to [alpha = 0.9]{.alt}. This makes the violins very slightly transparent. You should now see the dots behind the violins. Try running the code with values of alpha of 0.8, 0.6, 0.2 and 0 (fully transparent). As the violins get more transparent, the dots behind become more visible.

wish_plot <- ggplot2::ggplot(wish_tib, aes(time, success, colour = strategy))
wish_plot +
  stat_summary(fun = "mean", geom = "point", size = 4, position = position_dodge(width = 0.9)) +
  geom_violin(alpha = 1) +
  labs(x = "Time", y = "Success (%)", colour = "Success strategy") +
  coord_cartesian(ylim = c(0, 100)) +
  scale_y_continuous(breaks = seq(0, 100, 10)) +
  theme_minimal()

r user_visor() Plotting confidence intervals [(2)]{.alt}

The mean in the sample is an estimate, and estimates have uncertainty attached to them. It's a really good idea to include an indicator of this uncertainty on a plot. Typically, this is done by adding error bars to the means that show the 95% confidence interval. When we plotted a mean we added this layer to our plot:

stat_summary(fun = "mean", geom = "point")

Basically we set the data to plot to be the function that returns the mean value ([fun = "mean"]{.adj}), and the geom to be a point ([geom = "point"]{.adj}). If we want to plot the 95% confidence interval around the mean both of these things change. The number of data points changes because for every mean we now want to plot three data points (the mean and the upper and lower limit of the corresponding confidence interval) instead of one (the mean). The geom changes because we can't plot three values using a single point.

To change the number of data points we use [fun.data]{.adj} instead of [fun]{.adj}, and instead of specifying [mean]{.adj} we specify [mean_cl_normal]{.adj} for a normal confidence interval or [mean_cl_boot]{.adj} for a robust confidence interval based on a bootstrap. We change the geom to [geom = "pointrange"]{.adj} which is a geom that shows a point with a line through it representing a range (in this case, the limits of the confidence interval).

These two adjustments are made within stat_summary():

r robot() Code example

stat_summary(fun.data = "mean_cl_normal", geom = "pointrange")

r alien() Alien coding challenge

Below is a copy of the code used to create the last plot. Adapt it to add a 95% confidence interval to the means.

`r cat_space()` **Tip: Size** I would delete [size = 4]{.adj} because `ggplot2` applies the [size]{.adj} attribute to both the dot and the bar of the [pointrange]{.adj} geom and, in this situation makes it look silly. However, there may be situations where you want to adjust the size of both the point and the line and it would be appropriate to include the [size]{.adj} argument
wish_plot <- ggplot2::ggplot(wish_tib, aes(time, success, colour = strategy))
wish_plot +
  stat_summary(fun = "mean", geom = "point", size = 4, position = position_dodge(width = 0.9)) +
  geom_violin() +
  labs(x = "Time", y = "Success (%)", colour = "Success strategy") +
  coord_cartesian(ylim = c(0, 100)) +
  scale_y_continuous(breaks = seq(0, 100, 10)) +
  theme_minimal()
wish_plot <- ggplot2::ggplot(wish_tib, aes(time, success, colour = strategy))
wish_plot +
  geom_violin() +
  stat_summary(fun.data = "mean_cl_normal", geom = "pointrange", position = position_dodge(width = 0.9)) +
  labs(x = "Time", y = "Success (%)", colour = "Success strategy") +
  coord_cartesian(ylim = c(0, 100)) +
  scale_y_continuous(breaks = seq(0, 100, 10)) +
  theme_minimal()

r alien() Alien coding challenge

Below is a copy of the code used to create a plot from earlier that grouped means using facet_wrap(). Adapt it to add a 95% bootstrap confidence interval to the means.

wish_plot <- ggplot2::ggplot(wish_tib, aes(time, success))
wish_plot +
  stat_summary(fun = "mean", geom = "point", size = 4) +  
  labs(x = "Time", y = "Success (%)") +
  coord_cartesian(ylim = c(0, 70)) +
  scale_y_continuous(breaks = seq(0, 70, 10)) +
  facet_wrap(~strategy) +
  theme_minimal()
wish_plot <- ggplot2::ggplot(wish_tib, aes(time, success))
wish_plot +
  stat_summary(fun.data = "mean_cl_boot", geom = "pointrange") +  
  labs(x = "Time", y = "Success (%)") +
  coord_cartesian(ylim = c(0, 70)) +
  scale_y_continuous(breaks = seq(0, 70, 10)) +
  facet_wrap(~strategy) +
  theme_minimal()

r bmu() Transfer tasks [(1)]{.alt}

Imagine that a film company director was interested in whether there was really such a thing as a 'chick flick' (a film that has the stereotype of appealing to women more than to men). He took 20 men and 20 women and showed half of each sample a film that was supposed to be a 'chick flick' (The Notebook). The other half watched a documentary about notebooks as a control. In all cases the company director measured participants' emotional arousal as an indicator of how much they enjoyed the film. The data are in [notebook_tib]{.alt} and contains three variables:

r alien() Alien coding challenge task 1

Plot a boxplot of the data that shows sex on the x-axis, and fills the boxplots in different colours for different films. Name the plot object [note_plot]{.alt}.


# Set up the plot (replace the xs)
note_plot <- ggplot2::ggplot(xxxx, aes(xxx, xxxx, fill = xxxx))
# add the boxplot geom
note_plot <- ggplot2::ggplot(notebook_tib, aes(sex, arousal, fill = film))
note_plot +
  geom_boxplot()
# add labels

labs(x = "xxxxx", y = "xxxx", fill = "xxxxx")

# Don't forget a `+` after geom_boxplot() on the previous line
note_plot <- ggplot2::ggplot(notebook_tib, aes(sex, arousal, fill = film))
note_plot +
  geom_boxplot() +
  labs(x = "Biological sex", y = "Arousal", fill = "Film watched")

# now, set limits of the y-axis

coord_cartesian(ylim = c(xxx, xxxx))

# Don't forget a `+` after labs() on the previous line
note_plot <- ggplot2::ggplot(notebook_tib, aes(sex, arousal, fill = film))
note_plot +
  geom_boxplot() +
  labs(x = "Biological sex", y = "Arousal", fill = "Film watched") +
  coord_cartesian(ylim = c(0, 50))

# now, set breaks of the y-axis

scale_y_continuous(breaks = seq(xx, xx, xx))

# Don't forget a `+` after coord_cartesian() on the previous line
# Finally, apply a theme:

note_plot <- ggplot2::ggplot(notebook_tib, aes(sex, arousal, fill = film))
note_plot +
  geom_boxplot() +
  labs(x = "Biological sex", y = "Arousal", fill = "Film watched") +
  coord_cartesian(ylim = c(0, 50)) +
  scale_y_continuous(breaks = seq(0, 50, 5)) +
  theme_minimal()

r alien() Alien coding challenge task 1

Plot a violin plot (with means) of the data that shows sex on the x-axis, and plots points and violins for different films in different colours. Name the plot object [note_plot]{.alt}.


# Set up the plot (replace the xs)
note_plot <- ggplot2::ggplot(xxxx, aes(xxx, xxxx, colour = xxxx))
# add the violin geom
note_plot <- ggplot2::ggplot(notebook_tib, aes(sex, arousal, colour = film))
note_plot +
  geom_violin()
# add the means using stat_summary()
# don't forget position_dodge()!
# clue (fill in the Xs)

stat_summary(fun = xxxx, geom = xxxxx, size = xxxxx, position = position_dodge(xxxxxxxx)) +
note_plot <- ggplot2::ggplot(notebook_tib, aes(sex, arousal, colour = film))
note_plot +
  geom_violin() +
  stat_summary(fun = "mean", geom = "point", size = 4, position = position_dodge(width = 0.9))

# Now add axis labels

labs(x = "xxxxx", y = "xxxx", fill = "xxxxx")

# Don't forget a `+` after geom_boxplot() on the previous line
note_plot <- ggplot2::ggplot(notebook_tib, aes(sex, arousal, colour = film))
note_plot +
  geom_violin() +
  stat_summary(fun = "mean", geom = "point", size = 4, position = position_dodge(width = 0.9)) +
  labs(x = "Biological sex", y = "Arousal", colour = "Film watched")

# now, set limits of the y-axis

coord_cartesian(ylim = c(xxx, xxxx))

# Don't forget a `+` after labs() on the previous line
note_plot <- ggplot2::ggplot(notebook_tib, aes(sex, arousal, colour = film))
note_plot +
  geom_violin() +
  stat_summary(fun = "mean", geom = "point", size = 4, position = position_dodge(width = 0.9)) +
  labs(x = "Biological sex", y = "Arousal", colour = "Film watched") +
  coord_cartesian(ylim = c(0, 50))

# now, set breaks of the y-axis

scale_y_continuous(breaks = seq(xx, xx, xx))

# Don't forget a `+` after coord_cartesian() on the previous line
# Finally, apply a theme:

note_plot <- ggplot2::ggplot(notebook_tib, aes(sex, arousal, colour = film))
note_plot +
  geom_violin() +
  stat_summary(fun = "mean", geom = "point", size = 4, position = position_dodge(width = 0.9)) +
  labs(x = "Biological sex", y = "Arousal", colour = "Film watched") +
  coord_cartesian(ylim = c(0, 50)) +
  scale_y_continuous(breaks = seq(0, 50, 5)) +
  theme_minimal()

r bmu() Scatterplots [(1)]{.alt}

A psychologist was interested in the effects of exam stress on exam performance. She devised and validated a questionnaire to assess state anxiety relating to exams (called the Exam Anxiety Questionnaire, or EAQ). This scale produced a measure of anxiety scored out of 100. Anxiety was measured before an exam, and the percentage mark of each student on the exam was used to assess the exam performance. The first thing that the psychologist should do is draw a scatterplot of the two variables. The data are in [exam_tib]{.alt}, which contains 5 variables:

A scatterplot is just the values of one variable plotted on the x-axis, against the values of another on the y-axis.

r robot() Code example

If we wanted to plot anxiety on the x-axis and exam_grade on the y we could set this up in the usual way:

exam_plot <- ggplot2::ggplot(exam_tib, aes(anxiety, exam_grade))

This command creates an object called [exam_plot]{.alt} using the data in [exam_tib]{.alt}, and uses the aes() function to specify that anxiety is plotted on the x-axis and exam_grade on the y. We'd then need to simply add geom_point() to represent the data points:

exam_plot <- ggplot2::ggplot(exam_tib, aes(anxiety, exam_grade))
exam_plot +
  geom_point()

r alien() Alien coding challenge

Use the code example to create the scatterplot. Use what you have already learnt to add labels to the axes and apply a minimal theme.


# set up the basic plot as in the code example:
exam_plot <- ggplot2::ggplot(exam_tib, aes(anxiety, exam_grade))
exam_plot +
  geom_point()
# add labels
exam_plot <- ggplot2::ggplot(exam_tib, aes(anxiety, exam_grade))
exam_plot +
  geom_point() +
  labs(x = "Exam anxiety", y = "Exam mark (%)")
# apply a theme
exam_plot <- ggplot2::ggplot(exam_tib, aes(anxiety, exam_grade))
exam_plot +
  geom_point() +
  labs(x = "Exam anxiety", y = "Exam mark (%)") +
  theme_bw()

r user_visor() Changing the appearance of points [(2)]{.alt}

We can use the options of geom_point() to change the colour of the points, their size, their shape and their transparency. Many of these arguments work with other geoms too:

For colours it is useful to use hex codes. These are codes that specify exact colours and you can find lists of these codes on websites such as color hex which also contains various palettes of colours.

r robot() Code example

To make the points blue using hex code #56B4E9, we could specify:

geom_point(colour = "#56B4E9")

We could also change the shape of the geom. Figure 3 shows the numbers representing particular shapes. For example, there are three variants of a circle a hollow circle (shape number 1), solid circle (shape number 16) and filled circle with border (shape number 21). Common shapes all have these three variants (numbers represent the hollow, solid and bordered versions respectively): square (0, 15, 22), triangle pointed upwards (2, 17, 24), and diamond (6, 18, 23).

`r cat_space()` **Tip: Mappings** If you ever forget these mappings then execute `?points`. The resulting help file lists the numbers and shapes.
See main text for description.
Figure 3: Mapping of shapes to numeric values.

r robot() Code example

We can combine these arguments to change lots of things at once. The code below will make the points blue ([colour = "#56B4E9"]{.alt}), larger than default ([size = 4]{.alt}), triangles ([shape = 3]{.alt}) and slightly transparent ([alpha = 0.8]{.alt}).

exam_plot <- ggplot2::ggplot(exam_tib, aes(anxiety, exam_grade))
exam_plot +
  geom_point(colour = "#56B4E9", size = 4, shape = 17, alpha = 0.6) +
  labs(x = "Exam anxiety", y = "Exam mark (%)") +
  theme_bw()

r alien() Alien coding challenge

Try changing the values of colour, shape, size and alpha and note the effect it has on the plot.

exam_plot <- ggplot2::ggplot(exam_tib, aes(anxiety, exam_grade))
exam_plot +
  geom_point(colour = "#56B4E9", size = 4, shape = 17, alpha = 0.6) +
  labs(x = "Exam anxiety", y = "Exam mark (%)") +
  theme_bw()

r user_visor() Summarizing the trend [(2)]{.alt}

We can add a line summarizing the trend in the data using geom_smooth(). To fit a straight line we can set a method of "lm" (stands for linear model, more on that in later tutorials) and change its colour to be a nice orange (hex code #E69F00). By default, a confidence interval is plotted around the line, we can colour this interval orange by including [fill = "#E69F00"]{.alt}.

r robot() Code example

The complete code would be.

geom_smooth(method = "lm", colour = "#E69F00", fill = "#E69F00")

r alien() Alien coding challenge

Add the code for geom_smooth() from the example to the code box (underneath geom_point()) and run the code to see the plot. It should now have a line on top of the data points.

exam_plot <- ggplot2::ggplot(exam_tib, aes(anxiety, exam_grade))
exam_plot +
  geom_point(colour = "#56B4E9", alpha = 0.6) +
  labs(x = "Exam anxiety", y = "Exam mark (%)") +
  theme_bw() 
exam_plot <- ggplot2::ggplot(exam_tib, aes(anxiety, exam_grade))
exam_plot +
  geom_point(colour = "#56B4E9", alpha = 0.6) +
  geom_smooth(method = "lm", colour = "#E69F00", fill = "#E69F00") +
  labs(x = "Exam anxiety", y = "Exam mark (%)") +
  theme_bw()

r bmu() Grouped scatterplots [(1)]{.alt}

As with the other plots we've seen we can split the data into categories. For example, if we wanted to compare the relationship between male and female students, we could do this by adding a facet:

r alien() Alien coding challenge

Add facet_wrap(~sex) in the box below so that data for men and women are plotted in separate panels:

exam_plot <- ggplot2::ggplot(exam_tib, aes(anxiety, exam_grade))
exam_plot +
  geom_point(colour = "#56B4E9", alpha = 0.6) +
  geom_smooth(method = "lm", colour = "#E69F00", fill = "#E69F00") +
  labs(x = "Exam anxiety", y = "Exam mark (%)") +
  theme_bw()         
exam_plot <- ggplot2::ggplot(exam_tib, aes(anxiety, exam_grade))
exam_plot +
  geom_point(colour = "#56B4E9", alpha = 0.6) +
  geom_smooth(method = "lm", colour = "#E69F00", fill = "#E69F00") +
  labs(x = "Exam anxiety", y = "Exam mark (%)") +
  facet_wrap(~sex) +
  theme_bw()

We can also specifying different colours for men and women using [colour = sex]{.alt} when we set up the plot:

exam_plot <- ggplot2::ggplot(exam_tib, aes(anxiety, exam_grade, colour = sex))

To colour the interval around the line by sex, we'd also need to include [fill = sex`:

exam_plot <- ggplot2::ggplot(exam_tib, aes(anxiety, exam_grade, colour = sex, fill = sex))
`r bug()` **De-bug: colour clashes** Colours specified in a `geom()` override the colour argument in the original `ggplot()` function. Therefore, if you set the colour by a variable such as **sex** in `ggplot()`you must delete any colour arguments in the geom itself for this to take effect.

r robot() Code example

This code results in data points and a line coloured by sex:

exam_plot <- ggplot2::ggplot(exam_tib, aes(anxiety, exam_grade, colour = sex, fill = sex))
exam_plot +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm") +
  theme_minimal()         

In contrast this code results in data points that are all blue (hex code #56B4E9) and a line that is orange (hex code #E69F00), in other words the data haven't been split by sex:

exam_plot <- ggplot2::ggplot(exam_tib, aes(anxiety, exam_grade, colour = sex, fill= sex))
exam_plot +
  geom_point(colour = "#56B4E9") +
  geom_smooth(method = "lm", colour = "#E69F00", fill = "#E69F00") +
  theme_minimal()                  

r bmu() Adjusting the axis [(1)]{.alt}

r alien() Alien coding challenge

Use what you learnt earlier to scale the y-axis from 0 to 140 in intervals of 10.

exam_plot <- ggplot2::ggplot(exam_tib, aes(anxiety, exam_grade, colour = sex, fill = sex))
exam_plot +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm") +
  labs(x = "Exam anxiety", y = "Exam mark (%)") +
  theme_bw()         
exam_plot <- ggplot2::ggplot(exam_tib, aes(anxiety, exam_grade, colour = sex, fill = sex))
exam_plot +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm") +
  coord_cartesian(ylim = c(0, 140)) +
  scale_y_continuous(breaks = seq(0, 140, 10)) +
  labs(x = "Exam anxiety", y = "Exam mark (%)") +
  theme_bw()
discovr package hex sticker, female space pirate with gun. Gunsmoke forms the letter R. **A message from Mae Jemstone:** Well done on completing phase 5 of your mission! Visualizing data is an essential skill - both being able to produce plots and also to interpret them. There will be many times when newspapers, social media and politicians are waving plots at you to try to make a point, or influence you. You have acquired a very useful skill in being able to interpret these plots for yourself and see through the spin or bullshit. Good work!

Resources {data-progressive=FALSE}

Statistics

r rproj()

Acknowledgement

I'm extremely grateful to Allison Horst for her very informative blog post on styling learnr tutorials with CSS and also for sending me a CSS template file and allowing me to adapt it. Without Allison, these tutorials would look a lot worse (but she can't be blamed for my colour scheme).

References



profandyfield/discovr documentation built on May 4, 2024, 4:32 p.m.