# adventr: visualizing data In adventr: Interactive R Tutorials to Accompany Field (2016), "An Adventure in Statistics"

```library(forcats)
library(learnr)
library(tidyverse)

knitr::opts_chunk\$set(echo = FALSE)
tutorial_options(exercise.cap = "Exercise")

#Read dat files needed for the tutorial

#setup objects for code blocks

strength_sum <-
jig_tib %>%
dplyr::group_by(employee, sex) %>%
dplyr::summarize(
mean_strength = mean(strength),
med_strength = median(strength),
var_strength = var(strength),
sd_strength = sd(strength),
iqr_strength = IQR(strength, type = 8)
)

strength_sum <-
jig_tib %>%
dplyr::group_by(employee, sex) %>%
dplyr::summarize(
mean_strength = mean(strength),
med_strength = median(strength),
var_strength = var(strength),
sd_strength = sd(strength),
iqr_strength = IQR(strength, type = 8)
)
```

# An Adventure in R: Presenting data (summarizing groups and more ggplot2)

## Overview

This tutorial is one of a series that accompanies An Adventure in Statistics [@RN10163] by me, Andy Field. These tutorials contain abridged sections from the book so there are some copyright considerations but I offer them under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, ^[Basically you can use this tutorial for teaching and non-profit activities but do not meddle with it or claim it as your own work.]

• Who is the tutorial aimed at?
• What is covered?
• This tutorial looks at how to summarize data for groups. We learn a bit more about manipulating tibbles and ggplot2. It would be a useful tutorial to run alongside teaching based on Chapter 5 of An Adventure in Statistics.
• This tutorial does not teach the background theory: it is assumed you have either attended my lecture or read the relevant chapter in the aforementioned books (or someone else's)
• The aim of this tutorial is to augment the theory that you already know by guiding you through fitting linear models using R and RStudio and asking you questions to test your knowledge along the way.

## Story précis

### Why a précis?

Because these tutorials accompany my book An adventure in statistics, which uses a fictional narrative to teach the statistics, some of the examples might not make sense unless you know something about the story. For those of you who don't have the book I begin each tutorial with a précis of the story. If you're not interested then fair enough - click past this section.

### General context for the story

It is the future. Zach, a rock musician and Alice, a geneticist, who have been together since high school live together in Elpis, the ‘City of Hope’.

Zach and Alice were born in the wake of the Reality Revolution which occurred after a Professor Milton Gray invented the Reality Prism – a transparent pyramid worn on the head – that brought honesty to the world. Propaganda and media spin became unsustainable, religions collapsed, advertising failed. Society could no longer be lied to. Everyone could know the truth about anything that they could look at. A gift, some said, to a previously self-interested, self-obsessed society in which the collective good had been eroded.

But also a curse. For, it soon became apparent that through this Reality Prism, people could no longer kid themselves about their own puffed-up selves as they could see what they were really like – by and large, pretty ordinary. And this caused mass depression. People lost faith in themselves. Artists abandoned their pursuits, believing they were untalented and worthless.

Zach and Alice have never worn a Reality Prism and have no concept of their limitations. They were born after the World Governance Agency (WGA) destroyed all Reality Prisms, along with many other pre-revolution technologies, with the aim of restoring community and well-being. However, this has not been straightforward and in this post-Prism world, society has split into pretty much two factions

• The Chippers who have had WiFi-enabled chips implanted into their brains, enabling them to record and broadcast what they see and think in real time; upload memories for future generations into a newly-created memoryBank and live-stream music and films directly into their brains.
• The Clocktarians, followers of the old pre-Prism ways who use steam punk style technologies, who have elected not to have chips in their brains, regarded by the Chippers as backward-looking stuck in a ‘clockwork, Victorian society’.

Everyone has a star, a limitless space on which to store their digital world.

Zach and Alice are Clocktarians. Their technology consists mainly of:

• A Proteus, a device made from programmable matter that can transform shape and function simply by the owners’ wishes. Zach calls his a diePad, in the shape of a tombstone in an ironic reference to an over-reliance on technology at the expense of memory.
• A Reality Checker, a clockwork mechanism that, at the point of critical velocity, projects an opaque human head that is linked to everything and can tell you anything. Every head has a personality and Zach’s is a handsome, laid back ‘dude’ who is like an electronic friend, who answers questions if he feels like it and often winds Zach up by giving him false information. And he often flirts with Alice.

### Main Protagonists

• Zach
• Rock musician in band called The Reality Enigma.
• Spellbinding performer, has huge fan-base.
• Only people living in Elpis get to see The Reality Enigma in the flesh. Otherwise all performances are done via an oculus riff, a multisensory headset for experiencing virtual gigs.
• Zach’s music has influenced and changed thousands of lives.
• Wishes he had lived pre-Revolutionary times, the turn of the 21st Century, a golden age for music when bands performed in reality at festivals.
• Kind, gentle and self-doubting.
• Believes science and maths are dull and uninspiring. Creates a problem between him and Alice as she thinks that because he isn’t interested in science, he isn’t interested in her. Leads to lots of misunderstandings between them.
• Alice
• Shy, lonely, academically-gifted – estranged from the social world until she met Zach in the college library.
• Serious scientist, works at the Beimeni Centre of Genetics.
• At 21, won the World Science Federation’s Einstein Medal for her genetics research
• Desperately wants Zach to get over his fear of science so he can open his mind to the beauty of it.

Alice has been acting strangely, on edge for weeks, disconnected and uncommunicative, as if she is hiding something and Zach can’t get through to her. Arriving home from band practice, unusually, she already home and listening to an old album that the two of them enjoyed together, back in a simpler, less complicated time in their relationship. During an increasingly testy evening, that involves a discussion with the Head about whether or not a Proteus causes brain cancer, Alice is interrupted by an urgent call which she takes in private. She returns looking worried and is once again, distracted. She tells Zach that she has ‘a big decision to make’. Before going to bed, Zach asks her if he can help with the decision but she says he ‘already has’, thanking him for making ‘everything easier.’ He has no idea what she means and goes to sleep, uneasy.

On waking, Zach senses that something is wrong. And he is right. Alice has disappeared. Her clothes, her possessions and every photo of them together have gone. He can’t get hold of any of her family or friends as their contact information is stored on her Proteus, not on his diePad. He manages to contact the Beimeni Centre but is told that no one by the name of Alice Nightingale has ever worked there. He logs into their constellation but her star has gone. He calls her but finds that her number never existed. She has, thinks Zach, been ‘wiped from the planet.’ He summons The Head but he can’t find her either. He tells Zach that there are three possibilities: Alice has doesn’t want to be found, someone else doesn’t want her to be found or she never existed.

Zach calls his friend Nick, fellow band member and fan of the WGA-installed Repositories, vast underground repositories of actual film, books, art and music. Nick is a Chipper – solely for the purpose of promoting the band using memoryBank – and he puts the word out to their fans about Alice missing.

Thinking as hard as he can, Zach recalls the lyrics of the song she’d been playing the previous evening. Maybe they are significant? It may well be a farewell message and the Head is right. In searching for clues, he comes across a ‘memory stone’ which tells him to read what’s on there. File 1 is a research paper that Zach can’t fathom. It’s written in the ‘language of science’ and the Head offers to help Zach translate it and tells him that it looks like the results of her current work were ‘gonna blow the world’. Zach resolves to do ‘something sensible’ with the report.

Zach doesn’t want to believe that Alice has simply just left him. Rather, that someone has taken her and tried to erase her from the world. He decides to find her therapist, Dr Murali Genari and get Alice’s file. As he breaks into his office, Dr Genari comes up behind him and demands to know what he is doing. He is shaking but not with rage – with fear of Zach. Dr Genari turns out to be friendly and invites Zach to talk to him. Together they explore the possibilities of where Alice might have gone and the likelihood, rating her relationship satisfaction, that she has left him. During their discussion Zach is interrupted by a message on his diePad from someone called Milton. Zach is baffled as to who he is and how he knows that he is currently discussing reverse scoring. Out of the corner of his eye, he spots a ginger cat jumping down from the window ledge outside. The counsellor has to go but suggests that Zach and ‘his new friend Milton’ could try and work things out.

## Packages and data

### Packages

This tutorial uses the following packages:

• `tidyverse` [@RN11407]

This package is automatically loaded within this tutorial. If you are working outside of this tutorial (i.e. in RStudio) then you need to make sure that the package has been installed by executing `install.packages("package_name")`, where package_name is the name of the package. If the package is already installed, then you need to reference it in your current session by executing `library(package_name)`, where package_name is the name of the package.

### Data

This tutorial has the data files pre-loaded so you shouldn't need to do anything to access the data from within the tutorial. However, if you want to play around with what you have learnt in this tutorial outside of the tutorial environment (i.e. in a stand-alone RStudio session) you will need to download the data files and then read them into your R session. This tutorial uses the following file:

You can load the file in several ways:

• Assuming that you follow the workflow recommended in the tutorial adventr_02 (see also this online tutorial), you can load the data into an object called `jig_tib` by executing:
• `jig_tib <- readr::read_csv("../data/ais_05_jigsaw.csv")`
• If you don't follow my suggested workflow, you will adjust the file location in the above command.
• Alternatively, if you have an internet connection (and my server has not exploded!) load the file direct from the URL by executing:
• `jig_tib <- readr::read_csv("http://www.discoveringstatistics.com/repository/ais_data/ais_05_jigsaw.csv")`

## Descriptive statistics for groups

### The data

In chapter 5 of An Adventure in Statistics Milton and Zach attend a recruitment event by a corporation called JIG:SAW. During it, Rob Nutcot presents data about JIG:SAW's genetic enhancement programme. Afterwards he hands Zach a brochure full of graphs of data comparing JIG:SAW employees to non-employees on various physical attributes such as speed (footspeed), visual acuity (vision) and strength. Zach can't interpret these graphs. Milton takes Zach to visit Dr Sisyphus Tuff in a bar called '6' located in the Evil Pockets, an uninhabited part of the city of Elpis that attracted 'undesirables'. Dr Tuff is an expert in visualizing data but has been driven slightly crazy by his quest for visual perfection.

The data from which Rob Nutcot's graphs were generated are in the tibble called `jig_tib`; use the code box to view this tibble.

```
```
```jig_tib
```

You should see a tibble with 7 variables:

• id : employee ID
• employee: whether or not the employee works for JIG:SAW
• job_type: Categories of employee
• footspeed: Footspeed of the individual, in miles per hour
• strength: Maximal push force of the individual in newtons
• vision: Visual acuity
• sex: Biological sex of the individual

### Creating summaries by group

In several graphs in Nutcot's report the data are summarized by group (for example JIG:SAW employees vs. non-employees). We can summarize data by applying the `group_by()` and `summarize()` functions from the tidyverse package (specifically the `dplyr` package) to our tibble. For example, lets saw we wanted to get the mean visual acuity separately for the JIG:SAW employees and non-employees and store these values in a tibble called `vis_mean`. We could achieve this using the following pipe command:

```vis_mean <- jig_tib %>%
dplyr::group_by(employee) %>%
dplyr::summarize(
mean = mean(vision)
)
```

Let's break this command down:

• `vis_mean <- jig_tib %>%` translates as create an object called `vis_mean` and the starting point is to copy the tibble called `jig_tib`. The pipe operator (`%>%`) indicates that more processing will be done on the `jig_tib` tibble before `vis_mean` is created.
• `dplyr::group_by(employee) %>%` tells R to group by the variable employeee and because this command comes after the pipe operator we know that the thing that we're going to 'group by' employee is the object before the pipe operator (i.e. `jig_tib`). If `group_by()` is used outside of a pipe command then you need to specify a data frame or tibble within it otherwise it won't know what object to group. the pipe operator again tells R that there is more processing to come.
• `dplyr::summarize(mean = mean(vision))` creates a new variable called `mean_vis` within the tibble and then uses the function `mean()`, which we learnt about in a previous tutorial, to place values into that variable.

The easiest way to get a feel for what's happening is to execute the code and see what happens. The code is already types into the code box to show you a good way to format this kind of piped command. Note that the top line sets up the object to be created, then the commands to control how the object is created are indented on the lines below. You start a new line after each pipe operator (so that it's easy to see the components of the pipe), and within the `summarize()` command we indent any commands. It may not be obvious why we do this, but it will when we move onto more complex examples.

The final line `vis_mean` is there simply to display the newly created object. Run this code to see what is created by this command.

```vis_mean <-
jig_tib %>%
dplyr::group_by(employee) %>%
dplyr::summarize(
mean_vis = mean(vision)
)
vis_mean
```

You should see a tibble with two columns (employee and mean_vis), the first contains the different groups defined within the variable employee and the second contains the mean visual acuity of those two groups. Imagine that we wanted to get means within both employee groups split by the biological sex of the employee. We can do this by adding the second variable into the `group_by()` function. For example, instead of `group_by(employee)` we could use `group_by(employee, sex)`. Note how the two variables are separated by a comma.

Use the code box (and the example above) to create a tibble called `strength_mean`, which contains the mean of the variable strength split by employee status and biological sex.

```
```
```strength_mean <-
jig_tib %>%
dplyr::group_by(employee, sex) %>%
dplyr::summarize(
mean_strength = mean(strength)
)
strength_mean
```

The resulting values are those plotted in Figure 5.4 in the book.

We're not restricted to the mean, we can use all of the functions that we learnt about in the previous tutorial (`var()`, `median()`, `sd()` etc.) within the summarize function. For example, imagine we wanted a table (tibble) containing a detailed summary of the footspeed scores. We could create this using a pipe command like this:

```speed_sum <-
jig_tib %>%
dplyr::group_by(employee, sex) %>%
dplyr::summarize(
mean_speed = mean(footspeed),
med_speed = median(footspeed),
var_speed = var(footspeed),
sd_speed = sd(footspeed),
iqr_speed = IQR(footspeed, type = 8)
)
```

Note that I have listed several commands within the `summarize()` function. Each one is separated by a comma (the `IQR()` function is not followed by a comma because it is the last command). Each line within the `summarize()` function creates a variable based on a function such as `mean()`. For example, `mean_speed = mean(footspeed)` creates a new variable called mean_speed which contains the mean value of the variable footspeed.

This example should make it clearer why it is useful to format the command using indented lines: it is really simple to see what variables are being created. Note that because we have no missing values I have not needed to specify the `na.rm =` option in any of the functions within `summarize()` but I have specified `type = 8` for the IQR to override the default value of 7. Copy this code into the box below and run it. Include the name of the tibble as a final line to inspect the resulting tibble.

```
```
```speed_sum <-
jig_tib %>%
dplyr::group_by(employee, sex) %>%
dplyr::summarize(
mean_speed = mean(footspeed),
med_speed = median(footspeed),
var_speed = var(footspeed),
sd_speed = sd(footspeed),
iqr_speed = IQR(footspeed, type = 8)
)
speed_sum
```

You should see a tibble with 4 rows representing the different combinations of JIG:SAW employee vs. non-employee and male and female, and 7 columns listing the combinations of employee and sex and the five new variables that you created containing the summary statistics that you specified. In the code box below adapt the code that you just used to create a tibble called `strength_sum` that creates summary statistics for strength grouped by employee and sex. For the new variables you create replace 'speed' with strength (i.e., instead of creating `mean_speed` create `mean_strength`).

```
```
```strength_sum <-
jig_tib %>%
dplyr::group_by(employee, sex) %>%
dplyr::summarize(
mean_strength = mean(strength),
med_strength = median(strength),
var_strength = var(strength),
sd_strength = sd(strength),
iqr_strength = IQR(strength, type = 8)
)
strength_sum
```

## Plotting grouped data

### Boxplots

Figure 5.6 in An adventure in statistics shows a boxplot of footspeed split by employee. This is easy to create in `ggplot` using the `geom_boxplot()` function. We could set up the plot with this command:

```speed_box <- ggplot2::ggplot(jig_tib, aes(employee, footspeed))
speed_box +
geom_boxplot() +
labs(x = "Employee status", y = "Footspeed (mph)") +
theme_bw()
```

Let's break down this command (refer back to the tutorial on `ggplot2` if you need to):

• `speed_box <- ggplot2::ggplot(jig_tib, aes(employee, footspeed))` creates an object called `speed_box` that contains the plot. The `ggplot()` function is then used to specify that the plot uses the `jig_tib` tibble and plots the variable employee on the x-axis and the variable footspeed on the y-axis.
• `speed_box + geom_boxplot() +` takes the object `speed_box` and adds a boxplot geom to it.
• `labs(x = "Employee status", y = "Footspeed (mph)") +` applies descriptive labels to the x and y axes.
• `theme_bw()` applies a black and white theme to the plot

Copy this code into the box and run it to see the boxplot.

```
```
```speed_box <- ggplot2::ggplot(jig_tib, aes(employee, footspeed, fill = sex))
speed_box +
geom_boxplot() +
labs(x = "Employee status", y = "Footspeed (mph)") +
theme_bw()
```

Job done. If we wanted to split the data by a second grouping variable (for example sex) we can do this by specifying that `ggplot` varies the `fill` of the boxes or the `colour` of the lines around the boxes by this second variable. To do this we add this specification to the `aes()` function in the original command to set up the graph. For example, we'd change the first line to be:

`speed_box <- ggplot2::ggplot(jig_tib, aes(employee, footspeed, fill = sex))`

Note that all I have done is to add `fill = sex` to the initial aesthetic. The rest of the command can stay the same. Adapt the code above to include `fill = sex` and run it again. Note that the graph now splits the data by employee along the x-axis and within that splits it by sex using different colours to fill the boxes. Now change the word 'fill' to 'colour' and run the code again. You should see that now the data and the box outlines and whiskers are coloured differently for men and women.

Having added another variable to the plot, we might want to exit the label displayed on the graph. Currently we have specified labels for the x- and y-axis by including:

`labs(x = "Employee status", y = "Footspeed (mph)")`

If we want to edit the label for the variable that is used to determine the fill or colour of the plot, we can add this to the `labs()` function. If we used sex to determine the fill of the plot then we'd add `fill = "label"`, where label is the text we want to use:

`labs(x = "Employee status", y = "Footspeed (mph)", fill = "Biological sex")`

If we had used sex to determine the colour of the plot then we'd add `colour = "label"` to the function:

`labs(x = "Employee status", y = "Footspeed (mph)", colour = "Biological sex")`

What if you want to change the colours that are used to fill the boxes or the box outlines? That's easy. You can add the `scale_fill_manual()` or `scale_colour_manual()` functions to the command. Both functions have a similar syntax:

`scale_fill_manual(values = c("colour_1", "colour_2", "colour_3" ...))`

In which you change "colour_1" etc. to be the name of a colour of a hex code. In this example, we're using two colours. Let's change them to be a colour-blind friendly blue (hex code #56B4E9) and a colour-blind friendly orange(hex code #E69F00). The function would be:

`scale_fill_manual(values = c("#56B4E9", "#E69F00"))`

Our completed code would be:

```speed_box <- ggplot2::ggplot(jig_tib, aes(employee, footspeed, fill = sex))
speed_box +
geom_boxplot() +
labs(x = "Employee status", y = "Footspeed (mph)", fill = "Biological sex") +
scale_fill_manual(values = c("#56B4E9", "#E69F00")) +
theme_bw()
```

Copy this code into the box below and run it to see how the fill of the boxes has changed to the colours we specified. Then edit the code to change the colour rather than the fill to be these new colours (i.e., change 'fill' to 'colour' in the first line and the `labs()` function and change `scale_fill_manual` to be `scale_colour_manual`). Run the code to see the effect.

```
```
```speed_box <- ggplot2::ggplot(jig_tib, aes(employee, footspeed, colour = sex))
speed_box +
geom_boxplot() +
labs(x = "Employee status", y = "Footspeed (mph)", fill = "Biological sex") +
scale_colour_manual(values = c("#56B4E9", "#E69F00")) +
theme_bw()
```

### Plotting means

Plotting means is slightly more tricky. If you want to plot from the raw data (rather than a tibble containing the summary information) then your best bet is to use the `stat_summary()` function and then specify the geom to use within it. Let's begin by plotting the mean strength split by employee. We can do this with the following command:

```strength_plot <- ggplot2::ggplot(jig_tib, aes(employee, strength))
strength_plot +
stat_summary(fun = "mean", geom = "point", size = 4)
```

All we're doing here is setting up the plot as normal and then adding `stat_summary()` to it and within that function asking it to calculate the means and their confidence intervals (`fun = "mean"`) and displaying them using geom_point (`geom = "point"`). `size = 4` simply makes the points that display the means bigger than the default (you can omit this option if you like). Copy the code into the code box and run it to see the resulting plot. Bar charts are generally a massive waste of ink, but try replacing `geom = "point"` with `geom = "bar"` and re-running the code. Congratulations, you have just displayed the same information as the graph with points but using a tonne more ink!

```
```
```strength_plot <- ggplot2::ggplot(jig_tib, aes(employee, strength))
strength_plot +
stat_summary(fun = "mean", geom = "point", size = 4) +
labs(x = "Employee status", y = "Maximal push force (n)") +
coord_cartesian(ylim = c(1000, 1800)) +
scale_y_continuous(breaks = seq(1000, 1800, 100)) +
theme_bw()
```

The result can be tidied up by adjusting the scales, adding axis labels, and applying a nicer theme. Try adding the following commands into the code box and seeing what effect it has on the graph when you run the code (don't forget that every line should end with a `+` except for the last line):

• `labs(x = "Employee status", y = "Maximal push force (n)")` to label the axes
• `coord_cartesian(ylim = c(1000, 1800))` to adjust the y-axis to display values from 1000 to 1800
• `scale_y_continuous(breaks = seq(1000, 1800, 100))` to set the breaks on the y-axis. I've used the function `seq()` which takes the form `seq(from, to, by)` where from is the value you want to start at, to is the value you want to stop at, and by is the size of the increment you want. In this case, I have specified `seq(1000, 1800, 100)` which will create a sequence between 1000 and 1800 in intervals of 100.
• `theme_bw()` to apply a black and white theme

### Facet_wrap()

The fun begins if you want to divide the data by a second grouping variable (for example sex). One easy way is to use `facet_wrap()` to create separate plots for different groups. This function takes the general form:

`facet_wrap(facet, nrow = NULL, ncol = NULL, scales = "fixed")`

There are other options, but I have listed the four main ones, which are:

• `facet` specifies how you want to create the facet. To create separate plots for males and females our facet would be `~sex`.
• `nrow` specifies how many rows of plots to display. There is no default, the function just tries to make sensible choices. If we wanted the male and female plots side by side we want them arranged in 1 row, so we could be explicit and include the command `nrow = 1`.
• `ncol` specifies how many columns of plots to display. There is no default, the function tries to make sensible choices. If we wanted the male and female plots on top of each other we want them arranged in 1 column, so we could be explicit and include the command `ncol = 1`. In reality nrow and ncol become important when you have lots of plots. For example if you were plotting data from 12 different hospitals, you might want these arranged in 2 rows and 6 columns, 4 rows and 3 columns, 6 rows and two columns and so on.
• `scales` by default the scales of the plots are set to be the same ("fixed") but sometimes it's handy to let them vary across different plots, in which case set `scales = "free"` or use "free_x" or "free_y" to allow only the x-axis or y-axis to vary across plots.

The code box below displays the code that you used above to generate the previous plot. Add `facet_wrap(~sex)` to the command and execute the code to see what happens.

```strength_plot <- ggplot2::ggplot(jig_tib, aes(employee, strength))
strength_plot +
stat_summary(fun = "mean", geom = "point", size = 4) +
labs(x = "Employee status", y = "Maximal push force (n)") +
coord_cartesian(ylim = c(1000, 1800)) +
scale_y_continuous(breaks = seq(1000, 1800, 100)) +
theme_bw()
```
```strength_plot <- ggplot(jig_tib, aes(employee, strength))
strength_plot +
stat_summary(fun = "mean", geom = "point", size = 4) +
labs(x = "Employee status", y = "Maximal push force (n)") +
coord_cartesian(ylim = c(1000, 1800)) +
scale_y_continuous(breaks = seq(1000, 1800, 100)) +
theme_bw() +
facet_wrap(~sex)
```

### The more complicated way

Another approach is to plot sex not in different panels, but in different colours or fills. Doing so is a bit more involved. Lets start by varying males and females by colour. To set this up, we can start by adding `colour = sex` to the aesthetic where we set up the plot:

`strength_plot <- ggplot2::ggplot(jig_tib, aes(employee, strength, colour = sex))`

Compare this with the first line of code in the box below (which replicates the plot split solely by employee). Add `colour = sex` into the first line and run the code. What happens?

You should find that the male and female means and CIs appear in different colours, but they appear at the same horizontal location (that is, the male points are directly above the female ones). For these data that doesn't particularly matter because the error bars don't overlap. However, if they did overlap it would be difficult to interpret the plot.

```strength_plot <- ggplot2::ggplot(jig_tib, aes(employee, strength))
strength_plot +
stat_summary(fun = "mean", geom = "point", size = 4) +
labs(x = "Employee status", y = "Maximal push force (n)") +
coord_cartesian(ylim = c(1000, 1800)) +
scale_y_continuous(breaks = seq(1000, 1800, 100)) +
theme_bw()
```

We can avoid this potential problem by adjusting the horizontal position using `position_dodge()`. This command plots geoms so that they 'dodge' each other on the horizontal plane. You have to include a value for the 'dodge', 0.9 works well for this plot, for others play around with values until it looks good. In the code box above add `position = position_dodge(width = 0.9)` into the `stat_summary()`. It should look like this:

`stat_summary(fun = "mean", geom = "point", size = 4, position = position_dodge(width = 0.9))`

Run the code and you should see that the male and female points are set at different locations along the horizontal axis.

### Adding raw data to a plot

This section is only really for those feeling confident! If you want to get really flash, you could add your raw data to the plot too. We want to plot it behind the geom that is displaying the mean, and we can do this by placing the function before the `stat_summary()` function. Let's just dive in. The function we will use is `geom_point()` and we can set it up as follows:

`geom_point(position = position_dodge(width = 0.9), alpha = 0.3) +`

This function will add points (dots) to the plot showing the raw data. The commands inside the function are doing this:

• `position = position_dodge(width = 0.9)` just like for our `stat_summary()` we need the raw data for men and women to dodge each other. If you set the dodge identical to the value in `stat_summary()` as we have done here then the raw data will appear in the same location as the geom displaying the mean. You can set a slightly wider dodge (e.g., `width = 1`) so that the data and the means are side by side. Try changing the value to see how the data points move horizontally.
• `alpha = 0.3` this makes the data points semi-transparent, which is handy because it means you can see where there are lots of similar scores.

Add this function to the command in the box below (place it before the `stat-summary()` function). Finally, by adding the raw data we need to extend the y-axis so that we can see all of the scores. Previously we have set the limits (and breaks) to vary from 1000 to 1800, edit these values to show a range between 900 and 2500. When you have made all of those edits to the code, run it to see the plot. Try also adding a `scale_colour_manual()` function to specify the same colour-blind friendly colours that we used before.

```strength_plot <- ggplot2::ggplot(jig_tib, aes(employee, strength, colour = sex))
strength_plot +
stat_summary(fun = "mean", geom = "point", size = 4, position = position_dodge(width = 0.9)) +
labs(x = "Employee status", y = "Maximal push force (n)", colour = "Biological sex") +
coord_cartesian(ylim = c(1000, 1800)) +
scale_y_continuous(breaks = seq(1000, 1800, 100)) +
theme_bw()
```
```strength_plot <- ggplot2::ggplot(jig_tib, aes(employee, strength, colour = sex))
strength_plot +
geom_point(position = position_dodge(width = 0.9), alpha = 0.3) +
stat_summary(fun = "mean", geom = "point", size = 4, position = position_dodge(width = 0.9)) +
labs(x = "Employee status", y = "Maximal push force (n)", colour = "Biological sex") +
coord_cartesian(ylim = c(900, 2500)) +
scale_y_continuous(breaks = seq(900, 2500, 100)) +
scale_colour_manual(values = c("#56B4E9", "#E69F00")) +
theme_bw()
```

Because we changed the size of the points displaying the mean to make them bigger, this looks OK (in that the raw scores do not obscure the mean). Howeber, we could also change the shape of the geom to distinguish the means from the raw scores. The `shape =` option allows us to specify a number that indicate a shape for the `geom_point()` function. Figure 1 shows the numebrs representing particula shapes. If you ever forget these mappings then execute `?points`; the resulting help file lists the numbers and shapes.

The code box includes the command that we have just used. First, add `shape = 0` to the `stat_summary()` function. In other words change:

`stat_summary(fun = "mean", geom = "point", size = 4, position = position_dodge(width = 0.9))`

to be:

`stat_summary(fun = "mean", geom = "point", size = 4, shape = 0, position = position_dodge(width = 0.9))`

Run the code, and note that the means are now displayed as hollow squares. Feel free to try other shapes!

Now change the `stat_summary()` function back to the default by deleting the `shape` option. Now try changing the shapes that display the raw scores. Let's change them to crosses by adding `shape = 4` to `geom_point()`. (We can delete `alpha = 0.3` because shape 4 is not filled so setting the transparency is redundant.) So we want to change:

`geom_point(position = position_dodge(width = 0.9), alpha = 0.3)`

to be

`geom_point(position = position_dodge(width = 0.9), shape = 4)`

Run the code and you should see that the raw data points are crosses.

```strength_plot <- ggplot2::ggplot(jig_tib, aes(employee, strength, colour = sex))
strength_plot +
geom_point(position = position_dodge(width = 0.9), alpha = 0.3) +
stat_summary(fun = "mean", geom = "point", size = 4, position = position_dodge(width = 0.9)) +
labs(x = "Employee status", y = "Maximal push force (n)", colour = "Biological sex") +
coord_cartesian(ylim = c(900, 2500)) +
scale_y_continuous(breaks = seq(900, 2500, 100)) +
scale_colour_manual(values = c("#56B4E9", "#E69F00")) +
theme_bw()
```

## Scatterplots

Zach also showed Dr Tuff a scatterplot of footspeed against strength. At the simplest level we could set this up as:

```scat <- ggplot2::ggplot(jig_tib, aes(footspeed, strength))
scat +
geom_point()
```

The first line creates an object called `scat` that uses the `jig_tib` tibble and from it plots footspeed on the x-axis and strength on the y-axis. Try it in the code box below.

We can use the options of `geom_point()` to change the colour of the points, their size, their shape (see above) and their transparency. For example, to make the points blue (hex code #56B4E9), bigger and more transparent than the defaults we could specify:

`geom_point(colour = "#56B4E9", size = 2, alpha = 0.6)`

Edit the code below to include these options and run it to see how the plot is affected. We can also add axis labels and apply a theme by adding these functions:

`labs(x = "Footspeed (mph)", y = "Maximal push force (n)")` `theme_bw()`

Add those to the code (remember that each line except the first and last should end with a `+`). Finally, we could add a line of bets fit using `geom_smooth()`. To fit a straight line we can set a method of "lm" (stands for linear model) and change its colour to be a nice orange (hex code #E69F00). By default a confidence interval around the line will be plotted (you can turn it off by adding `SE = F`), so we can also colour this orange by including `fill = "#E69F00"`. The complete function would be:

`geom_smooth(method = "lm", colour = "#E69F00", fill = "#E69F00")`

Add this function to the code box (underneath `geom_point()`) and run the code to see the plot. It should now have a line on top of the data points.

```
```
```scat <- ggplot2::ggplot(jig_tib, aes(footspeed, strength))
scat +
geom_point(colour = "#56B4E9", size = 2, alpha = 0.6) +
geom_smooth(method = "lm", colour = "#E69F00", fill = "#E69F00") +
labs(x = "Footspeed (mph)", y = "Maximal push force (n)") +
theme_bw()
```

## Other resources

### Statistics

• The tutorials typically follow examples described in detail in @RN10163, so for most of them there's a thorough account in there. You might also find @RN4832 useful for the R stuff.
• There are free lectures and screen casts on my YouTube channel
• There are free statistical resources on my website www.discoveringstatistics.com