library(forcats)
library(learnr)
library(tidyverse)
library(boot)

knitr::opts_chunk\$set(echo = FALSE, warning = FALSE)
tutorial_options(exercise.cap = "Exercise")

#Read dat files needed for the tutorial

#setup objects for code blocks

An Adventure in R: Inferential statistics and robust estimation

Overview

This tutorial is one of a series that accompanies An Adventure in Statistics [@RN10163] by me, Andy Field. These tutorials contain abridged sections from the book so there are some copyright considerations but I offer them under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, ^[Basically you can use this tutorial for teaching and non-profit activities but do not meddle with it or claim it as your own work.]

• Who is the tutorial aimed at?
• What is covered?
• This tutorial looks at how to get (and plot) inferential statistics such as confidence intervals for the mean. We also look at robust confidence intervals. It would be a useful tutorial to run alongside teaching based on Chapters 8 and 9 of An Adventure in Statistics.
• This tutorial does not teach the background theory: it is assumed you have either attended my lecture or read the relevant chapter in the aforementioned books (or someone else's)
• The aim of this tutorial is to augment the theory that you already know by guiding you through fitting linear models using R and RStudio and asking you questions to test your knowledge along the way.

Story précis

Why a précis?

Because these tutorials accompany my book An adventure in statistics, which uses a fictional narrative to teach the statistics, some of the examples might not make sense unless you know something about the story. For those of you who don't have the book I begin each tutorial with a précis of the story. If you're not interested then fair enough - click past this section.

General context for the story

It is the future. Zach, a rock musician and Alice, a geneticist, who have been together since high school live together in Elpis, the ‘City of Hope’.

Zach and Alice were born in the wake of the Reality Revolution which occurred after a Professor Milton Gray invented the Reality Prism – a transparent pyramid worn on the head – that brought honesty to the world. Propaganda and media spin became unsustainable, religions collapsed, advertising failed. Society could no longer be lied to. Everyone could know the truth about anything that they could look at. A gift, some said, to a previously self-interested, self-obsessed society in which the collective good had been eroded.

But also a curse. For, it soon became apparent that through this Reality Prism, people could no longer kid themselves about their own puffed-up selves as they could see what they were really like – by and large, pretty ordinary. And this caused mass depression. People lost faith in themselves. Artists abandoned their pursuits, believing they were untalented and worthless.

Zach and Alice have never worn a Reality Prism and have no concept of their limitations. They were born after the World Governance Agency (WGA) destroyed all Reality Prisms, along with many other pre-revolution technologies, with the aim of restoring community and well-being. However, this has not been straightforward and in this post-Prism world, society has split into pretty much two factions

• The Chippers who have had WiFi-enabled chips implanted into their brains, enabling them to record and broadcast what they see and think in real time; upload memories for future generations into a newly-created memoryBank and live-stream music and films directly into their brains.
• The Clocktarians, followers of the old pre-Prism ways who use steam punk style technologies, who have elected not to have chips in their brains, regarded by the Chippers as backward-looking stuck in a ‘clockwork, Victorian society’.

Everyone has a star, a limitless space on which to store their digital world.

Zach and Alice are Clocktarians. Their technology consists mainly of:

• A Proteus, a device made from programmable matter that can transform shape and function simply by the owners’ wishes. Zach calls his a diePad, in the shape of a tombstone in an ironic reference to an over-reliance on technology at the expense of memory.
• A Reality Checker, a clockwork mechanism that, at the point of critical velocity, projects an opaque human head that is linked to everything and can tell you anything. Every head has a personality and Zach’s is a handsome, laid back ‘dude’ who is like an electronic friend, who answers questions if he feels like it and often winds Zach up by giving him false information. And he often flirts with Alice.

Main Protagonists

• Zach
• Rock musician in band called The Reality Enigma.
• Spellbinding performer, has huge fan-base.
• Only people living in Elpis get to see The Reality Enigma in the flesh. Otherwise all performances are done via an oculus riff, a multisensory headset for experiencing virtual gigs.
• Zach’s music has influenced and changed thousands of lives.
• Wishes he had lived pre-Revolutionary times, the turn of the 21st Century, a golden age for music when bands performed in reality at festivals.
• Kind, gentle and self-doubting.
• Believes science and maths are dull and uninspiring. Creates a problem between him and Alice as she thinks that because he isn’t interested in science, he isn’t interested in her. Leads to lots of misunderstandings between them.
• Alice
• Shy, lonely, academically-gifted – estranged from the social world until she met Zach in the college library.
• Serious scientist, works at the Beimeni Centre of Genetics.
• At 21, won the World Science Federation’s Einstein Medal for her genetics research
• Desperately wants Zach to get over his fear of science so he can open his mind to the beauty of it.

Alice has been acting strangely, on edge for weeks, disconnected and uncommunicative, as if she is hiding something and Zach can’t get through to her. Arriving home from band practice, unusually, she already home and listening to an old album that the two of them enjoyed together, back in a simpler, less complicated time in their relationship. During an increasingly testy evening, that involves a discussion with the Head about whether or not a Proteus causes brain cancer, Alice is interrupted by an urgent call which she takes in private. She returns looking worried and is once again, distracted. She tells Zach that she has ‘a big decision to make’. Before going to bed, Zach asks her if he can help with the decision but she says he ‘already has’, thanking him for making ‘everything easier.’ He has no idea what she means and goes to sleep, uneasy.

On waking, Zach senses that something is wrong. And he is right. Alice has disappeared. Her clothes, her possessions and every photo of them together have gone. He can’t get hold of any of her family or friends as their contact information is stored on her Proteus, not on his diePad. He manages to contact the Beimeni Centre but is told that no one by the name of Alice Nightingale has ever worked there. He logs into their constellation but her star has gone. He calls her but finds that her number never existed. She has, thinks Zach, been ‘wiped from the planet.’ He summons The Head but he can’t find her either. He tells Zach that there are three possibilities: Alice has doesn’t want to be found, someone else doesn’t want her to be found or she never existed.

Zach calls his friend Nick, fellow band member and fan of the WGA-installed Repositories, vast underground repositories of actual film, books, art and music. Nick is a Chipper – solely for the purpose of promoting the band using memoryBank – and he puts the word out to their fans about Alice missing.

Thinking as hard as he can, Zach recalls the lyrics of the song she’d been playing the previous evening. Maybe they are significant? It may well be a farewell message and the Head is right. In searching for clues, he comes across a ‘memory stone’ which tells him to read what’s on there. File 1 is a research paper that Zach can’t fathom. It’s written in the ‘language of science’ and the Head offers to help Zach translate it and tells him that it looks like the results of her current work were ‘gonna blow the world’. Zach resolves to do ‘something sensible’ with the report.

Zach doesn’t want to believe that Alice has simply just left him. Rather, that someone has taken her and tried to erase her from the world. He decides to find her therapist, Dr Murali Genari and get Alice’s file. As he breaks into his office, Dr Genari comes up behind him and demands to know what he is doing. He is shaking but not with rage – with fear of Zach. Dr Genari turns out to be friendly and invites Zach to talk to him. Together they explore the possibilities of where Alice might have gone and the likelihood, rating her relationship satisfaction, that she has left him. During their discussion Zach is interrupted by a message on his diePad from someone called Milton. Zach is baffled as to who he is and how he knows that he is currently discussing reverse scoring. Out of the corner of his eye, he spots a ginger cat jumping down from the window ledge outside. The counsellor has to go but suggests that Zach and ‘his new friend Milton’ could try and work things out.

Packages and data

Packages

This tutorial uses the following packages:

• boot to conduct bootstrapping [@RN11409]
• Hmisc to obtain confidence intervals [@RN11417]
• tidyverse for general data processing [@RN11407]

These packages are automatically loaded within this tutorial. If you are working outside of this tutorial (i.e. in RStudio) then you need to make sure that the package has been installed by executing install.packages("package_name"), where package_name is the name of the package. If the package is already installed, then you need to reference it in your current session by executing library(package_name), where package_name is the name of the package.

Note that boot is installed by default in R, but you will need to execute library(boot) to use it.

Data

This tutorial has the data files pre-loaded so you shouldn't need to do anything to access the data from within the tutorial. However, if you want to play around with what you have learnt in this tutorial outside of the tutorial environment (i.e. in a stand-alone RStudio session) you will need to download the data files and then read them into your R session. This tutorial uses the following file:

You can load the file in several ways:

• Assuming that you follow the workflow recommended in the tutorial adventr_02 (see also this online tutorial), you can load the data into an object called jig_tib by executing:
• If you don't follow my suggested workflow, you will adjust the file location in the above command.
• Alternatively, if you have an internet connection (and my server has not exploded!) load the file direct from the URL by executing:

Confidence intervals

After breaking into the JIG:SAW complex and catching a glimpse of Alice, Zach is in shock. Returning home he is desolate: the evidence of his eyes suggests that Alice is not being held hostage at JIG:SAW and the reality kicks in that she may in fact have left him. He feels worthless. He meets up with Nick, the drummer in his band, at the Repository and they discuss what has happened. Nick is utterly convinced that Alice wouldn’t have left Zach out of choice and Zach returns home even more determined to get back into JIG:SAW to speak to her and find out for himself. Milton is reluctant and tries to convince Zach that he would need to step up his understanding of statistics to a level of which he is not capable. Based on data showing that performing under a different name in a maths test can free mental blocks to maths, it is decided that Zach needs to dress up as Florence Nightingale - a ‘gifted statistician’ as well as a nurse. Feel free to do the same.

Milton goes on to explain confidence intervals to Zach.

quiz(
question("If a 95% confidence interval for the mean ranges from 45 to 51, what does this tell us?",
answer("If this confidence interval is one of the 95% that contains the population value then the population value of the mean lies between 45 and 51.", correct = TRUE),
answer("There is a 95% chance that the population value of the mean lies between 45 and 51", message = "You cannot make probability statements from a confidence interval. We don't know whether this particular CI is one of the 95% that contains the population value of the mean."),
answer("The probability of this confidence interval containing the population mean is 0.95.", message = "The probability of this confidence interval containing the population mean is either 0 (it doesn't) or 1 (it does) but it's impossible to know which."),
answer("I can be 95% confident that the population value of the mean lies between 45 and 51", message = "Confidence intervals do not quantify your subjective confidence."),
correct = "Correct - well done!",
allow_retry = T
)
)

In a previous tutorial we looked at how to summarize data to get summary statistics such as the mean, median, variance and so on. It is often useful to get the confidence interval too. This section looks at how to do that. We're going to use the data in the tibble called jig_tib again, which contains data about the strength, speed, and visual acuity of JIG:SAW employees and non-employees. Remember that this tibble contains 7 variables:

• id : employee ID
• employee: whether or not the employee works for JIG:SAW
• job_type: Categories of employee
• footspeed: Footspeed of the individual, in miles per hour
• strength: Maximal push force of the individual in newtons
• vision: Visual acuity
• sex: Biological sex of the individual

In the previous tutorial we looked at combining group_by() and summarize() in a pipe command to create a tibble containing summary statistics (split by group). For example, we used the following command to create a tibble called speed_sum that contained the mean of the variable footspeed grouped by combinations of the variables employee and sex from the tibble called jig_tib.

speed_sum <-
jig_tib %>%
dplyr::group_by(employee, sex) %>%
dplyr::summarize(
mean_speed = mean(footspeed)
)
quiz(
question("In the above command, what does **dplyr::group_by(employee, sex)** do?",
answer("Groups the output by all combinations of the variables **employee** and **sex** (I.e. employee-male, employee-female, non-employee-male, non-employee-female.", correct = TRUE),
answer("Creates one output grouped by **employee** (i.e. employee, non-employee), and a separate output grouped by sex (male, female)."),
correct = "Correct - well done!",
allow_retry = T
),
question("What does the pipe operator (%>%) do?",
answer("Carries any instruction on the left of the operator forward into any instrucion on the right of the operator", correct = TRUE),
answer("Allows you to chain independent commands without them affecting each other.", message = "Sorry, that's incorrect. The pipe operator is does the opposite: it allows you to chain commands so that commands on the left are carried forwared in to commands on the right"),
answer("Multiplies anything on the left of the operator by anything on the right.", message = "Sorry, that's incorrect. The pipe operator is does the opposite: it allows you to chain commands so that commands on the left are carried forwared in to commands on the right"),
correct = "Correct - well done!",
allow_retry = T
)
)

To incorporate confidence intervals into this command isn't entirely straightforward. One way is to use the mean_cl_normal() function from the ggplot2 package. (Under the hood this function is using the package Hmisc so if you're working outside of this tutorial then you'll need to install and load this package.) This function takes any model as its input and returns an object that contains estimates from the model and their confidence intervals. In general it takes the following form:

ggplot2::mean_cl_normal(object, conf.int = 0.95, na.rm = TRUE)

To keep things simple, imagine the 'object' we put into this function is a variable, like footspeed. We'd execute:

ggplot2::mean_cl_normal(jig_tib\$footspeed, conf.int = 0.95, na.rm = TRUE)

If we place only jig_tib\$footspeed into the function we'll get a 95% confidence interval and missing values will be removed, which is usually what we want. To get a different confidence interval specify a proportion in conf.int =. For example, ggplot2::mean_cl_normal(jig_tib\$footspeed, conf.int = 0.9) will give us a 90% confidence interval around the mean of footspeed. When you put a variable into the function it returns an object containing the mean (labelled y in the object), the lower boundary of the confidence interval (labelled ymin), and the upper boundary of the confidence interval (labelled ymax). To access these values simultaneously we execute the function (or create an object from it and execute that), but if we want to access them individually we can do that by appending their name after the \$ symbol. For example:

ggplot2::mean_cl_normal(jig_tib\$footspeed)\$ymin

returns the value of the lower boundary of the (by default 95%) confidence interval. If we add this command within the summarize() statement in our pipe, we'll get the lower boundaries for the confidence intervals around the mean when the data are grouped by employee and sex. One final point is that we wouldn't include this line exactly as I've written it because we have specified the jig_tib tibble earlier in the pipe, which means it will be carried forward into the summarize statement. This means that we can specify the variable of interest as simply footspeed rather than jig_tib\$footspeed because the function will already know to look for footspeed within the jig_tib tibble.

Therefore, to create a tibble with means and confidence intervals grouped by employee and sex we can adapt the code that we've already used in previous tutorials to include the mean_cl_normal() function:

speed_sum <-
jig_tib %>%
dplyr::group_by(employee, sex) %>%
dplyr::summarize(
mean_speed = mean(footspeed),
ci_low = ggplot2::mean_cl_normal(footspeed)\$ymin,
ci_upp = ggplot2::mean_cl_normal(footspeed)\$ymax
)

Note that the code is the same as we used before except that we have added two lines (which should make sense based on the explanation above):

• ci_low = ggplot2::mean_cl_normal(footspeed)\$ymin creates a variable called ci_low by applying the function mean_cl_normal() to the variable footspeed and extracting the value labelled "CI lower", which is the value of the lower boundary of the 95% confidence interval.
• ci_upp = ggplot2::mean_cl_normal(footspeed)\$ymax creates a variable called ci_upp by applying the function mean_cl_normal() to the variable footspeed and extracting the value labelled "CI upper", which is the value of the upper boundary of the 95% confidence interval.

Copy the command into the code box and run it. Also, run the name of the speed_sum tibble that you create to see what it contains. Hopefully you'll see a tibble with the mean and 95% confidence interval boundaries split by employee and sex. Try to create tibbles called strength_sum and vis_sum that tabulate the mean and confidence intervals for the variables strength and vision respectively.

speed_sum <-
jig_tib %>%
dplyr::group_by(employee, sex) %>%
dplyr::summarize(
mean_speed = mean(footspeed),
ci_low = ggplot2::mean_cl_normal(footspeed)\$ymin,
ci_upp = ggplot2::mean_cl_normal(footspeed)\$ymax
)
speed_sum

Plotting confidence intervals

We've seen in previous tutorials how to plot means. For example we created a plot of the strength data split by JIG:SAW employees and non-employees. We used this code:

strength_plot <- ggplot2::ggplot(jig_tib, aes(employee, strength))
strength_plot +
stat_summary(fun = "mean", geom = "point", size = 4) +
labs(x = "Employee status", y = "Maximal push force (n)") +
coord_cartesian(ylim = c(1000, 1800)) +
scale_y_continuous(breaks = seq(1000, 1800, 100)) +
theme_bw()

Let's remind ourselves of what each part of this command is doing:

• strength_plot <- ggplot(jig_tib, aes(employee, strength)) creates an object called strength_plot that contains the plot. The ggplot() function is then used to specify that the plot uses the jig_tib tibble and plots the variable employee on the x-axis and the variable strength on the y-axis.
• strength_plot + stat_summary(fun = "mean", geom = "point", size = 4) takes the object strength_plot and adds a statistics summary to it. The stat_summary() uses fun = "mean" to plot the mean using a dot (geom = "point") of size 4 (size = 4).
• labs(x = "Employee status", y = "Maximal push force (n)") applies descriptive labels to the x and y axes.
• coord_cartesian(ylim = c(1000, 1800)) sets the limits of the y-axis
• scale_y_continuous(breaks = seq(1000, 1800, 100)) sets the breaks along the y-axis
• theme_bw() applies a black and white theme to the plot

If we want to plot confidence intervals around the means all we need to do is to edit the details of stat_summary(). Specifically there are three changes that we need to make:

1. Change fun = "mean" to fun.data = "mean_cl_normal". This creates the data to be plotted for the mean and confidence intervals.
2. Change geom = "point" to geom = "pointrange". This changes the geom to one that can display a confidence interval.
3. Delete size = 4 because the default size for the 'pointrange' geom will look better than size 4.

The code box contains the code above and shows the resulting plot (that we created in previous tutorials). Change the two things listed above and run the code to see how the plot changes.

strength_plot <- ggplot(jig_tib, aes(employee, strength))
strength_plot +
stat_summary(fun = "mean", geom = "point", size = 4) +
labs(x = "Employee status", y = "Maximal push force (n)") +
coord_cartesian(ylim = c(1000, 1800)) +
scale_y_continuous(breaks = seq(1000, 1800, 100)) +
theme_bw()
strength_plot <- ggplot(jig_tib, aes(employee, strength))
strength_plot +
stat_summary(fun.data = "mean_cl_normal", geom = "pointrange") +
labs(x = "Employee status", y = "Maximal push force (n)") +
coord_cartesian(ylim = c(1000, 1800)) +
scale_y_continuous(breaks = seq(1000, 1800, 100)) +
theme_bw()

Hopefully your plot how looks like dots showing the means with vertical lines showing the confidence intervals. (Incidentally, you could layer a stat_summary() function underneath the confidence intervals that uses a bar geom to plot the mean. Something like stat_summary(fun = "mean", geom = "bar", alpha = 0.6) + placed on the line above the existing stat_summary() function would work. The result would be a bar graph of means with error bars layered on top. However, adding the bars does not add any new information, but does add a lot of unnecessary ink, so I wouldn't bother.)

What about more complex plots? In a previous tutorial we also created a similar plot but differentiated males and females by colours. The code box below replicates this code and shows the plot to remind you of how it looks. To add error bars we make exactly the same changes to stat_summary() as we did above:

1. Change fun = "mean" to fun.data = "mean_cl_normal".
2. Change geom = "point" to geom = "pointrange".
3. Delete size = 4.

Make these changes and run the code. You should see the same plot but with confidence intervals as well as means.

strength_plot <- ggplot(jig_tib, aes(employee, strength, colour = sex))
strength_plot +
stat_summary(fun = "mean", geom = "point", size = 4, position = position_dodge(width = 0.9)) +
labs(x = "Employee status", y = "Maximal push force (n)", colour = "Biological sex") +
coord_cartesian(ylim = c(1000, 1800)) +
scale_y_continuous(breaks = seq(1000, 1800, 100)) +
scale_colour_manual(values = c("#56B4E9", "#E69F00")) +
theme_bw()
strength_plot <- ggplot(jig_tib, aes(employee, strength, colour = sex))
strength_plot +
stat_summary(fun.data = "mean_cl_normal", geom = "pointrange", position = position_dodge(width = 0.9)) +
labs(x = "Employee status", y = "Maximal push force (n)", colour = "Biological sex") +
coord_cartesian(ylim = c(1000, 1800)) +
scale_y_continuous(breaks = seq(1000, 1800, 100)) +
scale_colour_manual(values = c("#56B4E9", "#E69F00")) +
theme_bw()

In the box below can you create a similar plot of the vision scores split by job_type on the x-axis and employees and non-employees in different colours.

vis_plot <- ggplot(jig_tib, aes(job_type, vision, colour = employee))
vis_plot +
stat_summary(fun.data = "mean_cl_normal", geom = "pointrange", position = position_dodge(width = 0.9)) +
labs(x = "Type of job", y = "Visual acuity", colour = "Employee status") +
coord_cartesian(ylim = c(0, 1)) +
scale_y_continuous(breaks = seq(0, 1, 0.1)) +
scale_colour_manual(values = c("#56B4E9", "#E69F00")) +
theme_bw()

Robust means

In Chapter 9, Zach steals the data we have been analysing that relates to JIG:SAW employees. Milton teaches him how the mean and confidence intervals can be biased by skew, kurtosis and outliers. Milton introduces the idea of the trimmed mean.

To get the trimmed mean for a variable we can adapt the code we have already used. For example, to get the mean footspeed split by employee and sex we used the code below. We discovered before that the mean has an option to specify a trim (trim = 0), so to produce a summary of trimmed means we can simply add this option to the command that gets the mean. For example, to get a 20% trimmed mean we might replace mean_speed = mean(footspeed) with:

tr_mean_speed = mean(footspeed, trim = 0.2)

Try adding this line to the summarize() function in the code box so that we store both the mean and the 20% trimmed mean (don't forget to include a comma after the first command within summarize())

speed_sum <-
jig_tib %>%
dplyr::group_by(employee, sex) %>%
dplyr::summarize(
mean_speed = mean(footspeed)
)
speed_sum
speed_sum <-
jig_tib %>%
dplyr::group_by(employee, sex) %>%
dplyr::summarize(
mean_speed = mean(footspeed),
tr_mean_speed = mean(footspeed, trim = 0.2)
)
speed_sum

Try this for strength as well:

strength_sum <-
jig_tib %>%
dplyr::group_by(employee, sex) %>%
dplyr::summarize(
mean_strength = mean(strength),
tr_mean_strength = mean(strength, trim = 0.2)
)
strength_sum

Robust confidence intervals

Milton also introduces Zach to the idea of bootstrapping as a way to create robust confidence intervals.

quiz(
question("Bootstrapping is a technique from which the sampling distribution of a statistic is estimated by ...",
answer("Taking repeated samples (with replacement) from the data set.", correct = TRUE),
answer("Taking repeated samples from the population.", message = "Samples are not taken from the population because we don't have access to it."),
answer("Adjusting the standard error to compensate for heteroscedasticity.", message = "Bootstrapping is a *sampling* process."),
answer("Tying my shoelaces together so that I fall over.", message = "Now you're just being silly."),
correct = "Correct - well done!",
allow_retry = T
),
question("Which of these statements about bootstrap confidence intervals is **not** true?",
answer("Bootstrap confidence intervals have the same values each time you compute them.", correct = TRUE, message = "This stament *is* false. Because bootstrapping relies on repeated sampling, the results can vary slighty each time you impliment the process."),
answer("Bootstrap confidence intervals are robust to violations of homoscedasticity.", message = "This statement is true."),
answer("Bootstrap confidence intervals do not assume a normal sampling distribution.", message = "This statement is true: bootstrapping is a technique from which the sampling distribution of a statistic is estimated *empirically* from the data so no assumptions about its shape are made."),
answer("Bootstrap confidence intervals are most useful in small samples.", message = "Technically this statement is true because bootstrapping was designed for small sample situations (where the central limit theorem can't be invoked). However, there is evidence that the central limit theorem may not hold up in samples that are quite large so there is a case to be made that bootstrapping is still useful in larger samples."),
correct = "Correct - well done!",
allow_retry = T
)
)

We can get bootstrap confidence intervals in R by using the boot package, which is installed by default. However, you do need to load it before using it if you are working outside of this tutorial environment. The process of getting the bootstrap intervals is - and I'm not going to lie - complicated. You should not feel bad if this section makes no sense whatsoever (but by all means feel a sense of pride if you get through it in one piece).

The boot() function works in the following way: it takes a bootstrap sample from the data you specify, sends it out to a function (that the user defines) that returns the statistic of interest (in this case the mean), it stores this information and the repeats the process lots of times. The general format of the boot() function is:

boot::boot(data, statistic, R)

It has other options too, but the three key ones are data which specifies the data from which to take bootstrap samples, statistic which specifies a function that computes the information you want (e.g., the mean), and R which is the number of times you want to repeat the process. So, to get bootstrap samples of the strength data we could specify something like this:

boot::boot(jig_tib\$strength, statistic, R = 1000)

This specifies that the data from which to take bootstrap samples is jig_tib\$strength and that we want to repeat the process 1000 times (i.e. use 1000 bootstrap samples). The question is what do we replace statistic with. Given we're interested in the mean, you might reasonably think we can replace statistic with mean() or something. However, we can't do this because the function mean() will not return information about which bootstrap sample was used. Instead we need to write out own function that will include information about the particular bootstrap sample that is being used to compute the mean. We do this by executing:

mean_boot <- function(data, index) mean(data[index], na.rm = TRUE)

This creates a new function called mean_boot(). This function takes data and an index value as input function(data, index) and it returns the mean of the data with a particular index value after removing missing values mean(data[index], na.rm = TRUE). In the main boot() function we would replace statistic with this function that we have created:

boot::boot(jig_tib\$strength, mean_boot, R = 1000)

To sum up, the boot() function is going to send data (the bootstrap samples) with an associated index value (from 1 to 1000) to the function called mean_boot. For each data set with each index value mean_boot will return the mean.

The code box contains these commands. Run the code.

mean_boot <- function(data, index) mean(data[index], na.rm = TRUE)
boot::boot(jig_tib\$strength, mean_boot, R = 1000)

The output doesn't give us the confidence intervals. Instead we're told the original mean and an estimate of bias. To get the confidence intervals there is one more step, which is to use the boot.mean_cl_normal() function, which takes the general form:

boot::boot.ci(boot.object, conf = 0.95, type = "all")

There are other options but these ones are the main ones. First you input an object created with the boot() function, then specify the width of confidence interval that you want (the default is 0.95 so can leave this option out if that's what you want), and the type that you want. By default it will compute 5 different types of interval. For most purposes you don't need five different intervals so it would be best to change this option to specify a particular type. There's much to recommend Bias Corrected and Accelerated intervals (type = "bca") or if those fail then a percentile interval (type = "perc").

The code box below combines what we have learnt. First it creates the function to get the mean. The second line inserts our boot() function into the boot.mean_cl_normal() function and asked for BCa confidence intervals. Run the code.

mean_boot <- function(data, index) mean(data[index], na.rm = TRUE)
boot::boot.ci(boot(jig_tib\$strength, mean_boot, R = 1000), type = "bca")

There is too much output to embed into a summarize function. We want to discard everything except for the numeric values that represent the lower and upper bound of the confidence interval. These values are stored in a variable called bca within the boot.mean_cl_normal() object. (Obviously if you chose a type of interval other than bca, the variable name will be different; for example, if you opted for type = "Perc" then the variable will be called perc). Like any variable, we can access it using \$. For example, to see the information relating to the BCa confidence intervals we could execute:

boot::boot.ci(boot(jig_tib\$strength, mean_boot, R = 1000), type = "bca")\$bca

Try it in the code box above and you'll see 5 values appear.Values 4 and 5 are the lower and upper limits of the interval. We can access these specific values using:

boot::boot.ci(boot(jig_tib\$strength, mean_boot, R = 1000), type = "bca")\$bca boot::boot.ci(boot(jig_tib\$strength, mean_boot, R = 1000), type = "bca")\$bca

Try these commands in the code box above. We can include these lines in the summarize function of the pipe that we used at the start of the tutorial. Remember that we can drop the jig_tib\$ from the command because that tibble is specified earlier in the pipe. The code box below shows the commands.

mean_boot <- function(data, index) mean(data[index], na.rm = TRUE)
strength_mean <-
jig_tib %>%
dplyr::group_by(employee, sex) %>%
dplyr::summarize(
mean_strength = mean(strength),
ci_low_boot = boot::boot.ci(boot(strength, mean_boot, R = 1000), type = "bca")\$bca,
ci_upp_boot = boot::boot.ci(boot(strength, mean_boot, R = 1000), type = "bca")\$bca
)
strength_mean

Run this code and you should see that it creates a tibble with the means, the lower limit of the 95% bootstrap confidence interval for the mean (ci_low_boot) and the upper limit of the 95% bootstrap confidence interval for the mean (ci_upp_boot) split b all combinations of employee and sex.

Plotting bootstrap confidence intervals

The last section was, I imagine, not the most fun that you have ever had. I expect you are dreading the added complexity of plotting these bootstrap intervals. Well, fear not. Earlier on we plotted this graph:

strength_plot <- ggplot(jig_tib, aes(employee, strength, colour = sex))
strength_plot +
stat_summary(fun.data = "mean_cl_normal", geom = "pointrange", position = position_dodge(width = 0.9)) +
labs(x = "Employee status", y = "Maximal push force (n)", colour = "Biological sex") +
coord_cartesian(ylim = c(1000, 1800)) +
scale_y_continuous(breaks = seq(1000, 1800, 100)) +
scale_colour_manual(values = c("#56B4E9", "#E69F00")) +
theme_bw()

To convert these confidence intervals to bootstrap confidence intervals requires only one change to the command:

• In the stat_summary() function change mean_cl_normal to mean_cl_boot.

It's a simple as that. Make the change in the code box above and re-run the code to observe the change on the plot.

Other resources

Statistics

• The tutorials typically follow examples described in detail in @RN10163, so for most of them there's a thorough account in there. You might also find @RN4832 useful for the R stuff.
• There are free lectures and screen casts on my YouTube channel
• There are free statistical resources on my website www.discoveringstatistics.com