knitr::opts_chunk$set( echo = TRUE, message = FALSE, warning = FALSE ) #necessary to render tutorial correctly library(learnr) library(htmltools) #tidyverse library(dplyr) library(ggplot2) library(magrittr) library(purrr) library(tibble) #non tidyverse library(Hmisc) library(datawizard) library(knitr) source("./www/discovr_helpers.R") #Read data files needed for the tutorial insta_tib <- tibble::tibble( followers = c(57, 40, 103, 234, 93, 53, 116, 98, 108, 121, 22) )
# Create bib file for R packages here::here("inst/tutorials/discovr_03/packages.bib") |> knitr::write_bib(c('here', 'tidyverse', 'dplyr', 'readr', 'tibble', 'datawizard', 'knitr'), file = _)
r cat_space(fill = blu)
Welcome to the discovr
space pirate academyHi, welcome to discovr space pirate academy. Well done on embarking on this brave mission to planet r rproj()
s, which is a bit like Mars, but a less red and more hostile environment. That's right, more hostile than a planet without water. Fear not though, the fact you are here means that you can master r rproj()
, and before you know it you'll be as brilliant as our pirate leader Mae Jemstone (she's the badass with the gun). I am the space cat-det, and I will pop up to offer you tips along your journey.
On your way you will face many challenges, but follow Mae's system to keep yourself on track:
r bmu(height = 1.5)
This icon flags materials for teleporters. That's what we like to call the new cat-dets, you know, the ones who have just teleported into the academy. This material is the core knowledge that everyone arriving at space academy must learn and practice. For accessibility, these sections will also be labelled with [(1)]{.alt}.r user_visor(height = 1.5)
Once you have been at space pirate academy for a while, you get your own funky visor. It has various modes. My favourite is the one that allows you to see everything as a large plate of tuna. More important, sections marked for cat-dets with visors goes beyond the core material but is still important and should be studied by all cat-dets. However, try not to be disheartened if you find it difficult. For accessibility, these sections will also be labelled with [(2)]{.alt}.r user_astronaut(height = 1.5)
Those almost as brilliant as Mae (because no-one is quite as brilliant as her) get their own space suits so that they can go on space pirate adventures. They get to shout RRRRRR really loudly too. Actually, everyone here gets to should RRRRRR really loudly. Try it now. Go on. It feels good. Anyway, this material is the most advanced and you can consider it optional unless you are a postgraduate cat-det. For accessibility, these sections will also be labelled with [(3)]{.alt}.It's not just me that's here to help though, you will meet other characters along the way:
r alien(height = 1.5)
aliens love dropping down onto the planet and probing humanoids. Unfortunately you'll find them probing you quite a lot with little coding challenges. Helps is at hand though. r robot(height = 1.5)
bend-R is our coding robot. She will help you to try out bits of r rproj()
by writing the code for you before you encounter each coding challenge.r bug(height = 1.5)
we also have our friendly alien bugs that will, erm, help you to avoid bugs in your code by highlighting common mistakes that even Mae Jemstone sometimes makes (but don't tell her I said that or my tuna supply will end). Also, use hints and solutions to guide you through the exercises (Figure 1).
By for now and good luck - you'll be amazing!
Before attempting this tutorial it's a good idea to work through this tutorial on how to install, set up and work within r rproj()
and r rstudio()
.
The tutorials are self-contained (you practice code in code boxes). However, so you get practice at working in r rstudio()
I strongly recommend that you create an Quarto document within an r rstudio()
project and practice everything you do in the tutorial in the Quarto document, make notes on things that confused you or that you want to remember, and save it. Within this Quarto document you will need to load the relevant packages and data.
This tutorial uses the following package:
datawizard
[@R-datawizard]here
[@R-here]knitr
[@R-knitr]It also uses these tidyverse
packages [@R-tidyverse; @tidyverse2019]: readr
[@R-readr], dplyr
[@R-dplyr], ggplot2
[@wickhamGgplot2ElegantGraphics2016] and tibble
[@R-tibble].
There are (broadly) two styles of coding:
Explicit: Using this style you declare the package when using a function: package::function()
. For example, if I want to use the mutate()
function from the package dplyr
, I will type dplyr::mutate()
. If you adopt an explicit style, you don't need to load packages at the start of your Quarto document (although see below for some exceptions).
Concise: Using this style you load all of the packages at the start of your Quarto document using library(package_name)
, and then refer to functions without their package. For example, if I want to use the mutate()
function from the package dplyr
, I will use library(dplyr)
in my first code chunk and type the function as mutate()
when I use it subsequently.
Coding style is a personal choice. The Google r rproj()
style guide and tidyverse style guide recommend an explicit style, and I use it in teaching materials for two reasons (1) it helps you to remember which functions come from which packages, and (2) it prevents clashes resulting from using functions from different packages that have the same name. However, even with this style it makes sense to load tidyverse
because the dplyr
and ggplot2
packages contain functions that are often used within other functions and in these cases explicit code is difficult to read. Also, no-one wants to write ggplot2::
before every function from ggplot2
.
You can use either style in this tutorial because all packages are pre-loaded. If working outside of the tutorial, load the tidyverse
package (and any others if you're using a concise style) at the beginning of your Quarto document:
library(tidyverse)
r bmu()
What is a confidence interval? [(1)]{.alt}As a brief recap, we usually use a sample value as an estimate of a parameter (e.g., the mean) in the population. The estimate of a parameter (e.g., the mean) will differ across samples, and we can use the standard error to get some idea of the extent to which these estimates differ across samples. We can also use this information to calculate boundaries within which we believe the population value will fall. Such boundaries are called [confidence intervals]{.alt}. Although what I'm about to describe applies to any parameter, we'll stick with the mean to keep things consistent with what you have already learnt.
In the book we look at an example of how many followers someone has on their social media account. I think I used Facebook, but that's because I haven't yet realised that Facebook is populated only by people over the age of 40. I'm told that Instagram is the place to be, but no-one wants to see my wrinkly old scrotum of a face so I never use it. I pretty much don't use Facebook either. Did I mention I like statistics? Draw your own conclusions.
Anyway, we looked at the number of followers people have on [insert name of currently popular social media gizmo]. We're interested in the average number of followers. There is a true mean (the mean in the population of all users). Let's imagine it's 120 (because most people are not statisticians). We take a sample of users and we find the mean number of followers is 100 Because we don't know what the true value of the mean is (the population value), we don't know how good (or bad) our sample value of 100 followers is as an estimate of it. So rather than fixating on a single value from the sample (the point estimate), we could use an interval estimate instead: we use our sample value as the midpoint, but set a lower and upper limit as well. So, we might say, we think the true value of the mean number of followers is somewhere between 70 and 130 followers (note that 100, our point estimate, falls exactly between these values). Of course, in this case the true value (100 followers) does falls within these limits. However, what if we'd set smaller limits – what if we'd said we think the true value falls between 90 and 110 followers (again, note that 100, our point estimate, is in the middle)? In this case the interval does not contain the population value of the mean.
Let's imagine that you were particularly fixated with Japanese quail sperm, and you took 100 different samples of social media users. In each sample, you calculate the mean and constructed an interval around it as I've just described. The crucial thing is to construct the intervals in such a way that they tell us something useful. For example, perhaps we might want to know how often, in the long run, an interval contains the true value of the parameter we're trying to estimate (in this case, the population mean). This is what a confidence interval does. Typically, we look at 95% confidence intervals, and sometimes 99% confidence intervals, but they all have a similar interpretation: they are limits constructed such that for a certain percentage of samples (be that 95% or 99%) the true value of the population parameter falls within the limits (on average). So, when you see a 95% confidence interval for a mean, think of it like this: if we'd collected 100 samples, and for each sample calculated the mean and a confidence interval for it then for approximately 95 of these samples, the confidence interval contains the value of the mean in the population, and in approximately 5 of the samples the confidence interval does not contain the population mean. The trouble is, you do not know whether the confidence interval from a particular sample is one of the 95% that contain the true value or one of the 5% that do not.
The parameter estimate (in this case the mean) is always in the centre of the confidence interval. We know that 95% of confidence intervals contain the population value, so we might assume that this confidence interval contains the true mean (but remember this is an assumption and 5% of the time we will be incorrect). Under this assumption, if the interval is small, the sample mean must be very close to the true mean and if the confidence interval is very wide then the sample mean could be very different from the true mean, indicating that the sample estimate is a poor representation of the population. We'll now use an interactive activity to explore this explanation of a confidence interval.
r bmu()
Confidence interval explorer [(1)]{.alt}fluidPage( # Application title titlePanel(""), fluidRow( wellPanel( h3("Instructions"), p("Use this interactive app to explore and understand confidence intervals. The idea is that you set up a population with a particular mean. When you click the", tags$code("Take samples"), "button 100 random samples will be taken from that population. In each sample we will calculate the mean and construct a confidence interval around it. We will then plot the mean as a dot and the confidence interval as a horizontal bar ranging from the lower limit of the interval to the upper limit. A vertical line, will show the population mean. By default, the data appears for each sample one at a time."), p("Your main job is to pay attention to how often each confidence interval crosses the vertical line showing the population mean because in these situations the confidence interval contains the population value (i.e. contains the value we're trying to estimate). To help you out, samples that have confidence intervals that don't contain the population mean are plotted in a different colour to those that do."), p("By default, the sliders are set up to mimic the example in the book chapter. Use these default settings first."), p("1. Note that in each of the 100 samples we take, we are trying to estimate a population parameter (in this case the mean) using a point estimate (the sample mean) and an interval estimate (the 95% confidence interval). Normally, we don't know the value in the population, but in this simulation we do. By default it is 15. We want our samples to 'hit' this value, which is shown on the plot as a vertical line."), p("2. Click", tags$code("Take samples"), "and watch each sample appear on the plot. Note that (1) the samples yield different estimates (both the sample mean and the corresponding confidence interval differ from sample to sample); (2) some samples yield confidence intervals that contain the population value, others don't; (3) Note how many samples have confidence intervals that contain the population value (i.e. cross the grey, vertical, line)."), p("3. Click", tags$code("Take samples"), "again to take another 100 samples. Again, note the percentage of samples that have confidence intervals that contain the population value. (You can de-select the", tags$code("Animate?"), "checkbox before clicking", tags$code("Take samples"), "to display the final plot rather than having to wait for each sample to appear individually.)"), p("4. Keeping everything else the same, change the slider for the confidence interval percentage to and repeat the process. What do you notice about the percentage of samples that have confidence intervals that contain the population value?"), p("5. Keeping everything else the same, change the slider for the size of each sample and repeat the process. What do you notice about the width of the confidence intervals as the sample size increases and decreases?") ) ), fluidRow( column(6, align="center", sliderInput("mu", "Population mean", min = 0, max = 100, step = 5, value = 15), sliderInput("pop_sd", "Population standard deviation", min = 1, max = 30, step = 1, value = 5), actionButton("gen_dat", "Take samples") ), column(6, align="center", sliderInput("n_samp", "The size of each sample", min = 10, max = 200, step = 10, value = 30), sliderInput("ci_level", "Confidence interval %", min = 50, max = 100, step = 1, value = 95), checkboxInput("animate_cbx", label = "Animate?", value = TRUE) ), ), hr(), fluidRow(align="center", h4(textOutput("coverage"), align = "centre") ), fluidRow(align="center", plotOutput("ci_plot", height = "1000px") ) )
blu <- "#5c97bf" ong <- "#d47500" counter = 0 invalidateLaterNew <- function (millis, session = getDefaultReactiveDomain(), update = TRUE) { if(update){ ctx <- shiny:::.getReactiveEnvironment()$currentContext() shiny:::timerCallbacks$schedule(millis, function() { if (!is.null(session) && session$isClosed()) { return(invisible()) } ctx$invalidate() }) invisible() } } unlockBinding("invalidateLater", as.environment("package:shiny")) assign("invalidateLater", invalidateLaterNew, "package:shiny") get_cis <- function(){ tibble::tibble(.rows = 100) %>% dplyr::mutate( sample = as.numeric(rownames(.)), data = purrr::map(sample, ~rnorm(n = input$n_samp, mean = input$mu, sd = input$pop_sd)), mean_ci = purrr::map(data, ~ggplot2::mean_cl_normal(., conf.int = input$ci_level/100)), mean = purrr::map_dbl(mean_ci, ~.$y), ci_low = purrr::map_dbl(mean_ci, ~.$ymin), ci_upp = purrr::map_dbl(mean_ci, ~.$ymax), contains_mu = ifelse(input$mu >= ci_low & input$mu <= ci_upp, TRUE, FALSE) ) } ci_tib <- reactiveValues( data = tibble::tibble(.rows = 100) %>% dplyr::mutate( sample = as.numeric(rownames(.)), data = purrr::map(sample, ~rnorm(n = 30, mean = 15, sd = 5)), mean_ci = purrr::map(data, ~ggplot2::mean_cl_normal(., conf.int = 0.95)), mean = purrr::map_dbl(mean_ci, ~.$y), ci_low = purrr::map_dbl(mean_ci, ~.$ymin), ci_upp = purrr::map_dbl(mean_ci, ~.$ymax), contains_mu = ifelse(15 >= ci_low & 15 <= ci_upp, TRUE, FALSE) ), show = FALSE ) set_x <- function(x_min, x_max){ ci = input$ci_level/100 area = (1 + ci)/2 width = round(4*qnorm(area)*(input$pop_sd/sqrt(10))) step <- floor(width/5) if(step > 0){ x_breaks <- input$mu + step*seq(-5, 5, 1) } else { x_breaks <- round(input$mu + (width/5)*seq(-5, 5, 1), 2) } return(x_breaks) } get_plot <- function(tib, x_min, x_max){ p = ggplot(tib, aes(mean, sample, colour = factor(contains_mu))) + geom_vline(xintercept = input$mu, linetype = 1, colour = "grey45", size = 1) + coord_cartesian(xlim = c(min(set_x()), max(set_x())), ylim = c(0, 100)) + scale_y_continuous(breaks = 1:100) + scale_x_continuous(breaks = set_x(x_min, x_max)) + scale_colour_manual(values = c("TRUE" = blu, "FALSE" = ong)) + labs(x = "", y = "Sample number") + theme_minimal() + theme(legend.position = "none") + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) #plot data when button clicked if(ci_tib$show == TRUE){ p + geom_errorbarh(xmax = tib$ci_upp, xmin = tib$ci_low, size = 0.75, height = 1) + geom_point(size = 4) } else { p } } output$ci_plot <- renderPlot(height = 1000, { ci_tib$data %>% get_plot(., min(.$ci_low), max(.$ci_upp)) }) # observe events observeEvent(input$gen_dat, { counter <<- 0 ci_tib$data = get_cis() ci_tib$show = TRUE x_min = min(ci_tib$data$ci_low) x_max = max(ci_tib$data$ci_upp) if(input$animate_cbx){ output$ci_plot <- renderPlot(height = 1000, { counter <<- counter + 1 invalidateLater(1000, session, counter < 100) isolate(ci_tib$data) %>% dplyr::filter(sample <= counter) %>% get_plot(., x_min, x_max) }) } else { output$ci_plot <- renderPlot(height = 1000, { isolate(ci_tib$data) %>% get_plot(., x_min, x_max) }) } #output$ci_plot = get_plot() output$coverage <- renderText({ paste0("Confidence intervals containing the population value = ", round(100*(sum(ci_tib$data$contains_mu)/100), 2), "%") }) })
Based on your explorations of confidence intervals, please try to answer these questions.
quiz(caption = "Confidence interval quiz", question("When the confidence interval width was 95% what did you observe?", answer("The percentage of samples with confidence intervals that included the population value was always 95%", message = "This statement could be true if you tried pressing the 'Take samples' only a few times, but in general this staement is false. The percentage of samples with confidence intervals that included the population value should not *always* be 95%. Go back to the interactive demo and take some more samples."), answer("The percentage of samples with confidence intervals that included the population value was never 95%", message = "This statement could be true if you tried pressing the 'Take samples' only a few times, but in general this staement is false. The percentage of samples with confidence intervals that included the population value should in the long run be 95%, so you'd expect a few of your sets of 100 samples to hit a 95% hit rate. Go back to the interactive demo and take some more samples."), answer("The percentage of samples with confidence intervals that included the population value seemed to fluctuate around 95%. Sometimes it was a little more, sometimes a little less.", message = "This statement should reflect what you'd see. In the long run (i.e. over infinite numbers of samples), the hit rate for 95% confidence intervals will be 95% but in the short run (i.e. for 100 samples) it might be slightly off that value. Try re-doing the exercise 5-10 times but this time note the percentage of samples with confidence intervals that included the population value each time you press 'Take samples'. You should find that if you average these values you get 95% (or close to it) showing that as you atke larger numbers of samples, the percentage of samples with confidence intervals that included the population value converge on 95%.", correct = TRUE), correct = "Correct - well done!", random_answer_order = TRUE, allow_retry = T ), question("What happened to the percentage of samples with confidence intervals containing the population value as you increased and decreased the confidence interval percentage?", answer("The percentage of samples with confidence intervals that included the population value seemed to fluctuate around the percentage I set.", correct = TRUE, message = "This statement should reflect what you'd see. In the long run (i.e. over infinite numbers of samples), the hit rate for confidence intervals will be the percentage you set but in the short run (i.e. for 100 samples) it might be slightly off that value. Note that as the percentage get higher you get less fluctuation, meaning that you have more certainty that your particular sample contains the population value."), answer("The percentage of samples with confidence intervals that included the population value was always the value that I set.", message = "This statement could be true if you tried pressing the 'Take samples' only a few times, but in general this staement is false. The percentage of samples with confidence intervals that included the population value should in the long run match the % value that you set on the slider."), answer("The percentage of samples with confidence intervals that included the population value was always 95%.", message = "This statement could be true if you didn't move the slider from the default position! Go back to the interactive demo and take some more samples but change the percentage for the confidence interval."), correct = "Correct - well done!", random_answer_order = TRUE, allow_retry = T ), question("What happened to the width of the confidence intervals as the sample size increased?", answer("The confidence intervals got narrower.", correct = TRUE, message = "This statement should reflect what you'd see. Large samples typically provide estimates of population values that more closely match the true values. Large sample also typically provide extimates of population values that vary less from sample to sample. Hence you get narrow intervals. Small samples tend to be more different from each other, and have more potential to provide estimates of population values that are more distant from the true values. Hence confidence intervals in small samples tend to be wider than large samples."), answer("The confidence intervals got wider.", message = "This should not have been what you saw. Re-do the exercise and pay particular attention to the values on the x-axis (which tell you about the width of the intervals)"), answer("The width of the confidence intervals stayed the same.", message = "This statement could be true if you didn't move the slider very far from the default positions. Go back to the interactive demo and take some more samples but set the sample size slider to the extremes."), correct = "Correct - well done!", random_answer_order = TRUE, allow_retry = T ) )
r bmu()
Confidence intervals with the tidyverse [(1)]{.alt}r bmu()
A basic confidence interval [(1)]{.alt}Let's compute a confidence interval using r rproj()
. In the book we have the number of followers for 11 instagram users:
57, 40, 103, 234, 93, 53, 116, 98, 108, 121, 22
r alien()
Alien coding challengeUse what you have learnt in previous tutorials to enter these scores into a variable called followers stored within a tibble called [insta_tib]{.alt}.
# There are numerous ways to do this task, this is the most efficient (probably) # You want to name the object insta_tib so start with ... insta_tib <- ...
# You are creating a tibble so ... insta_tib <- tibble::tibble(...) # Within this function create the variable followers
# Create the variable followers insta_tib <- tibble::tibble( followers = ... )
insta_tib <- tibble::tibble( followers = c(57, 40, 103, 234, 93, 53, 116, 98, 108, 121, 22) ) insta_tib #this line shows the tibble
As is often the case with r rproj()
, you have a few options for obtaining confidence intervals. We'll predominantly use two functions called mean_cl_normal()
and mean_cl_boot()
from the ggplot2
package (which is installed with tidyverse
). The function mean_cl_normal()
produces a standard confidence interval, whereas mean_cl_boot()
produces something called a bootstrap confidence interval (see [discovr_06]{.alt}).
r robot()
Code exampleThese functions take the general form:
ggplot2::mean_cl_normal(object, conf.int = 0.95, na.rm = TRUE) ggplot2::mean_cl_boot(object, conf.int = 0.95, na.rm = TRUE)
in which [object]{.alt} is a model being passed into the function, [conf.int]{.alt} sets the probability level for the interval (95% by default), and (as in other functions) [na.rm]{.alt} determines whether to remove missing values before computing the interval (by default, they are removed, which is helpful because that's usually what you'd want to do).
r alien()
Alien coding challengeIn this example, we want the 95% confidence interval for the variable followers within the tibble [insta_tib]{.alt}, so we could use the example code and replace [object]{.alt} with [insta_tib$followers]{.alt}. Try this out:
ggplot2::mean_cl_normal(insta_tib$followers) # Note: we don't need to include conf.int = 0.95 and na.rm = TRUE # because the defaults of a 95% CI and removing missing values are what # we want
This function produces a tibble containing three variables: the mean (with the variable name y), the lower boundary of the confidence interval (ymin), and the upper boundary of the confidence interval (ymax). To get something other than a 95% confidence interval, we change the default of [conf.int = 0.95]{.alt} within the function.
r alien()
Alien coding challengeAdapt the code from the previous challenge to get the 90% CI of the variable followers.
ggplot2::mean_cl_normal(insta_tib$followers, conf.int = 0.9)
r bmu()
Adding confidence intervals to summary tables [(1)]{.alt}In [discovr_02]{.alt} we explored code to compile a table of summary statistics for a variable. To recap, we could produce a summary table for these Instagram scores by executing:
insta_sum <- insta_tib |> dplyr::summarize( median = median(followers), mean = mean(followers), `trimmed mean 10%` = mean(followers, trim = 0.1), range = max(followers) - min(followers), `lower quartile` = quantile(followers, probs = 0.25), `upper quartile` = quantile(followers, probs = 0.75), IQR = IQR(followers), var = var(followers), sd = sd(followers) )
If we want to add the 95% confidence interval to this summary table we can do so by extracting values from the output of mean_cl_normal()
. The output of the function is a tibble within which each value is stored as a variable (called y for the estimate of the mean and ymin and ymax for the lower and upper boundary of the CI). We can, therefore, use the $
symbol to access each variable.
r robot()
Code exampleFor example, to obtain the lower boundary of the 95% confidence interval we'd execute:
ggplot2::mean_cl_normal(insta_tib$followers)$ymin
Therefore, we can add the upper and lower bound of the 95% confidence interval to the summary by extracting the relevant values from the mean_cl_normal()
function. We could also obtain the mean this way instead of using mean()
. For example, to create a table including the mean and its 95% confidence interval we could execute:
insta_sum <- insta_tib |> dplyr::summarize( Mean = ggplot2::mean_cl_normal(followers)$y, `95% CI Lower` = ggplot2::mean_cl_normal(followers)$ymin, `95% CI Upper` = ggplot2::mean_cl_normal(followers)$ymax )
You might wonder what the point of this is when the function itself spits out a summary table, but it is because it allows you to include other information in the table. For example, we could include all of the other summary statistics that we previously computed.
r alien()
Alien coding challengeAdapt the code below (which produces various summary statistics for the number of followers people have on social media) to include the mean, and lower and upper boundary of the 95% confidence interval.
insta_sum <- insta_tib |> dplyr::summarize( median = median(followers), range = max(followers) - min(followers), `lower quartile` = quantile(followers, probs = 0.25), `upper quartile` = quantile(followers, probs = 0.75), IQR = IQR(followers), var = var(followers), sd = sd(followers) ) insta_sum |> knitr::kable(caption = "Summary statistics for the Instagram data", align = 'c', #this argument centre aligns the columns digits = 2)
insta_sum <- insta_tib |> dplyr::summarize( Mean = ggplot2::mean_cl_normal(followers)$y, `95% CI Lower` = ggplot2::mean_cl_normal(followers)$ymin, `95% CI Upper` = ggplot2::mean_cl_normal(followers)$ymax, median = median(followers), range = max(followers) - min(followers), `lower quartile` = quantile(followers, probs = 0.25), `upper quartile` = quantile(followers, probs = 0.75), IQR = IQR(followers), var = var(followers), sd = sd(followers) ) insta_sum |> knitr::kable(caption = "Summary statistics for the Instagram data", align = 'c', #this argument centre aligns the columns digits = 2)
question("The 95% confidence interval for followers ranges from 56.85 to 133.15. What does this tell us?", answer("If this confidence interval is one of the 95% that contains the population value then the mean number of followers in the population lies between 56.85 and 133.15.", correct = TRUE), answer("There is a 95% chance that the mean number of followers in the population lies between 56.85 and 133.15.", message = "You cannot make probability statements from a confidence interval. We don't know whether this particular CI is one of the 95% that contains the population value of the mean number of followers."), answer("The probability of this confidence interval containing the population value is 0.95.", message = "The probability of this confidence interval containing the population value is either 0 (it doesn't) or 1 (it does) but it's impossible to know which."), answer("I can be 95% confident that the mean number of followers in the population lies between 56.85 and 133.15.", message = "Confidence intervals do not quantify your subjective confidence."), correct = "Correct - well done!", random_answer_order = TRUE, allow_retry = T )
r bmu()
Robust confidence intervals [(1)]{.alt}To get a robust confidence interval (based on a bootstrap), we simply replace mean_cl_normal
with mean_cl_boot
.
r robot()
Code exampleFor example, to get the robust confidence interval for the mean number of followers we could execute:
ggplot2::mean_cl_boot(insta_tib$followers)
r alien()
Alien coding challengeAdapt the code below (which produces various summary statistics for the number of followers people have on social media) to include the mean, and lower and upper boundary of the 95% [bootstrap]{.alt} confidence interval.
insta_sum <- insta_tib |> dplyr::summarize( median = median(followers), range = max(followers) - min(followers), `lower quartile` = quantile(followers, probs = 0.25), `upper quartile` = quantile(followers, probs = 0.75), IQR = IQR(followers), var = var(followers), sd = sd(followers) ) insta_sum |> knitr::kable(caption = "Summary statistics for the Instagram data", align = 'c', #this argument centre aligns the columns digits = 2)
insta_sum <- insta_tib |> dplyr::summarize( Mean = ggplot2::mean_cl_boot(followers)$y, `95% CI Lower` = ggplot2::mean_cl_boot(followers)$ymin, `95% CI Upper` = ggplot2::mean_cl_boot(followers)$ymax, median = median(followers), range = max(followers) - min(followers), `lower quartile` = quantile(followers, probs = 0.25), `upper quartile` = quantile(followers, probs = 0.75), IQR = IQR(followers), var = var(followers), sd = sd(followers) ) insta_sum |> knitr::kable(caption = "Summary statistics for the Instagram data", align = 'c', #this argument centre aligns the columns digits = 2)
r user_visor()
Confidence intervals using datawizard
[(2)]{.alt}In the previous tutorial we saw that creating a bespoke table of summary statistics is useful and helps us to practice our tidyverse
skills. However, a quicker method is to use the describe_distribution()
function from the datawizard
package. Remember, it has the following arguments:
datawizard::describe_distribution(x = my_data, select = NULL, exclude = NULL, centrality = "mean", dispersion = TRUE, iqr = TRUE, range = TRUE, quartiles = FALSE, include_factors = FALSE, ci = NULL, iterations = 100)
These arguments were explained in discovr_02
, but in that tutorial we ignored the argument to compute confidence intervals. Now we'll use it. To include a confidence interval we include the argument [ci = proportion]{.alt} in which we replace proportion with a value that represents the percentage of the interval, just like we did with mean_cl_normal()
earlier on. To recap, for a 95% confidence interval the proportion is 0.95 so you'd include [ci = 0.95]{.alt} within describe_distribution()
. This function always uses a bootstrap method with a default of 100 bootstrap samples ([iterations = 100]{.alt}). This is possibly a little low (but keeps computations fast), so consider increasing it to 500 or even 1000.
r alien()
Alien coding challengeUsing the default options, get some descriptive statistics including a 95% confidence interval for the number of Instagram followers.
#replace the xxx datawizard::describe_distribution(xxx)
#replace the xxx datawizard::describe_distribution(insta_tib) # now add the argument to set a 95% confidence interval and the number of samples/iterations
#replace the xxx datawizard::describe_distribution(insta_tib, ci = 0.95, iterations = 500) # now use kable() to round the values to 2dp
datawizard::describe_distribution(insta_tib, ci = 0.95, iterations = 500) |> knitr::kable(digits = 2, align = 'c')
r rproj()
r rproj()
and r rstudio()
.r rstudio()
cheat sheets.r rstudio()
list of online resources.I'm extremely grateful to Allison Horst for her very informative blog post on styling learnr tutorials with CSS and also for sending me a CSS template file and allowing me to adapt it. Without Allison, these tutorials would look a lot worse (but she can't be blamed for my colour scheme).
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.