```{css, echo=FALSE} body { counter-reset: li; / initialize counter named li / }

ol { margin-left:0; / Remove the default left margin / padding-left:0; / Remove the default left padding / } ol > li { position:relative; / Create a positioning context / margin:0 0 10px 2em; / Give each list item a left margin to make room for the numbers / padding:10px 80px; / Add some spacing around the content / list-style:none; / Disable the normal item numbering / border-top:2px solid #317EAC; background:rgba(49, 126, 172, 0.1); } ol > li:before { content:"Exercise " counter(li); / Use the counter as content / counter-increment:li; / Increment the counter by 1 / / Position and style the number / position:absolute; top:-2px; left:-2em; -moz-box-sizing:border-box; -webkit-box-sizing:border-box; box-sizing:border-box; width:7em; / Some space between the number and the content in browsers that support generated content but not positioning it (Camino 2 is one example) / margin-right:8px; padding:4px; border-top:2px solid #317EAC; color:#fff; background:#317EAC; font-weight:bold; font-family:"Helvetica Neue", Arial, sans-serif; text-align:center; } li ol, li ul {margin-top:6px;} ol ol li:last-child {margin-bottom:0;}

.oyo ul { list-style-type:decimal; }

.oyo ul { list-style-type:decimal; }

hr { border: 1px solid #357FAA; }

div#boxedtext { background-color: rgba(86, 155, 189, 0.2); padding: 20px; margin-bottom: 20px; font-size: 10pt; }

div#template { margin-top: 30px; margin-bottom: 30px; color: #808080; border:1px solid #808080; padding: 10px 10px; background-color: rgba(128, 128, 128, 0.2); border-radius: 5px; }

div#license { margin-top: 30px; margin-bottom: 30px; color: #4C721D; border:1px solid #4C721D; padding: 10px 10px; background-color: rgba(76, 114, 29, 0.2); border-radius: 5px; }

/------------ Fancy blocks --------------/

.noteblock { padding: 1em 1em 1em 6em; margin-bottom: 20px; margin-top: 20px; border-left: 6px solid #042e6b; background: 1.5em center/2.5em no-repeat; background-image: url("images/info-circle.svg"); }

.tipblock { padding: 1em 1em 1em 6em; margin-bottom: 20px; margin-top: 20px; border-left: 6px solid #d5d5ce; background: 1.5em center/2em no-repeat; background-image: url("images/lightbulb.svg"); }

.warningblock { padding: 1em 1em 1em 6em; margin-bottom: 20px; margin-top: 20px; border-left: 6px solid #be4e00; background: 1.5em center/2.5em no-repeat; background-image: url("images/exclamation-triangle.svg"); }

.importantblock { padding: 1em 1em 1em 6em; margin-bottom: 20px; margin-top: 20px; border-left: 6px solid #c30000; background: 1.5em center/2.5em no-repeat; background-image: url("images/exclamation-circle.svg"); }

.exerciseblock { padding: 1em 1em 1em 6em; margin-bottom: 20px; margin-top: 20px; border-left: 6px solid #317EAC; background: rgba(49, 126, 172, 0.1) 1.5em center/2.5em no-repeat; background-image: url("images/flag.svg"); }

```r
# packages
library(MA22004labs)
library(tidyverse)
library(infer)
library(learnr)
library(gradethis)
library(knitr)

# tutorial options
tutorial_options(
  # code running in exercise times out after 30 seconds
  # if it is taking more than 30 s something is wrong 
  exercise.timelimit = 30
  )
# use gradethis for checking
gradethis_setup()
options(width = 60)

# hide non-exercise code chunks
knitr::opts_chunk$set(echo = FALSE, error = TRUE, 
                      comment = "", fig.align = "center", fig.width = 6, 
                      fig.asp = 0.75, out.width = "100%")

data(yrbss)

obs_stat <- yrbss |> 
  specify(response = weight) |> 
  calculate(stat = "mean")

null_dist1 <- yrbss |> 
  specify(response = weight) |> 
  hypothesise(null = "point", mu = 68) |>
  generate(reps = 200, type = "bootstrap") |>
  calculate(stat = "mean") 

yrbss_exercise <- yrbss |> 
  mutate(physical_3plus = ifelse(yrbss$physically_active_7d > 2, "yes", "no")) |>
  filter(!is.na(weight)) |> filter(!is.na(physically_active_7d))

yrbss_exercise$physical_3plus <- factor(yrbss_exercise$physical_3plus)

obs_diff <- yrbss_exercise |>
  specify(weight ~ physical_3plus) |>
  calculate(stat = "diff in means", order = c("yes", "no"))

null_dist2 <- yrbss_exercise |>
  specify(weight ~ physical_3plus) |>
  hypothesize(null = "independence") |>
  generate(reps = 200, type = "permute") |>
  calculate(stat = "diff in means", order = c("yes", "no"))

data(yrbss_samp)

Getting started

This lab considers inferences for numerical data.

:::{.noteblock} The report for Lab 5 consists of answers to exercises 1-5 in the last section of this tutorial. Upload the submission to Gradescope as a PDF. To create your new lab report, in RStudio, go to New File > R Markdown, then choose From Template and select MA22004 Lab Report from the list of templates. Further instructions on using the class template to produce a nice lab report containing data analysis and text are in the first Section, "RStudio and lab reports", of Lab 1. :::

Packages

In this lab, we will explore and visualise the data using the tidyverse suite of packages and perform statistical inference using infer.

Recall that all the necessary packages can be loaded using the library command.

library(MA22004labs)
library(tidyverse)
library(infer)

The YRBSS Survey Data

:::{.tipblock} Every two years, the Centers for Disease Control and Prevention in the United States conduct the Youth Risk Behavior Surveillance System (YRBSS) survey, which collects response data from high school-aged youth (ca. 14 to 18-year-olds). The YRBSS survey is carried out to analyse health patterns.

The YRBSS Survey Data includes a selected group of variables from a random sample of observations during one of the years the YRBSS was conducted. The data frame yrbss contains observations for 13 variables, some categorical and some numerical. The meaning of each variable can be found by bringing up the help file ?yrbss. :::

Load the yrbss data set into your workspace by calling the data command.

data(yrbss)

:::{.exerciseblock} Consider yrbss. What are the variables in the YRBSS Survey Data? How many records are there in the sample?


# look, but don't include this output in your final lab report 
glimpse(yrbss) 
question_text(
  "Enter your observations as free text below.",
  answer("C0rrect", correct = TRUE),
  incorrect = "Okay! [even if this button is red]",
  try_again_button = "Modify your answer",
  allow_retry = TRUE
)

:::

Exploratory data analysis

You will first start with analysing the weight of the participants in kilograms: weight.

Using visualisation and summary statistics, describe the distribution of weights. The summary function can be helpful.

summary(yrbss$weight)
question(
  "How many observations are we missing weights from?",
  answer("1004", correct = TRUE),
  answer("none", message = "Unfortunately, not everyone fills out a survey completely."),
  answer("29.94"),
  answer("67"),
  answer("500"),
  allow_retry = TRUE,
  random_answer_order = TRUE
)

Next, consider the possible relationship between a youth's weight and physical activity. Plotting the data is an excellent first step because it helps us quickly visualise trends, identify strong associations, and develop research questions.

First, let's create a new variable, physical_3plus, which will be coded as either "yes" if they are physically active for at least three days a week and "no" if not. This can be done using the following code (note: since the variable physical_3plus is not part of the yrbss, you will need to recreate this variable if used in your lab report).

yrbss_exercise <- yrbss |> 
  mutate(physical_3plus = ifelse(yrbss$physically_active_7d > 2, "yes", "no")) |>
  filter(!is.na(weight)) |> filter(!is.na(physically_active_7d))

:::{.exerciseblock} Make a boxplot (using geom_boxplot()) of weight by the new categorical variable physical_3plus. Is there a relationship between engaging in physical activity for at least three days a week and weight? What did you expect and why?


ggplot(`___`, aes(x = `___`, y = `___`)) + geom_boxplot()
question_text(
  "Enter your thoughts and observations as free text below.",
  answer("C0rrect", correct = TRUE),
  incorrect = "Okay! [even if this button is red]",
  try_again_button = "Modify your answer",
  allow_retry = TRUE
)

:::

The box plots show how the medians of the two distributions compare. Still, we can also compare the means of the distributions using the following to first group the data by the physical_3plus variable and then calculate the mean weight in these groups using the mean function while ignoring missing values by setting the na.rm argument to TRUE.

yrbss_exercise |>
  group_by(physical_3plus) |>
  summarise(mean_weight = mean(weight, na.rm = TRUE))

The summary above indicates an observed difference in the mean weight, but is this difference statistically significant? To answer this question, we will conduct a hypothesis test.

Inference

In our lecture notes, we considered a process for determining which of two contradictory claims, or hypotheses, about a population parameter is correct. The decision to reject or fail to reject the null hypothesis involved computing a test statistic that was then compared to a critical value from a reference distribution. Choosing the reference distribution for the test relies on some conditions being satisfied.

:::{.exerciseblock} Consult the sections of the lecture notes on hypothesis testing (or your favourite statistics text). Are all conditions necessary for making an inference satisfied? Comment on each. You can compute the group sizes with the summarise command as above by defining a new variable with the definition n().


yrbss_exercise |>
  group_by(`___`) |>
  summarise(mean_weight = `___`, samp_size = `___`)
yrbss_exercise |>
  group_by(physical_3plus) |>
  summarise(mean_weight = mean(weight, na.rm = TRUE), samp_size = n())
question_text(
  "Enter your thoughts and observations as free text below.",
  answer("C0rrect", correct = TRUE),
  incorrect = "Okay! [even if this button is red]",
  try_again_button = "Modify your answer",
  allow_retry = TRUE
)

:::

If we are satisfied that the sample satisfies the conditions, we may carry out the inference procedures we learned in class. However, are there more flexible approaches?

Simulation methods, such as bootstrapping and permutation, do not assume any underlying data distribution. With the help of computers, these methods can be used to construct the null distribution, that is, the probability distribution of the test statistic, assuming the null hypothesis is true. The null distribution can then be used to conclude the hypothesis test.

Example Test 1: Is the true mean weight different from 68 kg?

Write the hypotheses for testing if the true mean weight differs from 68 kg.

question_text(
  "Enter the null hypothesis.",
  answer("C0rrect", correct = TRUE),
  incorrect = "Okay! [even if this button is red]",
  try_again_button = "Modify your answer",
  allow_retry = TRUE
)
question_text(
  "Enter the alternative hypothesis.",
  answer("C0rrect", correct = TRUE),
  incorrect = "Okay! [even if this button is red]",
  try_again_button = "Modify your answer",
  allow_retry = TRUE
)

We will use a new function, hypothesise, from the infer package. Recall the infer package is a pedagogical tool. The infer workflow can calculate test statistics, build the null distribution using sampling, and visualise associated objects.

To conduct the hypothesis test using infer, we first need to initialise the test by calculating the observed statistic obs_stat.

# calculating the observed mean from the sample
obs_stat <- yrbss |> 
  specify(response = weight) |> 
  calculate(stat = "mean")

Notice how you can use the functions specify and calculate again as you did for calculating confidence intervals.

Next, we use bootstrapping to simulate the null distribution. A bootstrap sample is drawn from the input sample data for each replicate, with replacement.

# use bootstrapping to calculate the 
# null distribution of weight, under the null hypothesis 
# that the mean weight is 68
null_dist1 <- yrbss |> 
  specify(response = weight) |> 
  hypothesise(null = "point", mu = 68) |>
  generate(reps = 200, type = "bootstrap") |>
  calculate(stat = "mean") 

The null distribution can be compared to the observed statistic.

# visualise the simulation-based null distribution
null_dist1 |>
  visualise() + shade_p_value(obs_stat, direction = "two-sided")

$P$-values can also be extracted from the null distribution.

# visualise the simulation-based null distribution
null_dist1 |>
  get_p_value(obs_stat, direction = "two-sided")

Since the $P$-value is large, the data provide weak evidence against the null hypothesis. We fail to reject the null hypothesis that the mean is different from 68 kg. We reasonably retain the view that true mean weight is 68 kg.

Example Test 2: Are mean weights are different for those who exercise at least times a week and those who do not?

Now consider testing if the mean weights are different for those who exercise at least times a week and those who do not. Record the hypotheses for this test.

question_text(
  "Enter the null hypothesis.",
  answer("C0rrect", correct = TRUE),
  incorrect = "Okay! [even if this button is red]",
  try_again_button = "Modify your answer",
  allow_retry = TRUE
)
question_text(
  "Enter the alternative hypothesis.",
  answer("C0rrect", correct = TRUE),
  incorrect = "Okay! [even if this button is red]",
  try_again_button = "Modify your answer",
  allow_retry = TRUE
)

To conduct the hypothesis test using infer, we first calculate the observed statistic, in this case, the difference in means obs_diff between the two groups.

obs_diff <- yrbss_exercise |>
  specify(weight ~ physical_3plus) |>
  calculate(stat = "diff in means", order = c("yes", "no"))
obs_diff

Here, the statistic you are searching for is the difference in means, $\mu_{yes} - \mu_{no} \neq 0$, and it is necessary to indicate the order as being "yes" and then "no".

After initialising the test, you can use infer to simulate the null distribution. Below, we use permutation to simulate the null distribution. In permutation, for each replicate, each input value is randomly reassigned (without replacement) to a new output value in the sample.

null_dist2 <- yrbss_exercise |>
  specify(weight ~ physical_3plus) |>
  hypothesise(null = "independence") |>
  generate(reps = 200, type = "permute") |>
  calculate(stat = "diff in means", order = c("yes", "no"))

Here, hypothesise is used to set the null hypothesis as a test for independence. Compare this to the previous example, where the null argument was set to "point" to test a hypothesis relative to a point estimate.

We can visualise this null distribution with the following code:

null_dist2 |>
   visualise() + shade_p_value(obs_diff, direction = "two-sided")

How many of these null permutations have a difference of at least obs_diff$stat?


null_dist2$stat >= obs_diff$stat
sum(null_dist2$stat >= obs_diff$stat)

Now that the test is initialised and the null distribution formed, you can calculate the $P$-value for your hypothesis test using the function get_p_value.

null_dist2 |>
  get_p_value(obs_stat = obs_diff, direction = "two_sided")

The $P$-value is vanishingly small! The data provide strong evidence to reject the null hypothesis on independence. The mean weight among the two groups, those that exercise three or more times a week and those that do not, differ; this difference is statistically significant at any reasonable level.

This is the standard workflow for performing hypothesis tests using simulation!

:::{.exerciseblock} Construct and record a confidence interval for the difference between the weights of those who exercise at least three times a week and those who do not. Interpret this interval estimate in the context of the data.


:::

Exercises

:::{.warningblock} Please work through the interactive tutorial for this lab. Your lab report should provide answers to the exercises below. For full marks, please answer the exercises in complete sentences, be mindful of appropriate usage, spelling, and grammar, and use $\LaTeX$ for mathematics where applicable. Use the MA22004 Lab Report template to create your lab report. Upload your final submission as a PDF to Gradescope. Further instructions on using the class template to produce a nice lab report containing data analysis and text are in the interactive tutorial for Lab 1. :::

The following exercises refer to variables found in or derived from the YRBSS Survey Data in the data frame yrbss, which is packaged with MA22004labs.

:::{.tipblock} Every two years, the Centers for Disease Control and Prevention in the United States conduct the Youth Risk Behavior Surveillance System (YRBSS) survey, which collects response data from high school-aged youth (ca. 14 to 18-year-olds). The YRBSS survey is carried out to analyse health patterns.

The YRBSS Survey Data includes a selected group of variables from a random sample of observations during one of the years the YRBSS was conducted. The data frame yrbss contains observations for 13 variables, some categorical and some numerical. The meaning of each variable can be found by bringing up the help file ?yrbss. :::

  1. Create a new categorical variable, physical_3plus, denoting those who engage in physical activity for at least three days a week (e.g., appropriate levels would be "no", "yes", and "NA"). Using ggplot, make a boxplot of height by physical_3plus. Is there a relationship between engaging in physical activity for at least three days a week and height? What did you expect and why? [3 marks]

  2. Calculate a 90% confidence interval for the average respondent height in meters (height) and interpret it in context. Use both the infer package and normality theory to generate interval estimates. Compare the two interval estimates. [5 mark]

  1. Conduct a hypothesis test evaluating whether the average height is different for those who exercise at least three times a week and those who do not, at a confidence level of $0.01$. Please use the infer package to conduct the test. [5 marks]

  2. Now, a non-inference task: determine the number of different levels or categories for the categorical variable hours_tv_per_school_day. [2 marks]

  3. Come up with a research question evaluating the relationship between height (or weight) and sleep. Formulate the question so it can be answered using a hypothesis test and/or a confidence interval. Report the statistical results and also explain in plain language. Be sure to check all assumptions, state your confidence level (i.e., $\alpha$), and place your conclusion in context. [5 marks]


Creative Commons License{style="border-width:0"}
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

This work is derivative of OpenIntro Labs and was modified by Eric Hall of the University of Dundee Mathematics Division. The original labs were adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel from labs written by Mark Hansen of UCLA Statistics.



dundeemath/MA22004labs documentation built on Sept. 18, 2024, 10:51 p.m.