In dundeemath/MA22004labs: MA22004 (University of Dundee)

```{css, echo=FALSE} body { counter-reset: li; / initialize counter named li / }

ol { margin-left:0; / Remove the default left margin / padding-left:0; / Remove the default left padding / } ol > li { position:relative; / Create a positioning context / margin:0 0 10px 2em; / Give each list item a left margin to make room for the numbers / padding:10px 80px; / Add some spacing around the content / list-style:none; / Disable the normal item numbering / border-top:2px solid #317EAC; background:rgba(49, 126, 172, 0.1); } ol > li:before { content:"Exercise " counter(li); / Use the counter as content / counter-increment:li; / Increment the counter by 1 / / Position and style the number / position:absolute; top:-2px; left:-2em; -moz-box-sizing:border-box; -webkit-box-sizing:border-box; box-sizing:border-box; width:7em; / Some space between the number and the content in browsers that support generated content but not positioning it (Camino 2 is one example) / margin-right:8px; padding:4px; border-top:2px solid #317EAC; color:#fff; background:#317EAC; font-weight:bold; font-family:"Helvetica Neue", Arial, sans-serif; text-align:center; } li ol, li ul {margin-top:6px;} ol ol li:last-child {margin-bottom:0;}

.oyo ul { list-style-type:decimal; }

hr { border: 1px solid #357FAA; }

div#boxedtext { background-color: rgba(86, 155, 189, 0.2); padding: 20px; margin-bottom: 20px; font-size: 10pt; }

div#template { margin-top: 30px; margin-bottom: 30px; color: #808080; border:1px solid #808080; padding: 10px 10px; background-color: rgba(128, 128, 128, 0.2); border-radius: 5px; }

div#license { margin-top: 30px; margin-bottom: 30px; color: #4C721D; border:1px solid #4C721D; padding: 10px 10px; background-color: rgba(76, 114, 29, 0.2); border-radius: 5px; }

/------------ Fancy blocks --------------/

.noteblock { padding: 1em 1em 1em 6em; margin-bottom: 20px; margin-top: 20px; border-left: 6px solid #042e6b; background: 1.5em center/2.5em no-repeat; background-image: url("images/info-circle.svg"); }

.tipblock { padding: 1em 1em 1em 6em; margin-bottom: 20px; margin-top: 20px; border-left: 6px solid #d5d5ce; background: 1.5em center/2em no-repeat; background-image: url("images/lightbulb.svg"); }

.warningblock { padding: 1em 1em 1em 6em; margin-bottom: 20px; margin-top: 20px; border-left: 6px solid #be4e00; background: 1.5em center/2.5em no-repeat; background-image: url("images/exclamation-triangle.svg"); }

.importantblock { padding: 1em 1em 1em 6em; margin-bottom: 20px; margin-top: 20px; border-left: 6px solid #c30000; background: 1.5em center/2.5em no-repeat; background-image: url("images/exclamation-circle.svg"); }

.exerciseblock { padding: 1em 1em 1em 6em; margin-bottom: 20px; margin-top: 20px; border-left: 6px solid #317EAC; background: rgba(49, 126, 172, 0.1) 1.5em center/2.5em no-repeat; background-image: url("images/flag.svg"); }

```r
# packages
library(MA22004labs)
library(tidyverse)
library(janitor)
library(infer)
library(learnr)
library(gradethis)
library(knitr)

# tutorial options
tutorial_options(
  # code running in exercise times out after 30 seconds
  # if it is taking more than 30 s something is wrong 
  exercise.timelimit = 30
  )
# use gradethis for checking
gradethis_setup()
options(width = 60)

# hide non-exercise code chunks
knitr::opts_chunk$set(echo = FALSE, error = TRUE, 
                      comment = "", fig.align = "center", fig.width = 6, 
                      fig.asp = 0.75, out.width = "100%")

data(aps_sco)
data(aps_sco_synth)
set.seed(8675309)
aps_sco_synth_tiny <- slice_sample(aps_sco_synth, prop=0.2)

Getting started

This lab considers inferences for categorical data.

:::{.noteblock} The report for Lab 6 consists of answers to exercises 1-4 in the last section of this tutorial. Upload the submission to Gradescope as a PDF. To create your new lab report, in RStudio, go to New File > R Markdown, then choose From Template and select MA22004 Lab Report from the list of templates. Further instructions on using the class template to produce a nice lab report containing data analysis and text are in the first Section, "RStudio and lab reports", of Lab 1. :::

Packages

In this lab, we will explore and visualise the data using the tidyverse suite of packages and perform statistical inference using infer.

Recall that all the necessary packages can be loaded using the library command.

library(MA22004labs)
library(tidyverse)
library(janitor)
library(infer)

The APS Unemployment in Scotland Data

:::{.tipblock} The Annual Population Survey (APS) is a continuous annual household survey of the UK population, with a sample size of 320,000, carried out by the Office of National Statistics (ONS). :::

Using the APS survey, the ONS estimates an unemployment rate for people (aged 16 and older) by ethnic group in Scotland. These estimates are reported quarterly and refer to the preceding twelve-month period. Note that the unemployment rate here only relates to people in the labour force, i.e., who are "economically active". Unemployment refers to people without a job, who can start work within two weeks following their survey interview, and who had either looked for work in the previous four weeks or were waiting to start a job. People that have stopped looking for work are not included in this unemployment figure. The APS Unemployment in Scotland Data in this lab covers the reporting period from 2012 Q4 through 2022 Q1 and is in a data frame named aps_sco. This data (and other ONS official census and labour market statistics) can be accessed through NOMIS (www.nomisweb.co.uk).

The APS Unemployment in Scotland Data reports the estimated number of people (aged 16 and over) unemployed in Scotland from 2012 Q4 through 2022 Q1. The total unemployment rate and the rate for two ethnic groups are given: white and minority (all ethnic groups excluding white).

:::{.exerciseblock} Take a peak at the data.

head(`___`)

head(aps_sco)
# examine variables, types, and the first few observations
# also, try help(aps_sco)

:::

Let's plot the estimated unemployment rates in Scotland over time by ethnicity. (Note that the APS also reports uncertainty estimates in their rate estimates; however, we exclude these from our investigation.)

dat <- aps_sco |>
 select(yq, white, minority) |>
 pivot_longer(cols = !yq, names_to = "ethnicity", values_to = "rate")

ggplot(dat, aes(x = yq)) + geom_line(aes(y = rate, color = ethnicity)) +
 scale_x_yearqtr(format = "%Y Q%q", n = 6) + 
 labs(x = "Date", y = "Unemployment rate (aged 16+)")

The ONS estimates that 3.7% of white and 6.9% of minority people were unemployed in Scotland for the year up to 2019 Q1. We would like to understand whether the difference in these estimated proportions is representative of a difference that would be observed if we could sample the whole of Scotland's labour force or whether the observed difference is due to chance variation (due to the individuals sampled). Is there a statistically significant difference between the unemployment rates of different ethnic groups in Scotland?

:::{.exerciseblock} What additional information would you need to know to test this?

question_text(
  "Note your thoughts on what additional information would be needed as free text below.",
  answer("C0rrect", correct = TRUE),
  incorrect = "Okay! [even if this button is red]",
  try_again_button = "Modify your answer",
  allow_retry = TRUE
)

:::

Inference on proportions

The ONS seeks insight into population parameters using survey data.

For example, the ONS surveys a sample of people in Scotland and tries to infer the proportion of unemployed people in Scotland. The question "What proportion of people in your sample of the labour force are unemployed?" is answered by a statistic. However, the question "What proportion of people in the Scottish labour force are unemployed?" is answered with an estimate of the population parameter. Here the statistic is the sample proportion of unemployed people in the labour force, the parameter is the proportion of unemployed people, and the population is the Scottish labour force.

The inferential tools for estimating population proportions are analogous to those used for means in the last chapter: the confidence interval and the hypothesis test.

The ONS does not publish the full APS results or the number of people surveyed in Scotland; however, let us suppose that the survey samples 26,137 people aged 16 and older in the Scottish labour force, of which 24,884 identify as white and 1,253 identify as a (non-white) minority. A data set of fictitious results from a survey on Scottish labour force participation and ethnicity is available in aps_sco_synth.

Is the Scottish labour force's unemployment rate statistically different for white and minority people?

This question refers to a hypothesis test on the difference in proportions. Let (p_m) and (p_w) denote the proportions of unemployed minority and white people aged 16 and over in Scotland. Our null hypothesis is then, [H_0 : \quad p_m-p_w =0\,,] and we test this against the alternative, [H_a : \quad p_m-p_w \neq 0\,.] One could also argue that from prior experience, e.g., citing evidence of similar analysis from other locations and contexts, we anticipate the unemployment rate for minority people to exceed white people. In that case, it is also reasonable to test against the alternative, [H_a : \quad p_m-p_w > 0\,.]

:::{.warningblock} The critical thing to remember is that the decision about which alternative to test against must be made honestly, e.g., before looking at the data and making calculations (why?). :::

Hypothesis testing "by hand"

Consider the data set aps_sco_synth.

help(aps_sco_synth)

To make calculating "by hand" easier, use the tabyl command from the janitor package can be used to summarise the data. For example, a table of sample proportions can be quickly obtained.

p <- aps_sco_synth |> 
 tabyl(ethnicity, unemployed) |> 
 adorn_percentages()
p

You can observe from the table generated above that the synthetic data was created to replicate the reported estimates from 2019 Q1.

:::{.exerciseblock} Try using tabyl to create a summary of counts with totals.

aps_sco_synth |> tabyl(`___`, `___`) |> 
  `___`

n <- aps_sco_synth |> 
 tabyl(ethnicity, unemployed) |> 
 adorn_totals(where = c("row","col"))
n

:::

:::{.exerciseblock} Recalling the procedure in §4.3 of the lecture notes, calculate the test statistic and hence the (P)-value for a two-sided hypothesis test concerning the equality of the population proportions: [H_0 : \quad p_m-p_w =0\,,] [H_a : \quad p_m-p_w \neq 0\,.]

n <- aps_sco_synth |> tabyl(ethnicity, unemployed) |> 
  adorn_totals(where = c("row","col"))
p <- aps_sco_synth |> tabyl(ethnicity, unemployed) |> 
  adorn_percentages()

phat <- (n$Total[1] * p$yes[1] + n$Total[2] * p$yes[2]) / n$Total[3]

# test statistic
z <- (p$yes[1] - p$yes[2]) / 
  sqrt(phat*(1-phat)* (1/n$Total[1] + 1/n$Total[2]))

# two-sided test
Pval <- 2*(1-pnorm(z))

:::

:::{.exerciseblock} Based on the (P)-value found above, would you accept or reject the null hypothesis? How would you place the result of the hypothesis test in the context of the problem and (synthetic) data?

question_text(
  "Place the outcome of the hypothesis text in the context using free text below.",
  answer("C0rrect", correct = TRUE),
  incorrect = "Okay! [even if this button is red]",
  try_again_button = "Modify your answer",
  allow_retry = TRUE
)

:::

The null distribution

Now we will use the package infer to investigate the null distribution.

The infer workflow allows us to specify a statistic and calculate that statistic. In this instance, we are interested in whether or not the proportions (p_m) and (p_w) are equal (this is another way of writing the null hypothesis (H_0 : \quad p_m - p_w =0). In the infer package, this functional relationship between unemployment and ethnicity is specified using the notation unemployed ~ ethnicity. (You'll see this notation again when we consider linear models.) Note it's necessary to include a success argument within specify to indicate which response in unemployed we view as a "success". Putting this all together, we can calculate the observed difference in proportions (\hat{p}_m - \hat{p}_w).

# calculate statistic
obs_diff <- aps_sco_synth |>
  # unemployed is independent of ethnicity, since "yes" unemployed
  specify(unemployed ~ ethnicity, success = "yes") |> 
  #specify p_minority - p_white
  calculate(stat = "diff in props", order = c("minority", "white")) 
obs_diff

To conduct a hypothesis test "by hand", you calculated a single realisation of the test statistic. Suppose we could survey other people in the Scottish labour force. In that case, we could use these repeated observations of the test statistic to build a histogram of the null distribution (the distribution of the test statistic, assuming the null hypothesis is true). We can use the infer package to generate realisations of the test statistic using sampling techniques (permutations and bootstrapping).

# calculate the distribution of observed differences 
null_dist <- aps_sco_synth |> 
  specify(unemployed ~ ethnicity, success = "yes") |>
  # null hypothesis is on independence, or p_minority - p_white = 0
  hypothesise(null = "independence") |>
  # based on permuting samples 
  generate(reps = 1000, type = "permute") |>
  calculate(stat = "diff in props", order = c("minority", "white"))
head(null_dist)

obs_diff <- aps_sco_synth |>
  specify(unemployed ~ ethnicity, success = "yes") |> 
  calculate(stat = "diff in props", order = c("minority", "white")) 
null_dist <- aps_sco_synth |> 
  specify(unemployed ~ ethnicity, success = "yes") |>
  hypothesise(null = "independence") |>
  generate(reps = 1000, type = "permute") |>
  calculate(stat = "diff in props", order = c("minority", "white"))

The null distribution can then be visualised and the (P)-value calculated. The visualisation is the histogram of the null distribution, that is, the histogram of repeated observations of the test statistic, assuming the null hypothesis is true.

# plot sampled/bootstrapped null distribution
null_dist |>
 visualize() +
 shade_p_value(obs_stat = obs_diff, direction = "two-sided")

# calculate two-sided p-value (compare to p1)
p_val <- null_dist |>
  get_p_value(obs_stat = obs_diff, direction = "two-sided")
p_val

How does the (P)-value compare to the value computed "by hand"?

Margin of error

Imagine you've set out to survey 1000 people on two questions: are you at least 6 feet tall? and are you left-handed? Since both sample proportions were calculated from the same sample size, they should have the same margin of error, right? Wrong! While the margin of error changes with sample size, the proportion also affects the margin of error.

The margin of error for a 95% confidence interval is given by, [\mathsf{ME} = 1.96 \cdot \mathsf{se} = 1.96 \cdot \sqrt{p(1-p)/n} \,,] where $\mathsf{se}$ is the standard error. Since the population proportion, $p$, is in this $\mathsf{ME}$ formula, it should make sense that the margin of error depends on the population proportion. We can visualise this relationship by creating a plot of $\mathsf{ME}$ vs $p$.

Since the sample size is irrelevant to this discussion, let's just set it to some value ($n = 1000$) and use the fixed value in the following calculations. The first step is to make a variable p that is a sequence from 0 to 1, with each number incremented by 0.01. You can then create a variable of the margin of error (me) associated with each of these values of p using the familiar approximate formula ($\mathsf{ME} = 2 \times \mathsf{se}$).

n <- 1000
p <- seq(from = 0, to = 1, by = 0.01)
me <- 2 * sqrt(p * (1 - p)/n)

Lastly, you can plot the two variables against each other to reveal their relationship. To do so, we need to first put these variables in a data frame (or tibble) that you can call in the ggplot function. Construct a plot of margin of error vs population proportion $p$.

dd <- tibble(p = p, me = me)
`___`

dd <- tibble(p = p, me = me)
ggplot(dd, aes(x = `___`, y = `___`)) + 
  `___` +
  labs(x = "Population Proportion", y = "Margin of Error")

dd <- tibble(p = p, me = me)
ggplot(dd, aes(x = p, y = me)) + 
  geom_line() +
  labs(x = "Population proportion", y = "Margin of error")

:::{.exerciseblock} Describe the relationship between a population proportion and the margin of error. For a given sample size, for which value of $p$ is the margin of error maximised?

question_text(
  "Enter your observations and thoughts as free text below.",
  answer("C0rrect", correct = TRUE),
  incorrect = "Okay! [even if this button is red]",
  try_again_button = "Modify your answer",
  allow_retry = TRUE
)

:::

Success-failure condition

We have emphasised that you must always check conditions before making an inference. For inference on the difference of two proportions (\hat{p}_X - \hat{p}_Y), the test statistic can be assumed to be nearly normal if it is based upon a random sample of independent observations and if (n_X \hat{p}_X \geq 10), (n_X(1 - \hat{p}_X) \geq 10), (n_Y \hat{p}_Y \geq 10), and (n_Y(1 - \hat{p}_Y) \geq 10). Similarly, for inferences about a single proportion, (\hat{p}) we required (n \hat{p} \geq 10) and (n(1 - \hat{p}) \geq 10). These rules of thumb are easy enough to follow, but it makes you wonder: what's so special about the number 10?

The short answer is: nothing. You could argue that you would be fine with nine or should be using 11. The "best" value for such a rule of thumb is, at least to some degree, arbitrary. However, when $n\hat{p}$ and $n(1-\hat{p})$ reach 10, the sampling distribution is sufficiently normal to use confidence intervals and hypothesis tests that are based on that approximation.

You can investigate the interplay between (n) and (p) and the shape of the sampling distribution for (\hat{p}) by using simulations. Play around with the following app to investigate how the shape, centre, and spread of the distribution of $\hat{p}$ changes as $n$ and $p$ change. Try changing (n) and (p) individually.

# Change n and p, then rerun code
n <- 100     
p <- 0.1    

# Plot command (no need to change)
obs_dist <- tibble(p_hat = rep(0, 5000))
for(i in 1:5000){
 samp <- sample(c(TRUE, FALSE), n, replace = TRUE, 
                prob = c(p, 1 - p))
 obs_dist$p_hat[i] <- sum(samp == TRUE) / n
}
ggplot(obs_dist, aes(x = p_hat)) +
 geom_histogram(binwidth = 1.06*sd(obs_dist$p_hat)*n^(-1/5)) +
 scale_x_continuous(limits = c(0,1)) + 
 ggtitle(paste0("Simulated distribution of p_hat, for p = ", p, ", n = ", n))

:::{.exerciseblock} Describe the sampling distribution of sample proportions at $n = 300$ and $p = 0.1$. Be sure to note the centre, spread, and shape.

question_text(
  "Note your thoughts as free text below.",
  answer("C0rrect", correct = TRUE),
  incorrect = "Okay! [even if this button is red]",
  try_again_button = "Modify your answer",
  allow_retry = TRUE
)

:::

:::{.exerciseblock} Keep $n$ constant and change $p$. How does the sampling distribution's shape, centre, and spread vary as $p$ changes? You might want to adjust the min and max for the $x$-axis for a better view of the distribution.

question_text(
  "Note your thoughts as free text below.",
  answer("C0rrect", correct = TRUE),
  incorrect = "Okay! [even if this button is red]",
  try_again_button = "Modify your answer",
  allow_retry = TRUE
)

:::

:::{.exerciseblock} Now also change $n$. How does $n$ affect the distribution of $\hat{p}$?

question_text(
  "Note your thoughts as free text below.",
  answer("C0rrect", correct = TRUE),
  incorrect = "Okay! [even if this button is red]",
  try_again_button = "Modify your answer",
  allow_retry = TRUE
)

:::

Exercises

:::{.warningblock} Please work through the interactive tutorial for this lab. Your lab report should provide answers to the exercises below. For full marks, please answer the exercises in complete sentences, be mindful of appropriate usage, spelling, and grammar, and use $\LaTeX$ for mathematics where applicable. Use the MA22004 Lab Report template to create your lab report. Upload your final submission as a PDF to Gradescope. Further instructions on using the class template to produce a nice lab report containing data analysis and text are in the interactive tutorial for Lab 1. :::

The following exercises refer to variables found in or derived from the APS Unemployment in Scotland Data in the data frames aps_sco and aps_sco_synth that are packaged with MA22004labs.

The APS Unemployment in Scotland Data from 2022 Q1 estimates that 7.8% of white and 9.5% of minority people (aged 16 and older) in Scotland are unemployed. Based on the overall size of the survey and the proportion of the population who belong to each ethnic group, suppose that 22,623 white and 1,291 minority people were surveyed. Does the data support that unemployment rates for white and minority people in the Scottish labour force are significantly different at level 0.10? Identify your hypothesis, place the outcome of your hypothesis test in the context of the problem, and check applicable rules of thumb for the inference you are conducting. [7 marks]
Does the answer to exercise 1 prove that employers are biased? Justify your answer, for example, by considering what other factors might influence the difference in unemployment rates between white and minority people in Scotland. [3 marks]
Many factors could contribute to the difference in unemployment rates between ethnic groups. We can control for education level by looking at the unemployment rates among minority and white people with the same education level. Suppose that among those with qualifications at or above degree level or equivalent, the unemployment rate is 1.9% for white and 3.9% for minority people. Based on the qualifications by ethnicity in Scotland, suppose that 5,722 white and 615 minority people were surveyed. Is the unemployment rate among people with a degree level qualification (or higher) significantly different at level 0.1 for white and minority people in Scotland? Identify your hypothesis, place the outcome of your hypothesis test in the context of the problem, and check applicable rules of thumb for the inference you are conducting. [7 marks]
Using ggplot, construct a plot of population proportion $p$ vs margin of error for a fixed sample size. Remember to label your plot. Describe the relationship between a population proportion and the margin of error. For a given sample size, for what value of $p$ is the margin of error maximised? [3 marks]

Creative Commons License {style="border-width:0"}
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

This work is derivative of OpenIntro Labs and was modified by Eric Hall of the University of Dundee Mathematics Division. The original labs were adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel from labs written by Mark Hansen of UCLA Statistics.

Ideas regarding the social justice component of this lab are taken from Abra Brisbin, "I Need a Job!" Analyzing Unemployment Rates in College Algebra and Introductory Statistics, in Mathematics for Social Justice (eds. Gizem Karaali and Lily S. Khadjavi), AMS/MAA, Providence, RI, 2021.

dundeemath/MA22004labs documentation built on Sept. 18, 2024, 10:51 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com