```{css, echo=FALSE} body { counter-reset: li; / initialize counter named li / }

ol { margin-left:0; / Remove the default left margin / padding-left:0; / Remove the default left padding / } ol > li { position:relative; / Create a positioning context / margin:0 0 10px 2em; / Give each list item a left margin to make room for the numbers / padding:10px 80px; / Add some spacing around the content / list-style:none; / Disable the normal item numbering / border-top:2px solid #317EAC; background:rgba(49, 126, 172, 0.1); } ol > li:before { content:"Exercise " counter(li); / Use the counter as content / counter-increment:li; / Increment the counter by 1 / / Position and style the number / position:absolute; top:-2px; left:-2em; -moz-box-sizing:border-box; -webkit-box-sizing:border-box; box-sizing:border-box; width:7em; / Some space between the number and the content in browsers that support generated content but not positioning it (Camino 2 is one example) / margin-right:8px; padding:4px; border-top:2px solid #317EAC; color:#fff; background:#317EAC; font-weight:bold; font-family:"Helvetica Neue", Arial, sans-serif; text-align:center; } li ol, li ul {margin-top:6px;} ol ol li:last-child {margin-bottom:0;}

.oyo ul { list-style-type:decimal; }

.oyo ul { list-style-type:decimal; }

hr { border: 1px solid #357FAA; }

div#boxedtext { background-color: rgba(86, 155, 189, 0.2); padding: 20px; margin-bottom: 20px; font-size: 10pt; }

div#template { margin-top: 30px; margin-bottom: 30px; color: #808080; border:1px solid #808080; padding: 10px 10px; background-color: rgba(128, 128, 128, 0.2); border-radius: 5px; }

div#license { margin-top: 30px; margin-bottom: 30px; color: #4C721D; border:1px solid #4C721D; padding: 10px 10px; background-color: rgba(76, 114, 29, 0.2); border-radius: 5px; }

/------------ Fancy blocks --------------/

.noteblock { padding: 1em 1em 1em 6em; margin-bottom: 20px; margin-top: 20px; border-left: 6px solid #042e6b; background: 1.5em center/2.5em no-repeat; background-image: url("images/info-circle.svg"); }

.tipblock { padding: 1em 1em 1em 6em; margin-bottom: 20px; margin-top: 20px; border-left: 6px solid #d5d5ce; background: 1.5em center/2em no-repeat; background-image: url("images/lightbulb.svg"); }

.warningblock { padding: 1em 1em 1em 6em; margin-bottom: 20px; margin-top: 20px; border-left: 6px solid #be4e00; background: 1.5em center/2.5em no-repeat; background-image: url("images/exclamation-triangle.svg"); }

.importantblock { padding: 1em 1em 1em 6em; margin-bottom: 20px; margin-top: 20px; border-left: 6px solid #c30000; background: 1.5em center/2.5em no-repeat; background-image: url("images/exclamation-circle.svg"); }

.exerciseblock { padding: 1em 1em 1em 6em; margin-bottom: 20px; margin-top: 20px; border-left: 6px solid #317EAC; background: rgba(49, 126, 172, 0.1) 1.5em center/2.5em no-repeat; background-image: url("images/flag.svg"); }

```r
# packages
library(MA22004labs)
library(tidyverse)
library(infer)
library(learnr)
library(gradethis)
library(knitr)

# tutorial options
tutorial_options(
  # code running in exercise times out after 30 seconds
  # if it is taking more than 30 s something is wrong 
  exercise.timelimit = 30
  )
# use gradethis for checking
gradethis_setup()
options(width = 60)

# hide non-exercise code chunks
knitr::opts_chunk$set(echo = FALSE, error = TRUE, 
                      comment = "", fig.align = "center", fig.width = 6, 
                      fig.asp = 0.75, out.width = "100%")

data(ames)
population <- ames
samp <- sample_n(ames, 60)

Getting started

If you have access to data on an entire population, say the size of every house in Ames, Iowa, it's straightforward to answer questions like, "How big is the typical house in Ames?" and "How much variation is there in sizes of houses?". If you have access to only a sample of the population, as is often the case, the task becomes more complicated. What is your best guess for the typical size if you only know the sizes of several dozen houses? This situation requires you to use your sample to make inferences about what your population looks like.

:::{.noteblock} The report for Lab 4 consists of answers to exercises 1-5 in the last section of this tutorial. Upload the submission to Gradescope as a PDF; an rmarkdown template called MA22004 Lab Report is provided. Further instructions on using the class template to produce a nice lab report containing data analysis and text are in Section "RStudio and lab reports" of Lab 1. :::

Packages

In this lab, we will explore and visualise the data using the tidyverse suite of packages and perform statistical inference using infer.

Recall that all the necessary packages can be loaded using the library command.

library(MA22004labs)
library(tidyverse)
library(infer)

The data

:::{.tipblock} In the previous lab, "Lab 3: Sampling Distributions", we looked at the Ames Housing Data, the population data of houses from Ames, Iowa. Recall that this data can be loaded by calling:

data(ames)

:::

In this lab, we'll start with a simple random sample of size 60 from the population. Specifically, this is a random sample of 60 rows (i.e. all the response variables) from the data frame ames (which we save as population for pedagogical purposes). Note that the data set has information on many housing variables, but for the first portion of the lab, we'll focus on the size of the house, represented by the variable area.

population <- ames
samp <- sample_n(population, 60)

Explore the population and sample distributions by plotting histograms and creating summary tables.


summary(`___`)
ggplot(`___`) + geom_histogram()
summary(samp$area)
ggplot(samp, aes(x = area)) + geom_histogram()

:::{.exerciseblock} Describe the distribution of your sample samp$area. What would you say is the "typical" size within your sample? Also, state precisely what you interpreted "typical" to mean (e.g., reference a summary statistic).


question_text(
  "Enter your thoughts and observations as free text below.",
  answer("C0rrect", correct = TRUE),
  incorrect = "Okay! [even if this button is red]",
  try_again_button = "Modify your answer",
  allow_retry = TRUE
)

:::

:::{.exerciseblock} Would you expect another student's distribution for samp$area to be identical to yours? Would you expect it to be similar? Why or why not?

question_text(
  "Enter your thoughts and observations as free text below.",
  answer("C0rrect", correct = TRUE),
  incorrect = "Okay! [even if this button is red]",
  try_again_button = "Modify your answer",
  allow_retry = TRUE
)

:::

Confidence intervals

One of the most common ways to describe a distribution's typical or central value is to use the mean. Calculate the mean of the sample area using the mean function.


mean(`___`)
mean(samp$area)
grade_this({
  if (.result[1] == mean(samp$area)){
    pass()
  }
  fail()
})
sample_mean <- mean(samp$area)

Return to the question that first motivated this lab: based on this sample, what can we infer about the population? Based only on this single sample, the best estimate of the average living area of houses sold in Ames would be the sample mean, usually denoted as $\bar{x}$ (here, we're calling it sample_mean). That serves as a reasonable point estimate but it would be helpful also to communicate how uncertain we are of that estimate. This can be captured by using a confidence interval.

Normality theory

One way of calculating a confidence interval for a parameter is based on the Central Limit Theorem (normality theory). More precisely, we can calculate a 95% confidence interval for a population mean by adding and subtracting 1.96 standard errors to the point estimate for the sample mean (see Section 3.1.2 of the Course Notes)

[\overline{x} \pm 1.96 \cdot \mathsf{se}\,. ]

Compute the standard error se for the 60 samples and save it to a vector se. Also, compute upper and lower limits, lower and upper, respectively, based on 1.96 deviations of the standard error from the sample_mean. Print the confidence interval using the vector command c(lower, upper).


se <- `___`
lower <- `___`
upper <- `___`
c(lower, upper)
se <- sd(`___`) / sqrt(`___`)
lower <- `___` - 1.96 * `___`
upper <- `___` + 1.96 * `___`
c(lower, upper)
grade_this({
  if (all( signif(.result[1], digits = 8) == signif(mean(samp$area) - 1.96 * sd(samp$area) / sqrt(60), digits = 8) & signif(.result[2], digits = 8) == signif(mean(samp$area) + 1.96 * sd(samp$area) / sqrt(60), digits = 8))) {
    pass()
  }
    fail()
})
se <- sd(samp$area) / sqrt(60)
lower <- sample_mean - 1.96 * se
upper <- sample_mean + 1.96 * se
c(lower, upper)

This is an important inference that we've just made: even though we don't know what the entire population looks like, we're 95% confident that the true average size of houses in Ames lies between the values lower and upper. A few conditions must be met for this interval to be valid.

:::{.exerciseblock} For the confidence interval [\overline{x} \pm 1.96 \frac{s}{\sqrt{m}}] computed in Section "Confidence intervals" above to be valid, the sample mean must be normally distributed and have standard error $s / \sqrt{m}$ where $s$ is the sample standard deviation and $m$ is the sample size. What conditions must be met for this to be true?

question_text(
  "Enter your thoughts and observations as free text below.",
  answer("C0rrect", correct = TRUE),
  incorrect = "Okay! [even if this button is red]",
  try_again_button = "Modify your answer",
  allow_retry = TRUE
)

:::

Bootstrapping

Another way of calculating a confidence interval for a parameter is using simulation, or more specifically, using bootstrapping. The term bootstrapping comes from the phrase "pulling oneself up by one's bootstraps", which is a metaphor for accomplishing an impossible task without outside help. In this case, the impossible task is estimating a population parameter (e.g., the unknown population mean), and we'll accomplish it using data from only the given sample. Note that this notion of saying something about a population parameter using only information from an observed sample is the crux of statistical inference; it is not limited to bootstrapping.

In essence, bootstrapping assumes that there are more observations in the populations like the ones in the observed sample. So we "reconstruct" the population by resampling from our sample with replacement. The bootstrapping scheme is as follows:

Instead of coding up each of these steps, one can use the infer package to construct confidence intervals.

Below is an overview of the functions we will use to construct this confidence interval:

+-------------+--------------------------------------------------------------------------------------------------------------------------------------------+ | Function | Purpose | +:============+:===========================================================================================================================================+ | specify | Identify your variable of interest | +-------------+--------------------------------------------------------------------------------------------------------------------------------------------+ | generate | The number of samples you want to generate | +-------------+--------------------------------------------------------------------------------------------------------------------------------------------+ | calculate | The sample statistic you want to do inference with, or you can also think of this as the population parameter you want to do inference for | +-------------+--------------------------------------------------------------------------------------------------------------------------------------------+ | get_ci | Find the confidence interval | +-------------+--------------------------------------------------------------------------------------------------------------------------------------------+

This code will find the 95% confidence interval for the area.

samp |>
 specify(response = area) |>
 generate(reps = 1000, type = "bootstrap") |>
 calculate(stat = "mean") |>
 get_ci(level = 0.95)

Feel free to test out the rest of the arguments for these functions since these commands will be used to calculate confidence intervals and solve inference problems for the rest of the semester. But we will also walk you through more examples in future chapters.

:::{.exerciseblock} Compare the confidence interval for the true mean area produced by bootstrapping (using the infer package) and the normality theory.

question_text(
  "Enter your thoughts as free text below.",
  answer("C0rrect", correct = TRUE),
  incorrect = "Okay! [even if this button is red]",
  try_again_button = "Modify your answer",
  allow_retry = TRUE
)

:::

:::{.tipblock} To recap: we have produced confident intervals for the true mean area of the Ames Housing Data using two methods: (1) normality theory and (2) bootstrapping. If we didn't already know the true population mean, then we would be 95% confident that the true population parameter lies within the computed interval estimates. :::

Confidence levels

In this case, we have the luxury of knowing the true population mean since we have data on the entire population. This value can be calculated using the following command:

mean(population$area)

:::{.exerciseblock} What does "95% confidence" mean in relation to a confidence interval? If you're unsure, see Section 2.2 of the Course Notes. Does your confidence interval capture the true average size of houses in Ames?

question_text(
  "Enter your thoughts and observations as free text below.",
  answer("C0rrect", correct = TRUE),
  incorrect = "Okay! [even if this button is red]",
  try_again_button = "Modify your answer",
  allow_retry = TRUE
)

:::

Using R, we're going to recreate many samples to learn more about how sample means and confidence intervals vary from one sample to another. Loops come in handy here.

Here is the rough outline:

Create a loop that calculates the means and standard deviations of 50 random samples. Save the means to a vector called samp_mean and the standard deviations to samp_sd. After the loop, Construct a vector of lower limits called lower_vector and upper limits called upper_vector based on the large-sample 95% confidence intervals for the population mean.

samp_mean <- rep(NA, 50)   # create empty vector for means
samp_sd <- rep(NA, 50)     # create empty vector for sds
m <- 60                    # sample size

for(i in 1:50){
`___`
}

`___`
`___`

summary(lower_vector)
summary(upper_vector)
for(i in 1:50){
  x <- `___`                      # obtain a sample of size 60 from the population
  samp_mean[i] <- `___`           # save sample mean in ith element of samp_mean
  samp_sd[i] <- `___`             # save sample sd in ith element of samp_sd
}

lower_vector <- `___`  
upper_vector <- `___` 
for(i in 1:50){
  x <- sample(`___`, m)          # obtain a sample of size m from the population
  samp_mean[i] <- mean(`___`)    # save sample mean in ith element of samp_mean
  samp_sd[i] <- sd(`___`)        # save sample sd in ith element of samp_sd
}

lower_vector <- `___` - 1.96 * `___` / `___`
upper_vector <- `___` + 1.96 * `___` / `___`

summary(lower_vector)
summary(upper_vector)
samp_mean <- rep(NA, 50)   # create empty vector for means
samp_sd <- rep(NA, 50)     # create empty vector for sds
m <- 60                    # sample size

for(i in 1:50){
  x <- sample(population$area, m) # obtain a sample of size m from the population
  samp_mean[i] <- mean(x)    # save sample mean in ith element of samp_mean
  samp_sd[i] <- sd(x)        # save sample sd in ith element of samp_sd
}

lower_vector <- samp_mean - 1.96 * samp_sd / sqrt(m) 
upper_vector <- samp_mean + 1.96 * samp_sd / sqrt(m) 

summary(lower_vector)
summary(upper_vector)
samp_mean <- rep(NA, 50)   # create empty vector for means
samp_sd <- rep(NA, 50)     # create empty vector for sds
m <- 60                    # sample size
for(i in 1:50){
  x <- sample(population$area, m) # obtain a sample of size m = 60 from the population
  samp_mean[i] <- mean(x)    # save sample mean in ith element of samp_mean
  samp_sd[i] <- sd(x)        # save sample sd in ith element of samp_sd
}

lower_vector <- samp_mean - 1.96 * samp_sd / sqrt(m) 
upper_vector <- samp_mean + 1.96 * samp_sd / sqrt(m) 

Included with the lab is a non-standard function called plot_ci; use the command plot_ci(lower_vector, upper_vector, mean(population)) to plot fifty (50) 95% confidence intervals for the true mean area from the Ames House Data. This tutorial is set up to rerun the loop above in the background every time you call the code chunk below; try rerunning the code.

plot_ci(lower_vector, upper_vector, mean(population$area))

:::{.exerciseblock} Included with the lab is a non-standard function called plot_ci; use the command plot_ci(lower_vector, upper_vector, mean(population)) to plot fifty (50) 95% confidence intervals for the true mean area from the Ames House Data. What proportion of your confidence intervals include the population mean? Is this proportion exactly equal to the confidence level? If not, explain why.


question_text(
  "Enter your thoughts and observations as free text below.",
  answer("C0rrect", correct = TRUE),
  incorrect = "Okay! [even if this button is red]",
  try_again_button = "Modify your answer",
  allow_retry = TRUE
)

:::

Exercises

:::{.warningblock} Please work through the interactive tutorial for this lab. Your lab report should provide answers to the exercises below. For full marks, please answer the exercises in complete sentences, be mindful of appropriate usage, spelling, and grammar, and use $\LaTeX$ for mathematics where applicable. Use the MA22004 Lab Report template to create your lab report. Upload your final submission as a PDF to Gradescope. Further instructions on using the class template to produce a nice lab report containing data analysis and text are in the interactive tutorial for Lab 1. :::

The following exercises refer to variables found in or derived from the Ames Housing Data in the data frame ames which is packaged with MA22004labs. In particular, we focus on two features: the above-ground living area of the house in square feet (area, i.e. ames$area) and the sale price (price, i.e. ames$price). Please use ggplot when creating plots and visualisations and use tidy methods for manipulating data.

:::{.tipblock} The Ames Housing Data: real estate data from the city of Ames, Iowa, USA. The City Assessor's office records the details of every real estate transaction in Ames. Our particular focus for this lab will be all residential home sales in Ames between 2006 and 2010. This collection represents our population of interest. In this lab, we would like to learn about these home sales by taking smaller samples from the entire population. :::

:::{.tipblock} To fix a seed value in an R chunk (replace n by your favourite integer or your student ID):

set.seed(n)

:::

  1. Using ggplot, plot and describe the distribution of the sample sample(ames$area, 60). What would you say is the "typical" size within your sample? Also, state precisely what you interpreted "typical" to mean (e.g., reference a summary statistic). [3 marks]

  2. Would you expect another student's distribution for sample(ames$area, 60) to be identical to yours? Would you expect it to be similar? Why or why not? [3 mark]

  3. In Section "Confidence intervals", the confidence interval [\overline{x} \pm 1.96 \frac{s}{\sqrt{m}}\,,] with sample standard deviation $s$ and sample size $m$, is used. What conditions must be met for this confidence interval to be valid? [3 marks]

  1. Included with the lab is a non-standard function called plot_ci; use plot_ci to plot fifty 90% confidence intervals for the true mean area from the Ames House Data. What proportion of your confidence intervals includes the population mean? Is this proportion exactly equal to the confidence level? If not, explain why. [5 marks]
  1. Using code from the infer package and data from the one sample of size 60 from ames, find a confidence interval for the proportion of Ames residents that live in a single-family home with a 90% confidence level and interpret it. [6 marks]

Creative Commons License{style="border-width:0"}
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

This work is derivative of OpenIntro Labs and was modified by Eric Hall of the University of Dundee Mathematics Division. The original labs were adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel from labs written by Mark Hansen of UCLA Statistics.



dundeemath/MA22004labs documentation built on Sept. 18, 2024, 10:51 p.m.