```{css, echo=FALSE} body { counter-reset: li; / initialize counter named li / }

ol { margin-left:0; / Remove the default left margin / padding-left:0; / Remove the default left padding / } ol > li { position:relative; / Create a positioning context / margin:0 0 10px 2em; / Give each list item a left margin to make room for the numbers / padding:10px 80px; / Add some spacing around the content / list-style:none; / Disable the normal item numbering / border-top:2px solid #317EAC; background:rgba(49, 126, 172, 0.1); } ol > li:before { content:"Exercise " counter(li); / Use the counter as content / counter-increment:li; / Increment the counter by 1 / / Position and style the number / position:absolute; top:-2px; left:-2em; -moz-box-sizing:border-box; -webkit-box-sizing:border-box; box-sizing:border-box; width:7em; / Some space between the number and the content in browsers that support generated content but not positioning it (Camino 2 is one example) / margin-right:8px; padding:4px; border-top:2px solid #317EAC; color:#fff; background:#317EAC; font-weight:bold; font-family:"Helvetica Neue", Arial, sans-serif; text-align:center; } li ol, li ul {margin-top:6px;} ol ol li:last-child {margin-bottom:0;}

.oyo ul { list-style-type:decimal; }

.oyo ul { list-style-type:decimal; }

hr { border: 1px solid #357FAA; }

div#boxedtext { background-color: rgba(86, 155, 189, 0.2); padding: 20px; margin-bottom: 20px; font-size: 10pt; }

div#template { margin-top: 30px; margin-bottom: 30px; color: #808080; border:1px solid #808080; padding: 10px 10px; background-color: rgba(128, 128, 128, 0.2); border-radius: 5px; }

div#license { margin-top: 30px; margin-bottom: 30px; color: #4C721D; border:1px solid #4C721D; padding: 10px 10px; background-color: rgba(76, 114, 29, 0.2); border-radius: 5px; }

/------------ Fancy blocks --------------/

.noteblock { padding: 1em 1em 1em 6em; margin-bottom: 20px; margin-top: 20px; border-left: 6px solid #042e6b; background: 1.5em center/2.5em no-repeat; background-image: url("images/info-circle.svg"); }

.tipblock { padding: 1em 1em 1em 6em; margin-bottom: 20px; margin-top: 20px; border-left: 6px solid #d5d5ce; background: 1.5em center/2em no-repeat; background-image: url("images/lightbulb.svg"); }

.warningblock { padding: 1em 1em 1em 6em; margin-bottom: 20px; margin-top: 20px; border-left: 6px solid #be4e00; background: 1.5em center/2.5em no-repeat; background-image: url("images/exclamation-triangle.svg"); }

.importantblock { padding: 1em 1em 1em 6em; margin-bottom: 20px; margin-top: 20px; border-left: 6px solid #c30000; background: 1.5em center/2.5em no-repeat; background-image: url("images/exclamation-circle.svg"); }

.exerciseblock { padding: 1em 1em 1em 6em; margin-bottom: 20px; margin-top: 20px; border-left: 6px solid #317EAC; background: rgba(49, 126, 172, 0.1) 1.5em center/2.5em no-repeat; background-image: url("images/flag.svg"); }

```r
# packages
library(MA22004labs)
library(tidyverse)
library(janitor)
library(cowplot)
library(learnr)
library(gradethis)
library(knitr)

# tutorial options
tutorial_options(
  # code running in exercise times out after 30 seconds
  # if it is taking more than 30 s something is wrong 
  exercise.timelimit = 30
  )
# use gradethis for checking
gradethis_setup()
options(width = 60)

# hide non-exercise code chunks
knitr::opts_chunk$set(echo = FALSE, error = TRUE, 
                      comment = "", fig.align = "center", fig.width = 6, 
                      fig.asp = 0.75, out.width = "100%")

data(ames)
samp1 <- sample(ames$area, 50)

sample_means10 <- rep(NA, 5000)
sample_means50 <- rep(NA, 5000)
sample_means100 <- rep(NA, 5000)
for(i in 1:5000){
  samp <- sample(ames$area, 10)
  sample_means10[i] <- mean(samp)
  samp <- sample(ames$area, 50)
  sample_means50[i] <- mean(samp)
  samp <- sample(ames$area, 100)
  sample_means100[i] <- mean(samp)
}

Getting started

In this lab, we investigate how the statistics from a random sample of data can serve as point estimates for population parameters. We're interested in formulating a sampling distribution of our estimate to learn about the properties of the estimate, such as its distribution.

:::{.noteblock} The report for Lab 3 consists of answers to exercises 1-5 in the last section of this tutorial. Upload the submission to Gradescope as a PDF; an rmarkdown template called MA22004 Lab Report is provided. Further instructions on using the class template to produce a nice lab report containing data analysis and text are in Section "RStudio and lab reports" of Lab 1. :::

Packages

In this lab, we will explore and visualise the data using the tidyverse suite of packages. The package janitor is for data cleaning and provides the function tabyl for summarising data. The package cowplot provides the plot_grid function for nicely arranging saved ggplot plots.

Recall that all the necessary packages can be loaded using the library command.

library(MA22004labs)
library(tidyverse)
library(janitor)
library(cowplot)

The Ames Housing Data

:::{.tipblock} The Ames Housing Data: real estate data from the city of Ames, Iowa, USA. The City Assessor's office records the details of every real estate transaction in Ames. Our particular focus for this lab will be all residential home sales in Ames between 2006 and 2010. This collection represents our population of interest. In this lab, we would like to learn about these home sales by taking smaller samples from the entire population. :::

Load the Ames Housing Data data frame by calling ames.


data(ames)

There are quite a few variables in the data set, enough to do a very in-depth analysis. For this lab, we'll restrict our attention to just two variables: the above-ground living area of the house in square feet (area) and the sale price (price).
Let's look at the distribution of area in our population of home sales by calculating a few summary statistics and making a histogram.


summary(`___`)
ggplot(`___`) + geom_histogram()
summary(ames$area)
ggplot(ames, aes(x = area)) + geom_histogram(bins = 25)
question_text(
  "How would you describe this population distribution?",
  answer("C0rrect", correct = TRUE),
  incorrect = "Okay! [even if this button is red]",
  try_again_button = "Modify your answer",
  allow_retry = TRUE
)

The unknown sampling distribution

In this lab, we have access to the entire population, which is rarely the case in real life. Gathering information on a whole population is often extremely costly or impossible. Because of this, we usually take a sample of the population and use that to understand the properties of the population.

If we were interested in estimating the mean living area in Ames based on a sample, we use the following command to sample a subset of the population.

samp1 <- sample(ames$area, 50) # (1 as in one not l as in lucy)

This command collects a simple random sample of size 50 from the vector area, which is assigned to samp1. This is like going into the City Assessor's database and pulling up the files on 50 random home sales. Working with these 50 files would be more straightforward than working with all 2930 home sales.

:::{.exerciseblock} Describe the distribution of the sample samp1 <- sample(ames$area, 50). How does it compare to the distribution of the population?


summary(`___`)
`___`
question_text(
  "Enter your observations as free text below.",
  answer("C0rrect", correct = TRUE),
  incorrect = "Okay! [even if this button is red]",
  try_again_button = "Modify your answer",
  allow_retry = TRUE
)

:::

If we're interested in estimating the average living area in homes in Ames using the sample, our best single guess is the sample mean. Calculate the sample mean for samp1.


mean(`___`)
grade_this({
  if(.result == mean(samp1)){
    pass()
  }
  fail()
})

Depending on which 50 homes you selected, your estimate could be slightly above or below the true population mean of 1499.69 square feet. In general, though, the sample mean turns out to be a pretty good estimate of the average living area, and we were able to get it by sampling less than 3\% of the population.

:::{.exerciseblock} Take a second sample of the area data, also of size 50, and call it samp2. How does the mean of samp2 compare with samp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?


samp2 <- `___`
question_text(
  "Enter your observations as free text below.",
  answer("C0rrect", correct = TRUE),
  incorrect = "Okay! [even if this button is red]",
  try_again_button = "Modify your answer",
  allow_retry = TRUE
)

:::

Not surprisingly, we get a different sample mean every time we take another random sample. It's helpful to understand how much variability we should expect when estimating the population mean this way. The distribution of sample means, called the sampling distribution, can help us understand this variability. In this lab, because we have access to the population, we can build up the sampling distribution for the sample mean by repeating the above steps many times. Here we will generate 5000 samples using a for loop and compute the sample mean of each.

sample_means50 <- rep(NA, 5000)
for(i in 1:5000){
   samp <- sample(ames$area, 50)
   sample_means50[i] <- mean(samp)
   }
ggplot(NULL, aes(x = sample_means50)) + geom_histogram()

Here, there is no data frame --- just a vector of sample means --- therefore, we set data = NULL. Adjust the bin width of your histogram to show a little more detail (recall, you can do so by changing the bins argument).


# Here, there is no data frame -- just a vector of sample means
ggplot(NULL, aes(x = sample_means50)) + geom_histogram()
ggplot(NULL, aes(x = sample_means50)) + geom_histogram(bins = 40) # bins >= 30

Here we use R to take 5000 samples of size 50 from the population, calculate the mean of each sample, and store each result in a vector called sample_means50. In the next section, we'll review how this code works.

:::{.exerciseblock} How many elements are there in sample_means50? Describe the sampling distribution, and be sure to note its centre. How would you expect the distribution to change if we collected 50,000 sample means instead?


question_text(
  "Enter your observations and thoughts as free text below.",
  answer("C0rrect", correct = TRUE),
  incorrect = "Okay! [even if this button is red]",
  try_again_button = "Modify your answer",
  allow_retry = TRUE
)

:::

Interlude: The for loop

Let's take a break from the statistics for a moment to let that last block of code sink in. Maybe you have just run your first for loop, a cornerstone of computer programming. The idea behind the for loop is iteration: it allows you to execute code multiple times without typing out every iteration. In the case above, we wanted to iterate the two lines of code inside the curly braces that take a random sample of size 50 from area and then save the mean of that sample into the sample_means50 vector. Without the for loop, this would be painful:

sample_means50 <- rep(NA, 5000)
samp <- sample(ames$area, 50)
sample_means50[1] <- mean(samp)
samp <- sample(ames$area, 50)
sample_means50[2] <- mean(samp)
samp <- sample(ames$area, 50)
sample_means50[3] <- mean(samp)
samp <- sample(ames$area, 50)
sample_means50[4] <- mean(samp)

and so on...

With the for loop, these thousands of lines of code are compressed into a handful of lines. We've added one extra line to the code below, which prints the variable i during each iteration of the for loop. Run this code.

sample_means50 <- rep(NA, 5000)
for(i in 1:5000){
   samp <- sample(ames$area, 50)
   sample_means50[i] <- mean(samp)
   print(i)
   }

Let's consider this code line by line to figure out what it does. In the first line, we initialised a vector. In this case, we created a vector of 5000 zeros called sample_means50. This vector will store values generated within the for loop.

The second line calls the for loop itself. The syntax can be loosely read as, "for every element i from 1 to 5000, run the following lines of code". You can think of i as the counter that keeps track of which loop you're on. Therefore, more precisely, the loop will run once when i = 1, then once when i = 2, and so on up to i = 5000.

The body of the for loop is the part inside the curly braces, and this code is run for each value of i. Here, on every loop, we take a random sample of size 50 from area, take its mean, and store it as the $i$th element of sample_means50.

To display this, we asked R to print i at each iteration. This line of code is optional and is only used for displaying what's happening while the for loop is running.

The for loop allows us to run the code 5000 times and to package the results neatly, element by element, into the empty vector we initialised at the outset.

:::{.exerciseblock} Try running a smaller version to ensure you understand what you've done in this loop. Initialise a vector of 100 zeros called sample_means_small. Run a loop that takes a sample of size 50 from area and stores the sample mean in sample_means_small, but only iterate from 1 to 100. Present your results in a histogram.


:::

Sample size v distribution

Mechanics aside, let's return to the reason we used a for loop: to compute a sampling distribution, specifically, the following.

ggplot(NULL, aes(x = sample_means50)) + geom_histogram(bins = 25)

The sampling distribution we computed tells us much about estimating the average living area in homes in Ames. Because the sample mean is an unbiased estimator, the sampling distribution is centred at the population's true average living area, and the distribution spread indicates how much variability is induced by sampling only 50 home sales.

To get a sense of the effect that sample size has on our distribution, build up two more sampling distributions: one based on a sample size of 10 and another based on a sample size of 100.

sample_means10 <- rep(NA, 5000)
sample_means100 <- rep(NA, 5000)
for(i in 1:5000){
`___`
}
sample_means10 <- rep(NA, 5000)
sample_means100 <- rep(NA, 5000)
for(`___`){
  samp <- `___`
  sample_means10[`___`] <- `___`
  samp <- `___`
  sample_means100[`___`] <- `___`
}
sample_means10 <- rep(NA, 5000)
sample_means100 <- rep(NA, 5000)
for(i in 1:5000){
  samp <- sample(ames$area, `___`)
  sample_means10[i] <- mean(`___`)
  samp <- sample(ames$area, `___`)
  sample_means100[i] <- mean(`___`)
}
sample_means10 <- rep(NA, 5000)
sample_means100 <- rep(NA, 5000)
for(i in 1:5000){
  samp <- sample(ames$area, 10)
  sample_means10[i] <- mean(samp)
  samp <- sample(ames$area, 100)
  sample_means100[i] <- mean(samp)
}

Here we can use a single for loop to build two distributions by adding additional lines inside the curly braces. Don't worry that samp is used to name two different objects. In the second command of the for loop, the mean of samp is saved to the appropriate place in the vector sample_means10. With the mean saved, we're free to overwrite the object samp with a new sample, this time of size 100. In general, anytime you create an object using a name already in use, the old object will be replaced with the new one.

To see the effect that different sample sizes have on the sampling distribution, plot the three distributions on top of one another.

xlimits <- range(sample_means10)
n10  <- ggplot(NULL, aes(x = sample_means10)) + geom_histogram(bins = 30) + xlim(xlimits)
n50  <- ggplot(NULL, aes(x = sample_means50)) + geom_histogram(bins = 30) + xlim(xlimits)
n100 <- ggplot(NULL, aes(x = sample_means100)) + geom_histogram(bins = 30) + xlim(xlimits)

plot_grid(n10, n50, n100, nrow = 3)

In this case, we save each of the plots to a variable name and then join the plots into a grid using the plot_grid function (provided by the cowplot package). The argument nrow = 3 specifies that we want to stack the three plots. The bins argument specifies the number of bins used in constructing the histogram. The xlim argument specifies the range of the histogram's x-axis, and by setting it equal to xlimits for each histogram, we ensure that all three histograms will be plotted with the same limits on the x-axis.

:::{.exerciseblock} Recreate sampling distribution histograms for the mean area for three different sample sizes. As the sample size increases, what happens to the centre of mass of the distribution? What happens to the spread (dispersion) as the sample size increases?


question_text(
  "Enter your observations and thoughts as free text below.",
  answer("C0rrect", correct = TRUE),
  incorrect = "Okay! [even if this button is red]",
  try_again_button = "Modify your answer",
  allow_retry = TRUE
)

:::

Exercises

:::{.warningblock} Please work through the interactive tutorial for this lab. Your lab report should provide answers to the exercises below. For full marks, please answer the exercises in complete sentences, be mindful of appropriate usage, spelling, and grammar, and use $\LaTeX$ for mathematics where applicable. Use the MA22004 Lab Report template to create your lab report. Upload your final submission as a PDF to Gradescope. Further instructions on using the class template to produce a nice lab report containing data analysis and text are in the interactive tutorial for Lab 1. :::

The following exercises refer to variables found in or derived from the Ames Housing Data in the data frame ames which is packaged with MA22004labs. In particular, we focus on two features: the above-ground living area of the house in square feet (area, i.e. ames$area) and the sale price (price, i.e. ames$price). Please use ggplot when creating plots and visualisations and use tidy methods for manipulating data.

:::{.tipblock} The Ames Housing Data: real estate data from the city of Ames, Iowa, USA. The City Assessor's office records the details of every real estate transaction in Ames. Our particular focus for this lab will be all residential home sales in Ames between 2006 and 2010. This collection represents our population of interest. In this lab, we would like to learn about these home sales by taking smaller samples from the entire population. :::

:::{.tipblock} To fix a seed value in an R chunk (replace n by your favourite integer or your student ID):

set.seed(n)

:::

  1. Describe the distribution of the sample samp1 <- sample(ames$area, 50). How does it compare to the distribution of the population? [2 mark]

  2. Take two samples each of size 50 from the area data and call them samp1 and samp2, respectively. How does the mean of samp1 compare with samp2? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean? [3 marks]

  3. In Section "The unknown sampling distribution", you created a vector sample_means50. How many elements are there in sample_means50? Describe the sampling distribution, and be sure to note its centre. How would you expect the distribution to change if we collected 50,000 sample means instead? [3 marks]

  1. Recreate sampling distribution histograms for the mean area for samples of three different sizes (that each differs by at least 30). As the sample size increases, what happens to the centre of mass of the distribution? Describe what happens to the spread (dispersion) as the sample size increases. [6 marks]
  1. Take 1000 samples each of size 15 from price/1e3, and calculate the variance of each sample. Present the distribution of the sample variances in a histogram. Take 1000 samples of size 30 from price/1e3 and calculate the variance of each sample. Present the distribution of the sample variances in a histogram. Comment on the underlying distribution for the sample variance. What happens to the mean of the sample variance as the sample size increases? [6 marks]

Creative Commons License{style="border-width:0"}
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

This work is derivative of OpenIntro Labs and was modified by Eric Hall of the University of Dundee Mathematics Division. The original labs were adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel from labs written by Mark Hansen of UCLA Statistics.



dundeemath/MA22004labs documentation built on Sept. 18, 2024, 10:51 p.m.