In emeyers/SDS100: Tools for the class Introductory Statistics

$\$

In the following homework, you will gain more experience examining sampling, sampling distributions and confidence intervals. Please submit a compiled pdf with your answers to the exercises to Gradescope by 11pm on Sunday February 18th. Five points will be taken off if you do not mark the pages for each problem on Gradescope.

A list of functions we have used in class can be found on Canvas. You might find the following symbols useful: $\mu$, $\bar{x}$, $\pi$, $\hat{p}$, $\rho$, $\hat{y}$, $\pm$, and $\approx$, as well as the functions qnorm(), do_it() and sample(). For full credit on problems involving R, be sure to label all your figures and to "show your work" by making sure all values that are answers to questions are printed/visible in your R Markdown pdf.

As always, if you need help with any of the homework assignments, please attend the TA office hours which are listed on Canvas and/or ask questions on Ed Discussions forum. Also, if you have completed the homework, please help others out by answering questions on Ed Discussions. Finally, reviewing the class slides and videos will be helpful if you run into difficulties.

Note: The "seed" to generate random numbers is set at the top of this R Markdown document. This will allow you get get the same answers every time you knit the document, and thus to answer questions that involve randomness in a consistent way (i.e., you can read out results generated by random functions from the knitted pdf and then comment on these results). In order to make it easier to grade the homework, please do not change this seed, and do not call the set.seed function anywhere else in the R Markdown document.

download.file("https://yale.box.com/shared/static/ey6ahs284lhoye1hgqloe44aqbb2h0qd.rda", "cars_small.Rda")

  options(scipen=999)
  set.seed(100)
  library(SDS100)

$\$

Part 1: Confidence intervals

Let's start by doing a few exercises to make sure you have a firm understanding of confidence intervals.

$\$

Exercise 1.1 (6 points) Which of the following statements are true about 90% confidence intervals? Indicate all statements that are true by copying the full sentence(s) to the answer section:

a. 90 percent of the sample data is contained in the intervals produced by this method.

b. 90 percent of the population data is contained in the intervals produced by this method.

c. 90 percent of the intervals produced by this method contain the parameter of interest.

d. 90 percent of the intervals produced by this method contain the statistic of interest.

Answers:

$\$

Exercise 1.2 (10 points): In the class we discussed that the formula for computing a 95% confidence interval for the mean $\mu$ can be written as: $\bar{x} ~ \pm 2 \cdot SE$. Please describe as clearly as you can, the logic of why an interval of this form is able to capture the parameter $\mu$ 95% of the time. You can assume that the sampling distribution is normal and that there is no bias.

If you are unsure of your answer, rewatching some of the class videos could be useful. If you think it will help your explanation, you are welcome to create drawings and embed them into your R Markdown document.

Answer:

$\$

Exercise 1.3 (8 points): Please write the formula that can be used to compute a 82% confidence interval for a proportion. You can assume the standard error, SE, is known and that the sampling distribution is normal. You are welcome to use the R chunk below as well to help come up with the formula.

Answer:

$\$

Part 2: Exploring sampling distributions

In the following exercises you will use a web application to explore sampling and bootstrap distributions. You will also explore standard errors and confidence intervals computing from these distributions. The application to use is located at: https://emeyers.shinyapps.io/sampling_bootstrap_distribution_app/. Note, the application is a little buggy, so if it acts strangely please reload the web page.

To complete these exercises, open up the web application in another window. When the application is open you will see the following three plots:

a. The plot on the upper left shows the distribution of data in a population. The shape of the population distribution can be changed using the population distribution dropdown box, where the choice are: right skewed, bimodal and normal.

b. The plot on the lower left shows a histogram of the data from one single sample of size n. The size of the sample can be changed using the sample size input box.

c. The plot on the upper right shows an (approximate) sampling distribution (histogram) of statistics based on 5,000 samples (i.e., each statistic in the histogram was computed from one of the 5,000 samples). The type of statistic that goes into this distribution can be changed using the 'statistic' drop down box. Options for the statistics you can use are: the mean, the median and the standard deviation.

$\$

Exercise 2.1 (6 points): Using the mean statistic and a sample size of n = 100, describe the shape of the sampling distribution for the:

a. the right skewed population distribution b. the bimodal population distribution c. the normal population distribution

Answer in the conclusion section below: does the shape of the sampling distribution change a lot depending on the shape of the population distribution?

Answers:

a. shape of the sampling distribution is:

b. shape of the sampling distribution is:

c. shape of the sampling distribution is:

Conclusion:

$\$

Exercise 2.2 (6 points): Using a sample size of n = 100 and a population distribution that is right skewed, describe the shape of the sampling distribution for the statistics of:

a. mean ($\bar{x}$) b. standard deviation (s) c. median

Answer in the conclusion section below: does the shape of the distribution change a lot depending on the statistic chosen?

Answers:

a. shape of the sampling distribution is:

b. shape of the sampling distribution is:

c. shape of the sampling distribution is:

Conclusion:

$\$

Exercise 2.3 (6 points): Keeping population distribution right skewed and the statistic being the sample mean $\bar{x}$. Compare the sample sizes of:

a. n = 20 b. n = 80 c. n = 320

In the answer section below describe the shape of the sampling distribution for these different sample sizes and what the standard errors (SE) are. Is this what you would expect? Why? Also, do you notice any relationship between the different standard errors?

Answers:

a. shape is: , SE =

b. shape is: , SE =

c. shape is: , SE =

Conclusion:

$\$

Part 3: Explore bootstrap distributions

Let's continue to use our web application, but now let's explore bootstrap distributions.

For the next two questions check the box that says 'Display Bootstrap Distribution'. You should notice a new plot in the lower right that shows the bootstrap distribution that is created from the one sample in the lower left plot.

$\$

Exercise 3.1 (8 points): The centers of these distributions are given by red lines on the plots, and the center values noted at the top of the 4 plots. To get a sense of the relationship between the center values between the plots, set the statistic to the mean, and explore relationships by changing the sample size to values around 100 which will generate: a) new a sample, b) a new sampling distribution and c) the resulting bootstrap distribution from the sample. Then answer the following questions:

a. How does the population mean ($\mu$) relate to the center of the sampling distribution ($E[\bar{x}]$)? Is there bias here?

b. How does the center of the sample ($\bar{x}$) related to the the population mean ($\mu$)? Is this expected?

c. How does the center of the bootstrap distribution ($E[\bar{x}*]$) related to the center of the sample ($\bar{x}$)?

d. How does the center of the bootstrap distribution relate to the center of the population ($\mu$)?

Answers:

$\$

Exercise 3.2 (10 points): How does the standard error and confidence interval of the sampling distribution compare to the standard error and confidence interval created from the bootstrap distribution? To explore this, set the distribution to normal and the statistic to the mean, and run the following steps twice (i.e., my answers from running it once are in answer 1., fill in answers 2. and 3. by running the steps below twice):

a. Set the sample size to n = 99 and then reset it to n = 100. This will rerun the app to create a new sample and a new bootstrap distribution.

b. Write down the SE from the sampling distribution and the SE* from the bootstrap distribution.

c. Fill in the table below by reporting the SE, $SE$, and the value of $\bar{x}$. Also, using the statistic value from the one sample ($\bar{x}$), compute 95% confidence intervals from the sampling distribution (using SE) and using the bootstrap distribution ($SE$) using the formulas: $CI = \bar{x} \pm 2 \cdot SE$ and $CI_b = \bar{x} \pm 2 \cdot SE*$ and add these intervals to the table.

Do all these intervals capture the population parameter?

Answers:

| $\bar{x}$ | SE | CI | SE* | CI-bootstrap | |------------|--------|----------------|---------|---------------| | 5.07 | 0.10 | [4.87 5.27] | 0.13 | [4.81 5.33] | | | | | | | | | | | | |

Write your answer here whether all three intervals capture the population parameter...

$\$

Exercise 3.3 (10 points): Repeat exercise 3.2 but using the right skewed population. Do all three intervals capture the population parameter?

a. Set the sample size to n = 99 and then reset it to n = 100. This will rerun the app to create a new sample and a new bootstrap distribution.

b. Write down the SE from the sampling distribution and the SE* from the bootstrap distribution.

c. Report the SE, $SE$, and the value of $\bar{x}$. Also, using the statistic value from the one sample ($\bar{x}$), compute 95% confidence intervals from the sampling distribution (using SE) and using the bootstrap distribution ($SE$) using the formulas: $CI = \bar{x} \pm 2 \cdot SE$ and $CI_b = \bar{x} \pm 2 \cdot SE*$ .

Do all these intervals capture the population parameter?

Answers:

| $\bar{x}$ | SE | CI | SE* | CI-bootstrap | |------------|--------|----------------|---------|---------------| | 1.91 | 0.14 | [1.63 2.19] | 0.13 | [1.65 2.17] | | | | | | | | | | | | |

Write your answer here whether all three intervals capture the population parameter...

$\$

Part 4: Calculating confidence intervals using the bootstrap in R

In the first set of exercises you will calculate confidence intervals for the mean price that Hondas were sold on Edmunds.com. As mentioned in homework 1, Edmunds has made this data available for education purposes only, so please do not distribute this data.

$\$

Exercise 4.1 (8 points): Let's start by looking at the price of Hondas that were sold on Edmunds.com. The code below loads the data on the prices of cars and extracts a sample of prices that different Hondas were sold for. Calculate the size of the sample and store the answer in an object called n_sample_size. Also calculate the mean price that Hondas were sold for and store the result in an object called the_stat. Report the value for the sample size and the mean price Hondas were sold for and use the appropriate symbol. Finally, plot a histogram of the car prices and don't forget to label your axes!

# load the car price data
load("cars_small.Rda")

# extract the sample of Honda prices
price_sample <- subset(car_data$price, car_data$brand == "Honda")


# get the sample size and store it in an object called n_sample_size


# get the mean price and store it in an object called the_stat 


# plot a histogram of the Honda prices

Answer:

$\$

Exercise 4.2 (14 points): Now use the do_it() function to create a bootstrap distribution by doing the following steps:

Sample with replacement n_sample_size car prices from the price_sample vector.
Calculate the mean of the bootstrap sample.
Repeat this process 10,000 times and store the results in a vector called bootstrap_dist.

Plot a histogram of the bootstrap distribution, calculate the bootstrap standard error ($SE$), and print the bootstrap standard error. Finally, calculate a 95% confidence interval using the formula: $CI_b = \bar{x} \pm 2 \cdot SE$ and report what the interval is below.

Answer:

$\$

Exercise 4.3 (5 points) If you were to create a confidence interval for the price of Hyundais using the bootstrap method on the Edmunds data, you would get a 95% confidence interval of [\$19,505 \$19,778] (you can verify this yourself by replacing the line above that gets the car prices to price_sample <- subset(car_data$price, car_data$brand == "Hyundai")). Based on the confidence interval you calculated for Hondas above, and the confidence interval for Hyundais I just gave you, is it likely that the population mean price for Hondas ($\mu_{honda}$) is the same as the population mean price for Hyundais ($\mu_{hyundai}$)? Why or why not?

Answers:

$\$