$\$

In the following homework, you will gain more experience examining sampling, sampling distributions and confidence intervals. Please submit a compiled pdf with your answers to the exercises to Gradescope by 11pm on Sunday February 11th. Please also make sure to indicate the pages to each question's answers on Gradescope. Five points will be taken off if you do not mark the pages for each problem on Gradescope.

A list of functions we have used in class can be found on Canvas. You might also find the following symbols useful: $\mu$, $\bar{x}$, $\pi$, $\hat{p}$, $\rho$, $\hat{y}$, $\pm$, and $\approx$. For full credit on problems involving R, be sure to label all your figures and to "show your work" by making sure all values that are answers to questions are printed/visible in your R Markdown pdf.

As always, if you need help with any of the homework assignments, please attend the TA office hours which are listed on Canvas and/or ask questions on Ed Discussions forum. Also, if you have completed the homework, please help others out by answering questions on Ed Discussions. Finally, reviewing the class slides and videos will be helpful if you run into difficulties.

  library(knitr)
  # makes sure the code is wrapped to fit when it creates a pdf
  opts_chunk$set(tidy.opts=list(width.cutoff=60))   


  SDS100::download_image("Lock5_3_6.png")
  SDS100::download_image("hw3_instate_tuition.png")

$\$

Part 1: Questions examining sampling and bias

The following questions are intended to improve your understanding of sampling, bias, sampling distributions and confidence intervals. I also recommend trying additional questions from the Lock5 textbook to get more practice.

$\$

Exercise 1.1 (6 points): In a sample survey of professors at the University of Nebraska, 94% of them described themselves as "above average" teachers. Please answer the following questions:

a. What is the sample? What is the population?

b. Based on the information provided, can we conclude that the study suffers from sampling bias?

c. Is 94% a good estimate for the percentage of above-average teachers at the University of Nebraska? If not, why not?

Answers:

a.

b.

c.

$\$

Exercise 1.2 (7 points): Employment statistics in the US are often based on two nationwide monthly surveys: the Current Population Survey (CPS) and the Current Employment Statistics (CES) survey. The CPS samples approximately 60,000 US households and collects the employment status, job type, and demographic information of each resident in the household. The CES survey samples 140,000 nonfarm businesses and government agencies and collects the number of payroll jobs, pay rates, and related information for each firm.

a. What is the population in the CPS survey?

b. What is the population in the CES survey?

For each of the following statistical questions, state whether the results from the CPS or CES survey would be more relevant.

i. Do larger companies tend to have higher salaries?

ii. What percentage of Americans are self-employed?

iii. Are married men more or less likely to be employed than single men?

Answers:

a.

b.

i.

ii.

iii.

$\$

Part 2: Questions examining sampling distributions and confidence intervals

Exercise 2.1 (8 points): A household consists of all people occupying a housing unit as their primary place of residence. The latest US Census lists the average household size in the United States as 2.61 people. Figure 1 shows possible distributions of means for 1000 samples of household sizes. The scale on the horizontal axis is the same in all four cases.

a. Assume that two of the distributions show results from 1000 random samples, while two others show distributions from a sampling method that is biased. Which two dotplots appear to show samples produced using a biased sampling method? Explain your reasoning. Pick one of the distributions that you listed as biased and describe a sampling method that might produce this bias.

b. For the two distributions that appear to show results from random samples, suppose that one comes from 1000 samples of size n = 100 and one comes from 1000 samples of size n = 500. Which distribution goes with which sample size? Explain.

Distribution of means of samples of household sizes

Answers:

a.

b.

$\$

Exercise 2.2 (8 points): In preparing for a test on a set of topics, is it better to study topics mixed together or topics one at a time? To examine this question, a study took a sample of fourth grade students and split them into two groups. The first group learned by studying mixed problem sets that included examples of all four types of calculations together. The second group studied repeated examples of one equation at a time. A day later, all the students were given a test on the material. The mixed practice group had an average grade of 77 while the students in the one-at-a-time group had an average grade of 38.

Please answer the following questions about the study:

a. Using the symbols discussed in class, give notation (as two symbols separated by a minus sign) for the quantity we would ideally like to know in order to tell if there was a difference between the average scores of these two groups.

b. Again using the symbols discussed in class, give notation (as two symbols separated by a minus sign) for the quantity that gives our best estimate of the ideal quantity based on the data collected.

c. Give the value for the best estimate of the ideal quantity, based on the actual data reported by the study.

For parts a and b, sure to clearly define what all symbols mean in the context of what the research study was examining.

Answers:

a.

b.

c.

$\$

Exercise 2.3 (12 points): The data set CollegeScores4yr contains information on over 2,000 4-year US colleges and universities. One of the variables in this data set is the in-state tuition cost which is the price that students who live in the state pay to attend the college/university. Figure 2 shows two box plots. One represents the in-state tuition cost data for one random sample of n = 30 individual colleges/universities. The other represents a sampling distribution of 1,000 means of in-state tuition cost data for samples of size 30.

a. Which is which? Explain.

b. From the box plot showing the data from one random sample, what does one value in the sample represent? How many values are included in the data to make the box plot? Estimate the minimum and maximum values. Give a rough estimate of the mean of the values and use appropriate notation for your answer.

c. From the box plot showing the data from a sampling distribution, what does one value in the sampling distribution represent? How many values are included in the data to make the box plot? Estimate the minimum and maximum values. Give a rough estimate of the value of the population parameter and use appropriate notation for your answer.

Box plots related to in-state tuition costs

Answers:

a.

b.

c.

$\$

Exercise 2.4 (10 points): A report from a Gallup poll in 2001 started by saying "Forty-five percent of American adults reported getting their health insurance from an employer..." Later in the article we find information on the sampling method, "a random sample of 147,291 adults, aged 18 and over, living in the US," and a sentence about the accuracy of the results, "the maximum margin of sampling error is $\pm$ 1 percentage point".

a. What is the population? What is the sample? What is the population parameter of interest? What is the relevant statistic? Please be sure to use the appropriate symbols we discussed in class in your answers.

b. Use the margin of error to give an interval estimate for the parameter of interest. Interpret this interval in the context of the problem.

Answers:

a.

b.

$\$

Exercise 2.5 (10 points): Overeating for just four weeks can increase fat mass and weight over two years later, a Swedish study shows. Researchers recruited 18 healthy and normal-weight people with an average age of 26. For a four-week period, participants increased calorie intake by 70% (mostly by eating fast food) and limited daily activity to a maximum of 5000 steps per day (considered sedentary). Not surprisingly, width and body fat of the participants went up significantly during the study and then decreased after the study ended. Participants are believed to have returned to the diet and lifestyle they had before the experiment. However, two and a half years after the experiment, the mean weight gain for participants was 6.8 lbs with a standard error of 1.2 lbs. A control group that did not binge had no change in weight.

a. What is the relevant parameter and what symbol should we use to denote it?

b. In a theoretical world where you could force people to do anything, how could you find the actual exact value of the parameter?

c. Give a 95% confidence interval for the parameter and interpret it.

d. Give the margin of error and interpret it.

Answers:

a.

b.

c.

d.

$\$

Part 3: Practice examining sampling distributions and calucating standard errors in R

In homework 0, exercise 2.3-2.5 you used the function get_sprinkle_sample(n) to get a (virtual) sample of sprinkle colors and then you calculated the proportion of red sprinkles in the sample. You then repeated this process for different sample sizes of n. In the exercises below we will now create a sampling distribution for the proportion of red sprinkles.

$\$

A note about random numbers: In this problem we are going to be using a computer to generate random data (i.e., random sprinkle colors). Thus, each time the code is run we would expect to get different data. While this is useful when running simulations, it makes it hard to report particular results since they numbers we get will change each time the code is run.

To deal with this issue, we can "set the seed" of the random number generator using the set.seed() function. What calling the set.seed() function does is it restarts the sequence of random number generation at the same point, so we can repeatedly get the same sequence of "random numbers".

One way to think of this imagine we had a 500 page book filled with random numbers. Calling the function set.seed(100) would be analogous to reading off the random numbers starting at page 100. Thus, we could get a long sequence of random numbers (from page 100 to the end of the book) and if we wanted to get the same repeated sequence of random numbers we could call set.seed(100) again which would return us to page 100 so we can read off the same "random" sequence of numbers again.

When doing this problem set, if you ever want to restart the sequence of random numbers just call the set.seed() function. Also, when reporting the numbers you get on this homework, we would recommend that you knit your document and look at the output from your simulations in the resulting pdf. Then go back and write down the numbers in the answer section based on the results that are in the pdf. If you set the seed in your markdown document (as we have done below for you) every time you knit the document you should get the same "random" numbers so that you can accurately report the results you get by reading them off the knitted pdf.

If you have more questions about this, please ask on Ed Discussions!

$\$

Exercise 3.1 (6 points): For the first exercise, let's review calculating the proportion of red sprinkles. Use the get_sprinkle_sample() function to get a sample of size n = 100 and calculate the proportion of red sprinkles. Report this statistic using the proper symbol. Note: the table() and the prop.table() functions will be useful here.

# do not change these two lines of code
library(SDS100)

# we are setting the seed here so we get the same sequence of random numbers in our simulations
set.seed(100) 

Answer:

$\$

Exercise 3.2 (3 points): In order to make the subsequent exercises a little easier, let's use the function get_proportion() from the SDS100 package. This function directly calculates the proportion of items of a specific category in a sample, so we don't need to call the table() and prop.table() functions.

The get_proportion(data, category_name) takes the following arguments:

  1. data: a vector of data.

  2. category_name: a string specifying the category name one wants to get the proportion of.

Please use the get_proportion() function on the same one_sample that you created in part 3.1 to calculate the proportion of red sprinkles. Verify that the get_proportion() function returns the same $\hat{p}$ value that you calculated in part 3.1.

# check that the get_proportion() returns the same proportion you got in question 3.1

$\$

Exercise 3.3 (6 points): Now use the do_it() function to create a sampling distribution of the proportion of red sprinkles using a sample size of n = 100 sprinkles. To do this, inside the do_it() loop do the following:

  1. Generate a sample of 100 random sprinkles
  2. Use the get_proportion() to calculate the proportion of red sprinkles, like you did in Exercise 3.2.

Make sure your do_it() function repeats this process of generating samples and calculating the proportion of red sprinkles 10,000 times so that there are 10,000 red proportions in your sampling distribution, and save your sampling distribution to an object called sampling_distribution.

Once you have created your sampling distribution in the object sampling_distribution, create a histogram of the sampling distribution and in the answer section report the standard error and what the shape of the sampling distribution looks like.

# create the sampling distribution 







# create a histogram to visualize the sampling distribution 






# calculate the SE

Answer:

$\$

Exercise 3.4 (8 points): Now repeat exercise 3.3 for sample sizes of n = 25 and n = 400. Each sampling distribution should still have 10,000 sample statistics. In the answer section report how the standard error change as function of the sample size n.


Answers:

$\$

Exercise 3.5 (10 points): Finally, draw one additional single sample of size n = 100 and calculate the 95% confidence interval based on this sample and the standard error calculated in exercise 3.3. Then repeat this process for n = 25, and n = 400 using the standard errors calculated in exercise 3.4. In the answer section report all these 95% confidence intervals and whether they change as you expected. Also, what do you think the value of the parameter is, and do you think any of these confidence intervals do not contain this parameter?


Answer: The proportion of red sprinkles is ...

a. The confidence interval for n = 100 is:

b. The confidence interval for n = 25 is:

c. The confidence interval for n = 400 is:

My best guess at the parameter is...

$\$

Reflection (3 points)

Please reflect on how the homework went by going to Canvas, going to the Quizzes link, and clicking on Reflection on homework 3.



emeyers/SDS100 documentation built on April 28, 2024, 5:07 p.m.