$\$

In the following homework, you will gain more experience examining quantitative data using histograms, box plots, and z-scores, and you will practice examining relationships between quantitative data. Please submit a compiled pdf with your answers to the exercises to Gradescope by 11pm on Sunday February 4th. Please also make sure to indicate the correct pages to each question's answers on Gradescope to make sure that we can grade them.

Some useful functions for completing this worksheet are: dim(), length(), hist(), mean(), median(), max(), sd(), sum(), fivenum(), and boxplot(), plot(), corr(), lm(), abline(), coef() also see the "R function" page on Canvas for a list of functions we have used in class.

You might also find the following symbols useful: $\mu$, $\bar{x}$, $\pi$, $\hat{p}$, $\rho$, $\hat{y}$. For full credit on problems involving R, be sure to label all your figures and to "show your work" by making sure all values that are answers to questions are printed/visible in your R Markdown pdf.

For all questions that involve R, please use the R chunks (gray sections) to create relevant plots and calculations, and then answer the questions in the "answer" section below each chunk. Be sure to "show your work" by printing out information that answers the question that have been asked (although please do not print out pages of irrelevant data). Also, make sure to add appropriate labels to all plots.

If you need help with any of the homework assignments, please attend the TA office hours which are listed on Canvas and/or ask questions on Ed Discussions forum. Also, if you have completed the homework, please help others out by answering questions on Ed Discussions. Finally, reviewing the class slides and videos will be helpful if you run into difficulties.

  install.packages('Lahman')
  install.packages("Lock5Data")

$\$

Part 1: More practice analyzing one quantitative variable

$\$

For the first set of exercises you will get more practice analyzing a single quantitative variable by analyzing data about the heights of Major League Baseball (MLB) players.

$\$

Exercise 1.1: (4 points) The code below loads information about every MLB player who has played through the 2019 season in a data frame called People. Examine the data frame and answer the questions:

a. What does a case correspond to in this data frame?

b. How many cases are there in the data frame?

# load a data frame called 'People' with baseball player information 
library(Lahman)

# A data frame with information about baseball players
People <- People

Answers:

a.

b.

$\$

Exercise 1.2 (5 points): The code below extracts a vector of heights of all the baseball players in inches. Create a histogram of their heights, and be sure to use an appropriate number of bins and to put appropriate labels on your axes. Describe the shape of the histogram. Are there any noticeably small or large outliers?

# extract a vector with baseball player heights
player_heights <- People$height

Answers:

$\$

Exercise 1.3 (8 points): Now create a box plot of the heights, and report the 5 number summary. Do you notice any extreme outliers? Describe what you should do when you see an extreme outlier and then do it! Hint: Google could be helpful here.


Answers:

$\$

Exercise 1.4 (7 points): Next, calculate the mean and standard deviation of the heights of baseball players. Note, you will need to use the na.rm = TRUE option since there is missing data. Then report the range of heights that one would expect for the middle 95% of the heights to be in if it was the case that the heights were normally distributed. Also calculate the actual 2.5th and 97.5th percentile values for the heights which correspond to the end points for the middle 95% of the data. Are these end points close to what you calculated for the end points that were based on the assumption that the heights had a normal distribution?


Answers:

$\$

Exercise 1.5 (7 points): Finally calculate the z-score for the minimum value in the data set (again use the na.rm = TRUE argument when calculating the mean, the standard deviation, and the minimum height in the data). Report what this value is, and what this value means.


Answers:

$\$

Part 2: Scatter plots and correlation

$\$

The following questions are based on questions from the Lock5 textbook and are intended to improve your understanding of scatter plots and correlation. I recommend you try additional questions from the textbook to get extra practice.

$\$

Exercise 2.1 (4 points): Let's explore two quantitative variables related to cars. Describe whether you expect a positive or negative association between the distance a car was driven since the last fill-up of the gas tank, and the amount of gas left in the tank. In 1-3 sentences, please explain the rationale behind your answer.

Answer:

$\$

Exercise 2.2 (12 points): Suppose we record the husband's age and the wife's age for 100 randomly selected heterosexual married couples. Please answer the following questions:

a. What would it mean about ages of couples if these two variables had a negative relationship?

b. What would it mean about ages of couples if these two variables had a positive relationship?

c. Which do you think is more likely, a negative or a positive relationship?

d. Do you expect a strong or weak relationship in the data? Why?

e. Would a strong correlation imply there is an association between husband age and wife age?

f. What symbol (of the symbols discussed in class) should we use to denote the correlation discussed in this problem?

Answers:

a.

b.

c.

d.

e.

f.

$\$

Exercise 2.3 (6 points): The table below shows the relationship between two quantitative variables X and Y (you might want to knit the document to see the table better). Please use R to create a scatter plot and calculate the correlation between these variables. Hint: the function to create vectors, c(), might be useful here. Challenge (0 points): try to calculate the correlation without using the cor() function.

| | | | | | | | |
---|----|----|----|----|----|----|----|----|
X | 15 | 20 | 25 | 30 | 35 | 40 | 45 | 50 | Y |532 |466 |478 |320 |303 |349 |275 |221 |


$\$

Exercise 2.4 (6 points): When honeybee scouts find a food source or a nice site for a new home, they communicate the location to the rest of the swarm by doing a "waggle dance". They point in the direction of the site and dance longer for sites farther away. The rest of the bees use the duration of the dance to predict distance to the site. The code below loads a data frame called HoneybeeWaggle that contains the distance (in meters), and the duration of the dance (in seconds) for seven honeybee scouts. Using this data:

a. Create a scatter plot of the data, and be sure to label the plot appropriately. Also describe whether there appears to be a linear trend in the data. If so, is it positive or negative?

b. Use the cor() function to find the correlation between the two variables.

# download the data and load it into R
library(Lock5Data)
data(HoneybeeWaggle)


# Plot the data here




# Calculate the correlation coefficient here

Answers:

a.

b.

$\$

Part 3: Simple linear regression

$\$

In class we examined data about the number of hot dogs eaten by the winner of Nathan's Famous hot dog eating contest. An additional question we can ask using this data set, is whether it has become harder to win this hot dog eating contest in more recent years.

The code below loads data on the winners of Nathan's Famous hot dog eating contest from the years 2002 to 2019. Please use the R chunk below to analyze data related to that contest and answer the following questions in the answer section below. Also, be sure to "show your work" by making sure your answers are printed out in the R chunk below.

$\$

Exercise 3.1 (4 points): In the R chunk below, create a scatter plot of the number of hot dogs eaten as a function of year. Is the trend in the data mostly positive or negative?

# load the data on the hot dog contest winners in a data frame called HotDogs
HotDogs <- HotDogs2019


# Plot the number of hot dogs eaten for the contest winner as a function of the year

Answer

$\$

Exercise 3.2 (7 points): Fit a linear model to this data predicting the number of hot dogs eaten as a function of the year, and save the output of your fitted model to an object called hotdog_fit. Then recreate the scatter plot from 3.1 and add the fitted regression line to your plot (bonus, see if you can make the line color red). Then from just looking at the resulting plot, answer the questions of whether: 1) is the residual larger in 2007 or in 2008? and 2) is the residual positive or negative in 2010?

# Recreate the plot above




# Fit a linear regression model data and add the regression line to the plot




# Add the regression line to the plot

Answer

$\$

Exercise 3.3 (4 points): By looking at the scatter plot, guess the correlation between year and number of hot dogs eaten. Then use R to find the correlation value and report what it is. Is the actual correlation value approximately what you expected it to be?

# Calculate the correlation between the number of hot dogs eaten by the winner and the year

Answer

$\$

Exercise 3.4 (6 points): Extract the coefficients from your linear model in the R chunk below. Then in the answers section, write out what the regression equation is; i.e., write the equation with the values of the coefficients plugged in.

# Print out the regression coefficients here. 
# Then write the equation for the prediction line in the answers section below.

Answer

$\$

Exercise 3.5 (6 points): In the answer section, write an interpretation of what the coefficients mean in terms of years and the number of hot dogs predicted to be eaten.

Answer

$\$

Exercise 3.6 (6 points): Use your regression equation to predict the number of hot dogs expected to be eaten in 2016. Show your work in R. Then find the actual number of hot dogs eaten by the winner in 2016 on the internet and report the residual.

# Make predictions for how many hot dogs are expected to be eaten in 2016 and calculate the residual


# or getting the prediction from 2016 slightly more manually 
# (either way to answer this question is fine) 



# The actual number of hot dogs eaten in 2016 was 70. 
# Thus the residual for the number of hot dogs eaten in 2016 is:

Answer

$\$

Exercise 3.7 (5 points): Would it be appropriate to use this regression line to predict the winners number of hot dogs in 2050? Please describe why or why not.

# The predicted number of hot dogs to be eaten in 2050.

Answer

$\$

Reflection (3 points)

Please reflect on how the homework went by going to Canvas, going to the Quizzes link, and clicking on Reflection on homework 2.



emeyers/SDS100 documentation built on April 28, 2024, 5:07 p.m.