In emeyers/SDS100: Tools for the class Introductory Statistics

$\$

In the following homework, you will explore more descriptive statistics and visualizations of categorical and quantitative data. Please submit a compiled pdf with your answers to the exercises and Lock 5 questions to Gradescope by 11pm on Sunday January 28th. Please also make sure to indicate the correct pages to each question's answers on Gradescope to make sure that we can grade them.

Some useful functions for completing this worksheet are: dim(), length(), table(), prop.table(), barplot(), hist(), mean(), median(), max(), sd(), fivenum(), and boxplot(). You might also find the following symbols useful: $\mu$, $\bar{x}$, $\pi$, $\hat{p}$

To get started on the homework, please run the first R chunk once by pressing the green play button. This will download a data file and an image that you will need for the homework. Once you have run this chunk once you will not need to run it again. After you have downloaded these files, I recommend you try to knit the document to make sure everything is working, and then you will be ready to start on the homework.

If you need help with any of the homework assignments, please attend the TA office hours which are listed on Canvas and/or ask questions on Ed Discussions forum. Also, if you have completed the homework, please help others out by answering questions on Ed Discussions. Finally, if you have conceptual questions, reviewing the class slides and videos will be helpful.

$\$

SDS100::download_image("beta_carotene.png")

download.file("https://yale.box.com/shared/static/ey6ahs284lhoye1hgqloe44aqbb2h0qd.rda", 
              "cars_small.Rda")

$\$

Part 1: Lock 5 questions

Please answer the following questions taken from the Lock5 textbook. Some symbols that might be useful for answering these questions include $\hat{p}$, $\bar{x}$, $\pi$, and $\mu$.

$\$

Exercise 1.1 (6 points): A study by Hamscher et al. (Environmental Health Perspectives, 2003) shows that antibiotics added to animal feed to accelerate growth can become airborne. Some of these drugs can be toxic if inhaled and may increase the evolution of antibiotic-resistant bacteria. The researchers analyzed 20 samples of dust particles from animal farms. Tylosin, an antibiotic used in animal feed that is chemically related to erythromycin, showed up in 16 samples. Based on this study please report:

a. What are the individual cases in this study?

b. What is the variable of interest in this study?

c. What is the statistic of interest in this study, what symbol should be used for this statistic, and what is the value of this statistic? (Your choice of statistic should be of the statistics we discussed in class).

Answers:

$\$

Exercise 1.2 (6 points): The plasma beta-carotene level (concentration of beta-carotene in the blood), in ng/ml was measured for a sample of n = 315 individuals. A histogram of this sample is shown in image below. Please do the following:

a. Describe the shape of this distribution using terms we have discussed in class, and state whether there are any obvious outliers.

b. Estimate the median of this sample.

c. Estimate the mean of this sample.

Note, you might need to knit the document and look at the pdf to see the image of the beta carotene levels.

$\$ concentration of beta-carotene in blood {width=80%} $\$

Answers:

$\$

Exercise 1.3 (6 points): A set of lucky numbers are: 41, 53, 38, 32, 115, 47, and 50. For these lucky numbers find:

a. the mean $\bar{x}$

b. the median m

c. indicate whether there appear to be any outliers and if so, what they are.

Answers:

$\$

Exercise 1.4 (6 points): Using only the whole numbers 1 through 9 as possible data values, create a data set with n = 6, and $\bar{x}$ = 5 with:

a. The standard deviation as small as possible.

b. The standard deviation as large as possible.

Note: There can be repeated numbers in your data set (e.g., it's ok to use the value 3 twice, etc.).

Answers:

$\$

Exercise 1.5 (9 points): The number of days it took 8 different rowers to row solo across the Atlantic Ocean is: 40, 87, 78, 106, 67, 70, 153, and 81. Use the R chunk below to calculate the z-score for the shortest and longest row times and interpret what these values tell us in terms of the mean and standard deviation of the sample.

Hint: it will be useful to create objects that hold intermediate values before calculating your final answers. For example, start by creating a vector called row_times that has the times for the different rowers. Then create objects called mean_time and sd_time which have the mean and standard deviations of rowers' times. Finally, calculate the z-scores for the shortest and longest row times.

Answer:

$\$

Part 2: Analyzing categorical data from cars sold on Edmunds.com

Let's get practice analyzing data in R by examining information about cars that were sold on Edmunds.com which is a website that helps people shop for cars. This data was used in the 2015 DataFest. Edmunds has made this data available for education purposes only, so please do not distribute this data.

$\$

Exercise 2.1 (6 points): The code below loads data on 5 different brands of cars (Toyota Corollas, Subaru Imprezas, Honda Civics, Hyundai Elantras and the Mazda 3's) in a data frame called car_data.

The dim() function in R works on a data frame and returns the number of rows and columns in the data frame. Use the dim() function on the car_data data frame to answer the following questions in the answer section below:

a. How many cases are in the data frame?

b. Report what the variables are and whether each one is categorical or quantitative.

c. Report what the population for this data set is.

Hint: to view the data frame you can load the data on the console and then use the View() on the console as well. Remember that you can't use the View() function in an R Markdown document and also that your R Markdown document does not have access to the objects in your R environment. Instead, R can only access objects that are created inside the R Markdown document itself.

# load the car data (don't change this line)
load("cars_small.Rda")

Answer:

$\$

Exercise 2.2 (6 points): Let's now get a little practice exploring categorical data. Let's start by extracting a vector that has the names of the car brands for each car sold and assigning this data to an object called brands (hint use the $ symbol). Then use the table() function on the brands vector to create a count of the number of different cars sold for each brand and assign the results to an object called brand_counts. Report which car brand sold the most cars.

Answer:

$\$

Exercise 2.3 (5 points): Now get the relative frequencies of how much each car brand sold from the brand_counts table. What proportion of cars listed were from the brand that sold the most cars? Which brand has the second most cars sold?

Answer:

$\$

Exercise 2.4 (5 points): Next create a bar chart of the brand counts. Suppose we had a new data set collected 2 years later, do you think the proportion of cars sold by the top selling brand would change? Why or why not?

Answers:

$\$

Part 3: Analyzing quantitative data from cars sold on Edmunds.com

$\$

In the next exercises we will analyze some quantitative data by examining the prices of Subaru Imprezas, Toyota Corollas and Mazda 3's.

$\$

Exercise 3.1 (5 points): The code below creates a vector called subaru_prices, which contains the prices for the Subaru Imprezas that were sold. Use the length() function to get the sample size n and verify that the number of Subarus in this vector matches the number that you found when you ran the table() function above.

Note: the length() function tells you how many elements are in a vector while the dim() function tells you how many rows and columns are in a data frame. Since we are finding the number of elements in a vector, we are using the length() function here.

# This line gets a vector of only the Subaru prices (do not change it)
subaru_prices <- subset(car_data$price, car_data$brand == "Subaru")

$\$

Exercise 3.2 (6 points): Now use the mean() and median() functions on the subaru_prices vector. Is the mean or the median higher? Does this indicate that the data is left or right skewed?

Answers:

$\$

Exercise 3.3 (7 points): Next use the hist(my_vector, breaks) function to create a histogram of the prices for the Subarus. Try using different values for the breaks argument to create different number of bins in the histogram and be sure to label your axes appropriately. Then describe the shape of the data, and whether there seem to be any noticeable outliers.

Answers:

$\$

Exercise 3.4 (7 points): Let's now compare the Subarbu prices to the prices of Toyota Corollas and Mazda 3's. The code below creates vectors of prices called mazda_prices and toyota_prices. Start by reporting the samples sizes for these two types of cars. Then create histograms to get a sense of how the Mazda and Toyota car prices are distributed (and again choose a reasonable number of bins) and describe the shape of these distributions.

# do not change the lines below 
mazda_prices <- subset(car_data$price, car_data$brand == "Mazda")
toyota_prices <- subset(car_data$price, car_data$brand == "Toyota")

Answer:

$\$

Exercise 3.5 (7 points): Now calculate the mean and standard deviation of the prices for all three brands. Report which brand has the highest mean and which brand has the largest standard deviation.

Answers:

$\$

Exercise 3.6 (8 points): Finally, report the range of values that approximately 95% of the Mazda prices would be in if it was the case that the Mazda prices were normally (bell-shaped) distributed (this range of values should be two numbers for the lower and upper ends of the range). Then calculate the actual 2.5th and 97.5th percentile for the Mazda prices (which captures the end points such that 95% of the data is between these numbers). Are the numbers found using the percentiles close to what you calculated based on assuming the prices had a normal distribution?

Answers:

$\$