$\$
In the following homework, you will explore more descriptive statistics and visualizations of categorical and quantitative data. Please submit a compiled pdf with your answers to the exercises and Lock 5 questions to Gradescope by 11pm on Sunday January 28th. Please also make sure to indicate the correct pages to each question's answers on Gradescope to make sure that we can grade them.
Some useful functions for completing this worksheet are: dim()
, length()
,
table()
, prop.table()
, barplot()
, hist()
, mean()
, median()
, max()
,
sd()
, fivenum()
, and boxplot()
. You might also find the following symbols
useful: $\mu$, $\bar{x}$, $\pi$, $\hat{p}$
To get started on the homework, please run the first R chunk once by pressing the green play button. This will download a data file and an image that you will need for the homework. Once you have run this chunk once you will not need to run it again. After you have downloaded these files, I recommend you try to knit the document to make sure everything is working, and then you will be ready to start on the homework.
If you need help with any of the homework assignments, please attend the TA office hours which are listed on Canvas and/or ask questions on Ed Discussions forum. Also, if you have completed the homework, please help others out by answering questions on Ed Discussions. Finally, if you have conceptual questions, reviewing the class slides and videos will be helpful.
$\$
SDS100::download_image("beta_carotene.png") download.file("https://yale.box.com/shared/static/ey6ahs284lhoye1hgqloe44aqbb2h0qd.rda", "cars_small.Rda")
$\$
Please answer the following questions taken from the Lock5 textbook. Some symbols that might be useful for answering these questions include $\hat{p}$, $\bar{x}$, $\pi$, and $\mu$.
$\$
Exercise 1.1 (6 points): A study by Hamscher et al. (Environmental Health Perspectives, 2003) shows that antibiotics added to animal feed to accelerate growth can become airborne. Some of these drugs can be toxic if inhaled and may increase the evolution of antibiotic-resistant bacteria. The researchers analyzed 20 samples of dust particles from animal farms. Tylosin, an antibiotic used in animal feed that is chemically related to erythromycin, showed up in 16 samples. Based on this study please report:
a. What are the individual cases in this study?
b. What is the variable of interest in this study?
c. What is the statistic of interest in this study, what symbol should be used for this statistic, and what is the value of this statistic? (Your choice of statistic should be of the statistics we discussed in class).
Answers:
a.
b.
c.
$\$
Exercise 1.2 (6 points): The plasma beta-carotene level (concentration of beta-carotene in the blood), in ng/ml was measured for a sample of n = 315 individuals. A histogram of this sample is shown in image below. Please do the following:
a. Describe the shape of this distribution using terms we have discussed in class, and state whether there are any obvious outliers.
b. Estimate the median of this sample.
c. Estimate the mean of this sample.
Note, you might need to knit the document and look at the pdf to see the image of the beta carotene levels.
$\$
{width=80%}
$\$
Answers:
a.
b.
c.
$\$
Exercise 1.3 (6 points): A set of lucky numbers are: 41, 53, 38, 32, 115, 47, and 50. For these lucky numbers find:
a. the mean $\bar{x}$
b. the median m
c. indicate whether there appear to be any outliers and if so, what they are.
Answers:
a.
b.
c.
$\$
Exercise 1.4 (6 points): Using only the whole numbers 1 through 9 as possible data values, create a data set with n = 6, and $\bar{x}$ = 5 with:
a. The standard deviation as small as possible.
b. The standard deviation as large as possible.
Note: There can be repeated numbers in your data set (e.g., it's ok to use the value 3 twice, etc.).
Answers:
a.
b.
$\$
Exercise 1.5 (9 points): The number of days it took 8 different rowers to row solo across the Atlantic Ocean is: 40, 87, 78, 106, 67, 70, 153, and 81. Use the R chunk below to calculate the z-score for the shortest and longest row times and interpret what these values tell us in terms of the mean and standard deviation of the sample.
Hint: it will be useful to create objects that hold intermediate values before
calculating your final answers. For example, start by creating a vector called
row_times
that has the times for the different rowers. Then create objects
called mean_time
and sd_time
which have the mean and standard deviations of
rowers' times. Finally, calculate the z-scores for the shortest and longest row
times.
Answer:
$\$
Let's get practice analyzing data in R by examining information about cars that were sold on Edmunds.com which is a website that helps people shop for cars. This data was used in the 2015 DataFest. Edmunds has made this data available for education purposes only, so please do not distribute this data.
$\$
Exercise 2.1 (6 points): The code below loads data on 5 different brands of
cars (Toyota Corollas, Subaru Imprezas, Honda Civics, Hyundai Elantras and the
Mazda 3's) in a data frame called car_data
.
The dim()
function in R works on a data frame and returns the number of rows and
columns in the data frame. Use the dim()
function on the car_data
data frame
to answer the following questions in the answer section below:
a. How many cases are in the data frame?
b. Report what the variables are and whether each one is categorical or quantitative.
c. Report what the population for this data set is.
Hint: to view the data frame you can load the data on the console and then use
the View()
on the console as well. Remember that you can't use the View()
function in an R Markdown document and also that your R Markdown document does
not have access to the objects in your R environment. Instead, R can only access
objects that are created inside the R Markdown document itself.
# load the car data (don't change this line) load("cars_small.Rda")
Answer:
a.
b.
c.
$\$
Exercise 2.2 (6 points): Let's now get a little practice exploring
categorical data. Let's start by extracting a vector that has the names of the
car brands for each car sold and assigning this data to an object called
brands
(hint use the $ symbol). Then use the table()
function on the brands
vector to create a count of the number of different cars sold for each brand and
assign the results to an object called brand_counts
. Report which car brand
sold the most cars.
Answer:
$\$
Exercise 2.3 (5 points): Now get the relative frequencies of how much each
car brand sold from the brand_counts
table. What proportion of cars listed
were from the brand that sold the most cars? Which brand has the second most
cars sold?
Answer:
$\$
Exercise 2.4 (5 points): Next create a bar chart of the brand counts. Suppose we had a new data set collected 2 years later, do you think the proportion of cars sold by the top selling brand would change? Why or why not?
Answers:
$\$
$\$
In the next exercises we will analyze some quantitative data by examining the prices of Subaru Imprezas, Toyota Corollas and Mazda 3's.
$\$
Exercise 3.1 (5 points): The code below creates a vector called
subaru_prices
, which contains the prices for the Subaru Imprezas that were
sold. Use the length()
function to get the sample size n and verify that the
number of Subarus in this vector matches the number that you found when you ran
the table()
function above.
Note: the length()
function tells you how many elements are in a vector while
the dim()
function tells you how many rows and columns are in a data frame.
Since we are finding the number of elements in a vector, we are using the
length()
function here.
# This line gets a vector of only the Subaru prices (do not change it) subaru_prices <- subset(car_data$price, car_data$brand == "Subaru")
$\$
Exercise 3.2 (6 points): Now use the mean()
and median()
functions on the
subaru_prices
vector. Is the mean or the median higher? Does this indicate that the
data is left or right skewed?
Answers:
$\$
Exercise 3.3 (7 points): Next use the hist(my_vector, breaks)
function to
create a histogram of the prices for the Subarus. Try using different values for
the breaks
argument to create different number of bins in the histogram and be
sure to label your axes appropriately. Then describe the shape of the data, and
whether there seem to be any noticeable outliers.
Answers:
$\$
Exercise 3.4 (7 points): Let's now compare the Subarbu prices to the prices
of Toyota Corollas and Mazda 3's. The code below creates vectors of prices called
mazda_prices
and toyota_prices
. Start by reporting the samples sizes for
these two types of cars. Then create histograms to get a sense of how the Mazda
and Toyota car prices are distributed (and again choose a reasonable number of
bins) and describe the shape of these distributions.
# do not change the lines below mazda_prices <- subset(car_data$price, car_data$brand == "Mazda") toyota_prices <- subset(car_data$price, car_data$brand == "Toyota")
Answer:
$\$
Exercise 3.5 (7 points): Now calculate the mean and standard deviation of the prices for all three brands. Report which brand has the highest mean and which brand has the largest standard deviation.
Answers:
$\$
Exercise 3.6 (8 points): Finally, report the range of values that approximately 95% of the Mazda prices would be in if it was the case that the Mazda prices were normally (bell-shaped) distributed (this range of values should be two numbers for the lower and upper ends of the range). Then calculate the actual 2.5th and 97.5th percentile for the Mazda prices (which captures the end points such that 95% of the data is between these numbers). Are the numbers found using the percentiles close to what you calculated based on assuming the prices had a normal distribution?
Answers:
$\$
Please reflect on how the homework went by going to Canvas, going to the Quizzes link, and clicking on Reflection on homework 1
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.