$\$
The purpose of this homework is to better understand the theory of hypothesis
testing, and to practice using dplyr
to transform data and ggplot2
to visualize
data. Please fill in the appropriate code and write answers to all questions in
the answer sections, then submit a compiled pdf with your answers through
Gradescope by 11pm on Sunday October 8th.
As always, if you need help with the homework, please attend the TA office hours which are listed on Canvas and/or ask questions on Ed Discussions. Also, if you have completed the homework, please help others out by answering questions on Ed Discussions, which will count toward your class participation grade.
For written response, also please clearly report the value or make sure your codes properly display the output.
The team that made dplyr
and ggplot
has also created many learning
resources. Some of them are listed below:
# makes sure you have all the packages we have used in class installed SDS230::download_data("freshman-15.txt") SDS230::download_data("forecast_ne_joined.rda") SDS230::download_image("grumpy_cat.jpg")
library(knitr) library(latex2exp) library(dplyr) library(gapminder) library(ggplot2) library(ggridges) # turn off some dplyr message options(dplyr.summarise.inform = FALSE) options(scipen=999) opts_chunk$set(tidy.opts=list(width.cutoff=60)) set.seed(230) # set the random number generator to always give the same random numbers
$\$
As discussed in class, if the Neyman-Pearson paradigm was followed perfectly, in a world where the null hypothesis was alwasy true, we would expect that only $\alpha$ proportion of hypothesis tests that were run would result in type I errors. We can use this fact to evaluate whether a given type of hypothesis test is "robust" to violations of the assumptions/conditions that are supposedly required to use a method.
In particular, we can use simulated data, where we know what the parameters values are, to assess whether a particular type of hypothesis test produces type I errors at the rate that is expected. We can do this by repeatedly generating random data, and then calculating the p-values from these samples. Once we have a large collection of p-values, we can assess the whether proportion of times that the test rejects the null hypothesis matches the specified significance level $\alpha$. If it matches, then the method is working properly, otherwise, it is not reliable.
Let's briefly explore this below by assessing how robust the independent sample t-test is to violations of the condition that the data in each sample comes from a normal distribution.
$\$
Part 1.1 (3 points):
To make our lives easier, we will fuse the t.test()
function to get p-values.
When we run the t.test()
function, the returned result is a data structure
that contains many values related to the t-test (i.e., the object that is
returned from running the t.test
function is an object similar to a list and it
contains the t-statistic, the degrees of freedom, the p-value, etc.). If we
assign the output of running the t.test()
function to an object called
result
, we can use the syntax result$p.value
to get the p-value from running
the hypothesis test.
Let's start this analysis by reminding ourselves how the t.test()
function
works. In particular, use the t.test()
function to run an independent samples
t-test to test whether the mean initial weight of freshman is
different than the mean final weight using the "Freshman 15" data from
homework 4 (please refer back to this homework for a description of the data).
Then extract the p-value and save it to an object called pval
. Finally, print
what is stored in your pval
object to show you have extracted the correct
p-value.
freshman <- read.table("freshman-15.txt", header = TRUE)
$\$
Part 1.2 (5 points):
One condition for running an independent samples t-test is that the data from
the samples come from normal distributions. Let's start by assessing the type I
error rate that occurs when this assumption is met. To do this we can create a
for loop which continually add results to a vector called rejections
. In each
iteration of the for loop we will:
Create a first sample of 10 random values from a standard normal distribution
and save this data to an object called sample_1
.
Create a second sample of 10 random values from a standard normal
distribution and save this data to an object called sample_2
.
Use the t.test()
method to get a p-value from running a two-tailed
independent samples t-test between sample_1
and sample_2
.
Assess whether the p-value is less than a specified $\alpha = 0.05$ level and
store the resulting Boolean in the i
th place in our rejection
object; i.e.,
a value of TRUE
means the null hypothesis was rejected and a value of FALSE
means it was not.
Once we have run the for loop for 10,000 iterations, we can assess the proportion of rejections and see whether it matches our $\alpha = 0.05$ level.
Please run the analysis below. Report the proportion of null hypotheses that were rejected and consequently if it appears that t-test is indeed giving the expected type I error rate.
Answer
$\$
Part 1.3 (5 points):
Above we assessed with ideal conditions where the data comes from normal distributions. A more interesting question is what happens when some of the assumptions of the t-test are violated. Let's explore this by recreating the code above, but rather than sampling 10 points for each sample from a normal distribution, instead sample the points from a standard exponential distribution (i.e., an exponential distribution with a rate parameter of 1), which is a distribution that has long tails.
Please estimate and report the type I error rate using the code chunk below, and answer whether the t-test appear to be robust to the assumption that the data comes from normal distributions.
Hint: You can use the rexp(k)
function to generate k
random points from a
standard exponential distribution.
Answer
Note: this simulation approach can be used to assess the robustness of many properties and methods including how outliers impact particular methods, and also to assess whether confidence intervals are capturing the parameters the correct proportion of the time (e.g., this method could be used to assess whether bootstrap confidence intervals are working correctly, etc.).
$\$
Travel by airplane can be convenient because airplanes fly very fast. However, even though the airplanes themselves are fast, their scheduled departure times are often delayed, which can significantly add to ones travel time, and can be frustrating.
Let's analyze data on flights to gain insight into how best to avoid flight delays. In particular, we will look at airplane flights that left the airports in New York City, since these airports are some of the closest major airports to New Haven, and we will use dplyr to do some quick explorations of the data to see if there are some ways to potentially avoid flight delays.
To begin, let's load the data for flights leaving New York City in 2013 using the code below. To get more information on this data set, use ? flights (you don't need to modify anything on the code below).
# install.packages("nycflights13") # get the flight delays data and load dplyr require("nycflights13") data(flights) data(airlines) # the names of the airline carriers
$\$
Part 2.1 (5 points): One way to avoid being delayed would be to avoid the
worst airlines. Which airline had the longest arrival delays on average, and how
long was this average delay? Use the airlines
data frame to figure out which
airline each carrier code corresponds to.
Note: For the analysis on this homework, just ignore missing values in any summary statistics you create.
library(dplyr) # get the average delay for each airline
Answers:
$\$
Part 2.2 (5 points): Flights that start off with a delay might end up making up some time during the course of the flight. Examine whether this is true on average by reporting relevant descriptive statistics.
Hint: we only use flights that have positive departure delay, since a flight needs to be delayed in order to "make up time".
Answers:
$\$
Part 2.3 (5 points): Another way to avoid flight delays would be to avoid particularly bad times to fly. Which month of the year had the longest departure delays? Also report which hour of the day had the longest departure delays. Finally, report how many flights left at the hour of the day with the longest delay and what the average delay was at that hour.
Hint: You can use n()
within summarize to get the size of each group specified
by the call to group_by()
.
Answers:
$\$
{width=50%}
$\$
In the next set of exercises you will use ggplot2
to compare different
visualizations and see which gives the clearest insights. As mentioned above, a
useful resource for ggplot and other tidyverse code is the book R for Data
Science.
$\$
Part 3.1 (8 points): Let's start by comparing some visualizations on the
gapminder data which contains information about
different countries in the world over time. Use ggplot
and the gapminder
data to
compare the GDP per capita of Japan, the United States and China. Plot a line
graph of GDP per capita as a function of the year, with each country in a
different color. Also, create a plot that compares these countries' GDP per
capita as a function of the year using facets, where the data from each country
is on a separate subplot. As always, make sure to label your axes in this plot
and in all other plots in this homework. Do you think one type of plot is better
than another in comparing these countries? Explain why.
Hint: first use the dplyr filter()
function to get the subset of data you
need, then plot it.
Answers: [Explain whether you think of of these plots is more informative than the other].
$\$
Part 3.2 (8 points): DataExpo is a Statistics event at the Joint Statistical Meetings where different researchers compare data analysis methods applied to a common data set. In 2018, the data set consisted of weather predictions made between 2014 and 2017. In this exercise, let's look at the data from this event and try to visualize the prediction accuracies for predictions made 0 to 6 days in the future.
The code below loads a data frame called forecast_ne_joined
that has the
prediction errors for the maximum temperature for the 9 cities in New England,
along with several other variables. First, create a new data frame called
new_haven_preds
that has only the predictions from New Haven, and has only the
variables cityID
, city
, num_days_out_prediction_made
and
max_temp_prediction_error
. Also, convert the variable
num_days_out_prediction_made
to a factor using the as.factor()
function
inside of the mutate()
function. Then use ggplot to create plots that compare
the prediction accuracy as a function of the number of days in advance that a
prediction was made using the following geoms:
geom_boxplot()
geom_violin()
geom_density_ridges()
Note that the geom geom_density_ridges()
comes from the ggridges packages that
was loaded at the top of the worksheet, and that the x and y aesthetic mapping
needs to be in the opposite order as the mapping used for the geom_boxplot()
and
geom_violin()
geoms.
After you created these plots, briefly discuss which plot you believe most
clearly shows how the prediction accuracy decreases as a function of days in the
future. Also, don't forget to label your axes using the xlab()
and ylab()
functions.
# load the data that has the weather prediction errors load('forecast_ne_joined.rda')
Answers:
$\$
Part 3.3 (5 points): Create an interesting plot using one of the data sets we have discussed in class or another data set you can find. Try exploring other features of ggplot we have not discussed in class using the ggplot cheat sheet. See if you can find something interesting in the data and explain why you find it interesting.
Answers:
$\$
Please reflect on how the homework went by going to Canvas, going to the Quizzes link, and clicking on Reflection on homework 5
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.