$\$

The purpose of this homework is to better understand the theory of hypothesis testing, and to practice using dplyr to transform data and ggplot2 to visualize data. Please fill in the appropriate code and write answers to all questions in the answer sections, then submit a compiled pdf with your answers through Gradescope by 11pm on Sunday October 8th.

As always, if you need help with the homework, please attend the TA office hours which are listed on Canvas and/or ask questions on Ed Discussions. Also, if you have completed the homework, please help others out by answering questions on Ed Discussions, which will count toward your class participation grade.

For written response, also please clearly report the value or make sure your codes properly display the output.

The team that made dplyr and ggplot has also created many learning resources. Some of them are listed below:

# makes sure you have all the packages we have used in class installed

SDS230::download_data("freshman-15.txt")
SDS230::download_data("forecast_ne_joined.rda")
SDS230::download_image("grumpy_cat.jpg")
library(knitr)
library(latex2exp)
library(dplyr)
library(gapminder)
library(ggplot2)
library(ggridges)

# turn off some dplyr message
options(dplyr.summarise.inform = FALSE)

options(scipen=999)

opts_chunk$set(tidy.opts=list(width.cutoff=60)) 
set.seed(230)  # set the random number generator to always give the same random numbers

$\$

Part 1: Assessing the t-tests type I error rate

As discussed in class, if the Neyman-Pearson paradigm was followed perfectly, in a world where the null hypothesis was alwasy true, we would expect that only $\alpha$ proportion of hypothesis tests that were run would result in type I errors. We can use this fact to evaluate whether a given type of hypothesis test is "robust" to violations of the assumptions/conditions that are supposedly required to use a method.

In particular, we can use simulated data, where we know what the parameters values are, to assess whether a particular type of hypothesis test produces type I errors at the rate that is expected. We can do this by repeatedly generating random data, and then calculating the p-values from these samples. Once we have a large collection of p-values, we can assess the whether proportion of times that the test rejects the null hypothesis matches the specified significance level $\alpha$. If it matches, then the method is working properly, otherwise, it is not reliable.

Let's briefly explore this below by assessing how robust the independent sample t-test is to violations of the condition that the data in each sample comes from a normal distribution.

$\$

Part 1.1 (3 points):

To make our lives easier, we will fuse the t.test() function to get p-values. When we run the t.test() function, the returned result is a data structure that contains many values related to the t-test (i.e., the object that is returned from running the t.testfunction is an object similar to a list and it contains the t-statistic, the degrees of freedom, the p-value, etc.). If we assign the output of running the t.test() function to an object called result, we can use the syntax result$p.value to get the p-value from running the hypothesis test.

Let's start this analysis by reminding ourselves how the t.test() function works. In particular, use the t.test() function to run an independent samples t-test to test whether the mean initial weight of freshman is different than the mean final weight using the "Freshman 15" data from homework 4 (please refer back to this homework for a description of the data). Then extract the p-value and save it to an object called pval. Finally, print what is stored in your pval object to show you have extracted the correct p-value.

freshman <- read.table("freshman-15.txt", header = TRUE)

$\$

Part 1.2 (5 points):

One condition for running an independent samples t-test is that the data from the samples come from normal distributions. Let's start by assessing the type I error rate that occurs when this assumption is met. To do this we can create a for loop which continually add results to a vector called rejections. In each iteration of the for loop we will:

  1. Create a first sample of 10 random values from a standard normal distribution and save this data to an object called sample_1.

  2. Create a second sample of 10 random values from a standard normal distribution and save this data to an object called sample_2.

  3. Use the t.test() method to get a p-value from running a two-tailed independent samples t-test between sample_1 and sample_2.

  4. Assess whether the p-value is less than a specified $\alpha = 0.05$ level and store the resulting Boolean in the ith place in our rejection object; i.e., a value of TRUE means the null hypothesis was rejected and a value of FALSE means it was not.

Once we have run the for loop for 10,000 iterations, we can assess the proportion of rejections and see whether it matches our $\alpha = 0.05$ level.

Please run the analysis below. Report the proportion of null hypotheses that were rejected and consequently if it appears that t-test is indeed giving the expected type I error rate.


Answer

$\$

Part 1.3 (5 points):

Above we assessed with ideal conditions where the data comes from normal distributions. A more interesting question is what happens when some of the assumptions of the t-test are violated. Let's explore this by recreating the code above, but rather than sampling 10 points for each sample from a normal distribution, instead sample the points from a standard exponential distribution (i.e., an exponential distribution with a rate parameter of 1), which is a distribution that has long tails.

Please estimate and report the type I error rate using the code chunk below, and answer whether the t-test appear to be robust to the assumption that the data comes from normal distributions.

Hint: You can use the rexp(k) function to generate k random points from a standard exponential distribution.


Answer

Note: this simulation approach can be used to assess the robustness of many properties and methods including how outliers impact particular methods, and also to assess whether confidence intervals are capturing the parameters the correct proportion of the time (e.g., this method could be used to assess whether bootstrap confidence intervals are working correctly, etc.).

$\$

Part 2: Data transformations with dplyr

Travel by airplane can be convenient because airplanes fly very fast. However, even though the airplanes themselves are fast, their scheduled departure times are often delayed, which can significantly add to ones travel time, and can be frustrating.

Let's analyze data on flights to gain insight into how best to avoid flight delays. In particular, we will look at airplane flights that left the airports in New York City, since these airports are some of the closest major airports to New Haven, and we will use dplyr to do some quick explorations of the data to see if there are some ways to potentially avoid flight delays.

To begin, let's load the data for flights leaving New York City in 2013 using the code below. To get more information on this data set, use ? flights (you don't need to modify anything on the code below).

# install.packages("nycflights13")

# get the flight delays data and load dplyr
require("nycflights13")
data(flights)
data(airlines)   # the names of the airline carriers

$\$

Part 2.1 (5 points): One way to avoid being delayed would be to avoid the worst airlines. Which airline had the longest arrival delays on average, and how long was this average delay? Use the airlines data frame to figure out which airline each carrier code corresponds to.

Note: For the analysis on this homework, just ignore missing values in any summary statistics you create.

library(dplyr)


# get the average delay for each airline

Answers:

$\$

Part 2.2 (5 points): Flights that start off with a delay might end up making up some time during the course of the flight. Examine whether this is true on average by reporting relevant descriptive statistics.

Hint: we only use flights that have positive departure delay, since a flight needs to be delayed in order to "make up time".


Answers:

$\$

Part 2.3 (5 points): Another way to avoid flight delays would be to avoid particularly bad times to fly. Which month of the year had the longest departure delays? Also report which hour of the day had the longest departure delays. Finally, report how many flights left at the hour of the day with the longest delay and what the average delay was at that hour.

Hint: You can use n() within summarize to get the size of each group specified by the call to group_by().


Answers:

$\$

{width=50%}

$\$

Part 3: Data visualization

In the next set of exercises you will use ggplot2 to compare different visualizations and see which gives the clearest insights. As mentioned above, a useful resource for ggplot and other tidyverse code is the book R for Data Science.

$\$

Part 3.1 (8 points): Let's start by comparing some visualizations on the gapminder data which contains information about different countries in the world over time. Use ggplot and the gapminder data to compare the GDP per capita of Japan, the United States and China. Plot a line graph of GDP per capita as a function of the year, with each country in a different color. Also, create a plot that compares these countries' GDP per capita as a function of the year using facets, where the data from each country is on a separate subplot. As always, make sure to label your axes in this plot and in all other plots in this homework. Do you think one type of plot is better than another in comparing these countries? Explain why.

Hint: first use the dplyr filter() function to get the subset of data you need, then plot it.


Answers: [Explain whether you think of of these plots is more informative than the other].

$\$

Part 3.2 (8 points): DataExpo is a Statistics event at the Joint Statistical Meetings where different researchers compare data analysis methods applied to a common data set. In 2018, the data set consisted of weather predictions made between 2014 and 2017. In this exercise, let's look at the data from this event and try to visualize the prediction accuracies for predictions made 0 to 6 days in the future.

The code below loads a data frame called forecast_ne_joined that has the prediction errors for the maximum temperature for the 9 cities in New England, along with several other variables. First, create a new data frame called new_haven_preds that has only the predictions from New Haven, and has only the variables cityID, city, num_days_out_prediction_made and max_temp_prediction_error. Also, convert the variable num_days_out_prediction_made to a factor using the as.factor() function inside of the mutate() function. Then use ggplot to create plots that compare the prediction accuracy as a function of the number of days in advance that a prediction was made using the following geoms:

  1. Create a box plot using geom_boxplot()
  2. Create a violin plot using geom_violin()
  3. Create a joy plot using geom_density_ridges()

Note that the geom geom_density_ridges() comes from the ggridges packages that was loaded at the top of the worksheet, and that the x and y aesthetic mapping needs to be in the opposite order as the mapping used for the geom_boxplot() and geom_violin() geoms.

After you created these plots, briefly discuss which plot you believe most clearly shows how the prediction accuracy decreases as a function of days in the future. Also, don't forget to label your axes using the xlab() and ylab() functions.

# load the data that has the weather prediction errors
load('forecast_ne_joined.rda')

Answers:

$\$

Part 3.3 (5 points): Create an interesting plot using one of the data sets we have discussed in class or another data set you can find. Try exploring other features of ggplot we have not discussed in class using the ggplot cheat sheet. See if you can find something interesting in the data and explain why you find it interesting.


Answers:

$\$

Reflection (2 points)

Please reflect on how the homework went by going to Canvas, going to the Quizzes link, and clicking on Reflection on homework 5



emeyers/SDS230 documentation built on Jan. 18, 2024, 1:01 a.m.