In emeyers/SDS230: Tools for the class Data Exploration and Analysis

$\$

The purpose of this homework is to better understand the theory of hypothesis testing, and to practice using dplyr to transform data and ggplot2 to visualize data. Please fill in the appropriate code and write answers to all questions in the answer sections, then submit a compiled pdf with your answers through Gradescope by 11pm on Sunday October 6th.

As always, if you need help with the homework, please attend the TA office hours which are listed on Canvas and/or ask questions on Ed Discussions. Also, if you have completed the homework, please help others out by answering questions on Ed Discussions, which will count toward your class participation grade.

For written response, also please clearly report the value or make sure your codes properly display the output.

The team that made dplyr and ggplot has also created many learning resources. Some of them are listed below:

SDS230::download_data("ne_forecasts.rda")
SDS230::download_data("freshman-15.txt")

library(knitr)
library(latex2exp)
library(dplyr)
library(gapminder)
library(ggplot2)
library(ggridges)

# turn off some dplyr message
options(dplyr.summarise.inform = FALSE)

options(scipen=999)

opts_chunk$set(tidy.opts=list(width.cutoff=60)) 
set.seed(230)  # set the random number generator to always give the same random numbers

$\$

Part 1: Assessing the t-tests type I error rate

As discussed in class, if the Neyman-Pearson paradigm was followed perfectly, in a world where the null hypothesis was always true, we would expect that only $\alpha$ proportion of hypothesis tests that were run would result in type I errors. We can use this fact to evaluate whether a given type of hypothesis test is "robust" to violations of the assumptions/conditions that are supposedly required to use the test.

In particular, we can use simulated data, where we know what the parameters values are, to assess whether a particular type of hypothesis test produces type I errors at the rate that is expected. We can do this by repeatedly generating random data, and then calculating the p-values from these samples. Once we have a large collection of p-values, we can assess the whether proportion of times that the test rejects the null hypothesis matches the specified significance level $\alpha$. If it matches, then the method is working properly, otherwise, it is not reliable.

Let's briefly explore this below by assessing how robust the independent samples t-test is to violations of the condition that the data in each sample comes from a normal distribution.

$\$

Part 1.1 (3 points):

To make our lives easier, we will use the t.test() function to get p-values. When we run the t.test() function, the returned result is a data structure that contains many values related to the t-test (i.e., the object that is returned from running the t.testfunction is an object similar to a list and it contains the t-statistic, the degrees of freedom, the p-value, etc.). If we assign the output of running the t.test() function to an object called result, we can use the syntax result$p.value to get the p-value from running the hypothesis test.

Let's start this analysis by reminding ourselves how the t.test() function works. In particular, use the t.test() function to run an independent samples t-test to test whether the mean initial weight of freshman is different than the mean final weight using the "Freshman 15" data from homework 4 (please refer back to this homework for a description of the data). Then extract the p-value and save it to an object called pval. Finally, print what is stored in your pval object to show you have extracted the correct p-value.

freshman <- read.table("freshman-15.txt", header = TRUE)

$\$

Part 1.2 (5 points):

One condition for running an independent samples t-test is that the data from the samples come from normal distributions. Let's start by assessing the type I error rate that occurs when this assumption is met. To do this we can create a for loop which continually adds results to a vector called rejections. In each iteration of the for loop we will:

Create a first sample of 10 random values from a standard normal distribution and save this data to an object called sample_1.
Create a second sample of 10 random values from a standard normal distribution and save this data to an object called sample_2.
Use the t.test() method to get a p-value from running a two-tailed independent samples t-test between sample_1 and sample_2.
Assess whether the p-value is less than a specified $\alpha = 0.05$ level and store the resulting Boolean in the ith place in our rejection object; i.e., a value of TRUE means the null hypothesis was rejected and a value of FALSE means it was not.

Once we have run the for loop for 10,000 iterations, we can assess the proportion of rejections and see whether it matches our $\alpha = 0.05$ level.

Please run the analysis below. In the answer section below, report the proportion of null hypotheses that were rejected and consequently if it appears that t-test is indeed giving the expected type I error rate.

Answer

$\$

Part 1.3 (5 points):

Above we assessed with ideal conditions where the data comes from normal distributions. A more interesting question is what happens when some of the assumptions of the t-test are violated. Let's explore this by recreating the code above, but rather than sampling 10 points for each sample from a normal distribution, instead sample the points from a standard exponential distribution (i.e., an exponential distribution with a rate parameter of 1), which is a distribution that has long tails.

Please estimate and report the type I error rate using the code chunk below, and answer whether the t-test appear to be robust to the assumption that the data comes from normal distributions. Note: a hypothesis tests is robust to type I errors if it has an error at that is as low or lower than promised by the significance level, and is not robust if the error rate is higher than what is promised.

Hint: You can use the rexp(k) function to generate k random points from a standard exponential distribution.

Answer

$\$

Part 2: Data transformations with dplyr

The DataExpo is a Statistics event at the Joint Statistical Meetings where different researchers compare data analysis methods applied to a common data set. In 2018, the data set consisted of weather predictions made between 2014 and 2017. In this exercise, let's look at the data from this event and try to understand how accurate whether predictions are.

The code below loads the dplyr library and also a data frame called ne_forecasts that has predicted weather forecasts and the actual weather for 8 cities in New England. Some variables in this data frame that are of particular interest are:

city: The city where the prediction is being made
date_to_predict: The date that is trying to be predicted
date_prediction_made: The date when a prediction was made
predicted max temp: The predicted maximum temperature
Min_TemperatureF: The actual minimum temperature on the date that is being predicted
Max_TemperatureF: The actual maximum temperature on the date that is being predicted

library(dplyr)


# load the data that has the weather prediction errors
load('ne_forecasts.rda')

$\$

Part 2.1 (5 points): To start, create a data frame that has the lowest minimum temperature reported for each city (use the actual minimum temperature, not the predicted minimum temperature). Display this data frame sorted from the city with lowest minimum temperature to the city with highest minimum temperature; i.e., you should print out a data frame that has 8 rows (one for each city) and one column, which has the minimum temperature for each city, where the data frame is sorted from the city with the lowest minimum temperature to the city with the highest minimum temperature. Please ignore any missing values when calculating the minimum.

In the answer section, explain whether these results match what you would expect in terms of which cities had the lowest temperatures.

Answers:

$\$

Part 2.2 (8 points): Now let's start exploring how accurate weather forecasts are. To do this, create a data frame called new_haven_preds which is the same as ne_forecasts but only has the data from New Haven and also has two additional columns which are:

max_temp_prediction_error: This column should have the difference between the actual maximum temperature and predicted maximum temperature for all predictions.
num_days_out_prediction_made: This should have the difference between the date being predicted and the date the prediction was made. Also, make this variable a factor using the as.factor() function.

Once you have created this data frame, show that you have created it correctly by printing out just thedate_to_predict, max_temp_prediction_error and the num_days_out_prediction_made variables for the first 5 rows of the new_haven_preds data frame.

$\$

Part 2.3 (8 points): Now let's finally look at how accurate weather predictions are for New Haven. In particular, create a data frame where:

Each row corresponds to the number of days in the future the prediction was made (hint: there should be 7 rows total).
There is one column called mean_abs_error which has mean value of the absolute errors for predicting what the maximum temperature was.

In the answer section, report if these results match your expectations.

Answers:

$\$

Part 3: Data visualization

In the next set of exercises you will use ggplot2 to compare different visualizations and see which gives the clearest insights. As mentioned above, a useful resource for ggplot and other tidyverse code is the book R for Data Science.

$\$

Part 3.1 (8 points): Let's start by comparing some visualizations on the gapminder data which contains information about different countries in the world over time. Use ggplot and the gapminder data to compare the GDP per capita of Japan, the United States and Singapore. Plot a line graph of GDP per capita as a function of the year, with each country in a different color. Also, create a plot that compares these countries' GDP per capita as a function of the year using facets, where the data from each country is on a separate subplot. As always, make sure to label your axes in this plot and in all other plots in this homework. Do you think one type of plot is better than another in comparing these countries? Explain why.

Hint: first use the dplyr filter() function to get the subset of data you need, then plot it.

library(gapminder)
# head(gapminder)

Answers: [Explain whether you think of of these plots is more informative than the other].

$\$

Part 3.2 (8 points): In part 2.3 above, you plotted the average absolute prediction accuracy as a function of the number of days in advance that a prediction was made. Starting with the new_haven_preds data frame, let's now use ggplot to visualize the whole distribution of New Haven prediction accuracies as a function of the number of days in advance that a prediction was made using the following geoms:

Create a box plot using geom_boxplot()
Create a violin plot using geom_violin()
Create a joy plot using geom_density_ridges()

Note that the geom geom_density_ridges() comes from the ggridges packages that was loaded at the top of the homework, and that the x and y aesthetic mapping needs to be in the opposite order as the mapping used for the geom_boxplot() and geom_violin() geoms.

After you created these plots, briefly discuss which plot you believe most clearly shows how the prediction accuracy decreases as a function of days in the future. Also, don't forget to label your axes using the xlab() and ylab() functions.

Answers:

$\$

Part 3.3 (5 points): Create an interesting plot using one of the data sets we have discussed in class or another data set you can find. Try exploring other features of ggplot we have not discussed in class using the ggplot cheat sheet. See if you can find something interesting in the data and explain why you find it interesting.

Answers:

$\$