$\$
The purpose of this homework is to better understand the theory of hypothesis
testing, and to practice using dplyr
to transform data and ggplot2
to visualize
data. Please fill in the appropriate code and write answers to all questions in
the answer sections, then submit a compiled pdf with your answers through
Gradescope by 11pm on Sunday October 6th.
As always, if you need help with the homework, please attend the TA office hours which are listed on Canvas and/or ask questions on Ed Discussions. Also, if you have completed the homework, please help others out by answering questions on Ed Discussions, which will count toward your class participation grade.
For written response, also please clearly report the value or make sure your codes properly display the output.
The team that made dplyr
and ggplot
has also created many learning
resources. Some of them are listed below:
SDS230::download_data("ne_forecasts.rda") SDS230::download_data("freshman-15.txt")
library(knitr) library(latex2exp) library(dplyr) library(gapminder) library(ggplot2) library(ggridges) # turn off some dplyr message options(dplyr.summarise.inform = FALSE) options(scipen=999) opts_chunk$set(tidy.opts=list(width.cutoff=60)) set.seed(230) # set the random number generator to always give the same random numbers
$\$
As discussed in class, if the Neyman-Pearson paradigm was followed perfectly, in a world where the null hypothesis was always true, we would expect that only $\alpha$ proportion of hypothesis tests that were run would result in type I errors. We can use this fact to evaluate whether a given type of hypothesis test is "robust" to violations of the assumptions/conditions that are supposedly required to use the test.
In particular, we can use simulated data, where we know what the parameters values are, to assess whether a particular type of hypothesis test produces type I errors at the rate that is expected. We can do this by repeatedly generating random data, and then calculating the p-values from these samples. Once we have a large collection of p-values, we can assess the whether proportion of times that the test rejects the null hypothesis matches the specified significance level $\alpha$. If it matches, then the method is working properly, otherwise, it is not reliable.
Let's briefly explore this below by assessing how robust the independent samples t-test is to violations of the condition that the data in each sample comes from a normal distribution.
$\$
Part 1.1 (3 points):
To make our lives easier, we will use the t.test()
function to get p-values.
When we run the t.test()
function, the returned result is a data structure
that contains many values related to the t-test (i.e., the object that is
returned from running the t.test
function is an object similar to a list and it
contains the t-statistic, the degrees of freedom, the p-value, etc.). If we
assign the output of running the t.test()
function to an object called
result
, we can use the syntax result$p.value
to get the p-value from running
the hypothesis test.
Let's start this analysis by reminding ourselves how the t.test()
function
works. In particular, use the t.test()
function to run an independent samples
t-test to test whether the mean initial weight of freshman is
different than the mean final weight using the "Freshman 15" data from
homework 4 (please refer back to this homework for a description of the data).
Then extract the p-value and save it to an object called pval
. Finally, print
what is stored in your pval
object to show you have extracted the correct
p-value.
freshman <- read.table("freshman-15.txt", header = TRUE)
$\$
Part 1.2 (5 points):
One condition for running an independent samples t-test is that the data from
the samples come from normal distributions. Let's start by assessing the type I
error rate that occurs when this assumption is met. To do this we can create a
for loop which continually adds results to a vector called rejections
. In each
iteration of the for loop we will:
Create a first sample of 10 random values from a standard normal distribution
and save this data to an object called sample_1
.
Create a second sample of 10 random values from a standard normal
distribution and save this data to an object called sample_2
.
Use the t.test()
method to get a p-value from running a two-tailed
independent samples t-test between sample_1
and sample_2
.
Assess whether the p-value is less than a specified $\alpha = 0.05$ level and
store the resulting Boolean in the i
th place in our rejection
object; i.e.,
a value of TRUE
means the null hypothesis was rejected and a value of FALSE
means it was not.
Once we have run the for loop for 10,000 iterations, we can assess the proportion of rejections and see whether it matches our $\alpha = 0.05$ level.
Please run the analysis below. In the answer section below, report the proportion of null hypotheses that were rejected and consequently if it appears that t-test is indeed giving the expected type I error rate.
Answer
$\$
Part 1.3 (5 points):
Above we assessed with ideal conditions where the data comes from normal distributions. A more interesting question is what happens when some of the assumptions of the t-test are violated. Let's explore this by recreating the code above, but rather than sampling 10 points for each sample from a normal distribution, instead sample the points from a standard exponential distribution (i.e., an exponential distribution with a rate parameter of 1), which is a distribution that has long tails.
Please estimate and report the type I error rate using the code chunk below, and answer whether the t-test appear to be robust to the assumption that the data comes from normal distributions. Note: a hypothesis tests is robust to type I errors if it has an error at that is as low or lower than promised by the significance level, and is not robust if the error rate is higher than what is promised.
Hint: You can use the rexp(k)
function to generate k
random points from a
standard exponential distribution.
Answer
$\$
The DataExpo is a Statistics event at the Joint Statistical Meetings where different researchers compare data analysis methods applied to a common data set. In 2018, the data set consisted of weather predictions made between 2014 and 2017. In this exercise, let's look at the data from this event and try to understand how accurate whether predictions are.
The code below loads the dplyr library and also a data frame called
ne_forecasts
that has predicted weather forecasts and the actual weather for 8
cities in New England. Some variables in this data frame that are of particular
interest are:
city
: The city where the prediction is being madedate_to_predict
: The date that is trying to be predicteddate_prediction_made
: The date when a prediction was madepredicted max temp
: The predicted maximum temperature Min_TemperatureF
: The actual minimum temperature on the date that is being predictedMax_TemperatureF
: The actual maximum temperature on the date that is being predictedlibrary(dplyr) # load the data that has the weather prediction errors load('ne_forecasts.rda')
$\$
Part 2.1 (5 points): To start, create a data frame that has the lowest minimum temperature reported for each city (use the actual minimum temperature, not the predicted minimum temperature). Display this data frame sorted from the city with lowest minimum temperature to the city with highest minimum temperature; i.e., you should print out a data frame that has 8 rows (one for each city) and one column, which has the minimum temperature for each city, where the data frame is sorted from the city with the lowest minimum temperature to the city with the highest minimum temperature. Please ignore any missing values when calculating the minimum.
In the answer section, explain whether these results match what you would expect in terms of which cities had the lowest temperatures.
Answers:
$\$
Part 2.2 (8 points): Now let's start exploring how accurate weather forecasts
are. To do this, create a data frame called new_haven_preds
which is the same as
ne_forecasts
but only has the data from New Haven and also has two additional
columns which are:
max_temp_prediction_error
: This column should have the difference between
the actual maximum temperature and predicted maximum temperature for all
predictions.
num_days_out_prediction_made
: This should have the difference between the
date being predicted and the date the prediction was made. Also, make this
variable a factor using the as.factor()
function.
Once you have created this data frame, show that you have created it correctly
by printing out just thedate_to_predict
, max_temp_prediction_error
and the
num_days_out_prediction_made
variables for the first 5 rows of the
new_haven_preds
data frame.
$\$
Part 2.3 (8 points): Now let's finally look at how accurate weather predictions are for New Haven. In particular, create a data frame where:
Each row corresponds to the number of days in the future the prediction was made (hint: there should be 7 rows total).
There is one column called mean_abs_error
which has mean value of the
absolute errors for predicting what the maximum temperature was.
In the answer section, report if these results match your expectations.
Answers:
$\$
In the next set of exercises you will use ggplot2
to compare different
visualizations and see which gives the clearest insights. As mentioned above, a
useful resource for ggplot and other tidyverse code is the book R for Data
Science.
$\$
Part 3.1 (8 points): Let's start by comparing some visualizations on the
gapminder data which contains information about
different countries in the world over time. Use ggplot
and the gapminder
data to
compare the GDP per capita of Japan, the United States and Singapore. Plot a line
graph of GDP per capita as a function of the year, with each country in a
different color. Also, create a plot that compares these countries' GDP per
capita as a function of the year using facets, where the data from each country
is on a separate subplot. As always, make sure to label your axes in this plot
and in all other plots in this homework. Do you think one type of plot is better
than another in comparing these countries? Explain why.
Hint: first use the dplyr filter()
function to get the subset of data you
need, then plot it.
library(gapminder) # head(gapminder)
Answers: [Explain whether you think of of these plots is more informative than the other].
$\$
Part 3.2 (8 points): In part 2.3 above, you plotted the average absolute
prediction accuracy as a function of the number of days in advance that a
prediction was made. Starting with the new_haven_preds
data frame, let's now
use ggplot to visualize the whole distribution of New Haven prediction
accuracies as a function of the number of days in advance that a prediction was
made using the following geoms:
geom_boxplot()
geom_violin()
geom_density_ridges()
Note that the geom geom_density_ridges()
comes from the ggridges packages that
was loaded at the top of the homework, and that the x and y aesthetic mapping
needs to be in the opposite order as the mapping used for the geom_boxplot()
and
geom_violin()
geoms.
After you created these plots, briefly discuss which plot you believe most
clearly shows how the prediction accuracy decreases as a function of days in the
future. Also, don't forget to label your axes using the xlab()
and ylab()
functions.
Answers:
$\$
Part 3.3 (5 points): Create an interesting plot using one of the data sets we have discussed in class or another data set you can find. Try exploring other features of ggplot we have not discussed in class using the ggplot cheat sheet. See if you can find something interesting in the data and explain why you find it interesting.
Answers:
$\$
Please reflect on how the homework went by going to Canvas, going to the Quizzes link, and clicking on Reflection on homework 5.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.