knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(STAT302PACKAGE)
The STAT302PACKAGE contains data and four main functions (in addition to some other minor ones):
my_t.test
, a function which computes a one sample t-test my_lm
, a function which fits a linear model my_knn_cv
, a function which trains a n-nearest neighbors model using
k-fold cross-validationmy_rf_cv
, a function which trains a random forest model using k-fold
cross-validationTo install and use this package, run the following commands in console:
# install and load package devtools::install_github("dzeng8/STAT302PACKAGE") library(STAT302PACKAGE)
my_t.test
TutorialWe will demonstrate the my_t.test
function with lifeExp
data from
the my_gapminder
dataset.
The input data are measurements of people's life expectancy at birth, in years.
We run a t-test with the null hypothesis being that the mean life expectancy is 60 years and with the alternative hypothesis being that the life expectancy is not 60 years.
# run two-sided t-test with null hypothesis of mean as 60 results <- my_t.test(my_gapminder$lifeExp, "two.sided", 60) results
Using a p-value cut-off at $\alpha = 0.05$, we can see that a p-value of
r results[[4]]
implies that we cannot
reject the null hypothesis. In other words, there is not enough evidence to reject the
null hypothesis that the mean life expectancy is 60 in support of the alternative
hypothesis that the mean life expectancy is not 60.
We run a t-test with the null hypothesis being that the mean life expectancy is 60 years and with the alternative hypothesis being that the life expectancy is less than 60 years.
# run one-sided t-test with null hypothesis of mean as 60, alternative less than 60 results <- my_t.test(my_gapminder$lifeExp, "less", 60) results
Using a p-value cut-off at $\alpha = 0.05$, we can see that a p-value of
r results[[4]]
implies that we can reject the null hypothesis.
In other words, we have sufficient evidence to reject the null hypothesis that
the mean life expectancy is 60 years in support of the alternative hypothesis
that the mean life expectancy is less than 60 years.
We run a t-test with the null hypothesis being that the mean life expectancy is 60 years and with the alternative hypothesis being that the life expectancy is greater than 60 years.
# run one-sided t-test with null hypothesis of mean as 60, alternative greater than 60 results <- my_t.test(my_gapminder$lifeExp, "greater", 60) results
Using a p-value cut-off at $\alpha = 0.05$, we can see that a p-value of
r results[[4]]
implies that we cannot reject the null hypothesis.
In other words, we do not have sufficient evidence to reject the null hypothesis that
the mean life expectancy is 60 years in support of the alternative hypothesis
that the mean life expectancy is greater than 60 years.
my_lm
TutorialWe demonstrate the my_lm
function by fitting a model from the
my_gapminder
dataset. The model will predict lifeExp
using
gdpPercap
and continent
as the explanatory variables.
# fit the linear model lifeExp ~ gdpPercap + continent on the my_gapminder dataset results <- my_lm(lifeExp ~ gdpPercap + continent, data = my_gapminder) results
See that the coefficient on the gdpPercap
variable is r results[2, 1]
,
which can be interpreted as meaning life expectancy is expected to increase
by r results[2, 1]
years per unit increase in gdpPercap
.
We can see that the hypothesis test associated with the gdpPercap
covariate is
$H_0 : \beta_{gdpPercap} = 0$ and $H_a : \beta_{gdpPercap} \neq 0$. In other words,
the null hypothesis is that the coefficient of gdpPercap
is equal to 0
while the alternative hypothesis is that the coefficient is not equal to 0.
Using a p-value cut-off of $\alpha = 0.05$, we can see that the p-value of
r results[2, 4]
allows us to reject the null hypothesis. In other words,
the gdpPercap
predictor is statistically significant.
We plot the actual vs fitted values of this linear model.
# create data frame of actual vs fitted values, also adding continent column X <- model.matrix(lifeExp ~ gdpPercap + continent, data = my_gapminder) beta <- results[, 1] fitted_values <- X %*% beta actual_vs_fitted_df <- data.frame(fitted = fitted_values, actual = my_gapminder$lifeExp, continent = my_gapminder$continent) # plot actual vs fitted values ggplot2::ggplot(actual_vs_fitted_df, ggplot2::aes(x = fitted, y = actual, col = continent)) + ggplot2::labs(title = "Actual vs Fitted Values", x = "Fitted Values", y = "Actual values") + ggplot2::geom_point() + ggplot2::geom_abline() + ggplot2::theme_bw()
When we plot the actual vs fitted values with a reference line y = x
, we
can conclude that points on the line are points where our model correctly predicts
the actual value. We can see that our model does not do well in predicting the
life expectancy for the continents Africa, Americas, and Asia but does somewhat
well in predicting the life expectancy for the continents Europe and Oceania.
We conclude that the model fits well to the data for Europe and Oceania, but
not to the data for Africa, Americas, and Asia.
my_knn_cv
TutorialHere we demonstrate the my_knn_cv
function using the my_penguins
dataset.
We will predict the output class species
using covariates
bill_length_mm
, bill_depth_mm
, flipper_length_mm
, and body_mass_g
using 5-fold cross validation. We train 10 different models using k-nearest
neighbors ranging from 1 to 10. We display the training misclassification rate
and the CV misclassification rate in a table.
# load my_penguins data and extract relevant variables penguin_data <- my_penguins[c("bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g", "species")] penguin_data <- as.data.frame(penguin_data) # remove rows containing NA values data_noNA <- na.omit(penguin_data) # split data into predictors and outcome train <- data_noNA[, 1:4] cl <- data_noNA[, 5] # demonstrate function with 5-fold cv with k-nearest neighbors ranging from # 1 to 10 training_misclass_rate <- rep(NA, 10) cv_misclass_rate <- rep(NA, 10) for (i in (1: 10)) { results <- my_knn_cv(train, cl, i, 5) # compute training misclassification rate predicted <- results[[1]] misclass_rate <- sum(predicted != cl) / length(cl) training_misclass_rate[i] = misclass_rate # compute cv misclassification rate cv_misclass_rate[i] <- results[[2]] } # create table of training misclassification rate and cv misclassification rate # for k-nearest neighbors ranging from 1 to 10 results_table <- cbind(training_misclass_rate, cv_misclass_rate) colnames(results_table) <- c("Training Misclassification Rate", "CV Misclassification Rate") rownames(results_table) <- c("1-nn", "2-nn", "3-nn", "4-nn", "5-nn", "6-nn", "7-nn", "8-nn", "9-nn", "10-nn") results_table
Based on the training misclassification rate, we would choose the 1-nn model because this has the lowest misclassification rate. Based on the CV misclassification rate, we would choose the 1-nn model because this also has the lowest misclassification rate.
In practice, we would want to choose the model with the lowest CV misclassification rate. This is because the process of cross-validation involves training a model on a subset of the data and then testing the model on the other part of the data which the trained model has not seen before. This process repeats $k$ times (in k-fold cross validation), each time on different subsets of the data. This is useful because it allows us to evaluate the performance of the model and ability to generalize on unseen data. We do not want to choose a model based on the training misclassification rate because evaluating the model on the same data set used to train it would give a more optimistic (and unrealistic) evaluation of the model. Therefore we would choose the 1-nn model in practice based on cross-validation misclassification rate.
my_rf_cv
TutorialThis function uses the my_penguins
data internally and trains a random
forest model which predicts body_mass_g
using covariates bill_length_mm
,
bill_depth_mm
, and flipper_length_mm
. It utilizes k-fold cross validation,
so we will run our function using 2-fold, 5-fold, and 10-fold cross validation.
We generate 30 iterations of a random forest model for each value of $k$.
# store results in a data frame mse_df <- data.frame(matrix(nrow = 0, ncol = length(2))) # run model on 2, 5, and 10 folds for (folds in c(2, 5, 10)) { # generate model 30 times and store mse in a vector mse_vector <- rep(NA, 30) for (i in (1:30)) { avg_mse <- my_rf_cv(folds) mse_vector[i] <- avg_mse } # store results in aggregate data frame new_df <- data.frame(mse = mse_vector, fold = rep(as.character(folds), 30)) mse_df <- rbind(mse_df, new_df) }
We generate boxplots of the data to visualize the distribution of the average mean squared error for each of the chosen number of folds.
# create boxplot, 2-fold cv ggplot2::ggplot(data = mse_df[1:30,], ggplot2::aes(x = fold, y = mse)) + ggplot2::labs(title = "Average Mean Squared Error, 30 Simulations", x = "Number of Folds", y = "Mean Squared Error") + ggplot2::geom_boxplot(fill = "lightblue") + ggplot2::theme_bw() # create boxplot, 5-fold cv ggplot2::ggplot(data = mse_df[31:60,], ggplot2::aes(x = fold, y = mse)) + ggplot2::labs(title = "Average Mean Squared Error, 30 Simulations", x = "Number of Folds", y = "Mean Squared Error") + ggplot2::geom_boxplot(fill = "lightblue") + ggplot2::theme_bw() # create boxplot, 10-fold cv ggplot2::ggplot(data = mse_df[61:90,], ggplot2::aes(x = fold, y = mse)) + ggplot2::labs(title = "Average Mean Squared Error, 30 Simulations", x = "Number of Folds", y = "Mean Squared Error") + ggplot2::geom_boxplot(fill = "lightblue") + ggplot2::theme_bw()
We can see that as the number of folds increases, the range of values for the average mean squared error decreases.
We also create a table which displays the average CV estimate and the standard deviation of the CV estimates for each number of folds used
# create table of mean and standard deviation of MSE across each fold avg_col <- rbind(mean(mse_df[1:30,1]), mean(mse_df[31:60,1]), mean(mse_df[61:90,1])) sd_col <- rbind(sd(mse_df[1:30,1]), sd(mse_df[31:60,1]), sd(mse_df[61:90,1])) results_table <- cbind(avg_col, sd_col) rownames(results_table) <- c("2-fold CV", "5-fold CV", "10-fold CV") colnames(results_table) <- c("Mean of MSE", "Standard Deviation of MSE") results_table
We can see that as the number of folds increases, the mean MSE decreases and the standard deviation of the MSE decreases. This is most likely due to the fact that as we increase the number of folds, we are increasing the size of our training set and decreasing the size of our test set, leading to more precise and accurate predictions.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.